1 Introduction

Relational learning can be described as the task of learning a first-order logic theory from examples (Džeroski and Lavrač 2001; De Raedt 2008). Differently from propositional learning, relational learning does not use a set of attributes and values. Instead, it is based on objects and relations among objects, which are represented by constants and predicates, respectively. This enables a range of applications of machine learning, for example in Bioinformatics, graph mining and link analysis, serious games, etc. (Bain and Muggleton 1994; Srinivasan and Muggleton 1994; Džeroski and Lavrač 2001; King and Srinivasan 1995; King et al. 2004; Muggleton et al. 2010). Inductive Logic Programming (ILP) (Muggleton and De Raedt 1994; Nienhuys-Cheng and de Wolf 1997) performs relational learning either directly by manipulating first-order clauses or through a method called propositionalization (Lavrač and Džeroski 1994; Železný and Lavrač 2006), which brings the relational task down to the propositional level by representing subsets of relations as features that can then be used as attributes. In comparison with full ILP, propositionalization normally exchanges accuracy for efficiency (Krogel et al. 2003), as it enables the use of fast attribute-value learners such as decision trees or even neural networks (Quinlan 1993; Rumelhart et al. 1994), but could lose information in the translation of first-order clauses into features.

In this paper, we introduce a fast system for relational learning based on a new form of propositionalization, which we call Bottom Clause Propositionalization (BCP). Bottom clauses are boundaries on the hypothesis search space, first introduced by Muggleton (1995) as part of the Progol system, and are built from one random positive example, background knowledge (a set of clauses that describe what is known) and language bias (a set of clauses that define how clauses can be built). A bottom clause is the most specific clause (with most literals) that can be considered as a candidate hypothesis. BCP uses bottom clauses for propositionalization because they carry semantic meaning, and because bottom clause literals can be used directly as features in a truth-table, simplifying the feature extraction process (Muggleton and Tamaddoni-Nezhad 2008; DiMaio and Shavlik 2004; Pitangui and Zaverucha 2012).

The idea of using BCP for learning came from our attempts to represent and learn first-order logic in neural networks (Garcez and Zaverucha 2012). Neural networks (Rumelhart et al. 1994) are attribute-value learners based on gradient-descent. Learning in neural networks is achieved by performing small changes to a set of weights, in contrast with ILP, which performs learning at the concept level. Neural networks’ distributed architecture is generally accredited as a reason for robustness; neural networks seem to perform well in continuous domains and when learning from noisy data (Rumelhart et al. 1994). Systems that combine symbolic computation with neural networks are called neural-symbolic systems (Garcez et al. 2002). In neural-symbolic integration, the representation of first-order logic by neural networks is of interest in its own right, since first-order logic learning and reasoning using connectionist systems remains an open research question (Garcez et al. 2008). As a result, we investigate whether neural-symbolic learning is a good match for BCP. The experiments reported below indicate that this is indeed the case, in comparison with standard ILP and a well-known propositionalization method.

The neural-symbolic system C-IL2P has been shown effective at learning and reasoning from propositional data in a number of domains (Garcez and Zaverucha 1999). C-IL2P uses background knowledge in the form of propositional logic programs to build a neural network, which is in turn trained by examples using backpropagation (Rumelhart et al. 1994). We have extended C-IL2P to handle first-order logic by using BCP. The extended system, which we call CILP++, was implemented in C++ and is available to download from Sourceforge at https://sourceforge.net/projects/cilppp/ (the experiments reported in this paper can be reproduced by downloading the datasets and list of parameters from http://soi.city.ac.uk/~abdz937/bcexperiments.zip). CILP++ incorporates BCP as a novel propositionalization method and, differently from C-IL2P, CILP++ networks are first-order in that each neuron denotes a first-order atom. Yet, CILP++ learning uses the same neural model as C-IL2P, by transforming each first order example into a bottom clause. Experimental evaluations reported in this paper show that such a combination can lead to efficient learning of relational concepts. Given our experimental results, which are summarized in the next paragraph, our long-term goal is to apply and evaluate BCP on other general settings, including discrete and continuous data, noisy environments with missing values, and problems containing errors in the background knowledge.

We have compared CILP++ with Aleph (Srinivasan 2007)—a state-of-the-art ILP system based on Progol—and compared BCP with a well-known propositionalization method, RSD (Železný and Lavrač 2006), using neural networks and the C4.5 decision tree learner (Quinlan 1993), on a number of benchmarks: four Alzheimer’s datasets (King and Srinivasan 1995), the Mutagenesis (Srinivasan and Muggleton 1994), KRK (Bain and Muggleton 1994) and UW-CSE (Richardson and Domingos 2006) datasets. Several aspects were empirically evaluated: standard classification accuracy using cross-validation and runtime measurements, how BCP performs in comparison with RSD, and how CILP++ performs in different settings using feature selection (Guyon and Elisseeff 2003). The CILP++ implementation has not been optimized for performance. We evaluated six different configurations of CILP++ in order to explore some of the capabilities of the approach: three versions of CILP++ trained with standard backpropagation, each one using three sizes of background knowledge, and three versions of CILP++ trained with early stopping (Prechelt 1997), with the same three background knowledge sizes used for standard backpropagation. In the first set of experiments—accuracy vs. runtime—CILP++ achieved results comparable to Aleph and performed faster on most datasets.

Regarding the performance of BCP against RSD, BCP achieved a statistically significant improvement in accuracy in comparison with RSD when running with a neural network, but BCP and RSD have shown similar performance when running with C4.5. Nevertheless, BCP was faster than RSD in all cases. Since bottom clauses may have a large number of literals (Muggleton 1995), BCP might generate a large number of features. Hence, we evaluated accuracy also using feature selection, as follows. CILP++ was extended to include a statistical feature selection method called mRMR which is widely used for visual recognition and audio analysis (Ding and Peng 2005). We applied three-fold cross validation on training data to choose two models and used those on two Alzheimer’s datasets. The results indicate the existence of an optimal variable-depth parameter for generating bottom clauses and that “more is not merrier”. In one CILP++ model, mRMR managed to reduce over 90 % of features while having a loss of less than 2 % on accuracy on two Alzheimer testbeds, although an increase in runtime was observed. Further experiments in different application domains and comparison with other propositionalization methods, e.g. Kuželka and Železný (2011), are under way.

Related work

Approaches related to CILP++ can be grouped into three categories: approaches that also use bottom clauses, other propositionalization methods, and other relational learning methods. In the first category, DiMaio and Shavlik (2004) use bottom clauses with neural networks to build an efficient hypothesis evaluator for ILP. Instead, CILP++ uses bottom clauses to classify first-order examples. The QG/GA system (Muggleton and Tamaddoni-Nezhad 2008) introduces a new hypothesis search algorithm for ILP, called Quick Generalization (QG), which performs random single reductions in bottom clauses to generate candidate clauses for hypothesis. Additionally, QG/GA proposes the use of Genetic Algorithms (GA) on those candidate clauses to further explore the search space, converting the clauses into numerical patterns. CILP++ does the same, but for use with neural networks instead of discrete GAs.

In the second category—other propositionalizations—LINUS (Lavrač and Džeroski 1994) was the first system to introduce propositionalization. It worked with acyclic and function-free Horn Clauses, like Progol and BCP, but differently from BCP, it constrained the first-order language further to only accept clauses where all body variables also appear in the head. Its successor, DINUS (Kramer et al. 2001), allows a larger subset of clauses to be accepted (determinate clauses), allowing clauses with body variables that do not appear in the head literal, but still allowing only one possible instantiation of those variables. Finally, SINUS (Kramer et al. 2001) improved on DINUS by allowing unconstrained clauses, making use of language bias in the feature selection and verifying if it is possible to unify newly found literals with existing ones, while keeping consistency between pre-existing variable namings, thus reducing the final number of features. LINUS and DINUS treat body literals as features, which is similar to BCP. However, BCP can deal with the same language as Progol, thus having none of the language restrictions of LINUS/DINUS. SINUS, on the other hand, propositionalizes similarly to another method, RSD, which is compared to this work and is explained separately in Sect. 2.3. RSD also has a recent successor, called RelF (Kuželka and Železný 2011), which takes a more classification-driven approach than RSD by only considering features that are interesting for distinguishing between classes (it also discards features that θ-subsume any previously-generated feature). Comparisons with RelF are under way.

Finally, in the third category, we place the body of work on statistical relational learning (Getoor and Taskar 2007; De Raedt et al. 2008), that albeit relevant for comparison, is less directly related to this work, e.g. Markov Logic Networks (MLN) (Richardson and Domingos 2006) and other systems combining relational and probabilistic graphical models (Koller and Friedman 2009; Paes et al. 2005), neural-symbolic systems for learning from first-order data in neural networks such as Basilio et al. (2001), Kijsirikul and Lerdlamnaochai (2005) and Guillame-Bert et al. (2010), and systems that propose to integrate neural networks and first-order logic through relational databases, e.g. Uwents et al. (2011). Those systems differ from CILP++ mainly in that they seek to embed relational data directly into the networks’ structures, which is a difficult task. In contrast, CILP++ seeks to benefit from using a simple network structure as an attribute-value learner, following a propositionalization approach, as discussed earlier.

Summarizing, the contribution of this paper is two-fold. The paper introduces: (i) a novel propositionalization method, BCP, which converts first-order examples into propositional patterns by generating their bottom clauses, treating each body literal as a propositional feature, and (ii) the successor of C-IL 2 P, the CILP++ system, which reduces C-IL2P’s learning times and system complexity, uses a new weight normalization, maintaining the integrity of first-order background knowledge, and is easily configurable to be used with any ILP dataset. CILP++ takes advantage of mode declarations and determinations to generate consistent bottom clauses which share variable namings, thus being applicable to any dataset that first-order systems Aleph or Progol are applicable. CILP++ may use C-IL2P’s knowledge extraction algorithm (Garcez et al. 2001) so that interpretable first-order rules can be obtained from the trained network. Currently, first-order rules can be obtained when BCP is used together with C4.5 (since each node represents a first-order literal). This is further discussed in the body of the paper.

The remainder of the paper is organized as follows. In Sect. 2, we introduce the ILP concepts used throughout the paper: propositionalization, neural networks and neural-symbolic systems. In Sect. 3, we show how CILP++ builds a neural network from bottom clauses and how the network can be trained using bottom clauses as examples with backpropagation. We also analyze two stopping criteria: standard training error minimization and early stopping, and discuss how knowledge extraction can be carried out. In Sect. 4, we report and discuss all experimental results, and in Sect. 5, we conclude and discuss directions for future work.

2 Background

In this section, both machine learning subfields that are directly related to this work (Inductive Logic Programming and Artificial Neural Networks) are reviewed. This section also introduces notation used throughout the paper. An introduction to C-IL2P is also presented, followed by a review of propositionalization and feature selection.

2.1 ILP and bottom clause

Inductive Logic Programming (Muggleton and De Raedt 1994) is an area of machine learning that makes use of logical languages to induce theory-structured hypotheses. Given a set of labeled examples E and background knowledge B, an ILP system seeks to find a hypothesis H that minimizes a specified loss function. More precisely, an ILP task is defined as 〈E,B,L〉, where E=E +E is a set of positive (E +) and negative (E ) clauses, called examples, B is a logic program called background knowledge, which is composed by facts and rules, and L is a set of logic theories called language bias.

The set of all possible hypotheses for a given task, which we call S H , can be infinite (Muggleton 1995). One of the features that constrains S H in ILP is the language bias, L. It is usually composed by specification predicates, which define how the search is done and how far it can go. The most common specification language is called mode declarations, composed of: modeh predicates, that define what can appear as head of a clause; modeb predicates, that define what predicates can appear in the body of a clause; and determination predicates, which relate body and head literals. The modeb and modeh declarations also specify what is considered to be an input variable, an output variable, a constant, and an upper bound on how many times the predicate it specifies can appear in the same clause, called recall. The language bias L, through mode declarations and determination predicates, can restrict S H during hypothesis search to only allow a smaller set of candidate hypotheses H c to be searched. Formally, H c is a candidate hypothesis for a given ILP task 〈E,B,L〉 iff H c L, BH c E +N and BH c E N, where NE is an allowed noise in H c in order to ameliorate overfitting issues.

Something else that is used to restrict hypothesis search space in algorithms based on inverse entailment, such as Progol, is the most specific (saturated) clause, ⊥ e . Given an example e, Progol firstly generates a clause that represents e in the most specific way as possible, by searching in L for modeh declarations that can unify with e and if it finds one, an initial ⊥ e is created. Then it passes through the determination predicates to verify which of the bodies specified among the modeb clauses can be added to ⊥ e , repeatedly, until a number of cycles (known as variable depth) through the modeb declarations has been reached.

2.2 Artificial neural networks and C-IL2P

An artificial neural network (ANN) is a directed graph with the following structure: a unit (or neuron) in the graph is characterized, at time t, by its input vector I i (t), its input potential U i (t), its activation state A i (t), and its output O i (t). The units of the network are interconnected via a set of directed and weighted connections such that if there is a connection from unit i to unit j then \(W_{ji}\in\mathbb{R}\) denotes the weight of this connection. The input potential of neuron i at time t (U i (t)) is obtained by computing a weighted sum for neuron i such that U i (t)=∑ j W ij I i (t). The activation state A i (t) of neuron i at time t is then given by the neuron’s activation function h i such that A i (t)=h i (U i (t)). In addition, b i (an extra weight with input always fixed at 1) is known as the bias of neuron i. We say that neuron i is active at time t if A i (t)>−b i . Finally, the neuron’s output value is given by its activation state A i (t).

For learning, backpropagation (Rumelhart et al. 1994) is the most widely used algorithm, based on gradient descent. It aims to minimize an error function E regarding the difference between the network’s answer and the example’s actual classification. Standard training and early stopping (Haykin 2009) are two commonly used stopping criteria for backpropagation training. In standard training, the full training dataset is used to minimize E, while early stopping uses a validation set to measure data overfitting: training stops when the validation set error starts to increase. When this happens, the best validation configuration obtained thus far is used as the learned model.

The Connectionist Inductive Learning and Logic Programming system, C-IL2P (Garcez and Zaverucha 1999), is a neural-symbolic system that builds a recursive ANN using background knowledge composed of propositional clauses (building phase). C-IL2P also learns from examples using backpropagation (training phase), performs inference on unknown data by querying the post-training ANN, and extracts a revised knowledge from the trained network (Garcez et al. 2001) to obtain a new propositional theory (extraction phase). Figure 1 illustrates the building phase and shows how to build a recursive ANN N from background knowledge BK.Footnote 1

Fig. 1
figure 1

C-IL2P building phase example. Starting from background knowledge BK, C-IL2P creates an ANN by creating a hidden layer neuron for each clause in BK. Thus, the C-IL2P network for BK has three hidden neurons. Then, each body literal in BK is associated with an input neuron, and each head literal in BK is associated with an output neuron. For example, for the first clause (A ← B, C), since it has two body literals B and C, two input neurons are created and are connected to a hidden neuron (1), corresponding to the clause, with positive weight W. If a literal is negated (for example, literal not D in the second clause of BK), its corresponding neuron is connected to the hidden neuron using weight −W. Hidden neuron (1) is then connected to the output neuron corresponding to A using connection weight W. Finally, input and output neurons that share the same label are recursively connected from the output to the input of the network with weight 1, so that output values can be propagated back to the input in the next calculation of rule chaining. For example, the head of the second clause in BK (B) is also one of the body literals of the first clause. Therefore, a recursive connection between the output neuron representing B and the input neuron representing B is created. The resulting network N encodes and can compute BK in parallel, as well as be trained from examples having BK as background knowledge

C-IL2P calculates weight and bias values for all neurons. The value of W is constrained by Eq. (2), the values of the biases of the input layer neurons are set as 0, and the biases of the hidden layer neurons n h (\(b_{n_{h}}\)) and of the output layer neurons n o (\(b_{n_{o}}\)) are given by Eqs. (3) and (4) below, respectively. Both W and the biases are functions of A min: this parameter controls the activation of each neuron n in C-IL2P by only allowing activation if the condition shown in Eq. (1) is satisfied, where w i is a network weight that ends in n, x i is an input and h n is the activation function of neuron n, which is linear if it is an input neuron and semi-linear bipolar Footnote 2 if it is not. W and the biases values are set so that the network implements an AND-OR structure with hidden neurons implementing a logical-AND of input neurons, and output neurons implementing a logical-OR of hidden neurons, so that the network can be used to run the logic program. To exemplify the network computation, given BK, if E and C are set to true, and D is set to false in the network (i.e. neurons E and C are activated while neuron D is not), a feedforward propagation activates output neuron B (because of the second clause in BK). Then, a recursive connection carries this activation to input neuron B, and a second feedforward propagation would activate A. This process continues until a stable state is reached, when no change in activation is seen after a feedforward propagation.

$$\begin{aligned} &h_{n}\biggl(\sum_{\forall i}{w_{i} \cdot x_{i}} + b\biggr) \geq A_{\min} \end{aligned}$$
(1)
$$\begin{aligned} &W \geq\frac{2}{\beta} \cdot\frac{\ln(1+A_{\min}) - \ln (1-A_{\min})}{\max(k_{n}, \mu_{n}) \cdot(A_{\min}-1) + A_{\min} + 1} \end{aligned}$$
(2)
$$\begin{aligned} &b_{n_{h}} = \frac{(1+A_{\min})(k_{n_{h}}-1)}{2} \cdot W \end{aligned}$$
(3)
$$\begin{aligned} &b_{n_{o}} = \frac{(1+A_{\min})(1-\mu_{n_{o}})}{2} \cdot W \end{aligned}$$
(4)

In Eqs. (2)–(4): \(k_{n_{h}}\) is the number of body literals in the clause corresponding to the hidden neuron n h (i.e., the number of connections coming from the input layer to n h ); \(\mu_{n_{o}}\) is the number of clauses in the background knowledge with the same head as the head literal mapped by the output neuron n o (i.e., the number of connections coming from the hidden layer to n o ); max(k n ,μ n ) is the maximum value among all k and all μ, for all neurons n; and β is the semi-linear bipolar activation function slope.

After the building phase, training can take place. Optionally, more hidden neurons can be added (if this is needed in order to better approximate the training data) and the network is fully connected with near-zero weighted connections. The training algorithm used by C-IL2P is standard backpropagation (Rumelhart et al. 1994). C-IL2P also does not train recursive connections: they are fixed and only used for inference.

Then, inference and knowledge extraction can be done. Garcez et al. (2001) proposes a knowledge extraction algorithm for C-IL2P by splitting of the trained network into “regular” ones, which do not have connections coming from the same neuron with different signs (positive and negative). It is shown that the extraction for those networks is sound and complete, and in the general case, soundness can be achieved, but not completeness.

2.3 Propositionalization

Propositionalization is the conversion of a relational database into an attribute-value table, amenable to conventional propositional learners (Krogel et al. 2003). Propositionalization algorithms use background knowledge and examples to find distinctive features, which can differentiate subsets of examples. There are two kinds of propositionalization: logic-oriented and database-oriented. The former aims to build a set of relevant first-order features by distinguishing between first-order objects. The latter aims to exploit database relations and functions to generate features for propositionalization. The main representatives of logic-oriented approaches include: LINUS (and its successors), RSD and RelF; and the main representative of database-oriented approaches is RELAGGS (Krogel and Wrobel 2003). BCP is a new logic-oriented propositionalization technique, which consists of generating bottom-clauses for each first-order example and using the set of all body literals that occur in them as possible features (in other words, as columns for an attribute-value table).

In order to evaluate how BCP performs, it will be compared with RSD (Železný and Lavrač 2006), a well-known propositionalization algorithm for which an implementation is available at (http://labe.felk.cvut.cz/~zelezny/rsd). RSD is a system which tackles the Relational Subgroup Discovery problem: given a population of individuals and a property of interest, RSD seeks to find population subgroups that are as large as possible and have the most unusual distribution characteristics. RSD’s input is an Aleph-formatted dataset, with background knowledge, example set and language bias and its output is a list of clauses that describe interesting subgroups of the examples dataset. RSD is composed of two steps: first-order feature construction and rule induction. The first is a propositionalization method that creates higher-level features that are used to replace groups of first-order literals, and the second is an extension of the propositional CN2 rule learner (Clark and Niblett 1989), for use as a solver of the relational subgroup discovery problem. We are interested in the propositionalization component of RSD, which can be further divided into three steps: all expressions that by definition form a first-order feature and comply with the mode declarations are identified; the user can instantiate variables (through instantiate/1 predicates) in the background knowledge and afterwards, irrelevant features are filtered out; and a propositionalization of each example using the generated features is created. From now on, when we refer to RSD, we are referring to the RSD propositionalization method, not the relational subgroup discovery system.

2.4 Feature selection

As stated in Sect. 2.1, bottom clauses are extensive representations of an example, possibly having an infinite size. In order to tackle this problem, at least two approaches have been proposed: reducing the size of the clauses during generation or using a statistical approach afterwards. The first can be done as part of the bottom clause generation algorithm (Muggleton 1995), by reducing the variable depth value. Variable depth specifies an upper bound on the number of times that the algorithm can pass through mode declarations and by reducing its value, it is possible to cut a considerable chunk of literals, although causing some information loss. Alternatively, statistical methods such as Pearson’s correlation and Principal Component Analysis can be used (a survey of those methods can be found in May et al. 2011), taking advantage of the use of numerical feature vectors as training patterns. A recent method, which has low computational cost, while surpassing most common methods in terms of information loss, is the mRMR algorithm (Ding and Peng 2005), which focuses on balancing minimum redundancy and maximum relevance of features, selecting them by using mutual information I between variables x and y, defined as:

$$ I(x,y) = \sum_{i,j} p(x_{i},y_{j}) \log\dfrac {p(x_{i},y_{j})}{p(x_{i})p(y_{j})}, $$
(5)

where p(x,y) is the joint probability distribution, and p(x) and p(y) are the respective marginal probabilities. Given a subset S of the feature set Ω to be ranked by mRMR, the minimum redundancy condition and the maximum relevance condition, respectively, are:

$$\begin{aligned} &\min \{W_{I}\}, \quad W_{I} = \dfrac{1}{|S|^{2}} \sum _{i,j \in S} I(i,j)\quad \text{and} \end{aligned}$$
(6)
$$\begin{aligned} & \max \{V_{I}\}, \quad V_{I} = \dfrac{1}{|S|} \sum_{i \in S} I(h,i), \end{aligned}$$
(7)

where h={h 1,h 2,…,h K } is the classification variable of a dataset with K possible classes. Let Ω S =ΩS be the set of unselected features from Ω. There are two ways of combining the two conditions above to select features from Ω S : Mutual Information Difference (MID), defined as max(V I W I ), and Mutual Information Quotient (MIQ), defined as max(V I /W I ). Results reported in Ding and Peng (2005) indicate that MIQ usually chooses better features. Thus, MIQ is the function we choose to select features in this work and for the sake of simplicity, whenever this work refers to mRMR, it is referring to mRMR with MIQ.

3 Learning with BCP using CILP++

Let us start with a motivating example: consider the well-known family relationship example (Muggleton and De Raedt 1994), with background knowledge B={mother(mom1,daughter1),wife(daughter1, husband1),wife(daughter2, husband2)}, with positive example motherInLaw(mom1, husband1), and negative example motherInLaw(daughter1, husband2). It can be noticed that the relation between mom1 and husband1, which the positive example establishes, can be alternatively described by the sequence of facts mother(mom1, daughter1) and wife(daughter1, husband1) in the background knowledge. This states semantically that mom1 is a mother-in-law because mom1 has a married daughter, namely, daughter1. Applied to this example, the bottom clause generation algorithm of Progol would create a clause ⊥=motherInLaw(A,B)←mother(A,C), wife(C,B). Comparing ⊥ with the sequence of facts above, we notice that ⊥ describes one possible meaning of mother-in-law: “A is a mother-in-law of B if A is a mother of C and C is wife of B”, i.e. the mother of a married daughter is a mother-in-law. This is why, in this paper, we investigate learning from bottom clauses. However, for each learned clause, Progol uses a single random positive example to generate a bottom clause, for limiting the search space. To learn from bottom clauses, BCP generates one bottom clause for each (positive or negative) example e, which we denote as ⊥ e .

In this section, we introduce the CILP++ system, which extends the C-IL2P system to learn from first-order logic using BCP. Each step of this relational learning task is explained in detail in what follows.

3.1 Bottom clause propositionalization

The first step of relational learning with CILP++ is to apply BCP. Each target literal is converted into a numerical vector that an ANN can use as input. In order to achieve this, each example is transformed into a bottom clause and mapped onto features on an attribute-value table, and numerical vectors are generated for each example. Thus, BCP has two steps: bottom clause generation and attribute-value mapping.

In the first step, each example is given to Progol’s bottom clause generation algorithm (Tamaddoni-Nezhad and Muggleton 2009) to create a corresponding bottom clause representation. To do so, a slight modification is needed to allow the same hash function to be shared among all examples, in order to keep consistency between variable associations, and to allow negative examples to have bottom clauses as well; the original algorithm deals with positive examples only. This modified version is shown in Algorithm 1, which has a single parameter, depth, which is the variable depth of the bottom clause generation algorithm.

Algorithm 1
figure 2

Adapted Bottom Clause Generation

For example, if Algorithm 1 is executed with depth=1 on the positive and negative examples of our motivating (family relationship) example above, motherInLaw(mom1, husband1) and motherInLaw(daughter1, husband2), respectively, it generates the following training set:

$$\begin{aligned} \begin{aligned} E_{\bot} &= \bigl\{\mathit{motherInLaw}(A,B) :- \mathit{mother}(A,C), \mathit{wife}(C,B); \\ &\quad \sim \mathit{motherInLaw}(A,B) :- \mathit{wife}(A,C)\bigr\}. \end{aligned} \end{aligned}$$

After the creation of the E set, the second step of BCP is as follows: each element of E (each bottom clause) is converted into an input vector v i , 0≤in, that a propositional learner can process. The algorithm for that, implemented by CILP++, is as follows:

  1. 1.

    Let |L| be the number of distinct body literals in E ;

  2. 2.

    Let E v be the set of input vectors, converted from E , initially empty;

  3. 3.

    For each bottom clause ⊥ e of E do

    1. (a)

      Create a numerical vector v i of size |L| and with 0 in all positions;

    2. (b)

      For each position corresponding to a body literal of ⊥ e , change its value to 1;

    3. (c)

      Add v i to E v ;

    4. (d)

      Associate a label 1 to v i if e is a positive example, and −1 otherwise;

  4. 4.

    Return E v .

As an example, for the same (family relationship) bottom clause set E above, |L| is equal to 3, since the literals are mother(A,C), wife(C,B) and wife(A,C). For the positive bottom clause, a vector v 1 of size 3 is created with its first position corresponding to mother(A,C), and second position corresponding to wife(C,B) receiving value 1, resulting in a vector v 1=(1,1,0). For the negative example, only wife(A,C) is in E and its vector is v 2=(0,0,1).

3.2 CILP++ building phase

Having created numerical vectors from bottom clauses, CILP++ then creates an initial network for training. Background knowledge (BK) only passes through BCP’s first step (resulting in a bottom clause set E ), i.e. their bottom clauses are generated, but they are not converted into input vectors. CILP++ then maps each body literal onto an input neuron and each head literal onto an output neuron. Following the C-IL2P building step, only bottom clauses generated from positive examples can be used as background knowledge. Let \(E_{\bot}^{+}\) denote the subset of E containing bottom clauses generated from positive examples only. Thus, any subset \(E^{BK}_{\bot} \subseteq E_{\bot}^{+}\) can be used as background knowledge (or none at all) for the purpose of evaluating CILP++.Footnote 3

The CILP++ algorithm for the building phase is presented below. Following C-IL2P, it uses positive weights W to encode positive literals, and negative weights −W to encode negative literals. The value of W for CILP++ is also constrained by Eq. (2), which guarantees the correctness of the translation, i.e. it can be shown that the network computes an intended meaning of the background knowledge (Garcez and Zaverucha 1999). As in C-IL2P, CILP++ builds so-called AND-OR networks, setting network biases w.r.t. W so that the hidden neurons implement a logical-AND, and the output neurons implement a logical-OR, as discussed in the Background section, as follows:

For each bottom clause ⊥ e of \(E^{BK}_{\bot}\), do:

  1. 1.

    Add a neuron h to the hidden layer of a network N and label it ⊥ e ;

  2. 2.

    Add input neurons to N with labels corresponding to each literal in the body of ⊥ e ;

  3. 3.

    Connect the input neurons to h with weight W if the corresponding literals are positive, and −W otherwise;

  4. 4.

    Add an output neuron o to N and label it with the head literal of ⊥ e ;

  5. 5.

    Connect h to o with weight W;

  6. 6.

    Set the biases in the following way: input neurons with bias 0, bias of h with Eq. (3), and bias of o with Eq. (4).

Continuing our example, suppose that the positive example of E :

$$ \mathit{motherInLaw}(A,B) \,{{:}{-}}\, \mathit{mother}(A,C), \mathit{wife}(C,B) $$
(8)

is to be used as background knowledge to build an initial ANN. In step 1 of the CILP++ building algorithm, a hidden neuron is created having Eq. (8) as associated label. In step 2, two input neurons are created, representing the body literals mother(A,C) and wife(C,B). In step 3, two connections are created from each input neuron to the hidden neuron, both having weight W. In step 4, an output neuron representing the head literal motherInLaw(A,B) is created. In step 5, the hidden layer neuron is connected to the output neuron with weight W, and the network biases are set in step 6.Footnote 4

In order to evaluate network building, in the next section, we run experiments using different sizes of \(E^{BK}_{\bot}\), including a network configuration with no BK, i.e. where only the input and output layers are built and associated with bottom clause literals, but no specific initial number of hidden neurons is prescribed, as detailed in what follows.

3.3 CILP++ training phase

After BCP is applied and a network is built, CILP++ training is next. As an extension of C-IL2P, CILP++ uses backpropagation. Differently from C-IL2P, CILP++ also has a built-in cross-validation method and an early stopping option (Prechelt 1997). Validation is used to measure generalization error during each training epoch. With early stopping, when an error measure starts to increase, training is stopped. A more permissive version of early stopping, which we use, does not halt training immediately after the validation error increases, but when the criterion in Eq. (9) is satisfied, where α is the stopping criterion parameter, t is the current epoch number, Err va (t) is the average validation error on epoch t and Err opt (t) is the least validation error obtained from epochs 1 up to t. The reason we apply Eq. (9) is that, without feature selection, BCP can generate large networks; early stopping has been shown effective at avoiding overfitting in large networks (Caruana et al. 2000).

$$ GL(t) > \alpha, \quad GL(t) = 0.1 \cdot \biggl(\frac{\mathit{Err}_{\mathit{va}}(t)}{\mathit{Err}_{\mathit{opt}}(t)} - 1 \biggr) $$
(9)

Given a bottom clause set \(E^{\mathit{train}}_{\bot}\), the steps below are followed for training network N:

  1. 1.

    For each bottom clause \(\bot_{e} \in E^{\mathit{train}}_{\bot}\), ⊥ e =h :− l 1,l 2,…,l n , do:

    1. (a)

      Add all l i ,1≤in, that are not represented yet in the input layer of N as new neurons;

    2. (b)

      If h does not exist yet in the network, create an output neuron corresponding to it;

  2. 2.

    Add new hidden neurons, if required for convergence;

  3. 3.

    Make the network fully-connected, by adding weights with zero values;

  4. 4.

    Normalize all weights and biases (as explained below);

  5. 5.

    Alter weights and biases slightly, to avoid the symmetry problem;Footnote 5

  6. 6.

    Apply backpropagation using each \(\bot_{e} \in E^{\mathit{train}}_{\bot}\) as training example.

The normalization process of step 4 above is done to solve a problem found while experimenting with C-IL2P: the initial weight values for the connections, depending on the background knowledge that is being mapped, could be excessively large, which makes the derivative of the semi-linear activation function tend to zero, thus not allowing proper training. We used a standard normalization procedure for ANNs, described in Haykin (2009): let w l be a weight in layer l and similarly, let b l be a bias. For each l, the normalized weights and biases (respectively, \(w^{\mathit{norm}}_{l}\) and \(b^{\mathit{norm}}_{l}\)) are defined as:

$$\begin{aligned} w^{\mathit{norm}}_{l} =& w_{l} \cdot\frac{1}{(|l-1|^{\frac{1}{2}}) \cdot \max_{w}} \quad \text{and} \\ b^{\mathit{norm}}_{l} =& b_{l} \cdot\frac{1}{(|l-1|^{\frac{1}{2}}) \cdot \max_{w}} , \end{aligned}$$

where |l| is the number of neurons in layer l and max w is the maximum absolute connection weight value among all weight connections in the network.

To illustrate the training phase, assume that the bottom clause set E is our training data and no background knowledge has been used. In step 1(a), all body literals from both examples (mother(A,C), wife(C,B) and wife(A,C)) cause the generation of three new input neurons in the network, with labels identical to the corresponding literals. In step 1(b), an output neuron labeled motherInLaw(A,B) is added. In step 2, let us assume that two hidden neurons are added. In step 3, zero-weighted connections are added from all three input neurons to both hidden neurons, and from those to the output neuron. Step 4 is only needed when background knowledge is used. In step 5, we add a random non-zero value in [−0.01,0.01] to each weight. Finally, in step 6, backpropagation is applied (see Fig. 2), firing the input neurons mother(A,C) and wife(C,B) when the positive example is being learned (example 1 in the figure), with target output 1, and firing the input neuron wife(A,C) when the negative example is being learned (example 2 in the figure), with target output −1.

Fig. 2
figure 3

Illustration of CILP++’s training step. L1, L2 and L3 are labels corresponding to each distinct body literal found in E . The shown output values are the labels which are used for backpropagation training. In the figure, network N appears repeated for each example, for clarity

Additionally, notice that BCP does not combine first-order literals to generate features like RSD or SINUS: it treats each literal of a bottom clause as a feature. The hidden layer of the ANN can be seen as a (weighted) combination of the features provided by the input layer. Thus, ANNs can combine features when processing the data, which allows CILP++ to group features similarly to RSD or SINUS, but doing so dynamically (during learning), due to the small changes of real-valued weights in the network.

After training, CILP++ can be evaluated. Firstly, each test example e test from a test set E test is propositionalized with BCP, resulting in a propositional data set \(E^{\mathit{test}}_{\bot}\), where each e testE test has a corresponding \(\bot_{e}^{\mathit{test}} \in E^{\mathit{test}}_{\bot}\). Then, each \(\bot_{e}^{\mathit{test}}\) is tested in CILP++’s ANN: each input neuron corresponding to a body literal of \(\bot _{e}^{\mathit{test}}\) receives input 1 and all other input neurons (input neurons which labels are not present in \(\bot_{e}^{test}\)) receive input 0. Lastly, a feedforward pass through the network is performed, and the output will be CILP++’s answer to \(\bot_{e}^{\mathit{test}}\) and consequently, to e test.

4 Experimental results

In this section, we present the experimental methodology and results for CILP++ as a first-order neural-symbolic system and for BCP as a standalone propositionalization method. We also compare results with ILP system Aleph and propositionalization method RSD. Before we experiment on ILP problems, though, we have tested CILP++ against its predecessor, C-IL2P, to evaluate whether CILP++ is as good as C-IL2P on propositional problems. We used the Gene Sequences/Promoter Recognition dataset used in Garcez and Zaverucha (1999) with leave-one-out cross validation. CILP++ obtained 92.41 % accuracy, against 92.48 % obtained by C-IL2P; CILP++ took 5:21 minutes to run the entire experiment, while C-IL2P took 5:23 minutes. This suggests that CILP++ and C-IL2P perform similarly on propositional problems.

As mentioned, we have compared results with Aleph and RSD. Aleph is an ILP system, which has several other algorithms built-in, such as Progol (by default). RSD is a well-known propositionalization method capable of obtaining results comparable to full ILP systems. We have used four benchmarks: the Mutagenesis dataset (Srinivasan and Muggleton 1994), the KRK dataset (Bain and Muggleton 1994), the UW-CSE dataset (Richardson and Domingos 2006), and the Alzheimer’s benchmark (King and Srinivasan 1995), which consists of four datasets: Amine, Acetyl, Memory and Toxic. Table 1 reports some general characteristics and the number of BCP features obtained for each dataset.

Table 1 Datasets statistics

We have run CILP++ on the above datasets (all folds are available from http://soi.city.ac.uk/~abdz937/bcexperiments.zip, including our version of the UW-CSE dataset, as explained below), reporting results on six CILP++ configurations. We report: accuracy vs. runtime on all datasets in comparison with Aleph, and a comparison between BCP and RSD on the Mutagenesis and KRK datasets.Footnote 6 We also evaluate feature selection in CILP++ by constraining the clause length when building bottom clauses with BCP and applying mRMR.

Since a varied number of accuracy results have been reported in the literature on the use of Aleph with the Alzheimer’s and Mutagenesis datasets (King and Srinivasan 1995; Landwehr et al. 2007; Paes et al. 2007), we have decided to run both Aleph and CILP++ for our comparisons. We built 10 folds from each dataset (in the case of UW-CSE, we followed Davis et al. 2005 and used 5 folds) and both systems used the exact same training folds. BCP and RSD could not, however, share the exact same training folds, as a result of the way in which the RSD tool was implemented (the RSD tool generates features before the folds are created, while CILP++ creates the folds in the first place). As mentioned earlier, the CILP++ system, the different configurations/parameterizations, and all the data folds are available for download so that the results reported in this paper should be reproducible. The six CILP++ configurations include:

  • st: uses standard backpropagation stopping criteria;

  • es: uses early stopping;

  • n%bk: the network is created using n % of the examples in \(E^{\mathit{train}}_{\bot}\) as BK;Footnote 7

  • 2h: uses no building step and starts with 2 hidden neurons only.

The choice of the 2h configuration is explained in detail in Haykin (2009); ANNs having two neurons in the hidden layer can generalize binary problems approximately as well as any network. Furthermore, if a network has many features to evaluate, as in the case of BCP, i.e. the input layer has many neurons, it should have sufficient degrees of freedom; further increasing it by adding hidden neurons might increase the chances of overfitting. Since bottom clauses are “rough” representations of examples and we would like to model the general characteristics of the examples, a simpler model such as 2h should be preferred (Caruana et al. 2000).

For all experiments with Aleph, the same configurations as in Landwehr et al. (2007) were used for the Alzheimer’s datasets (any parameter not specified below used Aleph’s default values): variable depth=3, least positive coverage of a valid clause=2, least accuracy of an acceptable clause=0.7, minimum score of a valid clause=0.6, maximum number of literals in a valid clause=5 and maximum number of negative examples covered by a valid clause (noise)=300. Regarding Mutagenesis, the parameters were based on Paes et al. (2007) (again, if a parameter is not listed below, the Aleph default value has been used): least positive coverage of a valid clause=4. For KRK, the configuration provided by Aleph in its documentation was used. For UW-CSE, the same configuration as in Davis et al. (2005) was used: variable depth = 3, least positive coverage of a valid clause = 10, least accuracy of an acceptable clause = 0.1, maximum number of literals in a valid clause = 10, maximum number of negative examples covered by a valid clause (noise) = 1000 and evaluation function = m-estimate.

With regards to the UW-CSE dataset, we have used an ILP version of the dataset following Davis et al. (2005). The original UW-CSE dataset contains positive examples only for use with Markov Logic Networks (Richardson and Domingos 2006). Davis et. al. generated negative examples for this dataset using Closed World Assumption (Davis et al. 2005). This has produced an unbalanced dataset containing 113 positive examples and 1772 negative examples. Thus, we have re-balanced the dataset by performing random undersampling, until we had obtained twice as many negative examples as positive examples. Our goal was to cover the distribution of negative examples as best as possible, while not allowing too much unbalancing, and to provide a fair comparison with Aleph. Alternative undersamplings and oversamplings have been investigated also with results reported below.

As for the CILP++ parameters, we used the same variable depth values as Aleph for BCP (except for UW-CSE, where we used variable depth = 1, as discussed below) and the following parameters for backpropagation:Footnote 8 on st configurations, learning rate = 0.1, decay factor = 0.995 and momentum = 0.1; and on es configurations: learning rate = 0.05, decay factor = 0.999, momentum = 0 and alpha (early stopping criterion) = 0.01.

Finally, extra hidden neurons were not added to the network configurations above, i.e. step 2 of CILP++’s training algorithm, Sect. 3.3, was not applied. The networks labeled as 2h have only 2 hidden neurons, and those labeled n%bk have as many hidden neurons as the size of the BK, i.e. n % the size of the set \(E^{\textit{train}}_{\bot}\).

4.1 Accuracy results

In this experiment, CILP++ is evaluated on accuracy vs. runtime against Aleph. Two tables are presented with accuracy averages, standard deviations and complete runtimes over 10-fold cross-validation for Mutagenesis, four Alzheimer’s datasets and KRK, and 5-fold cross-validation for UW-CSE, on the st (Table 2) and es (Table 3) CILP++ configurations. By “complete runtime” we mean the total building, training and testing times for each system. In both tables, accuracy results in bold are the highest ones and the difference between them and the ones marked with asterisk (*) are statistically significant by two-tailed, paired t-test. All experiments were run on a 3.2 GHz Intel Core i3-2100 with 4 GB RAM.

Table 2 Test set accuracy (standard deviation) results and runtimes (in % for accuracy and in hh:mm:ss format for runtimes) for st configurations. It can be seen that Aleph and CILP++ present comparable accuracy results for the standard CILP++ configurations, with the st,2.5%bk model winning on three datasets. CILP++ performs faster in most cases, confirming our expectation that relational learning through propositionalization should trade accuracy for efficiency, in comparison with full first-order ILP learners
Table 3 Test set accuracy (standard deviation) results and runtimes for the CILP++ configurations with early stopping (in % for accuracy and in hh:mm:ss format for runtimes). Using es models, CILP++ was much faster than Aleph but with a considerable decrease in accuracy. Aleph won in accuracy in all but the Mutagenesis dataset. This indicates that early stopping is not recommended in general for use with BCP, unless speed is paramount

Notice how CILP++ can achieve runtimes that are considerably faster than Aleph. We believe the speed-ups are caused by the following main factors: ILP covering-based search algorithms have well-known efficiency bottlenecks (Paes et al. 2008; DiMaio and Shavlik 2004), while bottom clause generation is fast, and standard backpropagation learning is efficient (Rumelhart et al. 1986). Further, propositionalized examples are generally easier to handle computationally than first-order examples (Krogel et al. 2003). Tables 2 and 3 for the st and es configurations, respectively, seem to confirm an expected trade-off between speed and accuracy between propositionalization and methods dealing directly with first-order logic. We can also see that st configurations seem to emphasize accuracy, while es emphasizes speed.

Regarding the CILP++ results on UW-CSE, as mentioned earlier, we have used variable depth = 1 for BCP. The reason is that UW-CSE examples when propositionalized by BCP for variable depths higher than 1 become considerably large. At the same time, variable depth 1 causes serious information loss in the propositionalization procedure. To ameliorate this, we have tried an oversampling method called SMOTE, based on kNN (Chawla et al. 2002). SMOTE suggests a combination with random undersampling for better results, whereby we increased the positive examples five times (from 113 to 565 examples) using SMOTE, and undersampled the class of negative examples until we had the same number (565) of negative examples. The problem with this approach, in what concerns a comparison with Aleph, is that, to the best of our knowledge, no oversampling method exists for Aleph; SMOTE is applicable to numerical or propositional data only, thus we could not compare those results with Aleph. Hence, we do not report those results in the accuracy tables above. Nevertheless, with SMOTE, CILP++ obtained 93.34 %, 90.11 % and 93.58 % accuracy for the es,2h, es,2.5%bk and es,5%bk configurations, respectively. None of the networks took longer than 6 minutes to run (train and test) on all 5 UW-CSE folds, including the SMOTE and undersampling pre-processing. In st configurations, CILP++ obtained 73.44 % for st,2.5%bk, 77.35 % for st,5%bk and 74.2 % for st,2h, with no configuration taking longer than 9 minutes to run completely. Those results indicate that an adequate ANN data pre-processing, enabled by the BCP method, can improve results considerably.

So far, we have explored a number of CILP++ configurations. The use of other configurations and their combination through tuning sets is possible. However, the ILP literature on Aleph generally reports a single optimal configuration per dataset (and not per fold) (Paes et al. 2007; Landwehr et al. 2007). We believe, therefore, that applying tuning sets to CILP++ would lead to an unfair advantage to the network model, for the sake of comparison with Aleph. Nevertheless, an optimal CILP++ configuration would use tuning sets, and we report those results below in Table 4. A three-fold internal cross validation was applied on the training set of each one of the 10 folds used in Tables 2 and 3. The fold accuracy of the best model, chosen with tuning sets, was then chosen for that fold. Thus, the dataset accuracy of CILP++ using tuning sets is the average of the test set accuracy obtained for each fold with the model that obtained the best tuning set accuracy. We also report the runtimes obtained with this approach and the “best” model for each dataset, which is the one that is chosen the most times, for all the folds. The best model results shown in the table were used to guide our choice of model in the experiments on feature selection and BCP to follow.

Table 4 Results using tuning sets for CILP++. We report three results in this table, from left to right: CILP++ test set accuracy using tuning sets averaged over the six CILP++ configurations, CILP++ runtime using tuning sets, and best model, i.e. the configuration with most wins on the 10 train/test folds (5 train/test folds, in the case of UW-CSE). Overall, the best st model is the st,2.5%bk configuration, the best es model is the es,2h model, and the best model overall is the st,2.5%bk model

In comparison with the results reported in Tables 2 and 3, the results using tuning sets were slightly lower than the results of the best individual models, but better than most of them. Additionally, we applied tuning sets to the version of UW-CSE to which we applied SMOTE and undersampling, and we obtained 81.12 % test set accuracy, with CILP++ taking less than 8 minutes to finish, which is considerably better than the results obtained with UW-CSE without SMOTE. In the following experiments (feature selection analysis and BCP results), we choose the best models obtained from tuning sets for further analysis.

4.2 Comparative results with propositionalization

In this section, comparative results against RSD are carried out, using the datasets Mutagenesis (named muta in the table below) and KRK (the reason for this choice of datasets is explained in the previous section). In Table 5, accuracy and runtimes are shown. We compare both BCP and RSD propositionalization when generating training patterns for CILP++ (labeled ANN in the table) and for the C4.5 decision tree learner. Aleph results are shown as well as a baseline. We use the CILP++ configuration that obtained the best results in the tuning sets for each dataset. Values in bold are the highest obtained, and the difference between those and the ones marked with (*) are statistically significant by two-tailed, unpaired t-test (we use unpaired t-test because of the RSD tool implementation issue, mentioned earlier). All experiments were also run on a 3.2 GHz Intel Core i3-2100 with 4 GB RAM.

Table 5 Accuracy and runtime results for Mutagenesis and KRK datasets (in % for accuracy and in hh:mm:ss format for runtimes). The results show that BCP is faster than RSD, while showing highly competitive results w.r.t. Aleph, but RSD performed as well as BCP when using C4.5 as learner. BCP outperformed RSD in all models: BCP was faster in all cases, but in the KRK dataset, RSD with C4.5 showed higher accuracy, although the difference was not statistically significant. The results also show that BCP performs well with both learners (ANN and C4.5), but excels with ANNs. On the other hand, RSD did not perform well with ANNs

In summary, our hypothesis was that BCP, as a standalone propositionalization method, can be fast and is capable of generating accurate features for learning. The results indicate that BCP is a good match for ANN, indicating the promise of the CILP++ system. BCP performs on a par with RSD when integrated with C4.5. BCP is also faster than RSD in all cases, empirically confirming our hypothesis.

4.3 Results with feature selection

In Sect. 2.4, it was discussed that, due to the extensive size of bottom clauses, feature selection techniques may obtain improved results when applied after BCP. Two ways of performing feature selection were discussed: changing the variable depth (see Algorithm 1) and using a statistical method, mRMR. We have chosen two datasets on which to run these experiments with feature selection: Alz-amine and Alz-toxic. We opted for those because CILP++ performed well on them, not outstandingly well (as in Mutagenesis), neither poorly (as in Alz-acetyl). Additionally, we have chosen the best st configuration (st,2.5bk, chosen by tuning sets) and the best es configuration (es,2h). Even though the results using tuning sets showed st,2h as the best model for the Alz-toxic dataset, we wanted to analyze feature selection on es configurations as well, and so we have chosen the best es configuration.

First, we changed the variable depth in Alz-amine and Alz-toxic, which was 3, to 2 and 5, to analyze how changes in this parameter would affect performance. The results are shown in Fig. 3. Alternatively, we applied mRMR with three levels of selection: 50 %, 25 % and 10 % of the best-ranked features. These results are shown in Fig. 4.

Fig. 3
figure 4

Accuracy (above) with varying variable depth on Alz-amine (left) and Alz-toxic (right), with runtimes (below) in hh:mm:ss format. The results indicate that the default variable depth is satisfactory: neither increasing it nor decreasing it has helped increase performance. As stated in Sect. 2.1, variable depth controls how far the bottom clause generation algorithm goes when generating concept chaining and it is a way of controlling how much information loss the propositionalization method will have. From this and the obtained results, it should be intuitive that higher variable depths should mean a better performance, but together with useful features, it seems to bring redundancy as well

Fig. 4
figure 5

Accuracy (above) when using mRMR on Alz-amine (left) and Alz-toxic (right), with runtimes (below) in hh:mm:ss format. The results show that in both Alz-amine and Alz-toxic datasets, a reduction of 90 % in the number of features caused a loss of less than 2 % in accuracy, albeit with an increase in runtime. The reduction in features caused CILP++ to take more training epochs to converge and mRMR itself also contributed to the increase in runtime. However, at 90 % filtered features, the runtimes approached in general the ones obtained without mRMR filtering. Even with an increase in runtime, feature selection with mRMR seems useful to reduce the size of the network and improve readability, especially if knowledge extraction is to be carried out

In summary, statistical feature selection seems to be useful with BCP. Changes in variable depth did not seem to offer gains, but mRMR offered more than 90 % feature reduction with a loss of less than 2 % in accuracy. The goal of selecting features with mRMR should not be to improve efficiency, although in one case (Alz-amine es,2h), CILP++ with mRMR was faster at 90 % feature reduction than CILP++, despite a loss of more than 10 % of accuracy.

5 Conclusion and future work

This paper has introduced a fast method and algorithm for ILP learning with ANNs, by extending a neural-symbolic system called C-IL2P. The paper’s two contributions are: a novel propositionalization method, BCP, and the CILP++ system, an open-source, freely distributed neural-symbolic system for relational learning. CILP++ obtained accuracy comparable to Aleph on most standard configurations and stood behind Aleph, but was faster, on early stopping configurations. In comparison with RSD, CILP++ has been shown superior, but BCP and RSD present similar results when using C4.5 as learner. Nevertheless, BCP obtained better runtime results overall. Lastly, when using feature selection, results have shown that mRMR is applicable with CILP++ and it can reduce drastically the number of features with a small loss of accuracy, despite an increase in runtime in some cases. Feature selection with mRMR can be useful to reduce the size of the network and improve readability, especially if knowledge extraction is needed. Propositionalization methods usually show a trade-off between accuracy and efficiency. Our results show that CILP++ can improve on this trade-off by offering considerable speed-up in exchange for small accuracy loss in some datasets, even achieving better accuracy in some cases.

ILP covering-based hypothesis induction is an efficiency bottleneck in traditional ILP learners such as Aleph (Paes et al. 2007, 2008; DiMaio and Shavlik 2004). On the other hand, bottom clause generation by itself is fast. Thus, we claim that propositionalizing first-order example with BCP and using an efficient learning algorithm such as backpropagation should offer a faster and reasonably accurate way of dealing with first-order data. Our empirical results seem to confirm this claim.

As future work, there are a number of avenues for research. First, background knowledge translation into ANNs can be explored further in CILP++. A first attempt could be to use the language bias and the definite clauses from background knowledge to build the network. The study of how the last step of C-IL2P’s learning cycle, knowledge extraction, can be done in CILP++, is another area for future work. One option (Craven and Shavlik 1995) would be to create one clause for each class c and add antecedents to it which correspond to body literals of each bottom clause that belongs to c. Alternatively, the same knowledge extraction procedure which C-IL2P uses can be applied to CILP++, although it is considerably costly and further analysis on the fidelity of the extracted theory is required. This work has shown that learning first-order data with CILP++ is fast, but without considering knowledge extraction. If extraction is to be taken into consideration, faster learning algorithms for ANNs (Jacobs 1988; Møller 1993) can be used to try and keep up with Aleph in terms of runtime. Lastly, regarding other analyses that can be done with CILP++, the work of DiMaio and Shavlik (2004), which uses bottom clauses as training patterns to build a hypothesis scoring function, used several meta-parameters such as size of the bottom clause and number of distinct predicates. The same meta-parameters can be useful to CILP++. Furthermore, experiments on datasets with continuous data could be done: it should be interesting to see how CILP++ behaves on this kind of data and to analyze if this approach inherits the additive noise robustness from traditional backpropagation ANNs (Copelli et al. 1997). Also, due to the results for feature selection with mRMR, it is worth evaluating how our approach deals with very large relational datasets, e.g. CORA or Proteins, which are considered to be challenging for ILP learners (Perlich and Merugu 2005).