Inductive Logic Programming for Bioinformatics in Prova
Adrian Paschke
Michael Schröder
RuleML Inc., Canada
Biotec/Dept. of Computing, TU Dresden,
Germany
adrian.paschke AT gmx.de
ms AT biotec.tu-dresden.de
ABSTRACT
This paper describes the inductive logic programming (ILP)
features of Prova, a state-of-art distributed Semantic Web
and Life Science inference service system and architecture
for multi-relational data mining of complex Life Science phenomena such as complex biological relationships. The proposed novel design artifact implements typical ILP inference
formalisms for rule-based generalization and specialization
and combines them with expressive logic-based formalisms
such as scoped meta-data based reasoning and typed logic in
order to constrain the search space and the level of generality of relevant background knowledge. The tight integration
of declarative rule-based programming with object-oriented
programming (Java) allows outsourcing of computation intensive functionalities such as aggregations and data selections to highly optimized procedural code and query languages such as SQL, XQuery, OWL2Prova RDF, SPARQL.
Parallel processing of ILP tasks is supported by a distributed
service-oriented and event-driven middleware where several
Prova rule engine instances are deployed on the Web as distributed inference services having access to modular data
sources and distributed web-based resources. As a result
our approach preserves the high expressiveness and flexibility of ILP for multi-relational data mining and attempts to
overcome well-known computational and logical problems
of ILP when facing very large and scattered heterogenous
amounts of data with complex relationships published on
the (Semantic) Web.
1. INTRODUCTION
Typical propositional data mining approaches use a simplified assumption that all data is stored in a single relation and that each object of interest is represented by one
row. However, mining biological data, such as in molecular biology, requires expressive, efficient and scalable multirelational data mining algorithms to find highly complex
structural elements in multiple and possibly distributed data
relations. In data mining there exists two main approaches
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the VLDB copyright notice and the title of the publication and its date appear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘07, September 23-28, 2007, Vienna, Austria.
Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.
for handling relational data: Inductive Logic Programming
(ILP) and Propositionalization. [13]
Propositionalization converts the relational complex data
into a flat propositional representation and generates one
single relation out of multiple relations such that typical
propositional learners can be applied. This can be achieved
by using e.g. typical aggregation functions as provided by
relational database languages such as SQL or by generating
features (attributes) by applying logic-oriented propositionalization.
In contrast, ILP systems directly operate on multiple relations where the relational patterns are represented as subsets
of first-order logic as logic programs (LPs) consisting of rules
and facts. They search for regularities by inductively generalizing the specialized individual instances to more general
rules which describe new relations.
Both approaches have pros and cons. Database oriented
propositionalization approaches allow using highly-optimized
queries and aggregations to reduce the number of relations
and apply efficient propositional learning techniques. However, beside the computational costs of joins, they typically produce one huge propositional relation with a large
number of possible redundant features which might negatively impact the performance of learning algorithms. ILP
systems directly operate on multi-relational models, provide expressive declarative representation languages (logic
programming languages) and can handle additional (userdefined) background knowledge to substantially improve the
results of learning in terms of accuracy and efficiency. On
the other hand, large background knowledge bases (KBs)
with many irrelevant information for the problem might
have the opposite effect since the induction algorithm has
to search over all the relations and rules and generalized
model construction might take very long or even be infinite
(depending on the logic class).
In this paper we introduce Prova, a distributed web-based
rule inference system, which combines expressive declarative logic programming techniques with procedural objectoriented programming and distributed web technologies. In
particular, we describe the ILP meta program implementations of Prova which beside the inductive logical inference
algorithms allows utilizing expressive logical formalisms for,
e.g., building constructive scopes on modular (distributed)
KBs, object-oriented (OO) description languages with external OO and Semantic-Web type systems (e.g. meta data
vocabularies, ontologies), integration of multiple external
tools and data from e.g. relational databases, and parallel computation by distributing inference tasks to multiple
web-based Prova inference services deployed on a stable and
highly scalable service-oriented communication middleware.
This novel integrated approach preserves the expressiveness
benefits of ILP and adopts the aggregation and constructive
view approach of relational database systems to logic programming. Moreover, it addresses the heterogeneousness of
complex data and data types in the Life Science domain by
integrating Semantic Web domain ontologies and meta data
and considers computational complexity due to large and
and increasing amounts of data via distribution of computational tasks to multiple Prova inference services (akin to
service grids) for parallel computation.
The further paper is structured as follows: Section 2 describes the relevant background in ILP. Section 3 implements
the ILP formalisms of Prova and elaborates on several expressive formalisms in Prova which can be used to access
and query external data sources using existing highly optimized query languages and construct modular scopes on
the possible distributed knowledge base in order to constrain
the search space on relevant background knowledge. Section
4 extends Prova with a highly scalable and efficient serviceoriented middleware for deploying several Prova rule engines
as distributed inference services on the Web. The middleware features complex event processing and conversationbased messaging for seamless integration of external tools
and resources and for distributing ILP tasks in the Prova
service grids. Finally, section 5 summarizes the novel design artifact for distributed rule-based ILP proposed in this
paper.
2. INDUCTIVE LOGIC PROGRAMMING
In the following, we assume that the reader is familiar
with logic programming techniques [3]. We use the standard
LP notation with an ISO Prolog related scripting syntax, i.e.
variables start with upper-case letters, constants/individuals
with lower-case letters.
ILP is a research area at the intersection of machine learning and logic programming. [13] It allows inductively deriving general information from specific knowledge. Traditionally, ILP has been concerned with finding patterns expressed
as logic programs. In recent years, however, the scope of ILP
has broadened to cover the whole spectrum of data mining tasks (classification, regression, clustering, association
analysis). There are two main directions in ILP: learning
from entailment and learning from interpretations. Learning from entailment is also called explanatory ILP. Most of
ILP systems are learning from entailments (e.g. RDT, Progol, FOIL, SRT, Fors). Learning from interpretations is also
called descriptive ILP. Examples of the ILP systems which
are based on this setting are Claudien, ICL, and Tilde. The
differences between the two ILP approaches are in the way
they represent the examples, the background knowledge and
the way the final hypothesis is induced. The entailment paradigm represents all the data examples together with the
background knowledge as one LP. Background knowledge is
a prior knowledge, provided by the user to be used in the
construction of rules. In ILP background knowledge is expressed in the form of clauses (facts or rules) and is used
in the construction of relations. ILP generalizes from individual instances or observations in the presence of background knowledge, finding regularities or hypotheses about
yet unseen instances. It learns from examples, usually positive ground clauses as positive examples (+ negative exam-
ples) with additionally taking background knowledge into
account. To test the coverage of the learned hypothesis, a
function covers(H, E) returns the value true if E (the examples) is covered by H (the hypothesis), and false otherwise. ILP systems can be differentiated into systems which
only learn one hypothesis or several, systems which know
all examples from the beginning (batch learner, e.g. empirical ILPs such as FOIL, MARKUS, GOLEM, LINUS) or
incrementally learn them (incremental learner, e.g. MIS,
MARVIN, CLINT, CIGOL), and systems which might ask
an additional oracle (interactive) or not (non-interactive).
To enable a direct and efficient search the search space
for hypothesis needs to be structured in a certain way. θsubsumption ordering introduces a syntactic notion of generality: A rule (clause) r (resp. a term t) θ-subsumes another
rule r′ , if there exists a substitution θ, such that r ⊆ r′ ,
i.e. a rule r is as least as general as the rule r′ (r ≤ r′ ), if r
θ-subsumes r′ resp. is more general than r′ (r < r′ ) if r ≤ r′
and r′ r. (see e.g. [12]). Specialization techniques search
the hypothesis space in a top-down manner, from general
to specific hypotheses, using a θ-subsumption-based specialization operator, called refinement operator. Generalization
techniques search the hypothesis space in a bottom-up manner. Bottom-up learners start from the most specific clause
that covers a given example and then generalize the clause
until it cannot further be generalized without covering examples. Two basic generalization techniques are: relative
least general generalization (rlgg) and inverse resolution. A
lgg is the generalization that keeps an generalized term t
(or clause) as special as possible so that every other generalization would increase the number of possible instances of
t in comparison to the possible instances of the lgg. The extension of lgg builds the relative least general generalization
(rlgg), which takes into consideration available background
knowledge. Inverse resolution faces the following ”inverse”
problem: given a clause R and a parent clause C1 , find a second parent clause C2 such that R is an instance of a resolvent
of C1 and c2 . θ-subsumption and rlgg has some nice computational properties and it works for simple terms as well as
for complex terms, e.g. p() : −q(f (a)) is a specialization of
p : −q(X). θ-subsumption and lgg are purely syntactic notions. Their computation is therefore simple, as compared
to inverse resolution or inverse implication, which are both
computationally intractable. Thus, theta-subsumption and
rlgg qualify to be the right framework of generality in the
application of ILP in the domain of bioinformatics data mining.
3. INDUCTIVE LOGIC PROGRAMMING IN
PROVA
Among other application domains the Prova project [1]
is addressing Semantic Web Life Science applications [2].
It follows the spirit and design of the recent W3C Semantic Web initiative and combines declarative rules, ontologies
and inference with dynamic object-oriented Java API calls
and access to external data sources such as relational databases or enterprise applications and IT services. One of the
key advantages of Prova is its elegant separation of logic,
data access, and computation and its tight integration of
Java and Semantic Web technologies. In the following we
first describe the ILP support of Prova and then elaborate
on several expressive extensions of Prova in this context.
3.1
ILP Meta Program
The ContractLog KR [8, 5, 7] of Prova implements a meta
inference engine which allows
• computing the substitution sets of terms and clauses,
• apply the substitutions to compute specializations of
clauses (instantiations of rules),
• generalize clauses/terms and compute the (r)lgg
• compute the coverage
The ILP inference engine is implemented as a meta program (as a Prova LP script). Meta-programming and metainterpreters have their roots in the original von Neumann
computer architecture where program and data treated in
a uniform way and are a popular technique in logic programming for representing knowledge, in particular, knowledge in the domains containing logic programs as objects
of discourse. LPs representing such knowledge are called
meta-programs (a.k.a. meta interpreters) and their design is
referred to as meta-programming. The core inference functions implemented in the meta program are:
Specialization
• substitution(T erm1, T erm2, Subst) - Compute and return the
substitution S of two terms t1 and t2.
• substitute(Clause, ClauseInstance, Subst)
substitute(T erm, T ermInstance, Subst) - Apply the substitutions to a clause/term and compute the specialized instance
• specializations(Goal, Clause, Instances) - Unify (i.e. specialize) a clause (rule) with a goal (set of subgoals)and compute
the specialized top level instances (specializations)
• specialize(Goal, InputLP, OutputLP ) - Specialize an input LP
(set of clauses) with a goal and return the specialization of the
LP (i.e. set of top level clause instances).
Generalization
• lgg(Clause1, Clause2, LGG) - compute (r)lgg of two clauses
• lgg(T erm1, T erm2, LGG) - compute (r)lgg of two terms
• lggs(Clause, LP, LGGs) - compute all (r)lggs of a clause and
a LP (set of clauses)
• generalize(InputLP, OutputLP ) - Generalize an input LP (set
of clauses) and returns the generalized and minimalized (compacted) output LP using relative least general generalization
with the given background knowledge in the input LP.
Cover / Coverage
• cover(LP 1, LP 2, CoveredClause) - Return the covered clause
from both LPs, i.e. the clauses which are variants
• coverage(Goal, LP, CoveredClauses,
N otCoveredClauses, CoverageLevel) - Computes the test coverage for a given hypothesis and a given LP.
The specialization and generalization functions can be
used to define meta reasoning rules for reasoning on top
of the LP/knowledge base and the contained rules, where
a logic program is viewed as a single logical formula. For
example, recursively computing the substitution sets, the
substitutions and continuing this process with the body literals (sub goals) of the computed substitution leads to a
standard top-down derivation. In order to enable processing of clauses and their terms in the ILP meta inference
engine, queries, rules and facts are internally represented
in a list format, e.g. a term p(X) can be equivalently represented as [”p”, X]. That is, rules a represented as a list
starting with the head literal and then the body literals,
e.g. p(X) : −q(X) is written as [[p(X)], [q(X)]]. A fact is a
rule consisting only of the rules’ head, e.g. the fact q(a) is
written as [[p, a]] or equivalently [p(a)].
Here are some examples to illustrate the use of the inductive logic / meta inference functions and the list representation:
% compute the substitution set for the two complex terms
:-solve(substitution(f(g(A),B),f(g(h(a)),i(b)),Subst)).
% substitute a complex term with the substitution set
% {(A / h(a)),(B / h(b))}
:-solve(substitute(f(g(A),A),Instance,
[[A,["h","a"]],[B,["h","b"]]])).
% compute the lgg = f(X, g(Y,Z), c).
:-solve(lgg(f(a, g(b, h(X)), c), f(d, g(j(X), a), c),LGG)).
% Generalize a LP with the rules p(a):-q(a). p(a):-q(a),r(a).
% ... and the facts r(a). q(a). ...
% and return the generalized LP (set of general rules)
:-solve(generalize([
[p(a),q(a)],
[p(a),q(a),r(a)],
[p(b),q(b)],
[p(c),q(c)],
[r(a)],[q(a)],[q(b)],[q(c)]],
Generalization)).
The special built-in predicate metaLP (LP ) (coming with
the ContractLog distribution or the Prova distribution since
2.0) automatically translates the internal rules/facts of the
knowledge base into the list representation format and binds
it to the variable LP . The ”meta” LP can then be used in
further meta reasoning rules, e.g. the rule
clauses(Clause) : −metaLP (LP ), member(Clause, LP ).
returns all clauses of the logic program using the member
function on the list of clauses bound the the variable LP .
The combination of generalization and specializations allows
implementing typical top-down and bottom-up ILP learning
algorithms as well as combinations of both (akin to Muggletons’ unifying framework of generalization which combines
rlgg and inverse resolution.
As it is well-known several problems in pure theta- subsumption and rlgg arise due to the combinatorics of the
search-space (the space is infinite in multi-relational models)
and the determinancy problem. To make the search space
tractable and efficient, it is thus necessary to constrain the
search space in some way. In the following subsections we
will elaborate on several formalisms in Prova which can be
used to limit the number of clauses in an useful way.
3.2
Aggregations and Constructive Scopes
A common technique in logic-based propositionalization
to reduce the number of relations to be considered for feature
generation is aggregation by using aggregation functions as
provided by SQL. Aggregation replaces a set of values by a
suitable single value that summarizes properties of the set.
The data in Prova can either be available locally as facts
in the KB, or dynamically accessed via database queries on
arbitrary external data sources such as relational databases,
XML documents, Semantic Web RDF or RDFS/OWL ontologies which can be queried by several built-in query languages (e.g. SQL, RDF, XQuery, SPARQL) or wrapped via
Java APIs (e.g. local enterprise java beans or distributed
web services):
Prova Java Integration: The tight and natural Java integration of Prova [2] allows dynamically calling external
procedural Java methods during runtime. That is, efficient
procedural code can be integrated into the rule executions
and used for dynamically accessing external data sources
and tools using their programming interfaces (APIs). Methods of classes in arbitrary Java packages can be dynamically
invoked from Prova rules. The method invocations include
calls to Java constructors creating Java variables and calls
to instance and static methods for Java classes. The example below shows how XML Document Object Model (DOM)
is manipulated in the code. Prova provides a special wrapper object for XML DOM with a built-in class XML. The
objects of this class can be constructed from StringReader
objects and can be manipulated with ordinary methods of
the standard Java org.w3c.dom.Document class.
attachResults(Doc,Root,XMLPapers) :element(XMLPaper,XMLPapers),
ResId = XMLPapers.indexOf(XMLPaper),
StringReader = java.io.StringReader(XMLPaper),
Document = XML(StringReader),
ResRoot = Document.getDocumentElement(),
ResRoot.setAttribute("ResId",ResId),
Paper = Doc.importNode(ResRoot,Boolean.TRUE),
Root.appendChild(Paper),
fail().
attachResults(Doc,Root,XMLList).
The Java list XM LP apers contains papers returned from
a query to an external database. The built-in predicate
element non-deterministically enumerates each paper in the
list. The method indexOf invoked on the list XM LP apers
returns ResId as the sequential number of the current paper.
An XML DOM document is imported from the text based
XML representation contained in XM LP aper by first creating a StringReader object from it and then constructing
an XML DOM object. The root attribute is set in the next
two lines and then standard Java XML importN ode and
appendChild methods are used to append the P aper node
to the XML DOM in Doc.
By calling external Java methods, computation intensive
functions can be implemented by highly optimized procedural code and external data sources can be accessed via
calling Java wrappers and Java (web) service APIs. For
typical data sources such as relational databases, Semantic
Web and XML documents, Prova provides specialized query
and update built-ins.
Prova SQL Integration: Provas’ SQL integration has a
crucial role in providing an efficient and flexible mechanism
for relational data integration. Prova offers a seamless integration of predicates with most common SQL queries and
updates. The language goes beyond providing embedded
SQL calls and attempts to achieve a more flexible and natural integration of queries with Prova predicates.
The main format for Prova predicates dynamically mapped
to SQL Select statements is as follows:
or an exception occurs. It accepts a variable number of parameters of which only the first two are required. DB corresponds to an open database connection and F rom is either
the name of a table to be queried or a valid F rom clause
in SQL syntax enclosed in single or double quotes. F rom
can be a variable but it must become instantiated before the
execution of the query. Not only the F rom clause can be determined dynamically, but also all the remaining parameters
can be either variables or constants or even the whole list
of parameters can be dynamically constructed. The most
important part of the syntax of sql select is 0 or more field
name-value pairs [N 1, V 1], ..., [N k, V k]. N 1, ..., N k correspond to field names (with possible modifiers) included in
the query. As opposed to ordinary SQL Select statements,
this list of fields includes both the fields to be returned from
the query and those that can be supplied in the automatically constructed part of a SQL W here clause. Whether a
particular field N i will be returned or used as a constraint
depends on the values V i corresponding to these field. If
V i is a constant at the time of the invocation, it becomes a
constraint in the automatically constructed W here clause.
Otherwise, V i is an un-instantiated (free) variable and will
be returned by the query in each record in the result set.
In addition to simple field names, N 1, ..., N k can be strings
containing special SQL modifiers such as Distinct (for example, distinctname) or group functions such as Count (for
example, count(px)). The remaining parameters are entirely
optional. In the pair [where, W here], where is a reserved
word and W here is a variable or constant containing an
explicit SQL W here clause. An automatically constructed
W here clause part is concatenated via AN D with the explicit W here clause specified in this parameter. This syntax
is useful in situations requiring the use of such constraints
as Like or Rlike, for example, [where, ”pdbi dlike′ %%gs′ ”].
The pair [having, Having] allows specifying a post processing filter on the results returned by the query, for example,
[having, ”count(px) > 1”]. A large variety of other modifiers for the query can be included with the [order, Order]
pair. Queries with joined tables can either be constructed by
combining several single table queries or by using a composite F rom clause and making sure each field name is prefixed
with either the corresponding table name or an alias variable
if a syntax table as alias is used in the F rom clause.
sql_select(DB,cla,[pdb_id,"1alx"],[px,Domain])
sql_select(DB,cla,[pdb_id,PDB_ID],[count(px),2])
sql_select(DB,cla,[pdb_id,PDB_ID],[count(px),Count])
sql_select DB,cla,[pdb_id,PDB_ID],[count(px),Count],
[where,"pdb_id like ’%%gs’"]
sql_select(DB,cla,["distinct pdb_id",PDB_ID],[options,"limit 10"])
sql_select(DB,’cla as c1,cla as c2’,[’c1.px’,PXA],[’c2.px’,PXB],
[’c1.pdb_id’,PDB_ID],[where,’c1.pdb_id=c2.pdb_id and c1.px<c2.px’])
sql_select(DB,From,[N1,V1],...,[Nk,Vk],
[where,Where],[having,Having],[options,Options])
The where clause can be used to define a view on the
relational data base and constrain the number of considered instances. The last rules in the example shows how
two sql select calls can be used to compute an inner join
for table cla finding two different domains P XA and P XB
belonging to the same P DB file. Beside querying a database Prova also supports built-ins for inserting knowledge
and updating databases.
The built-in sql select predicate non-deterministically enumerates over all possible records in the result set corresponding to the query. The predicate fails if the result set is empty
Prova RDF / Ontology Integration: As for SQL, Prova
provides a special RDF query predicate which can be used in
the body of rules to interact with Semantic Web data sources
and explicitly express queries, such as concept membership,
role membership or concept inclusion on the ontologies. [4,
6] The special query predicate rdf is used to query external
ontologies written in RDF(S) or OWL (OWL Lite or OWL
DL).
% Bind all individuals of type "Gene" to the variable "Subject"
%using the owl ontology "gene1.owl" and the "rdfs" reasoner
rdf(
"http://www.gene.com/gene1.owl",
"rdfs",
Subject,"rdf_type","gene1_Gene")
The first argument specifies the URL of the external ontology. The second argument specifies the external reasoner
which is used to infer the ontology model and answer the
query. The hybrid approach provides a technical separation
between the inferences in the ontology (Description Logic)
part which is solved by an optimized external DL reasoner
and the Logic Programming components which is solved by
the rule engine. As a result the combined heterogenous integration approach is robustly decidable, even in case where
the rule language is far more expressive than Datalog. Moreover, the triple-based query language also supports queries
to plain RDF data sources. The following predefined reasoner are supported:
• ”” — ”empty” — null = no reasoner
• default = OWL reasoner
• transitive = transitive reasoner
• rdfs = RDFS rule reasoner
• owl = OWL reasoner
• daml = DAML reasoner
• dl = OWL-DL reasoner
• swrl = SWRL reasoner
• rdfs full = rdfs full reasoner
• rdfs simple = rdfs simple reasoner
• owl mini = owl mini reasoner
• owl micro = owl micro reasoner
User-defined reasoners can be easily configured and used.
By default the specified reasoners are used to query the external models on the fly, i.e. to dynamically answer the
queries using the external reasoner. But, a pre-processing
mode is also supported. Here the reasoners are used to preinfer the ontology model, i.e. build an inferred RDF triple
model where the logical DL entailments such as transitive
subclasses are already resolved at compilation time. Queries
then operate on the inferred model and are hence much fast
to answer, however with the drawback that updates of the
ontology model require a complete recompilation of the inferred model.
Prova Meta-Data LPs and Scoped Reasoning: To
capture the often distributed and open structure of multirelational knowledge/data bases which are deployed on the
Web Prova implements expressive updates and imports of
Prova scripts (knowledge modules) from web URIs, metadata annotated labelled logic programs (LLPs) and scoped
reasoning. [8] Arbitrary meta-data such as rule labels, module labels or Dublin Core annotations (e.g., author, date,
topic) can be attached to rules and facts. These additional
meta-data annotations become in particular interesting when
the knowledge base consists of several (possibly distributed)
rule sets, so called modules, which might be dynamically
imported from different external sources accessible by their
Web-based URIs. The meta-data might be used to to create
constructive (explicitly closed) views on the distributed KB
via scoped reasoning by scoped queries, e.g., ”all rules/facts
to a particular topic” or ”all facts with time stamps after a
certain date/time”. Hence, scoping leads to much smaller
search spaces and allows an explicit management of the level
of generality of queries/goals.
To explicitly annotate clauses in a labelled logic program
(LLP) P with an additional set of meta-data labels Prova
introduces a general n-ary metadata function into the LP
language. The function metadata is a partial injective labelling function that assigns a set of meta data annotations
m (property-value pairs) to a clause cl in the program P ,
i.e., m : cl. It is syntactically defined separated from a clause
(rule/fact/query) by ”::”:
metadata(L1 , .., Ln ) :: H ← B
where Li are a finite set of unary positive literals (positive meta data literals) which denote an arbitrary meta
data property(value) pair, e.g., label(rule1). The explicit
metadata() annotation is optional, i.e., a program P without meta data annotated clauses coincides with a standard
unlabelled LP.
metadata(label(rule1), topic("mutagenesis"), dc_date(2006-11-12))::
p(X):-q(X).
metadata(label(fact1))::q(1).
% scoped query using topic as scope
:-solve(scope(p(X),topic("mutagenesis"))).
The example shows a rule with rule label rule1, a topic
mutagenesis, an additional Dublin Core annotation
dc date(2006−11−12) and a fact with fact label f act1. The
meta annotation of rules and rule sets (modules) enables
(meta) reasoning with the semantic annotations. The meta
data can act as an explicit scope for constructive queries
(creating a view) on the knowledge base. For instance, the
meta data annotations might be used to constrain the level
of generality of a scoped goal literal to a particular module
(defined by the meta data constraints), i.e., to consider only
the set of rules and facts which belong to the specified module. A scoped literal is of the form L : C where L is a positive
or negative atom and C is the scope definition which is a set
of one or more meta data constraints. Scoped literals are
only allowed in the body of a rule. Scoped literals might be
default negated ∼ L : C. Syntactically, the following builtin predicates are used to query the meta data annotations
and define the scope of literals for metadata-based scoped
reasoning on explicitly specified parts of the KB:
% scoped literal
scope(<literal>,<meta data value>)
% query meta data value
metadata(<literal>,<Variable>,<meta data property>)
% constrain scoped goal literal
metadata(<literal>,<meta data value>,<meta data property>)
Scoped reasoning is crucial to explicitly close open and
possibly distributed KBs on the Web. Comparable to database views created by Where-SQL clauses scoped goals can
be used to create constructive views on the KB and reduce
the number of relations and background knowledge which
needs to be considered in ILP. Moreover, more meaningful
and relevant information can be selected from the KB by
the additional meta data of the rules/facts and rule sets
(modules).
3.3
Types and Modes
Type and mode declarations are a common way in many
ILP systems, like in PROGOL, TILDE or WARMR, to constrain the search space and state how clauses can be refined.
Prova provides rich support for modes and external ordersorted type systems, in particular Semantic Web ontologies
and Java class hierarchies by a polymorphic order-sorted
typed unification [6].
In order to type a variable with a Java type the fully qualified name of the Java class to which the variable should belong must be specified as a prefix separated from the variable
by a dot ”.”.
java.lang.Integer.X
java.util.Calendar.T
java.sql.Types.STRUCT.S
variable X is of type Integer
variable T is of type Calendar
variable S is of SQL type Struct
Java objects, as instances of Java classes, can be dynamically constructed by calling their constructors or static methods using highly-expressive procedural attachments.
The returned objects, might then be used as individuals /
constants that are bound by an equality relation (denoting
typed unification equality) to appropriate variables, i.e., the
variables must be of the same type or of a super type of the
Java object. Ad-hoc polymorphic specialized functions can
be implemented based on the type declarations, as can be
seen in the following example showing two variants of the
add function.
add(java.lang.Integer.In1,java.lang.Integer.In2,Result):Result = java.lang.Integer.In1 + java.lang.Integer.In2.
add(In1, In2,Result):I1 = java.lang.Integer(In1),
I2 = java.lang.Integer(In2),
X = I1+I2,
Result = X.toString().
Beside Java class hierarchies Semantic Web taxonomies
and ontologies (e.g. RDFS taxonomies or OWL ontologies)
can be used as external order-sorted type systems in the
multi-sorted Prova rule language. The implementation follows a prescriptive hybrid DL-typing approach with an polymorphic order-sorted unification and incorporates ontology
type information directly into the names of symbols in the
rule language. [8, 4, 6]
sameTranscriptionDirection(patika_P53Protein:A,
patika_MacroMolecule:B) :orfDirection(patika_P53Protein:A,patika_MacroMolecule:D),
orfDirection(patika_MacroMolecule:B,patika_MacroMolecule:D),
patika_P53ProteinA<>patika_MacroMolecule:B.
The example annotates variables (T ype : T erm) with conceptual types such as P 53P rotein or M acroM olecule from
an ontology patika, which denotes the namespace.
In the rlgg computation in ILP types are used to select
relevant facts and rules from the background knowledge.
4. DISTRIBUTED INDUCTIVE LOGIC PROGRAMMING IN PROVA
Biological data mining systems typically accesses large
and distributed web-based data sources and integrates multiple services, tools and resources during runtime. In this
Figure 1: Distributed Prova Web Services
section we will implement Prova as a highly efficient and
scalable inference service architecture with a communication
middleware which supports parallel computation and resource allocation, seamless integration of external tools and
communication of data, tasks and results between the distributed Prova inference services and external data sources
/ tools using an enterprise service bus (ESB) as communication middleware. [9] Figure 1 exemplifies the technical
design of our approach.
The three core design artifacts in our architecture are several instances of Prova rule engines deployed as web-based
inference services (web-based execution environments), a scalable and efficient service-broker and communication middleware (an ESB) and, a common platform-independent rule
interchange format to interchange rules, data and events between arbitrary Prova inference services and with external
tools and data sources.
Several Prova rule engines might be deployed as distributed web-based services. Each service might dynamically
import or pre-compile and load distributed rule bases which
implement ILP theories and background knowledge. External data from data sources such as Web resources or relational databases and external application tools, web services
and object representations can be directly integrated during
runtime or by translation during compile time by the expressive homogenous and heterogenous integration interfaces of
Prova. Furthermore, the ESB can be used to communicate
with external components such as web services via asynchronous publish-subscribe message conversations. The ESB is
used as object broker for the Prova inference services and as
stable and efficient messaging middleware between the ser-
vices [10]. Different transport protocols such as JMS, HTTP
or SOAP (or Rest) can be selected to transport rule sets,
data, queries and answers as payload of Reaction RuleML
event messages between the internal Prova inference services
deployed on the ESB. RuleML/Reaction RuleML [11, 10] is
used as common platform-independent rule interchange format in which the Prova platform-specific execution language
is translated and vice versa.
The main Prova language constructs for rule interchange
are: sendMsg predicates, reaction rcvMsg rules, and rcvMsg
or rcvMult inline reactions:
sendMsg(XID,Protocol,Agent,Performative,Payload |Context)
rcvMsg(XID,Protocol,From,queryref,Paylod|Context)
rcvMult(XID,Protocol,From,queryref,Paylod|Context)
where XID is the conversation identifier (conversation-id)
of the conversation to which the message will belong. Protocol defines the communication protocol. More than 30 protocols such as JMS, HTTP, SOAP, Jade are supported by
the underlying ESB as efficient and scalable object-broker
and communication middleware. Agent denotes the target
(an agent or service wrapping an instance of a Prova rule engine) of the message. Performative describes the pragmatic
context in which the message is send (e.g. a multi-request
in a contract net protocol). Payload represents the message
content sent in the message envelope. It can be a specific
query or answer or a complex interchanged rule base (set of
rules and facts).
% Upload a rule base read from File to the host
% at address Remote via JMS
upload_mobile_code(Remote,File) :% Opening a file returns an instance
% of java.io.BufferedReader in Reader
fopen(File,Reader),
Writer = java.io.StringWriter(),
copy(Reader,Writer),
Text = Writer.toString(),
% SB will encapsulate the whole content of File
SB = StringBuffer(Text),
sendMsg(XID,esb,Remote,eval,consult(SB)).
The example shows a rule that sends a rule base from
an external File to the inference service Remote using the
ESB. The inline sendMsg reaction rules is locally used within
a derivation rule, i.e. only applies in the context of the
derivation rule. The corresponding global receiving rule on
the inference service side could be:
rcvMsg(XID,esb,Sender,eval,[Predicate|Args]):derive([Predicate|Args]).
This rule receives all incoming messages from the ESB
send to the inference service with the pragmatic context
eval and derives the message content. The list notation
[P redicate | Args] will match with arbitrary n-ary predicate
functions, i.e., it denotes a kind of restricted second order
notation since the variable Predicate is always bound, but
matches to all predicates in the signature of the language
with an arbitrary number of arguments Args).
Rules and data are translated and interchange as inbound
and outbound Reaction RuleML messages < M essage >
over the ESB:
<Message mode="outbound" directive="ACL:inform">
<oid> <!-- conversation ID--> </oid>
<protocol> <!-- transport protocol --> </protocol>
<sender> <!-- sender agent/service --> </sender>
<content> <!-- message payload --> </content>
</Message>
• @mode = inbound|outbound – attribute defining the type of a
message
• @directive – attribute defining the pragmatic context of the
message, e.g. a FIPA ACL performative
• < oid > – the conversation id used to distinguish multiple
conversations and conversation states
• < protocol > – a transport protocol such as HTTP, JMS,
SOAP, Jade, Enterprise Service Bus (ESB) ...
• < sender >< receiver > – the sender/receiver agent/service
of the message
• < content > – message payload transporting a RuleML / Reaction RuleML query, answer or rule base
By distributing mobile code to several inference services
parallel computation of ILP tasks becomes possible. Relevant parts of the background knowledge for learning particular hypotheses are bundled to modules using constructive
scopes and distributed to several client inference services in
parallel. The learned rlggs from each client are send back
to the manager node and integrated and aggregate in the
background knowledge removing redundant and irrelevant
clauses. The manager node then constructs new modules
from the updated background knowledge and again sends
out the ILP tasks to the clients for parallel processing. This
process is repeated until a certain fixpoint in the incremental learner is reached for the inductively derived hypotheses
such that no further generalizations can be found. Typical
verification and validation tasks such as coverage proving
the learned hypotheses with negative examples by specialization, or finding and removing failures in the background
knowledge can be also solved in parallel by ”outsourcing”
this processes to the client services.
In summary, the described middleware addresses the needs
for a seamless integration of distributed external data sources,
tools and resources and provides the technical infrastructure
to develop new distributed and service-oriented ILP algorithms which share Web resources and data.
5. CONCLUSION
While previous work in data mining has focused on extracting useful information from large database and on implementing scalable, robust algorithms for propositional and
flat relations, multi-relation data mining operating on heterogenous and distributed data sources on the Web is a
relatively young field. In this paper we have introduced
Prova as a state-of-art distributed Semantic Web inference
service which supports distributed multi-relational inductive
logic programming based on a rule and event-based middleware. Prova combines technologies from declarative rulebased programming with enterprise application technologies for object-oriented programming, relational and semistructured heterogenous data access and novel techniques
for service oriented computing and complex event processing as basis for inference service grids, resource sharing networks and parallel computation. The resulting design artifact addresses real-world requirements in ILP-based mining
of biological data such as: highly complex structural elements with diverse and unusual relational, semi-structured
or object-centered data types, e.g. using Semantic Web ontology languages as semantically rich concept description
languages; large amounts of data stored in distributed heterogenous data sources; seamless integration and combinations of tools and services demanding for efficient interchange of data and events; high computational complexity
of the ILP tasks due to the complex combinatorics of multirelational search space and the open-world assumption of
the open distributed Web knowledge bases.
Our distributed rule-based Prova approach, which is akin
to grid service networks, has the potential to overcome these
problems in standard ILP and establish ILP as a potential
approach to analyze biological data in multi-relational Life
Science data bases published on the Web.
The implementations described in this paper are part of
the Prova / ContractLog open-source distribution [1] and we
have successfully demonstrated the usability and adequacy
of our ILP and enterprise service middleware approach in
various domains of research and industry use cases such as
for test-driven verification and validation of correctness and
quality of rule bases (see RBSLA project [7, 5, 8]), and Semantic Web-based virtual organizations and web service collaborations (see Rule Responder project [9]).
6. REFERENCES
[1] A. Kozlenkov, A. Paschke, and M. Schroeder. Prova,
http://prova.ws, accessed jan. 2006. 2006.
[2] A. Kozlenkov and M. Schroeder. Prova: Rule-based
java-scripting for a bioinformatics semantic web.
Proceedings International Workshop on Data
Integration in the Life Sciences, 2004.
[3] J. W. Lloyd. Foundations of logic programming; (2nd
extended ed.). Springer-Verlag New York, Inc., New
York, NY, USA, 1987.
[4] A. Paschke. Owl2prova: Homogeneous and
heterogeneous integration of description logics into
logic programming,
http://prova.ws/forum/viewtopic.php?t=152,
accessed dec. 2005, 2005.
[5] A. Paschke. Rule based service level agreements,
http://ibis.in.tum.de/projects/rbsla/index.php, 2006.
[6] A. Paschke. A typed hybrid description logic
programming language with polymorphic order-sorted
dl-typed unification for semantic web type systems. In
OWL-2006 (OWLED’06), Athens, Georgia, USA,
2006.
[7] A. Paschke. Verification, validation and integrity of
distributed and interchanged rule based policies and
contracts in the semantic web. In Int. Semantic Web
and Policy Workshop (SWPW’ 06), Athens, Georgia,
USA, 2006.
[8] A. Paschke. Rule-Based Service Level Agreements Knowledge Representation for Automated e-Contract,
SLA and Policy Management. IDEA, Munich, 2007.
[9] A. Paschke, H. Boley, A. Kozlenkov, and B. Craig.
Rule responder: A ruleml-based pragmatic agent web,
www.responder.ruleml.org, 2007.
[10] A. Paschke, A. Kozlenkov, and H. Boley. A
homogenous reaction rules language for complex event
processing. In International Workshop on Event Drive
Architecture for Complex Event Process (EDA-PS
2007), Vienna, Austria, 2007.
[11] A. Paschke, A. Kozlenkov, H. Boley, M. Kifer,
S. Tabet, M. Dean, and K. Barrett. Reaction ruleml,
http://ibis.in.tum.de/research/reactionruleml/, 2006.
[12] G. Plotkin. A note on inductive generalization.
Machine Intelligence, 5, 1970.
[13] S. Wrobel. Inductive Logic Programming for
Knowledge Discovery in Databases. Relational Data
Mining. Springer, Berlin, 2001.