Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Component-Based Data Mining Frameworks: Fernando Berzal, Ignacio Blanco, Juan-Carlos Cubero, and Nicolas Marin

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Fernando Berzal, Ignacio Blanco,

Juan-Carlos Cubero, and Nicolas Marin

Component-based Data Mining


Frameworks
OLAP Vs. OLTP in the middle tier.
oth researchers and practi- usually provide only low-level mining algorithms in order to
B tioners accept that decision
support systems (DSSs) have spe-
information processing capabili-
ties, since they are OLTP-applica-
solve second-order data mining
problems. Both knowledge models
cific needs that cannot be prop- tion-oriented. Here, we propose and dataset metadata might be
erly addressed by conventional the development of custom-tai- stored for later use in the back-end
information systems. Online lored component-based frame- database (for instance, the Object
Transaction Processing (OLTP) works to solve DSS problems, Pool).
systems work with relatively although this approach can be Component-based frameworks
small chunks of information at a extended to a wide range of scien- such as Enterprise JavaBeans (see
time, while DSS applications tific applications. java.sun.com/products/j2ee) and
require the analysis of huge Microsoft .NET (see www.
amounts of data. Online Analyti- A Component-based Data microsoft.com/com/net) are
cal Processing (OLAP) [1], data Mining Framework based on a common architectural
mining [3] and data warehouses An open framework for the devel- pattern, a.k.a. the Enterprise
[7] emerged during the last opment of data mining algorithms Component Framework [5]. A
decade in order to fulfill the and DSSs should include capabili- simplified representation of this
expectations of executives, man- ties to analyze huge datasets, clus- pattern is shown in Figure 2.
agers, and analysts (also known ter data, build classification This pattern, modeled as a para-
as knowledge workers). models, and extract associations meterized collaboration in UML
We have also witnessed the and patterns from input data. The [6], contains six classifier roles
flourishing of component-based conceptual model for such a sys- which are depicted as rectangles:
frameworks during the last few tem is shown in Figure 1.
years (see Communications’ special The data miner (the user of the • Client. Any entity that requests
section on object-oriented appli- system) has to analyze large a service from a component in
cation frameworks, Oct. 1997 datasets and he or she needs to the framework. These requests
and Communications’ special sec- make use of data mining tools to could be performed using spe-
tion on component-based enter- perform a task. Data is gathered cial-purpose data mining query
prise frameworks, Oct. 2000). and data mining algorithms are languages, for example, OLE
These frameworks are intended to used in order to build knowledge DB for Data Mining (see
help developers to build increas- models that summarize the input www.microsoft.com/data/oledb).
ingly complex systems, enhancing data. Those models may provide Instead of calling the com-
productivity and promoting com- the information our user needs, or ponent directly, the client
ponent reuse in well-defined pat- they may just suggest new ways to internally uses a pair of proxies
PAUL WATSON

terns. Nowadays those systems are explore the available data. More- that relay calls from the client
widely used in enterprises over, those knowledge models to the component. This level of
throughout the world, but they could be used as input to other indirection is, however, hidden

COMMUNICATIONS OF THE ACM December 2002/Vol. 45, No. 12 97


Technical Opinion
from the client perspective; it Commercial application servers back-end database is also useful
makes location transparency may be suitable for e-business to provide a reliable computing
possible and, when needed, it applications, but they lack the environment (preserving the
supports message interception. stricter control of computing system state against power out-
• Factory proxy. Performs object resources data mining systems ages, for example).
factory operations, common to require. This shortage is also
all framework components applicable to a wide range of Design Principles
(such as create or find) and scientific applications by exten- We believe every component-
facilitates class methods. sion. Custom-tailored frame- based, data mining framework
• Remote proxy. Handles opera- works should include should focus its design efforts
tions specific to each kind of capabilities such as CPU/mem- into two major objectives:
component (such as inspection, ory/database usage monitoring,
parameter setting, and so • Transparency. Both for
forth) and facilitates users and program-
User
instance methods. mers. Users do not
• Component. Both need to be aware of the
Data Mining System Graphical User Interface
datasets and knowledge underlying complexity
models are components in of the system, while
Knowledge
a data mining framework. Datasets Models programmers should
Data mining algorithms Data Mining Algorithm Classifiers be able to create new
could also be considered framework compo-
Persistence
as components on their Service nents just by imple-
JDBC ODBC ASCII
own, but they are just Wrapper Wrapper Wrapper menting a core set of
used through factory well-defined interfaces.
proxies to build knowl- • Usability [4]. As any
edge models. other software system,
• Container. Represents the data mining systems
framework’s runtime envi- Oracle SQL Server ASCII Back-end Database are used by people and
DBMS DBMS Files (Object Pool)
ronment, and holds the system usability is crit-
components and both Figure 1. A conceptual model for DSSs. ical for user accep-
proxy roles. This containment tance, provided that knowledge
is shown by aggregation links in resource discovery, reconfig- workers are not usually knowl-
Figure 2. urable load balancing, and a edgeable about computers. Users
The container supports dis- higher degree of freedom for should be able to store anything
tributed computing services, programmers to implement dis- they can reuse in future sessions,
such as security, interprocess tributed algorithms. or even share the information
communication, persistence, • Persistence service. Permits the they obtain. This groupware
and hot deployment. Transac- storage and retrieval of frame- focus is especially important in
tions are also supported by work components. This persis- data mining applications, where
enterprise frameworks (such as tence service can be managed the discovered knowledge must
EJB/.NET frameworks) but are and coordinated by the con- be properly represented and
not needed in a data mining tainer. Dataset metadata, dis- communicated.
environment. Such an environ- covered knowledge models, and
ment, however, needs schedul- user session information are all Component-based
ing, monitoring, and candidates to be saved for Dataset Modeling
notification mechanisms to future use in the back-end data- Let us consider, for example, the
manage data mining tasks. base (the Object Pool). This case of assembling the datasets

98 December 2002/Vol. 45, No. 12 COMMUNICATIONS OF THE ACM


that are used as input to build languages to define the cus- • Joiners are used to join multi-
knowledge models. These tomized datasets they need. They ple datasets. They allow the
datasets may come from hetero- would probably reject a system user to combine information
geneous information sources. that requires them to learn any coming from different sources.
Data mining tools usually work complex formalism. In order to Joiners are also useful to
with tables in the relational sense. improve system acceptance, we include lookup fields into a
Each table contains a set of fixed- propose a bottom-up approach. A given dataset (as in data ware-
width tuples that can be obtained family of dataset-building compo- house star schemas) and to
either from relational databases or nents should provide users with specify relationships between
any other information two datasets from the
source (ASCII or XML same source (for example,
files, for example). master/detail relation-
Factory Component
All tabular datasets Client proxy ships).
have a set of columns Remote
• Aggregators summarize
(also called attributes). proxy datasets in order to pro-
Each one of them has a vide a higher-level view of
unique identifier and an the available data. Aggre-
associated data type Container
Persistence
gations are useful in a
(strings, numbers, dates, Service wide range of OLAP
and so forth). A flexible applications, where trends
tool should allow the are much more interesting
specification of order Object
Pool
than particular details.
relationships among Common aggregation
attribute values and the functions include MAX,
grouping of attribute Figure 2. A simplified representation of the MIN, TOP, BOTTOM,
values to define concept hierar- Enterprise Component Framework. COUNT, SUM, and AVG.
chies. • Filters perform a selection over
A data mining system should all the primitives they need to the input dataset to obtain sub-
also be capable of performing het- build their own datasets from the sets of the original input
erogeneous queries over different available data sources: dataset. In data mining applica-
databases and information tions, filters can be used to per-
sources. The independently • Wrappers are responsible for form samplings, to build
retrieved datasets, in fact, might providing uniform access to dif- training datasets (such as when
be processed further in order to ferent data sources. Data stored using cross-validation in classi-
join them with other datasets as sets of tables in relational fication problems), or just to
(data integration), to standardize databases can be retrieved per- select the data we are interested
concept representations and elimi- forming standard SQL queries in for further processing.
nate redundancies (data cleaning), through call-level interfaces such • Transformers are also needed
to compute aggregations (data as JDBC and ODBC. Data to modify dataset columns.
summarizing), or just to discard stored in other formats (locally Encoders are used to encode
part of them (data filtering). as ASCII/XML files or remotely input data, for example, pro-
All the aforementioned opera- in DSTP servers [see viding a uniform encoding
tions involving datasets can be www.ncdm.iuc.edu/ dstp], for schema to deal with data com-
performed using powerful formal example) require specific wrap- ing from different sources.
models and query languages. pers. In fact, any accessible infor- Encoders are useful to ensure
However, typical users are not mation source requires its own that real-world entities are
prepared to use such models and suitable wrapper. always represented in the same

COMMUNICATIONS OF THE ACM December 2002/Vol. 45, No. 12 99


Technical Opinion
way, even when represented dif- resources available with current
ferently in different data corporate intranets. Although we
sources. In fact, the same entity have proposed a component-
could even have several repre- based framework model for
sentations in a given data building a data mining system
source. This kind of compo- here, our approach is extendible
nent is useful for data cleaning to any CPU-intensive computing
and integration. application. c
Extenders are used to append
new fields to a given dataset. References
Those fields, known as calcu- 1. Chaudhuri, S. and Dayal, U. An overview of
data warehousing and OLAP technology.
lated fields, are useful for man- ACM SIGMOD Record, Mar. 1997.
aging dates and converting 2. Codd, E.F., Codd, S.B., and Salley, C.T.
Providing OLAP to User-Analysts: An IT
measurement units. The value Mandate. Hyperion Solutions Corporation,
of a calculated field is com- Sunnyvale, CA., 1998; www.hyperion.com/
pletely determined by the other products/whitepapers.
3. Han, J. and Kamber, M. Data Mining: Con-
field values in the same tuple. A cepts and Techniques. Morgan Kaufmann,
calculated field could be speci- San Francisco, 2000.
4. Juristo, N., Windl, H. and Constantine, L.
fied using a simple arithmetic (Eds.) Usability engineering. IEEE Software
expression (including operators 18, 1 (Jan./Feb. 2001).
such as +, -, *, and /) or even 5. Kobryn, C. Modeling components and
frameworks with UML. Commun. ACM 43,
more complicated algorithms 10, (Oct. 2000), 31–38.
(involving if statements and 6. Object Management Group. OMG Unified
Modeling Language Specification, 1.4, 1st. Ed.
table lookups, for example). (Sept. 2001);
www.omg.org/technology/uml/.
These components can be 7. Widom, J. Research problems in data ware-
housing. In Proceedings of the 1995 Interna-
combined easily in tree-like struc- tional Conference on Information and
tures to build highly personalized Knowledge Management (CIKM'95), Nov.
datasets. These datasets are 29–Dec. 2, 1995, Baltimore, MD.

amenable to standard query opti- Fernando Berzal (berzal@acm.org) is a


mization techniques, therefore researcher at the Intelligent Databases and
improving system performance. Information Systems research group in the
Department of Computer Science and
Using our approach, even com- Artificial Intelligence at the University of
puter-illiterate users are able to Granada, Spain.
use complex data mining systems Ignacio Blanco (iblanco@ual.es) is an
by linking dataset modeling assistant professor at the University of
Almeria, Spain.
components. Juan-Carlos Cubero (jc.cubero@
decsai.ugr.es) is an associate professor in the
Commercial enterprise applica- Department of Computer Science and
tion servers (the containers in the Artificial Intelligence at the University of
Granada, Spain.
framework pattern) are currently Nicolas Marin (nicm@descsai.ugr.es) is
restricted to OLTP applications, an assistant professor in the Department of
and we believe it is time for sys- Computer Science and Artificial Intelligence
tem architects to focus on higher- at the University of Granada, Spain.
level information processing
capabilities in order to take
advantage of the vast computing © 2002 ACM 0002-0782/02/1200 $5.00

100 December 2002/Vol. 45, No. 12 COMMUNICATIONS OF THE ACM

You might also like