Towards Flexibility in A Distributed Data Mining Framework: Jos e M. Pe Na Ernestina Menasalvas
Towards Flexibility in A Distributed Data Mining Framework: Jos e M. Pe Na Ernestina Menasalvas
Towards Flexibility in A Distributed Data Mining Framework: Jos e M. Pe Na Ernestina Menasalvas
Jose M. Pena
Ernestina Menasalvas
de Madrid
Campus de Montegancedo S/N, 28660
Madrid, Spain
de Madrid
Campus de Montegancedo S/N, 28660
Madrid, Spain
jmpena@fi.upm.es
emenasalvas@fi.upm.es
ABSTRACT
1.
INTRODUCTION
Distributed data mining (DDM) systems must keep the existing features of the current centralized versions. Nevertheless new features should be added to deal with the new
trends:
New algorithms should be included easily.
Integration between relational a data mining operations is a main goal.
State-of-the-art trends in interoperability, like CORBA,
should be talen into account.
Factors like the inclusion of algorithms, changes on system topology, dierent user privileges or system load
restrictions have a deep impact in the overall performance of the system. Important decisions like task
scheduling, resource management or partition schema
should be tuned again almost every time one of these
factors is modied.
Nowadays, there is no universal strategy to congure a distributed data mining environment to be optimal in the resolution of every query. As an alternative our contribution
presents an appropriate architecture to support multiple optimization and conguration strategies in data mining systems. This architecture provides a framework in which control policies can be changed to congure and optimize the
2. CURRENT TRENDS
Distributed and parallel data mining systems are more often experimental systems rather than commercial products.
Due to this most of them are focused only on a particular
data mining problem. Papyrus and JAM are distributed
classiers. Other problems like association calculation [1, 3,
7] or sequential patterns [12, 13] have also distributed algorithms. Each of this algorithms deals with dierent factors
of: (i) data distribution (horizontal or vertical), (ii) communication schemas (broadcast or unicast), (iii) synchronization and coordination (collaborative, centralized control, voting, : : : ) or (iv) distributed task granularity.
In order to run a number of these algorithms in a common
system many dierent behaviors should be supported. A
3.
MOIRAE ARCHITECTURE
MOIRAE architecture has been designed to achieve distributed control in complex environments. This is a generic
architecture based on the mechanism/policy paradigm.
Control Propagation: When a control plane is unable to solve a problem it submits the problem description (e.g.: the con
ict) and any additional information
to the control plane immediately superior in the hierarchy.
Control Delegation: After receiving a control propagation from a lower element, the upper element may
take three dierent alternatives:
If it is also unable to solve the problem it propagates up the con
ict as well.
If it can solve the problem, it may reply to the
original component with the sequence of actions
necessary to solve the problem. This original component executes these actions.
In the last situation it is also possible that the
component, instead of replying with the sequence
of actions the component may provide the information necessary to solve the problem in the
lower component. This information could be used
also in any future situation. This alternative is
called Control Delegation.
4.
MOIRAE architectures is an abstract tool to develop distributed systems of any kind. A distributed data mining
system called DIDAMISYS has been designed using this architecture. The system has been divided into dierent federations. The federations are groups of components that
collaborate to achieve a well-dened set of functionalities.
DIDAMISYS has the following federations:
Data Warehouse Federation: Denes data administration tasks, like data integration, retrieval or information gathering.
Sketch of a Solution
Operator load :
Execution control :
The data
bus has to be managed to oer a quick access to data
tables. The design of the internal structure of the tables as well as the way in which they are stored in secondary devices are quite related to each other in order
to obtain the optimal access. Data stored on the physical media is loaded into memory by the components
on this bus and then these components act as a server
of the original data table they represent. The amount
of memory used to load the data is constrained by the
management component of this bus balancing resource
usage by dierent components. If the memory space
assigned to one of these components is not enough to
load all the data then a swapping process is executed
to load and release data blocks between memory and
disk. These data blocks are called pages.
When a process is submitted to this federation from either Data Mining (algorithm executions) or Data Warehousing (data management/integration tasks) Federations, it is
represented as group description. This component denes
which operator components must be executed and which
data sources participate. Both operator and data components are activated, hosting them on specic nodes, once
active operators and data components are linked (by their
operational plane).
While their execution, external factors or control strategies
might require to move the components from one node to another. Moving an operator may cause previous local links
to become remote. Remote and local links provide the same
features but it is obvious that remote data access require
data transfer over the network, due to this strategic location of operations as near as possible to the data they use.
Data components can also be moved but they always require
original data to be moved also, this means for large data tables a lot of time therefore it is better to plan these actions
when the system is idle and only for the most common used
tables.
5.
CONCLUSIONS
6.
REFERENCES