1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
The early development of data collection and database creation mechanisms served as a
prerequisite for the later development of effective mechanisms for data storage and retrieval,
as well as query and transaction processing. Nowadays numerous database systems offer query
and transaction processing as common practice. Advanced data analysis has naturally become
the next step.
Since the 1960s, database and information technology has evolved systematically from
primitive file processing systems to sophisticated and powerful database systems.
The research and development in database systems since the 1970s progressed from early
hierarchical and network database systems to relational database systems, data modeling tools,
and indexing and accessing methods.
After the establishment of database management systems, database technology moved toward
the development of advanced database systems, data warehousing, and data mining for
advanced data analysis and web-based databases.
Advanced data analysis sprang up from the late 1980s onward.
Huge volumes of data have been accumulated beyond databases and data warehouses. During
the 1990s, the World Wide Web and web-based database began to appear.
Figure 1: The evolution of database system technology
In summary, the abundance of data, coupled with the need for powerful data analysis tools, has
been described as a data rich but information poor situation.
1.2 What is Data Mining?
Data mining refers to extracting or “mining” knowledge from large amounts of data. Data
mining should have been more appropriately named “knowledge mining from data”. Many
other terms carry a similar or slightly different meaning to data mining, such as knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging. Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply
an essential step in the process of knowledge discovery.
Figure 2: Data Mining as a step in the process of knowledge discovery
Knowledge discovery as a process is depicted in Figure 2 and consists of an iterative sequence
of the following steps:
1. Data cleaning - to remove noise and inconsistent data.
2. Data integration - where multiple data sources may be combined.
3. Data selection - where data relevant to the analysis task are retrieved from the database.
4. Data transformation - where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance.
5. Data mining - an essential process where intelligent methods are applied in order to extract
data patterns.
6. Pattern evaluation - to identify the truly interesting patterns representing knowledge based
on some interestingness measures.
7. Knowledge presentation - where visualization and knowledge representation techniques are
used to present the mined knowledge to the user.
Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns
are presented to the user and may be stored as new knowledge in the knowledge base.
We adopt a broad view of data mining functionality: Data mining is the process of discovering
interesting patterns and knowledge from large amounts of data. The data sources can include
databases, data warehouses, the Web, other information repositories, or data that are streamed
into the system dynamically.
Figure 5: Represents 3 Clusters where each cluster center is marked with a “+”
The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. That is, clusters of objects are formed so
that objects within a cluster have high similarity in comparison to one another, but are very
dissimilar to objects in other clusters.
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of
the data. These data objects are outliers. Most data mining methods discard outliers as noise or
exceptions. However, in some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier mining.
Rather than using statistical or distance measures, deviation-based methods identify outliers by
examining differences in the main characteristics of objects in a group.
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time. Although this may include characterization, discrimination, association and
correlation analysis, classification, prediction, or clustering of time related data, distinct
features of such an analysis include time-series data analysis, sequence or periodicity pattern
matching, and similarity-based data analysis.
A data mining study of stock exchange data may identify stock evolution regularities for overall
stocks and for the stocks of particular companies. Such regularities may help predict future
trends in stock market prices, contributing to your decision-making regarding stock
investments.