Module 1
Module 1
Module 1
Unstructured data: This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer program. About 80%
data of an organization is in this format; for example, memos, chat rooms,
PowerPoint presentations, images, videos, letters. researches, white papers,
body of an email, etc.
Semi-structured data: Semi-structured data is also referred to as self describing
structure. This is the data which does not conform to a data model but has
some structure. However, it is not in a form which can be used easily by a
computer program. About 10% data of an organization is in this format; for
example, HTML, XML, JSON, email data etc
Structured data: When data follows a pre-defined schema/structure we say it
is structured data. This is the data which is in an organized form (e.g., in rows
and columns) and be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. About 10% data of
an organization is in this format. Data stored in databases is an example of
structured data.
KDD (Knowledge Discovery from Data)
1. Data Cleaning − Basically in this step, the noise and inconsistent data are removed.
2. Data Integration − Generally, in this step, multiple data sources are combined.
3. Data Selection − Basically, in this step, data relevant to the analysis task are retrieved from
the database.
4. Data Transformation −In this step, data is transformed into forms appropriate for mining.
Also, by performing summary or aggregation operations.
5. Data Mining − Generally, In this, intelligent methods are applied in order to extract data
patterns.