Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Module 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Applied Data Science

Data : Computer data is information processed or stored by a


computer. This information may be in the form of text documents,
images, audio clips, software programs, or other types of data.
Computer data may be processed by the computer's CPU and is stored
in files and folders on the computer's hard disk.
Computer data is a bunch of ones and zeros, known as binary data.
Because all computer data is in binary format, it can be created,
processed, saved, and stored digitally. This allows data to be
transferred from one computer to another using a network connection
or various media devices.
Data may be:

Unstructured data: This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer program. About 80%
data of an organization is in this format; for example, memos, chat rooms,
PowerPoint presentations, images, videos, letters. researches, white papers,
body of an email, etc.
Semi-structured data: Semi-structured data is also referred to as self describing
structure. This is the data which does not conform to a data model but has
some structure. However, it is not in a form which can be used easily by a
computer program. About 10% data of an organization is in this format; for
example, HTML, XML, JSON, email data etc
Structured data: When data follows a pre-defined schema/structure we say it
is structured data. This is the data which is in an organized form (e.g., in rows
and columns) and be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. About 10% data of
an organization is in this format. Data stored in databases is an example of
structured data.
KDD (Knowledge Discovery from Data)
1. Data Cleaning − Basically in this step, the noise and inconsistent data are removed.

2. Data Integration − Generally, in this step, multiple data sources are combined.

3. Data Selection − Basically, in this step, data relevant to the analysis task are retrieved from
the database.

4. Data Transformation −In this step, data is transformed into forms appropriate for mining.
Also, by performing summary or aggregation operations.

5. Data Mining − Generally, In this, intelligent methods are applied in order to extract data
patterns.

6. Pattern Evaluation − Basically in this step, data patterns are evaluated.

7. Knowledge Presentation − Generally, in this step, knowledge is represented.


CIA
CHARACTERISTICS OF DATA
Data has three key characteristics:
1.Composition: The composition of data deals with the structure of data,
that is, the sources of data, the granularity, the types, and the nature of data
as to whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of data, that is, "Can
one use this data as is for analysis?" or "Does it require cleansing for further
enhancement and enrichment?"
3. Context: The context of data deals with "Where has this data been
generated?" "Why was this data generated?" How sensitive is this data?"
DEFINITION OF BIG DATA
• Big data is high-velocity and high-variety information assets that
demand cost effective, innovative forms of information processing for
enhanced insight and decision making.
• Big data refers to datasets whose size is typically beyond the storage
capacity of and also complex for traditional database software tools
• Big data is anything beyond the human & technical infrastructure
needed to support storage, processing and analysis.
• It is data that is big in volume, velocity and variety.
Variety: Data can be structured data, semi-structured data and unstructured
data. Data stored in a database is an example of structured data.HTML data,
XML data, email data, CSV files are the examples of semi-structured data.
Power point presentation, images, videos, researches, white papers, body of
email etc are the examples of unstructured data.
Velocity: Velocity essentially refers to the speed at which data is being
created in real- time. We have moved from simple desktop applications like
payroll application to real- time processing applications.
Volume: Volume can be in Terabytes or Petabytes or Zettabytes. Gartner
Glossary Big data is high-volume, high-velocity and/or high variety
information assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight and decision making.
WHY BIG DATA?
The more data we have for analysis, the greater will be the analytical
accuracy and the greater would be the confidence in our decisions
based on these analytical findings. The analytical accuracy will lead a
greater positive impact in terms of enhancing operational efficiencies,
reducing cost and time, and originating new products, new services,
and optimizing existing services.
Let’s start by understanding what is data science?
1.Have you ever wondered how Amazon, eBay suggest items for you
to buy?
2.How Gmail filters your emails in the spam and non-spam categories?
3.How Netflix predicts the shows of your liking?
How do they do it? These are the few questions we ponder from time
to time. In reality, doing such tasks are impossible without the
availability of data.
Data science is all about using data to solve problems. The problem
could be decision making such as identifying which email is spam and
which is not. Or a product recommendation such as which movie to
watch? Or predicting the outcome such as who will be the next
President of the USA?
So, the core job of a data scientist is to understand the data, extract
useful information out of it and apply this in solving the problems.
Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and
make future predictions. By using Data Science, companies are able to
make: Better decisions (should we choose A or B)

Data science is an interdisciplinary science that incorporates scientific


fields of data such as data engineering, information science, computer
science, statistics, artificial intelligence, machine learning, data mining,
and predictive analytics.
Data science uses a wide array of data-oriented technologies,
including SQL, Python, R, and Hadoop, etc. However, it also makes
extensive use of statistical analysis, data visualization, distributed
architecture, and more to extract meaning out of sets of data. The
information extracted through data science applications is used to
guide business processes and reach organizational goals.
Data science is a collection of techniques used to extract value from
data. It has become an essential tool for any organization that collects,
stores, and processes data as part of its operations.

Data science techniques rely on finding useful patterns, connections,


and relationships within data.
Data science is also commonly referred to as knowledge discovery,
machine learning, predictive analytics, and data mining.
Data science starts with data, which can range from a simple array of a
few numeric observations to a complex matrix of millions of
observations with thousands of variables.
Data science utilizes certain specialized computational methods in
order to discover meaningful and useful structures within a dataset.
The discipline of data science coexists and is closely associated with a
number of related areas such as database systems, data engineering,
visualization, data analysis, experimentation, and business intelligence
(BI).
Applications of Data Science
1.Fraud Detection
2.HealthCare
3.E-Commerce
4.Supply Chain Management
5.Customer Service
6.Marketing
7.Transportation
8.Energy and Utilities
9.Social Media
10.Sports
Data Science Process
The methodical discovery of useful relationships and patterns in data is
enabled by a set of iterative activities collectively known as the data
science process.
The standard data science process involves (1) understanding the
problem, (2) preparing the data samples, (3) developing the model, (4)
applying the model on a dataset to see how the model may work in the
real world, and (5) deploying and maintaining the models.
Business Understanding :This initial phase indicated in Figure concentrates on
discovery of the data science goals and requests from a business perspective.
Data Understanding: The data understanding phase noted in Figure starts with
an initial data collection and continues with actions to discover the
characteristics of the data. This phase identifies data quality complications and
insights into the data.
Data Preparation :The data preparation phase indicated in Figure covers all
activities to construct the final data set for modeling tools. This phase is used in
a cyclical order with the modeling phase, to achieve a complete model. This
ensures you have all the data required for your data science.
Modeling :In this phase noted in Figure, different data science modeling
techniques are nominated and evaluated for accomplishing the prerequisite
outcomes, as per the business requirements. It returns to the data preparation
phase in a cyclical order until the processes achieve success .
Evaluation: Before proceeding to final deployment of the data science,
validation that the proposed data science solution achieves the business
objectives is required. If this fails, the process returns to the data understanding
phase, to improve the delivery.
Deployment :
2.1 PRIOR KNOWLEDGE: The prior knowledge step in the data science process helps
to define what problem is being solved, how it fits in the business context, and what
data is needed in order to solve the problem.
• Objective :The data science process starts with a need for analysis, a question, or a
business objective. This is possibly the most important step in the data science
process. Without a well-defined statement of the problem, it is impossible to come
up with the right dataset and pick the right data science algorithm.
• Subject Area: The process of data science uncovers hidden patterns in the dataset
by exposing relationships between attribute.
• Data: Similar to the prior knowledge in the subject area, prior knowledge in the
data can also be gathered. Understanding how the data is collected, stored,
transformed, reported, and used is essential to the data science process.
• Causation Versus Correlation :A correlation between variables, however, does
not automatically mean that the change in one variable is the cause of the
change in the values of the other variable. Causation indicates that one
event is the result of the occurrence of the other event; i.e. there is a causal
relationship between the two events.
2.2 DATA PREPARATION: Preparing the dataset to suit a data science task is the most time-
consuming part of the process. Most of the data science algorithms would require data to be
structured in a tabular format with records in the rows and attributes in the columns. If the data is
in any other format, the data would need to be transformed by applying pivot, type conversion, join,
or transpose functions, etc., to condition the data into the required structure.
• Data Exploration: Data preparation starts with an in-depth exploration of the data and gaining a
better understanding of the dataset. Data exploration, also known as exploratory data analysis,
provides a set of simple tools to achieve basic understanding of the data. Data exploration
approaches involve computing descriptive statistics and visualization of data.
• Data Quality: Data quality is an ongoing concern wherever data is collected, processed, and
stored.
• Missing Values :One of the most common data quality issues is that some records have missing
attribute values. For example, a credit score may be missing in one of the records.
• Data Types and Conversion: The attributes in a dataset can be of different types, such as
continuous numeric (interest rate), integer numeric (credit score), or categorical. For example, the
credit score can be expressed as categorical values (poor, good, excellent) or numeric score.
• Transformation: In some data science algorithms like k-NN, the input
attributes are expected to be numeric and normalized, because the
algorithm compares the values of different attributes and calculates distance
between the data points. Normalization prevents one attribute dominating
the distance results because of large values.
• Outliers: Outliers are anomalies in a given dataset. Outliers may occur
because of correct data capture or erroneous data capture (human height as
1.73 cm instead of 1.73 m) . Detecting outliers may be the primary purpose
of some data science applications, like fraud or intrusion detection.
• Feature Selection: Feature selection is a way of selecting the subset of the
most relevant features from the original features set by removing the
redundant, irrelevant, or noisy features.
• Data Sampling :Sampling is a process of selecting a subset of records as a
representation of the original dataset for use in data analysis or modeling.
2.3 MODELING: A model is the abstract representation of the data and
the relationships in a given dataset.
Different types of Model:
1.Conceptual data models : A conceptual data model is a high-level
representation of the main entities, attributes, and relationships in a
domain of interest. It is usually independent of any specific database
system or technology, and focuses on the meaning and logic of the data
rather than the implementation details. A conceptual data model can help
you understand the scope and context of your data, and communicate it
to stakeholders and users. Some examples of conceptual data models
are entity-relationship diagrams, UML class diagrams, and ontologies.
2.Logical data models : A logical data model is a more detailed and
formalized version of a conceptual data model, that specifies the data
types, constraints, and rules for each entity, attribute, and relationship.
Some examples of logical data models are relational models, hierarchical
models, and network models.
3.Physical data models:A physical data model is a representation of
how the data will be actually stored and accessed in a specific database
system or technology. Some examples of physical data models are SQL
schemas, NoSQL schemas, and XML schemas.
2.4 APPLICATION :Deployment is the stage at which the model becomes
production ready or live. In business applications, the results of the data
science process have to be assimilated into the business process—
usually in software applications.
The model deployment stage has to deal with: assessing model
readiness, technical integration, response time, model maintenance, and
assimilation.
2.5 KNOWLEDGE : The data science process provides a framework to
extract nontrivial information from data. The data science process starts
with prior knowledge and ends with posterior knowledge, which is the
incremental insight gained.
Data Analytics Life Cycle
Difference Between Data Science and Data Analytics
Case Study on Global Innovation Network and
Analysis(GINA)

You might also like