ToolKit 1 - Unit 1 - Introduction To Data Analytics
ToolKit 1 - Unit 1 - Introduction To Data Analytics
To Data Analytics
01
What is Data?
A formalized representation of facts, concepts, or instructions that is
suitable for transmission, interpretation, or processing by a human or an
electronic system is referred to as data. Data is represented by characters
such as alphabets (A-Z, a-z), digits (0-9) or special characters (+,-,/,*,,>,=,
etc.). A collection of information obtained by observations, measurements,
study, or analysis is referred to as data.
Data Analytics
Data analytics is about studying raw data with the purpose of drawing
conclusions about it. Data analytics is critical since it allows firms to improve
their performance. Companies that include it into their business models can
help cut costs by developing more efficient methods of doing business and
storing massive volumes of data.
02
Data Analysis Steps
Step 6 Validation
Roadmap and
1 operating model
2 Data Acquisition
Data Governance
3 Data Security 4 and Standards
03
Data Analytics Life Cycle
It involves 6 phases namely:
Discovery
Measure
Effectiveness 1
Data Prep
6 2
Communicate
Results/publish 5 3
Insights
Plan Model
4
Build Model
Descriptive Analysis
Diagnostic Analytics
04
Predictive Analysis
Prescriptive Analytics
Data Collection
Data collection is the procedure of collecting, measuring and analyzing accurate
insights for research using standard validated techniques.
The most important goal of data collecting is to collect information-rich and
accurate data for statistical analysis so that data-driven research choices may
be made.
05
Data Collection Methods
1 Primary
This is original, first-hand data collected by the data researchers.
Primary data results are highly accurate provided the researcher
collects the information.
2 Secondary
Secondary data is second-hand data collected by other parties
and already having undergone statistical analysis. This data is
either information that the researcher has tasked other people to
collect or information the researcher has looked up.
06
Data Collection Tools
Documents and
7 records
8 Focus groups 9 Oral histories
07
How to deliver value with analytics?
• Enable self-service analytics
• Provide specific goals and their related KPIs to help teams
measure success
• Democratize advanced analysis with intuitive AI
• Support development of data literacy or confidence when working
with data
• Identify subject matter experts in each department
Aspects of Framework
1 Discovery 2 Insights
3 Actions 4 Outcomes
Intelligent
4 Data Preparation 5 Learning 6 Actions
08
Techniques of Framework
The big data analytics framework is primarily based on two fundamental
frameworks, namely:
Many entrepreneurs all around the world employ data analytics frameworks.
• Apache Cassandra
• Knime
• Datawrapper
• Lumify
• Apache Storm
• Rapidminer
• Flink
Big Data
Big data is, as the term implies, a "large" quantity of data. It refers to a data
collection that is both huge in volume and complicated. Traditional data
processing software cannot manage Big Data due to its vast volume and
increased complexity. Big Data simply refers to datasets that contain a
significant quantity of different data, both organized and unstructured.
5 Vs of Big Data
09
It refers to the nature of data that is structured,
Variety semi-structured and unstructured data. It also refers
to heterogeneous sources.
Semi-
1 Structured 2 Unstructured 3 Structured
10
Big Data Life-cycle
There are 9 phases involved in the Big Data Life Cycle. They are as
follows:
• Business Case/Problem Definition
• Data Identification
• Data Acquisition and filtration
• Data Extraction
• Data Munging(Validation and Cleaning)
• Data Aggregation & Representation(Storage)
• Exploratory Data Analysis
• Data Visualization(Preparation for Modeling and Assessment)
• Utilization of analysis results.
7 CouchDB 8 RapidMiner
11
Data Warehouse
An analytics-focused type of data management system called a data
warehouse is designed to support and facilitate business intelligence (BI)
operations. Data warehouses are only used to conduct searches and
analyses on vast amounts of historical data. Data for a data warehouse is
frequently produced from a variety of sources, such as transactional
programmes and application log files.
12
Data Warehouse Components
1 ETL- Extract/Transform/Load:
Online updates of integrated data are carried out with an OLTP (online
Transaction Processing) response time in the ODS. An integrated format
for application data is created in the hybrid environment known as the ODS
(often via ETL). Data can be used for high-performance processing,
including update processing, once it is placed in the ODS.
3 Data Mart
The data mart is designed around a single set of user-wide expectations for
how data should appear and is typically arranged by department. There is a
separate information warehouse for finance. Compared to the data warehouse,
each data mart typically contains much less data. Additionally, data marts
frequently include a sizable amount of summarized and aggregated data.
4 Exploration Warehouse
Inmon’s Approach
Kimball’s Approach
14
Data warehouse can be mapped into
different types of architecture as follows:
Shared memory architecture: The standard method for putting an RDBMS on
SMP hardware is to implement it in shared-memory or shared-everything form.
The main benefit of this method is that a single RDBMS server can likely
access all memory, all CPUs, and the whole database, giving the client a
consistent single system image.
Shared disk architecture: The idea of shared ownership of the complete
database between RDBMS servers, each of which is executing on a node of a
distributed memory system, is implemented via shared-disk architecture. Each
RDBMS server can access the same shared database to read, write, update,
and delete data, necessitating the implementation of a distributed lock
management (DLM).
Shared nothing architecture: Systems that share nothing are often loosely
connected. Only one CPU is attached to a specific disc in shared nothing
systems. Access is entirely dependent on the PU that owns any tables or
databases that are stored on that disc.
15