Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

chapter-1 Introduction to Data Analytics

Uploaded by

tinayetakundwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

chapter-1 Introduction to Data Analytics

Uploaded by

tinayetakundwa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter-1

Introduction to Data Analytics


Prepared By :- Assistant Professor Manthan Rankaja
Definition of Data Analytics
• Data Analytics involves the use of specialized systems and software to
analyze data and draw insights from it.
• In the era of big data, analytics help organizations make informed
decisions, predict trends, and understand customer behavior.
Applications of Data Analytics
• Various industries where data analytics is applied: Healthcare
(predicting disease outbreaks), Finance (fraud detection), Retail
(customer segmentation), and many more.
• Real-world examples of data analytics: Netflix’s recommendation
system, Credit card fraud detection, etc
Types of Data Analytics
• Descriptive: Analyzes historical data to understand what has
happened.
• Diagnostic: Digs deeper into data to understand the root cause of the
outcome.
• Predictive: Uses statistical models and forecasts techniques to
understand the future.
• Prescriptive: Uses optimization and simulation algorithms to advise
on possible outcomes.
Descriptive Analytics
• Definition: Descriptive Analytics deals with the analysis of historical
data to understand changes that have occurred in a business.
• Use cases: Sales trend analysis, Social media trend analysis.
• Examples: Monthly revenue report, Social media post reach analysis.
Diagnostic Analytics
• Definition: Diagnostic Analytics is a form of advanced analytics that
examines data to answer the question “Why did it happen?”.
• Use cases: Sales decline analysis, Customer churn analysis.
• Examples: Analyzing customer feedback to understand a drop in
product sales, Studying customer behavior data to understand churn.
Predictive Analytics
• Definition: Predictive Analytics uses statistical techniques and
machine learning algorithms to understand the future.
• Use cases: Customer lifetime value prediction, Predictive
maintenance.
• Examples: Using past purchase history to predict a customer’s future
purchase, Predicting machine failure using sensor data.
Prescriptive Analytics
• Definition: Prescriptive Analytics goes beyond predicting future
outcomes by also suggesting actions to benefit from the predictions.
• Use cases: Supply chain optimization, Personalized marketing.
• Examples: Optimizing delivery routes in real-time to save costs,
Personalizing marketing messages based on customer behavior
prediction.
Types of Data
• Structured data: Data that is organized and formatted so it’s easily
readable.
• For example, a database of customer information where data is
organized in rows and columns.
• Unstructured data: Data that doesn’t follow a specified format. For
example, emails, social media posts, etc.
• Semi-structured data: A mix of structured and unstructured data. For
example, a document which contains metadata.
Structured Data
• Definition: Structured data is highly organized and formatted in a way
so it’s easily searchable in relational databases.
• Examples:
Customer databases, Excel spreadsheets, etc.
• Advantages:
Easy to enter, store, query, and analyze.
• Disadvantages:
Requires a lot of time and resources to maintain.
Not suitable for complex, interconnected data.
Unstructured Data
• Definition: Unstructured data is not organized in a pre-defined
manner or does not have a pre-defined data model. It is difficult to
process and analyze.
• Examples: Word documents, PDFs, emails, audio files, etc.
• Advantages: Can capture nuanced information. More flexible as it
does not require a predefined schema.
• Disadvantages: Difficult to analyze and process. Requires more
storage space.
Semi-Structured Data
• Definition: Semi-structured data is a type of data that is both raw and
formatted, falling somewhere in between structured and
unstructured data.
• Examples: XML files, JSON files, etc.
• Advantages: More flexible than structured data, while still being
easier to analyze than unstructured data.
• Disadvantages: Can be more complex to work with and manage
compared to structured data.
•XML: extensible Markup Language
<person>
<name>John Doe</name>
<email>john.doe@example.com</email>
<age>30</age>
</person>
JSON: JavaScript Object Notation

{
"person": {
"name": "John Doe",
"email": "john.doe@example.com",
"age": 30
}
}
Data Sources
• Explanation:
• Data sources are the locations, files, databases, or services where
data comes from.
• Understanding data sources is important as the quality and reliability
of the data can greatly impact the results of data analysis.
Databases
• Explanation: Databases are structured sets of data. They are a
common source of data for analytics.
• Discussion: There are different types of databases,
• such as SQL (relational databases) and
• NoSQL (non-relational databases like MongoDB).
• Examples: Customer information in a SQL database, product
information in a NoSQL database.
Web Data
• Explanation: Web data refers to data that is obtained from the
internet. This can include data scraped from websites, data from
social media platforms, etc.
• Discussion: Different types of web data include text data, user
behaviour data, transactional data, etc.
• Examples: Tweets scraped from Twitter for sentiment analysis,
product reviews scraped from e-commerce websites.
Sensor Data
• Explanation: Sensor data is data that is collected by sensors, which
can be anything from temperature sensors to motion sensors.
• Discussion: Different types of sensor data include time series data,
spatial data, etc.
• This data is often used in IoT (Internet of Things) applications.
• Examples: Temperature data from a weather station, accelerometer
data from a smartphone
Data Collection Types
• Primary data collection involves gathering new data directly from the
source,
• while secondary data collection involves using data that already
exists, such as data from existing databases or data collected by
others.
Data Collection Methods
• Explanation: Data collection methods refer to how we obtain data.
• Common methods include surveys, where we ask people for
information;
• experiments, where we observe outcomes under controlled
conditions;
• observations, where we collect data about real-world behavior.
Data Preprocessing
• Definition: Data preprocessing is the process of cleaning and
transforming raw data into an understandable format.
• It’s a crucial step before data analysis or data modeling.
• Overview:
• Preprocessing involves data cleaning (removing noise and
inconsistencies),
• data transformation (normalizing data),
• data integration (combining data from various sources).
Data Cleaning
• Definition: Data cleaning involves handling missing values, removing
duplicates, and treating outliers.
• It ensures the quality of the data and improves the accuracy of the
insights derived from it.
• Discussion: Techniques include imputation for handling missing
values, deduplication for removing duplicate data, and outlier
detection methods for identifying and handling anomalies in the data.
Data Transformation
• Definition: Data transformation involves changing the format,
structure, or values of data to prepare it for analysis.
• It can involve
• normalization (scaling data to a small, specified range),
• standardization (shifting the distribution of each attribute to have a
mean of zero and a standard deviation of one),
• binning (converting numerical variables into categorical
counterparts).
• Discussion: These techniques help in reducing the complexity of data
and making data compatible for analysis.
Normalization

• Normalization involves scaling data to fit within a small, specified


range, typically between 0 and 1. This is useful when you want to
ensure that all features contribute equally to the analysis. The
formula for min-max normalization is:

• [ 10, 20, 30, 40, 50 ] >[ 0, 0.25, 0.5, 0.75, 1 ]


Standardization
• Standardization transforms data to have a mean of zero and a
standard deviation of one. This is useful when you want to compare
data that have different units or scales. The formula for
standardization is

• [ 10, 20, 30, 40, 50 ] >[ -1.41, -0.71, 0, 0.71, 1.41 ]


Data Integration
• Definition: Data integration involves combining data from different
sources and providing users with a unified view of the data.
• Discussion: This process becomes significant in a variety of situations,
which include both
• commercial (when two similar companies need to merge their
databases)
• scientific (combining research findings from different bioinformatics
repositories, for example) applications.
Data Analytics Tools
• Data analytics tools are software applications used to process and
analyze data. They help data analysts manage and interpret data from
various sources
• We will be discussing the features and use cases of popular data
analytics tools like R, Python, and SAS.
SAS
• Introduction to SAS:
SAS (Statistical Analysis System) is a software suite developed by SAS
Institute for advanced analytics, business intelligence, data
management, and predictive analytics.
• Key features and use cases of SAS in data analytics:
SAS provides a graphical point-and-click user interface for non-technical
users and more advanced options through the SAS language.
It is widely used in the corporate world.
R
• Introduction to R:
• R is a programming language and free software environment for
statistical computing and graphics.
It is widely used among statisticians and data miners for developing
statistical software and data analysis.
• Key features and use cases of R in data analytics:
R provides a wide variety of statistical and graphical techniques and is
highly extensible.
It is used in fields like healthcare, finance, academia, etc.
Python
• Python is a high-level, interpreted programming language. It is known
for its simplicity and readability, making it a popular choice for
beginners and experts in data analytics
• Python has powerful libraries for data manipulation and analysis like
pandas, NumPy, and SciPy.
• It is used in various domains like web development, machine learning,
AI, and more
Data Analytics Technologies
• Data analytics technologies refer to the frameworks and systems used
to process and analyze large datasets. They are designed to handle
big data and are essential for advanced analytics.
• Discussion on various technologies such as Hadoop, Spark, etc.: We
will be discussing the features and use cases of popular data analytics
technologies like Hadoop and Spark.
Hadoop
• Hadoop is an open-source software framework for storing data and
running applications on clusters of commodity hardware.
• It provides massive storage for any kind of data, enormous processing
power, and the ability to handle virtually limitless concurrent tasks or
jobs.
• Key features and use cases of Hadoop in data analytics: Hadoop is
known for its scalability, cost-effectiveness, flexibility, and fault
tolerance.
• It is used in various industries like finance, healthcare, media, etc.
Spark
• Introduction to Spark: Spark is an open-source, distributed computing
system used for big data processing and analytics.
• It provides an interface for programming entire clusters with implicit
data parallelism and fault tolerance.
• Key features and use cases of Spark in data analytics: Spark is known
for its speed, ease of use, and versatility.
• It can be used for various tasks like batch processing, real-time data
streaming, machine learning, etc.

You might also like