Notes Data Analytics
Notes Data Analytics
Data Analytics
Descriptive Statistics and Probability Distributions
Descriptive statistics and probability distributions are two important concepts in statistics
that help in summarizing and analysing data.
Descriptive Statistics
Descriptive statistics involve methods for summarizing and organizing data. They provide a
way to describe the main features of a dataset, such as the mean, median, mode, range, and
measures of variability like standard deviation.
Common descriptive statistics include:
Mean (Average): The sum of all values divided by the number of values.
Median: The middle value of a dataset when it is sorted.
Mode: The most frequently occurring value in a dataset.
Range: The difference between the maximum and minimum values.
Standard Deviation: A measure of the amount of variation or dispersion in a set of values.
Probability Distribution
A probability distribution describes how the values of a random variable are distributed. It
provides the probabilities of different outcomes in a sample space.
There are two types of probability distributions
1. Discrete
2. Continuous
Descriptive statistics are concerned with summarizing and organizing data, while probability
distributions deal with the likelihood of different outcomes in random processes. Both are
essential in statistical analysis and help in understanding and interpreting data.
1. Mean (Average)
Mean = (75+82+ ... +91)/20
2. Median
Sort the scores and find the middle value.
Sorted Scores: 65, 68, 70, 75, 75, 78, 79, 82, 82, 84, 85, 87, 88, 88, 89, 90, 91
Median = 85
3. Mode
The mode is the most frequent score.
Mode = 75, 82, 88 (multiple modes)
4. Range
Range = Max Score - Min Score = 95 - 65 = 30
5. Standard Deviation
Calculate the standard deviation to measure the spread of scores.
Now, let's talk about the probability distribution. Suppose we want to model the probability
distribution of getting a specific score.
Now, let's talk about the probability distribution. Suppose we want to model the probability
distribution of getting a specific score.
Define a discrete random variable X representing the exam score.
Assign probabilities to each possible outcome.
For example
P(X=75) = Number of students who scored 75 / Total number of students
This table represents the probability distribution of scores in the given class.
In summary, descriptive statistics help us understand the properties of the dataset, while
probability distributions model the likelihood of different outcomes in a random process.
Inferential Statistics
Inferential statistics is a branch of statistics that deals with making inferences or drawing
conclusions about a population based on a sample of data from that population. It involves
using data from a subset of individuals or observations (the sample) to make predictions or
draw generalizations about the larger group (the population) from which the sample is
drawn. Inferential statistics plays a crucial role in scientific research, as well as in various
practical applications in fields such as business, medicine, and social sciences.
2. Hypothesis Testing
Null Hypothesis (H0): A statement suggesting no effect, no difference, or no relationship in
the population.
Alternative Hypothesis (H1 or Ha): A statement that contradicts the null hypothesis,
indicating the presence of an effect, difference, or relationship.
Significance Level (α): The threshold for deciding whether to reject the null hypothesis.
Common values include 0.05 and 0.01.
P-value: The probability of observing the observed data or more extreme results under the
assumption that the null hypothesis is true.
3. Confidence Intervals: A range of values that is likely to contain the true population
parameter with a certain level of confidence (e.g., 95% confidence interval).
4. Regression Analysis: Modeling the relationship between a dependent variable and one or
more independent variables to make predictions about the dependent variable.
8. Central Limit Theorem: A fundamental concept stating that, under certain conditions, the
distribution of the sample mean will be approximately normally distributed, regardless of
the distribution of the population.
9. Type I and Type II Errors:
Type I Error (False Positive): Incorrectly rejecting a true null hypothesis.
Type II Error (False Negative): Failing to reject a false null hypothesis.
In summary, inferential statistics provides tools and methodologies for making informed
decisions, drawing conclusions about populations, and quantifying uncertainty. It forms the
basis for hypothesis testing, estimation, and prediction in various fields, contributing to
evidence-based decision-making and scientific inquiry.
Regression
Overview: Regression analysis is used to model the relationship between a dependent
variable and one or more independent variables.
Types
1. Simple Linear Regression: Involves one dependent variable and one independent
variable.
2. Multiple Linear Regression: Involves one dependent variable and multiple
independent variables.
Process
1. Collect Data: Gather data on the dependent and independent variables.
2. Fit the Model: Use statistical techniques to fit the regression model to the data.
3. Assess Model Fit: Evaluate the goodness of fit and statistical significance.
4. Make Predictions: Use the model to make predictions about the dependent variable.
Overview: ANOVA is used to compare means across different groups to determine if there
are statistically significant differences.
Types
1. One-Way ANOVA: Compares means across one factor (independent variable) with
more than two levels or groups.
2. Two-Way ANOVA: Examines the influence of two different independent variables.
Process
1. Formulate Hypotheses: Set up null and alternative hypotheses regarding the means.
2. Collect Data: Gather data from multiple groups.
3. Calculate Variability: Decompose the total variability into between-group and within-
group components.
4. Test Statistic: Calculate the F-statistic and compare it to a critical value.
5. Make a Decision: Decide whether to reject the null hypothesis based on the
comparison.
In all these methods, statistical significance is often determined by comparing p-values to a
significance level (commonly 0.05). If the p-value is less than the significance level, the null
hypothesis is rejected. These techniques are powerful tools for making inferences and
understanding relationships in data.
1. Data Growth: One of the primary drivers for big data is the exponential growth of
data. With the proliferation of digital devices, social media, sensors, and other
sources, organizations are dealing with vast amounts of data that traditional systems
may struggle to handle.
2. Technology Advancements: Advances in technology, particularly in storage,
processing power, and distributed computing, have enabled organizations to
efficiently store, process, and analyze large datasets. Technologies like Hadoop,
Spark, and other distributed computing frameworks play a crucial role in handling big
data.
3. Data Variety: Big data is not just about volume; it also involves diverse types of data,
including structured, semi-structured, and unstructured data. This variety includes
text, images, videos, social media interactions, and more, requiring specialized tools
and techniques for processing.
4. Real-time Data Processing: The need for real-time or near-real-time analytics has
become crucial in many industries. With the advent of technologies like Apache
Kafka and stream processing frameworks, organizations can analyze and respond to
data as it's generated.
5. Cost-effective Storage: Storage solutions have become more cost-effective, allowing
organizations to store massive amounts of data economically. Cloud storage
services, such as Amazon S3, Azure Blob Storage, and Google Cloud Storage, provide
scalable and cost-efficient options.
6. Open Source Software: The availability of open-source tools and frameworks has
significantly contributed to the adoption of big data technologies. Open-source
projects like Apache Hadoop, Apache Spark, and others have become foundational
components in many big data ecosystems.
7. Advanced Analytics: The desire to extract meaningful insights from data has driven
the adoption of advanced analytics techniques, including machine learning and
artificial intelligence. These technologies enable organizations to uncover patterns,
predict trends, and make data-driven decisions.
8. Regulatory Compliance: Compliance requirements, such as GDPR, HIPAA, and other
data protection regulations, have prompted organizations to implement robust big
data solutions to manage and protect sensitive information appropriately.
These drivers collectively contribute to the ongoing evolution and adoption of big data
solutions across various industries. Organizations that effectively harness big data can gain
valuable insights, enhance decision-making processes, and stay competitive in a data-driven
world.
The key characteristics of big data, often referred to as the 3Vs, are Volume, Velocity, and
Variety:
1. Volume: Big data involves large amounts of data that exceed the capacity of
traditional databases and tools. This data can range from terabytes to petabytes and
beyond.
2. Velocity: Big data is generated and processed at high speeds. The data is produced
rapidly and needs to be analyzed in near real-time to derive timely insights.
3. Variety: Big data comes in various formats, including structured data (like
databases), unstructured data (such as text, images, and videos), and semi-
structured data (like JSON or XML files).
Besides the 3Vs, other characteristics like Veracity (dealing with the quality of the data) and
Value (extracting meaningful insights) are also considered in the big data context.
Big Data Analytics Applications:
Big Data Analytics has found applications across various industries, transforming the way
organizations make decisions, gain insights, and solve complex problems. Here are some
notable applications:
1. Hadoop Distributed File System (HDFS): Hadoop's parallel processing begins with its
distributed storage system, HDFS. HDFS breaks down large files into smaller blocks
(typically 128 MB or 256 MB) and replicates them across multiple nodes in the
cluster. This ensures fault tolerance and enables parallel data access.
2. Parallel Data Processing with MapReduce: The MapReduce programming model is
at the core of Hadoop's parallelism. It divides large processing tasks into smaller,
independent sub-tasks that can be executed in parallel across the nodes of the
cluster. The overall process involves two main phases:
3. Map Phase: In this phase, the input data is divided into smaller chunks, and a map
function is applied to each chunk independently. The output of the map phase is a
set of key-value pairs.
4. Shuffle and Sort Phase: The key-value pairs generated by the map phase are
shuffled and sorted based on their keys. This phase ensures that all values associated
with a particular key are grouped together.
5. Reduce Phase: In this phase, the output of the shuffle and sort phase is processed by
a reduce function. The reduce function takes a key and a set of values associated
with that key, combining or aggregating them to produce the final output.
6. Distributed Execution: Hadoop's parallelism is achieved by distributing both data
storage and processing across multiple nodes in a cluster. Each node processes a
subset of the data independently, and the results are combined to produce the final
output. This allows Hadoop to scale horizontally by adding more nodes to the cluster
as the dataset or processing requirements grow.
7. Fault Tolerance: Hadoop is designed to be fault-tolerant. If a node in the cluster fails
during processing, Hadoop redistributes the work to other nodes that have copies of
the data. This ensures that the overall processing continues without loss of data or
interruption.
8. Data Locality: Hadoop strives to maximize data locality, meaning that computation is
performed on the same node where the data resides. This minimizes data transfer
across the network, improving performance.
Overall, Hadoop's parallel world enables the processing of massive datasets by harnessing
the power of distributed computing, fault tolerance, and parallel processing paradigms.
While MapReduce has been the traditional processing model, newer frameworks like
Apache Spark have emerged to provide more flexibility and improved performance for
certain types of workloads.
1. Scalability: Cloud platforms, such as Amazon Web Services (AWS), Microsoft Azure,
and Google Cloud Platform (GCP), offer elastic and scalable resources. This allows
organizations to easily scale their infrastructure up or down based on the varying
demands of big data processing workloads.
2. Storage: Cloud storage services, like Amazon S3, Azure Blob Storage, and Google
Cloud Storage, provide cost-effective and scalable storage solutions for large
datasets. These storage services are often used as data lakes, where diverse types of
structured and unstructured data can be stored.
3. Compute Resources: Cloud platforms offer virtualized computing resources, such as
virtual machines (VMs) or containers, which are crucial for running distributed big
data processing frameworks like Apache Hadoop and Apache Spark. Users can
provision the required compute resources without the need for upfront capital
investment.
4. Managed Big Data Services: Cloud providers offer managed big data services that
simplify the deployment and maintenance of big data frameworks. Examples include
Amazon EMR, Azure HDInsight, and Google Cloud Dataproc, which provide fully
managed Hadoop and Spark clusters.
5. Serverless Computing: Serverless computing, exemplified by AWS Lambda, Azure
Functions, and Google Cloud Functions, allows organizations to execute code in
response to events without the need to provision or manage servers. This model is
well-suited for event-driven big data processing.
6. Data Warehousing: Cloud-based data warehouses, such as Amazon Redshift, Azure
Synapse Analytics, and Google BigQuery, enable fast and efficient querying of large
datasets. They are designed to handle analytical workloads and support complex
queries on structured data.
7. Integration with Analytics and Machine Learning: Cloud platforms provide
integrated services for analytics and machine learning. Users can leverage tools like
Google Cloud AI Platform, AWS SageMaker, and Azure Machine Learning to build,
train, and deploy machine learning models on large datasets stored in the cloud.
8. Data Security and Compliance: Cloud providers implement robust security measures
and compliance frameworks, addressing concerns related to data security and
privacy. Encryption, access controls, and compliance certifications contribute to
creating a secure environment for big data processing.
9. Cost Optimization: Cloud platforms offer flexible pricing models, enabling
organizations to optimize costs based on actual resource usage. Pay-as-you-go
models and reserved instances allow for cost-effective utilization of cloud resources
for big data workloads.
10. Global Accessibility: Cloud services facilitate global accessibility to big data
resources. Teams can collaborate and access data and processing resources from
various locations, promoting flexibility and agility in data-driven decision-making.
Predictive Analytics
Predictive analytics is a data-driven process that involves collecting, cleaning, and analyzing
historical data to make informed predictions about future events or trends. It begins with
comprehensive data collection from diverse sources, followed by rigorous data cleaning and
preprocessing to ensure data quality. Exploratory Data Analysis and feature selection
provide insights into data characteristics and aid in selecting relevant variables for predictive
modeling. Choosing an appropriate model, such as linear regression or decision trees,
precedes the training phase, where the model learns patterns from historical data.
Evaluation using separate datasets validates the model's performance before deployment
into a production environment. Regular monitoring and maintenance are crucial to ensure
ongoing accuracy, considering changes in data distribution or the business landscape.
Predictive analytics finds widespread applications in business areas, enabling organizations
to forecast customer behavior, manage risks, optimize operations, and make strategic
decisions based on data-driven insights.
1. Real-Time Access and Analysis: Mobile BI enables users to access real-time data
analytics on their smartphones or tablets. Integration with Big Data allows
organizations to process vast datasets in real-time, providing up-to-the-minute
insights for decision-makers.
2. Improved Accessibility: The combination of Mobile BI and Big Data ensures that
decision-makers can access and interact with large datasets on their mobile devices.
This improved accessibility promotes quicker decision-making and responsiveness.
3. Data Visualization: Mobile BI tools often include sophisticated data visualization
capabilities. Integrating with Big Data allows for the visualization of complex
datasets, making it easier for users to interpret and gain insights from large volumes
of information.
4. Location-Based Analytics: Mobile BI can leverage location-based data, and when
combined with Big Data, organizations can analyze geospatial information. This is
particularly valuable for businesses in sectors such as retail, logistics, and healthcare.
5. Offline Access: Many Mobile BI applications support offline access, allowing users to
retrieve and interact with data even without a live internet connection. This is useful
for users who need to access critical information while on the go.
6. Big Data Processing for Mobile Insights: Big Data technologies, such as Apache
Hadoop and Apache Spark, can process and analyze large datasets. Mobile BI
applications can tap into these insights, providing users with comprehensive and
detailed information for decision-making.
7. Data Security: Security is a critical concern, especially when dealing with sensitive
business data. Integrating Mobile BI with Big Data requires robust security measures
to ensure the confidentiality and integrity of data, including encryption and access
controls.
8. Scalability: Big Data technologies provide scalability to handle large datasets
efficiently. This scalability is crucial when dealing with the increasing volume of data
generated and processed through Mobile BI applications.
9. Business Agility: The combination of Mobile BI and Big Data enhances business
agility by providing decision-makers with the flexibility to access, analyze, and act
upon data insights in real-time, regardless of their physical location.
10. Enhanced Decision-Making: Ultimately, the integration of Mobile BI and Big Data
aims to enhance decision-making processes by providing timely, comprehensive, and
actionable insights to users, fostering a more data-driven organizational culture.
Information Management
Information management is a comprehensive process encompassing the systematic
collection, storage, organization, and dissemination of data within an organization. It begins
with the careful collection of relevant data from diverse sources, followed by secure storage
and organization to facilitate easy retrieval. Ensuring the quality of data is paramount,
involving processes for validation and cleansing. Information management extends beyond
data to include knowledge management, collaboration, and metadata management. It plays
a crucial role in maintaining data security, with measures such as access controls and
encryption. Compliance with regulations and standards is prioritized, mitigating legal and
reputational risks. The lifecycle management of information, from creation to archival, and
integration of data from various sources further contribute to the efficiency of
organizational processes. Information management is not static but evolves with emerging
technologies, incorporating analytics, reporting, and advancements like artificial
intelligence. Ultimately, it serves as the backbone for informed decision-making, innovation,
and sustained competitiveness in the dynamic and data-driven landscape of modern
organizations.