Unit 5
Unit 5
Unit 5
Predictive Analytics and Visualizations: Predictive Analytics, Simple linear regression, Multiple
linear regression, Interpretation of regression coefficients, Visualizations, Visual data analysis
techniques, interaction techniques, Systems and application
---------------------------------------------------------------------------------------------------------------
PREDICTIVE ANALYTICS
Predictive analytics in big data refers to the process of extracting meaningful insights and
making predictions about future events or outcomes based on the analysis of large and complex
datasets. This field uses advanced statistical techniques, machine learning algorithms, and data
mining methods to find patterns, trends, and relationships within the data.
Types of predictive modeling
Predictive analytics models are designed to assess historical data, discover patterns,
observe trends, and use that information to predict future trends. Popular predictive analytics
models include classification, clustering, and time series models.
Regression analysis: This is a statistical technique that is used to estimate the relationship
between one or more independent variables and a dependent variable. Regression analysis can be
used to predict a continuous outcome, such as sales figures or customer lifetime value.
Classification: This type of predictive modeling is used to classify data points into one of two or
more categories. For example, a classification model could be used to predict whether a customer
is likely to churn (cancel their service) or not.
Decision trees: This type of predictive modeling is a flowchart-like structure that is used to make
decisions based on a series of questions. Decision trees can be used for both classification and
regression tasks.
Time series analysis: This type of predictive modeling is used to forecast future values based on
past data. Time series analysis is often used in finance to predict stock prices or in retail to predict
demand for products.
Neural Networks: This type of predictive modeling is inspired by the structure of the human brain.
Neural networks are a complex type of model that can be used for a variety of tasks, including
classification, regression, and pattern recognition.
The type of predictive modeling that is best for a particular task will depend on the nature
of the data and the specific question that you are trying to answer.
Applications of Predictive Analytics
1
Predictive analytics has a wide range of applications across various industries. Here are
some use cases that showcase its transformative power:
Customer-Centric Applications:
Predicting Buying Behavior: In retail, e-commerce giants like Amazon and Walmart uses
purchase history, browsing behavior, and demographics to anticipate customer needs. This
allows them to personalize product recommendations, optimize promotions, and boost
sales.
Reducing Customer Churn: Predictive models can analyze customer data to identify users
at risk of cancelling subscriptions or services. This enables businesses to take steps to retain
valuable customers.
Up-Selling and Cross-Selling: Predictive analytics can identify products or services a
customer might be interested in based on past purchases and preferences. This empowers
businesses to recommend relevant add-ons, leading to increased revenue per customer.
Healthcare Diagnosis: By analyzing medical records and patient data, healthcare providers
can leverage predictive analytics to identify patients at risk of developing certain diseases.
This enables early intervention and potentially improves patient outcomes.
2
Public Safety: Law enforcement agencies can use predictive analytics to analyze crime data
and identify areas with high crime rates. This allows them to allocate resources more
effectively and potentially prevent crimes from happening.
These are just a few examples, and the potential applications of predictive analytics
continue to grow as data collection and analysis techniques become more sophisticated.
Security: Every modern organization must be concerned with keeping data secure. A
combination of automation and predictive analytics improves security. Suspicious and
unusual end user behavior can trigger specific security procedures.
Risk reduction: In addition to keeping data secure, most businesses are working to reduce
their risk profiles.
For example, a company that extends credit can use data analytics to better
understand if a customer poses a higher-than-average risk of defaulting.
Operational efficiency: More efficient workflows translate to improved profit margins.
For example, understanding when a vehicle used for delivery is going to need
maintenance before it’s broken down on the side of the road means deliveries are made on
time, can avoid additional costs.
Improved decision making: Running any business involves making calculated decisions. Any
expansion or addition to a product line or other form of growth requires balancing the
inherent risk with the potential outcome. Predictive analytics can provide insight to inform
the decision-making process and offer a competitive advantage.
3
continuous in nature. This relationship represents how an input variable is related to the output
variable and how it is represented by a straight line.
To understand this concept, let us have a look at scatter plots. Scatter diagrams or plots
provides a graphical representation of the relationship of two continuous variables.
4
2. Perfect positive relationship: +1
3. Perfect negative relationship: -1
4. No Linear relationship: 0
5. Strong correlation: r > 0.85 (depends on business scenario)
Steps to Implement Simple Linear Regression:
1. Analyze data (analyze scatter plot for linearity)
2. Get sample data for model building
3. Then design a model that explains the data
4. And use the same developed model on the whole population to make predictions.
The equation that represents how an independent variable X is related to a dependent variable
Y.
Example:
Let us understand simple linear regression by considering an example. Consider we want to
predict the weight gain based upon calories consumed only based on the below given data.
5
Now, if we want to predict weight gain when you consume 2500 calories. Firstly, we need to
visualize data by drawing a scatter plot of the data to conclude that calories consumed is the best
independent variable X to predict dependent variable Y.
We can
also
calculate
“r” as follows:
As, r = 0.9910422 which is greater than 0.85, we shall consider calories consumed as the best
independent variable(X) and weight gain(Y) as the predict dependent variable.
Now, try to imagine a straight line drawn in a way that should be close to every data point
in the scatter diagram.
6
To predict the weight gain for consumption of 2500 calories, you can simply extend the
straight line further to the y-axis at a value of 2,500 on x-axis. This projected value of y-axis gives
you the rough weight gain. This straight line is a regression line.
Similarly, if we substitute the x value in equation of regression model such as:
So, weight gain predicted by our simple linear regression model is 4.49Kgs after
consumption of 2500 calories.
7
There are two types of linear regression algorithms -
1. Simple - deals with two features.
2. Multiple - deals with more than two features.
What Is Multiple Linear Regression (MLR)?
One of the most common types of predictive analysis is multiple linear regression. This type
of analysis allows you to understand the relationship between a continuous dependent variable
and
two or more independent variables.
The independent variables can be either continuous (like age and height) or categorical (like
gender and occupation). It's important to note that if your dependent variable is categorical, you
should dummy code it before running the analysis.
8
3. Furthermore, we assume that Y is linearly dependent on the factors according to
Y = β0 + β1x1 + β2x2 + · · · + βkxk + ε
The variable Y is dependent or predicted
The slope of Y depends on the y-intercept, that is, when x1 and x2 are both zero, Y will be
β 0.
The regression coefficients β1 and β2 represent the change in Y as a result of one-unit
changes in x1 and x2.
βp refers to the slope coefficient of all independent variables
ε term describes the random error (residual) in the model.
4. We have n observations, n typically being much more than k.
5. For i th observation, we set the independent variables to the values x i1, xi2, xi3, xi4, . .,xik and
measure a value yi for the random variable Yi.
Thus, the model can be described by the equations.
Yi = β0 + β1xi1 + β2xi2 + · · · + βkxik + i for i = 1, 2, . . . , n,
Where the errors i are independent standard variables, each with mean 0 and the
same unknown variance σ2.
6. Altogether the model for multiple linear regression has k + 2 unknown parameters: β 0, β1, . . . ,
βk, and σ 2.
7. When k was equal to 1, we found the least squares line y = β0 +β1x. It was a line in the plane.
8. Now, with k ≥ 1, we’ll have a least squares hyperplane.
y = β0 + β1 x1 + β2x2 + · · · + βkxk in Rk+1.
9
1. Planning and Control.
2. Prediction or Forecasting.
Estimating relationships between variables can be exciting and useful. As with all other
regression models, the multiple regression model assesses relationships among variables in terms
of their ability to predict the value of the dependent variable.
10
The intercept term in a regression table tells us the average expected value for the
response variable when all of the predictor variables are equal to zero.
In this example, the regression coefficient for the intercept is equal to 48.56. This means
that for a student who studied for zero hours (Hours studied = 0) and did not use a tutor (Tutor =
0), the average expected exam score is 48.56.
Interpreting the Coefficient of a Continuous Predictor Variable
For a continuous predictor variable, the regression coefficient represents the difference in
the predicted value of the response variable for each one-unit change in the predictor variable,
assuming all other predictor variables are held constant.
From the regression output, we can see that the regression coefficient for Hours studied is
2.03. This means that, on average, each additional hour studied is associated with an increase of
2.03 points on the final exam, assuming the predictor variable Tutor is held constant.
For example, consider student A who studies for 10 hours and uses a tutor. Also consider
student B who studies for 11 hours and also uses a tutor. According to our regression output,
student B is expected to receive an exam score that is 2.03 points higher than student A.
11
Interpreting All of the Coefficients At Once
We can use all of the coefficients in the regression table to create the following estimated
regression equation:
Expected exam score = 48.56 + 2.03*(Hours studied) + 8.34*(Tutor)
For example, a student who studied for 10 hours and used a tutor is expected to receive an
exam score of:
Expected exam score = 48.56 + 2.03*(10) + 8.34*(1) = 77.2
12
DATA VISUALIZATION
Data visualization is the art of representing information and data in a visual format, such as
charts, graphs, and maps. It is a critical aspect of big data analytics, as it helps us to understand
and communicate complex patterns and trends that would be difficult to see in raw data.
Benefits:
Here's why data visualization is important in big data analytics:
1. Understanding Complex Data Structures: Big data often comes from diverse sources and in
various formats. Data visualization helps analysts and stakeholders comprehend the structure and
relationships within large datasets, making it easier to identify relevant patterns and insights.
2. Detecting Trends and Patterns: With big data, there can be numerous trends and patterns
hidden within the vast volumes of information. Data visualization techniques allow analysts to
efficiently identify these trends, outliers, correlations, and anomalies, enabling organizations to
make better decisions.
3. Real-time Monitoring and Analysis: Big data analytics often involves processing data streams in
real-time. Visualization tools provide dashboards and interactive displays that enable real-time
monitoring of key metrics, allowing organizations to respond promptly to changing conditions and
opportunities or threats.
4. Scalable Visualizations: Traditional data visualization tools may struggle to handle the big data.
Therefore, specialized visualization techniques and tools have emerged that can handle large
datasets efficiently.
5. Predictive Analytics Visualization: Visualization plays a crucial role in conveying the results of
predictive analytics models. Visual representations of predictions and forecasts help stakeholders
understand the potential future outcomes based on the analysis of big data.
6. Data Storytelling: Effective data visualization facilitates storytelling with data, enabling analysts
to communicate complex insights in a compelling and understandable manner. Visual narratives
help convey the significance of findings and drive action within organizations.
7. Collaboration and Decision-Making: Big data visualization tools support collaboration among
teams by providing shared platforms where stakeholders can interact with visualizations, and share
findings & insights. This collaborative approach improves decision-making processes by ensuring
that all relevant stakeholders have access to the insights derived from big data analytics.
13
In essence, data visualization is an indispensable component of big data analytics,
empowering organizations to extract actionable insights from vast and complex datasets and drive
informed decision-making.
VISUAL DATA ANALYSIS TECHNIQUES
Visual data analysis techniques encompass a variety of methods and tools used to explore,
analyze, and interpret data through visualization. Here are some key techniques commonly
employed in visual data analysis:
1. Scatter Plots: Scatter plots display individual data points on a two-dimensional plane, with one
variable plotted on each axis. They are useful for identifying correlations, clusters, and outliers
within datasets.
2. Bar Charts and Histograms: Bar charts represent data using rectangular bars, while histograms
display data distribution by grouping values into bins. These techniques are commonly used to
visualize categorical and numerical data, respectively.
3. Line Charts: Line charts illustrate trends and patterns over time or across ordered categories.
They are particularly effective for visualizing time-series data and tracking changes in variables
over continuous intervals.
4. Heatmaps: Heatmaps use color gradients to represent data values in a matrix format. They are
commonly used to visualize relationships between two variables and highlight patterns or clusters
within large datasets.
5. Box Plots: Box plots, also known as box-and-whisker plots, provide a visual summary of the
distribution of a dataset, including measures such as median, quartiles, and outliers. They are
useful for comparing distributions and identifying variability within data.
6. Bubble Charts: Bubble charts represent data points using circles, where the size and color of
each bubble encode additional information. They are often used to visualize three-dimensional
data and highlight relationships between multiple variables.
7. Tree Maps: Tree maps visualize hierarchical data structures by representing each data category
as a rectangle, with the size of the rectangle proportional to a specific metric. They are useful for
visualizing nested relationships and understanding the composition of datasets.
8. Network Graphs: Network graphs represent relationships between entities as nodes (vertices)
connected by edges (links). They are commonly used to analyze complex systems, such as social
networks, transportation networks, and biological systems.
14
9. Choropleth Maps: Choropleth maps use color shading or patterns to represent spatial data,
typically aggregated by geographic regions such as countries, states, or districts. They are effective
for visualizing spatial patterns and trends across different regions.
10. Interactive Dashboards: Interactive dashboards combine multiple visualizations into a single
interface, allowing users to explore and analyze data dynamically. They often incorporate filters,
sliders, and drill-down capabilities to facilitate interactive data exploration.
11. Timeline Charts: Timeline charts illustrate events, in chronological order — for example the
progress of a project, advertising campaign, acquisition process — in whatever unit of time the
data was recorded — for example week, month, year, quarter. It shows the chronological sequence
of past or future events on a timescale.
12. Pie Chart: It is a circular statistical graph which decides slices to illustrate numerical proportion.
Here the arc length of each slide is proportional to the quantity it represents. They are used to
compare the parts of a whole. These are most effective when there are limited components.
However, they can be difficult to interpret because the human eye has a hard time estimating
areas and comparing visual angles.
These techniques are some of the wide range of visual data analysis methods available.
Depending on the nature of the data and the analytical goals, analysts may employ a combination
of these techniques to gain insights and make data-driven decisions.
DATA VISUALIZATION INTERACTION TECHNIQUES
Data visualization interaction techniques enhance user engagement and facilitate exploration of
data by allowing users to interact with visualizations dynamically.
Here are some common interaction techniques used in data visualization:
1. Selection: Users can select specific data points or regions of interest within a visualization.
Selection allows users to focus on particular subsets of data and perform further analysis or
comparison.
2. Filtering: Filtering enables users to control which data is displayed within a visualization by
applying specific criteria or conditions. Users can filter data based on attributes, ranges, categories,
or other parameters to refine their analysis.
3. Zooming and Panning: Zooming and panning functionalities allow users to navigate and explore
large datasets by adjusting the scale and viewport of the visualization. Users can zoom in to
examine fine-grained details or zoom out for a broader overview of the data.
15
4. Brushing: Brushing is a technique where users can interactively highlight or select data points by
dragging a "brush" over the visualization. Brushing is often used in conjunction with linked
visualizations to highlight corresponding data across multiple views.
5. Hovering: Hovering over data points or elements in a visualization displays additional
information or tooltips, providing context and details without requiring explicit selection.
6. Drag-and-Drop: Drag-and-drop interaction allows users to rearrange elements within a
visualization or dynamically modify parameters such as axis scales, grouping, or sorting.
7. Animation: Animation can be used to convey changes over time or transitions between different
states of the data. Animated visualizations can enhance understanding by illustrating dynamic
processes, trends, or relationships.
8. Sorting: Users can sort data within a visualization based on specific attributes or criteria, such as
alphabetical order, numerical values, or chronological sequence. Sorting helps users identify
patterns and trends more easily.
9. Linked Visualizations: Linked visualizations establish connections between multiple views of the
same dataset, enabling coordinated interactions. Selections or actions in one visualization can
trigger corresponding updates in linked views, facilitating exploration and analysis from different
perspectives.
10. Search and Highlight: Users can search for specific data points or values within a visualization,
and the corresponding results are highlighted or emphasized to draw attention.
11. Annotation: Users can add annotations, labels, or comments to visualizations to provide
context, insights, or explanations. Annotations help communicate findings and facilitate
collaboration among stakeholders.
12. Drill-Down and Drill-Up: Drill-down and drill-up capabilities enable users to explore
hierarchical data structures by navigating to lower-level details or returning to higher-level
summaries within a visualization.
By incorporating these interaction techniques, data visualization tools empower users to
interactively explore, analyze, and gain insights from data, fostering a deeper understanding and
enabling data-driven decision-making.
16
Data visualization applications encompass a diverse range of software tools and platforms
designed to create, explore, and communicate insights from data using visual representations.
Here are some notable data visualization applications across various domains:
1. Tableau: Tableau is a widely used business intelligence and analytics platform that offers
intuitive drag-and-drop functionality to create interactive dashboards, reports, and visualizations.
It supports a wide range of data sources and provides advanced analytics capabilities.
2. Microsoft Power BI: Power BI is a suite of business analytics tools that enables users to visualize
and share insights from data. It offers interactive dashboards, customizable reports, and
integration with Microsoft Excel and other data sources.
3. Google Data Studio: Data Studio is a free data visualization and reporting tool provided by
Google. It allows users to create interactive dashboards and reports using data from various
sources, including Google Analytics, Google Sheets, and Google BigQuery.
4. Qlik Sense: Qlik Sense is a self-service data visualization and discovery platform that enables
users to explore and analyze data using interactive visualizations. It supports associative data
modeling and offers advanced analytics capabilities.
5. D3.js: D3.js is a JavaScript library for creating interactive and customizable data visualizations on
the web. It provides powerful tools for manipulating HTML, SVG, and CSS to create dynamic and
data-driven visualizations.
6. Matplotlib: Matplotlib is a Python library for creating static, animated, and interactive
visualizations. It is widely used for scientific computing, data analysis, and machine learning tasks.
7. Plotly: Plotly is a Python and JavaScript library for creating interactive visualizations, including
charts, graphs, and dashboards. It supports a wide range of programming languages and
integration with web frameworks such as Flask and Dash.
8. R Shiny: Shiny is an R package for building interactive web applications and dashboards directly
from R scripts. It enables data scientists and analysts to create and share interactive visualizations
without requiring expertise in web development.
9. Microsoft Excel: Excel is a popular spreadsheet software that includes built-in charting and
visualization capabilities. While not as powerful as dedicated data visualization tools, Excel is
widely used for basic data analysis and reporting tasks.
10. Plotly Dash: Dash is a Python framework for building analytical web applications and
dashboards. It leverages Plotly for creating interactive visualizations and provides a declarative
syntax for defining dashboard layouts and interactivity.
17
These are some of the data visualization tools available, each offering different features,
capabilities, and target audiences. Depending on the specific requirements and use cases,
organizations may choose to use one or more of these tools to visualize and derive insights from
their data.
APPLICATIONS OF DATA VISUALIZATION
Data visualization plays a crucial role in big data analytics by helping organizations extract
actionable insights from massive and complex datasets. Here are some specific applications of data
visualization in the context of big data analytics:
1. Exploratory Data Analysis (EDA): Data visualization techniques are used to explore and
understand the structure, patterns, and relationships within large datasets. Visualizations such as
scatter plots, histograms, and heatmaps enable analysts to identify trends, outliers, and
correlations, providing valuable insights into the underlying data.
2. Real-time Monitoring and Dashboards: Big data visualization tools allow organizations to
monitor key metrics and performance indicators in real-time through interactive dashboards.
These dashboards provide a comprehensive overview of data streams, enabling timely decision-
making and proactive management of operational processes.
3. Predictive Analytics Visualization: Data visualization is used to visualize the results of predictive
analytics models, such as regression analysis, classification algorithms, and time-series forecasting.
Visualizations help stakeholders understand predicted outcomes, assess model performance, and
make informed decisions based on predictive insights.
4. Fraud Detection and Anomaly Detection: Data visualization techniques are employed to detect
fraudulent activities and anomalies within big data streams. Visualizations help analysts identify
irregular patterns, deviations from normal behavior, and suspicious transactions, enabling
organizations to take corrective actions and mitigate risks.
5. Customer Segmentation and Personalization: Big data visualization tools are used to analyze
customer behavior, segment target audiences, and personalize marketing strategies. Visualizations
such as customer journey maps, cohort analysis, and demographic heatmaps enable organizations
to understand customer preferences, tailor products and services, and enhance customer
engagement.
6. Network Analysis and Social Media Analytics: Data visualization is employed to analyze
networks, such as social networks, communication networks, and cybersecurity networks.
Visualizations such as network graphs, node-link diagrams, and community detection algorithms
18
help analysts identify influencers, detect network anomalies, and uncover hidden patterns within
complex network structures.
7. Spatial Analysis and Geospatial Visualization: Big data visualization tools enable organizations
to analyze and visualize spatial data, such as geographic information systems (GIS), satellite
imagery, and location-based data. Visualizations such as choropleth maps, 3D terrain models, and
geospatial heatmaps help analysts understand spatial patterns, identify spatial trends, and make
location-based decisions.
8. Text Analytics and Sentiment Analysis: Data visualization techniques are used to analyze text
data from sources such as social media, customer reviews, and online forums. Visualizations such
as word clouds, sentiment analysis charts, and topic models enable analysts to extract insights
from unstructured text data, identify emerging trends, and monitor public opinion.
9. Supply Chain Optimization: Big data visualization tools are used to optimize supply chain
operations by analyzing large volumes of supply chain data, such as inventory levels, transportation
routes, and demand forecasts. Visualizations such as supply chain maps, flow diagrams, and
logistics dashboards help organizations streamline supply chain processes, reduce costs, and
improve efficiency.
10. Healthcare Analytics and Clinical Decision Support: Data visualization is employed in
healthcare analytics to analyze patient data, medical records, and clinical outcomes. Visualizations
such as patient dashboards, medical imaging overlays, and disease heatmaps help healthcare
providers identify patterns, track disease outbreaks, and make data-driven decisions to improve
patient care and outcomes.
Overall, data visualization serves as a powerful tool in big data analytics, enabling
organizations to derive actionable insights, make informed decisions, and gain a competitive
advantage in today's data-driven world.
19