Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit V Fds Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

lOMoARcPSD|42231367

UNIT V FDS - Notes

Data Science (Mailam Engineering College)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Meganath (megan727803@gmail.com)
lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

UNIT V PREDICTIVE ANALYTICS


INTRODUCTION
Linear least squares – implementation – goodness of fit – testing a linear model – weighted
resampling. Regression using StatsModels – multiple regression – nonlinear relationships –
logistic regression – estimating parameters – Time series analysis – moving averages –
missing values – serial correlation – autocorrelation. Introduction to survival analysis.

PART A

1. What Is Predictive Analytics?

 The term predictive analytics refers to the use of statistics and


modeling techniques to make predictions about future outcomes and
performance.
 Predictive analytics looks at current and historical data patterns to
determine if those patterns are likely to emerge again.
 It allows businesses and investors to adjust where they use their
resources to take advantage of possible future events.
 Predictive analysis can also be used to improve
operationalefficienciesand reduce risk.

2. Understanding Predictive Analytics?

 Predictive analytics is a form of technology that makes predictions


about certain unknowns in the future.
 It draws on a series of techniques to make these determinations,
including artificial intelligence(AI),datamining, machine learning,
modeling, and statistics.
 Data mining involves the analysis of large sets of data to detect
patterns from it.

3. What are the uses of Predictive model analysis?

a. Weather forecasts
b. Creating video games
c. Translating voice to text for mobile phone messaging
d. Customer service

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

e. Investment portfolio development

4. What is mean by Forecasting?

 Forecasting is essential in manufacturing because it ensures the


optimal utilization of resources in a supply chain.
 Critical spokes of the supply chain wheel, whether it is inventory
management or the shop floor, require accurate forecasts for
functioning.
 Predictive modelling is often used to clean and optimize the quality of
data used for such forecasts.

5. Define Credit?

 Creditscoringmakes extensive use of predictive analytics.


 Example: When a consumer or business applies for credit, data on the
applicant's credit history and the credit record of borrowers with
similar characteristics are used to predict the risk that the applicant
might fail to perform on any credit extended.

6. Define Underwriting?

 Data and predictive analytics play an important role in underwriting.


 Insurance companies examine policy applicants to determine the likelihood of
having to pay out for a future claimbased on the current risk pool of similar
policyholders, as well as past events that have resulted in pay-outs.

7. What is mean by Marketing?

 Individuals who work in this field look at how consumers have reacted
to the overall economy when planning on a new campaign.
 They can use these shifts in demographics to determine if the current
mix of products will entice consumers to make a purchase.
 Active traders, meanwhile, look at a variety of metrics based on past
events when deciding whether to buy or sell a security.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

8. What is the Decision Trees?

Decision trees are the simplest models because they're easy to understand and
dissect. They're also very useful when you need to make a decision in a short
period of time.

9. Define Regression?

 Regression is the model that is used the most in statistical analysis.


 If you want to determine patterns in large sets of data and when there's a
linear relationship between the inputs.
 This method works by figuring out a formula, which represents the
relationship between all the inputs found in the dataset. For example, you
can use regression to figure out how priceand other key factors can shape
the performance of a security.

10. Define Neural Networks?

Neural networks were developed as a form of predictive analytics by imitating the


way the human brain works.
This model can deal with complex data relationships using artificial intelligence
and pattern recognition.
Uses:
 If you have several hurdles that you need to overcome like when you have
too much data on hand.
 When you don't have the formula you need to help you find a relationship
between the inputs and outputs in your dataset.

11. What Is Data Analytics?

 Data analytics is the science of analyzing raw data to make conclusions


about that information.
 Many of the techniques and processes of data analytics have been
automated into mechanical processes and algorithmsthat work over raw
data for human consumption.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

12. What are the various steps of Data Analysis?

 The first step is to determine the data requirements.


 The second step in data analytics is the process of collecting it.
 Once the data is collected, it must be organized so it can be analyzed.
 The data is then cleaned up before analysis.

13. What is linear least squares?

In statistics and mathematics, linear least squares is an approach to fitting a mathematical or


statistical model to data in cases where the idealized value provided by the model for any data
point is expressed linearly in terms of the unknown parameters of the model.

14. How do you implement linear least square?

Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
Step 2: In the next two columns, find xy and (x)2.
Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
Step 4: Find the value of slope m using the above formula.
Step 5: Calculate the value of b using the above formula.

15. What is goodness of fit?

Goodness of fit is a measure of how well a statistical model fits a set of observations. When
goodness of fit is high, the values expected based on the model are close to the observed values.

Here is the example for goodness of fit,

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

16. What are the two types of testing a linear model?

In the case of the linear regression model, two types of hypothesis testing are done. They are T-
tests and F-tests. In other words, there are two types of statistics that are used to assess whether
linear regression models exist representing response and predictor variables. They are t-statistics
and f-statistics.

17. What is weighted resampling?

A sample in which each Sampling unit has been assigned a weight for use in subsequent
analysis. Common uses include survey weights to adjust for intentional oversampling of some
units relative to others.

18.What is the use of resampling method?

Resampling is a method that involves repeatedly drawing samples from the training dataset.
These samples are then used to refit a specific model to retrieve more information about the
fitted model. The aim is to gather more information about a sample and improve the accuracy
and estimate the uncertainty.

19.What are the five types of regression model?

1. Linear Regression.
2. Logistic Regression.
3. Ridge Regression.
4. Lasso Regression.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

5. Polynomial Regression.

20.What is Multiple regression?

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response variable.

Multiple regression is a statistical technique that can be used to analyze the relationship between
a single dependent variable and several independent variables.

21.What is non linear relationship?

A nonlinear relationship between two variables is one for which the slope of the curve showing
the relationship changes as the value of one of the variables changes. A nonlinear curve is a
curve whose slope changes as the value of one of the variables changes.

22. What is meant by logistic regression?

Logistic regression is a data analysis technique that uses mathematics to find the relationships
between two data factors. It then uses this relationship to predict the value of one of those factors
based on the other. The prediction usually has a finite number of outcomes, like yes or no.

23. Define Parameter Estimation.

Parameter estimation is the process of computing a model’s parameter values from measured
data. You can apply parameter estimation to different types of mathematical models, including
statistical models, parametric dynamic models, and data-based Simulink models.

24. Define time series analysis?

Time series analysis is a specific way of analyzing a sequence of data points collected over an
interval of time. In time series analysis, analysts record data points at consistent intervals over a
set period of time rather than just recording the data points intermittently or randomly.

25. What is moving averages?

In statistics, a moving average (rolling average or running average) is a calculation to analyze


data points by creating a series of averages of different selections of the full data set. It is also
called a moving mean (MM) or rolling mean and is a type of finite impulse response filter.

26. What is missing value analysis?

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

Missing value analysis helps address several concerns caused by incomplete data. If cases with
missing values are systematically different from cases without missing values, the results can be
misleading.

27. What are the types of missing value analysis?

Missing completely at random (MCAR).

Missing at random (MAR).


Missing not at random (MNAR).

28. Define serial correlation in statistics.

Serial correlation is used in statistics to describe the relationship between observations of the
same variable over specific periods. If a variable's serial correlation is measured as zero, there is
no correlation, and each of the observations is independent of one another.

29. Define auto correlation in statistics.

Autocorrelation refers to the degree of correlation of the same variables between two successive
time intervals. It measures how the lagged version of the value of a variable is related to the
original version of it in a time series. Autocorrelation, as a statistical concept, is also known as
serial correlation.

30. Define Survival analysis.

Survival analysis is a collection of statistical procedures for data analysis where the outcome
variable of interest is time until an event occurs.

PART- B

1. Explain in detail about Linear least squares.


List of topics:
 Introduction
 Linear regression
 Finding the Error
 Least squares method
 Implementing the model

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 Making Predictions

Introduction:
 Linear Regression is the simplest form of machine learning out there.
 In this post, we will see how linear regression works and implement it in Python from
scratch.
Linear regression:
 In statistics, linear regression is a linear approach to modelling the relationship between a
dependent variable and one or more independent variables.
 In the case of one independent variable it is called simple linear regression. For more than
one independent variable, the process is called mulitple linear regression.
 We will be dealing with simple linear regression in this tutorial.
 Let X be the independent variable and Y be the dependent variable. We will define a
linear relationship between these two variables as follows:

 This is the equation for a line that you studied in high school.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 m is the slope of the line and c is the y intercept.


 Today we will use this equation to train our model with a given dataset and predict the
value of Y for any given value of X.


Our challenege today is to determine the value of m and c, that gives the minimum error
for the given dataset.
 We will be doing this by using the Least Squares method.

Finding the Error:


 So to minimize the error we need a way to calculate the error in the first place.
 A loss function in machine learning is simply a measure of how different the predicted
value is from the actual value.
Today we will be using the Quadratic Loss Function to calculate the loss or error in our model. It
can be defined as:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 We are squaring it because, for the points below the regression line y — p will be
negative and we don’t want negative values in our total error.
Least squares method:
 Now that we have determined the loss function, the only thing left to do is minimize it.
 This is done by finding the partial derivative of L, equating it to 0 and then finding an
expression for m and c.
 After we do the math, we are left with these equations:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 Here x̅ is the mean of all the values in the input X and ȳ is the mean of all the values in
the desired output Y.
 This is the Least Squares method.
 Now we will implement this in python and make predictions.

Implementing the model:

This is the example for implementing the model using python

Making Predictions:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

# Making predictions
Y_pred = m*X + c
plt.scatter(X, Y) # actual
# plt.scatter(X, Y_pred, color='red')
plt.plot([min(X), max(X)], [min(Y_pred), max(Y_pred)], color='red')
# predicted
plt.show()

 There wont be much accuracy because we are simply taking a straight line and forcing it
to fit into the given data in the best possible way.
 But you can use this to make simple predictions or get an idea about the magnitude/range
of the real value.
 Also this is a good first step for beginners in Machine Learning.

2. Explain in detail about Regression using statsmodels.


List of topics:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 Introduction
 Linear Regression

Introduction:
 There wont be much accuracy because we are simply taking a straight line and forcing it
to fit into the given data in the best possible way.
 But you can use this to make simple predictions or get an idea about the magnitude/range
of the real value.

 Also this is a good first step for beginners in Machine Learning.

Linear Regression:
 Linear models with independently and identically distributed errors, and for errors with
heteroscedasticity or autocorrelation.

This module allows estimation by ordinary least squares (OLS), weighted least squares (WLS),
generalized least squares (GLS), and feasible generalized least squares with autocorrelated
AR(p) errors.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

3. Explain in detail about Multiple regression.

List of topics:
 Introduction
 Multiple Regression Definition
 Multiple Regression formula
 Multiple Regression Analysis
 Advantages of Stepwise Multiple Regression
Introduction:

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response
variable. Multiple regression is an extension of linear (OLS) regression that uses just one
explanatory variable.
 In our daily lives, we come across variables, which are related to each other. To study the
degree of relationships between these variables, we make use of correlation.
 To find the nature of the relationship between the variables, we have another measure,
which is known as regression.
 In this, we use correlation and regression to find equations such that we can estimate the
value of one variable when the values of other variables are given.

Multiple Regression Definition:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 Multiple regression analysis is a statistical technique that analyzes the relationship


between two or more variables and uses the information to estimate the value of the
dependent variables.
 In multiple regression, the objective is to develop a model that describes a dependent
variable y to more than one independent variable.
Multiple Regression formula:

In linear regression, there is only one independent and dependent variable involved. But, in the
case of multiple regression, there will be a set of independent variables that helps us to explain
better or predict the dependent variable y.

The multiple regression equation is given by

y = a + b 1×1+ b2×2+……+ bkxk

where x1, x2, ….xk are the k independent variables and y is the dependent variable.

Multiple Regression Analysis:


 Multiple regression analysis permits to control explicitly for many other circumstances
that concurrently influence the dependent variable.

 The objective of regression analysis is to model the relationship between a dependent


variable and one or more independent variables.
 Let k represent the number of variables and denoted by x1, x2, x3, ……, xk. Such an
equation is useful for the prediction of value for y when the values of x are known.
Advantages of Stepwise Multiple Regression:

 Only independent variables with non zero regression coefficients are included in the
regression equation.
 The changes in the multiple standard errors of estimate and the coefficient of
determination are shown.
 The stepwise multiple regression is efficient in finding the regression equation with only
significant regression coefficients.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 The steps involved in developing the regression equation are clear.

4. Explain in detail about Non linear relationship.


List of topics:
 Introduction
 Example of Non linear relationship
 How to Implement Nonlinear Regression
Introduction:
 In a nonlinear relationship, changes in the output do not change in direct proportion to
changes in any of the inputs. A linear relationship creates a straight line when plotted on
a graph.
 A nonlinear relationship does not create a straight line but instead creates a curve.

 Nonlinearity is a statistical term used to describe a situation where there is not a straight-
line or direct relationship between an independent variable and a dependent variable
 In a nonlinear relationship, changes in the output do not change in direct proportion to
changes in any of the inputs.
For example, let's say we're studying the relationship between the temperature and the number of
visitors to a zoo.
 At first, as the temperature increases, more people visit the zoo. But at some point, when
the temperature gets too hot, fewer people visit the zoo. This is a nonlinear relationship.

Example of Non linear relationship:


 Nonlinear relationships are relationships between two variables that cannot be described
by a straight line. Instead, they may follow a curve or some other pattern.
 In the below figure, the example for non linear relationship is provided Plant growth rate
vs fertilizer amount.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

How to Implement Nonlinear Regression:

5. Explain in detail about Logistic regression.


List of topics:
 Introduction
 Types of logistic regression:
o Binary logistic regression
o Multinomial logistic regression
o Ordinal logistic regression
o Logistic Regression

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 Python code for Logistic Regression


 Advantages of Logistic Regression
 Disadvantages of Logistic Regression
Introduction:
 This type of statistical model (also known as logit model) is often used for classification
and predictive analytics.

 Logistic regression estimates the probability of an event occurring, such as voted or


didn’t vote, based on a given dataset of independent variables. Since the outcome is a
probability, the dependent variable is bounded between 0 and 1.

 In logistic regression, a logit transformation is applied on the odds—that is, the


probability of success divided by the probability of failure.

 This is also commonly known as the log odds, or the natural logarithm of odds, and this
logistic function is represented by the following formulas:

Logit(pi) = 1/(1+ exp(-pi))


ln(pi/(1-pi)) = Beta_0 + Beta_1*X_1 + … + B_k*K_k
 In this logistic regression equation, logit(pi) is the dependent or response variable and x is
the independent variable. The beta parameter, or coefficient, in this model is commonly
estimated via maximum likelihood estimation (MLE).

Types of logistic regression:


 Binary logistic regression:

In this approach, the response or dependent variable is dichotomous in nature—i.e. it has


only two possible outcomes (e.g. 0 or 1). Some popular examples of its use include
predicting if an e-mail is spam or not spam or if a tumor is malignant or not malignant.

 Multinomial logistic regression:

In this type of logistic regression model, the dependent variable has three or more
possible outcomes; however, these values have no specified order.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 Ordinal logistic regression:

This type of logistic regression model is leveraged when the response variable has three
or more possible outcome, but in this case, these values do have a defined order.

 Logistic Regression:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

Python code for Logistic Regression:

Advantages of Logistic Regression:

 Logistic Regression performs well when the dataset is linearly separable.

 Logistic Regression not only gives a measure of how relevant a predictor (coefficient
size) is, but also its direction of association (positive or negative).

 Logistic regression is easier to implement, interpret and very efficient to train.

Disadvantages of Logistic Regression:

Main limitation of Logistic Regression is the assumption of linearity between the dependent
variable and the independent variables.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

6. Explain in detail about Time series Analysis.


List of topics:
 Introduction
 Components of Time Series Data
 Four main elements make up a time series dataset
 You will import the file to R by the following command

Introduction:
 Time series analysis is indispensable in data science, statistics, and analytics.

 At its core, time series analysis focuses on studying and interpreting a sequence of data
points recorded or collected at consistent time intervals.
 Unlike cross-sectional data, which captures a snapshot in time, time series data is
fundamentally dynamic, evolving over chronological sequences both short and extremely
long.
 This type of analysis is pivotal in uncovering underlying structures within the data, such
as trends, cycles, and seasonal variations.
 Technically, time series analysis seeks to model the inherent structures within the data,
accounting for phenomena like autocorrelation, seasonal patterns, and trends.
 The order of data points is crucial; rearranging them could lose meaningful insights or
distort interpretations.
 Furthermore, time series analysis often requires a substantial dataset to maintain the
statistical significance of the findings.

 This enables analysts to filter out 'noise,' ensuring that observed patterns are not mere
outliers but statistically significant trends or cycles.

Components of Time Series Data:

Time series data is generally comprised of different components that characterize the patterns
and behaviour of the data over time. By analyzing these components, we can better understand
the dynamics of the time series and create more accurate models.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

Four main elements make up a time series dataset:

 Trends
 Seasonality
 Cycles
 Noise

In summary, the key components of time series data are:

 Trends: Long-term increases, decreases, or stationary movement


 Seasonality: Predictable patterns at fixed intervals
 Cycles: Fluctuations without a consistent period
 Noise: Residual unexplained variability
Time Series analysis example:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

You will import the file to R by the following command:

>library_borrowing<-read.csv(“C:/Table.1″, header=T, dec=”,”, sep=”;”)

Note that paths use forward slashes “/” instead of backslashes

>plot(library_borrowing[, 5], type=”1″, lowd=2, col=”red”, xlab=Years”, ylab=”Number of


books”, main=”Number of books borrowed from the library” xl)

The result for the above code is:

7. Explain in detail about Moving averages.

List of topics:

 Introduction
 Moving average smoothing:
 Example
 Autoplot code

Introduction:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 The predictive moving average indicators are designed to calculate the price required on
the next bar in order for the market and a moving average, or two moving averages, to
meet and potentially cross.
 The predictive moving average indicators employ three of the most popular moving
averages found in TradeStation.
 Moving averages are usually calculated to identify the direction of a trend. This can be
done in a variety of ways, with the most common being simple and weighted moving
averages.
 Simple moving average forecasting is what we commonly think of by averaging. It can
be used for a single period or multiple periods.

Moving average smoothing:

A moving average of order m can be written as

where m=2k+1. That is, the es琀椀mate of the trend-cycle at 琀椀me t is obtained by averaging values of the
琀椀me series within k periods of t.

Observa琀椀ons that are nearby in 琀椀me are also likely to be close in value.

Example: Annual electricity sales in Australia

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

For example, consider in the above which shows the volume of electricity sold to residen琀椀al
customers in South Australia each year from 1989 to 2008 (hot water sales have been excluded). The
data are also shown in the below table.

Autoplot code:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

8. Explain in detail about Missing values.


List of topics:
 Introduction
o a. Missing Completely at Random (MCAR)
o b. Missing at Random (MAR)
o c. Missing Not at Random (MNAR)
 Comparison for Handling Missing Data
 Comparison methods to handling missing data
Introduction:
 It is common in data analytics to find missing values in a dataset, where some values
should exist and fail to be observed or recorded [48].

 The problem of missing data is relatively common in almost all studies and can have a
significant effect on conclusions that can be drawn from the data [49].

 In general, there are three types of data lost by the mechanism of loss, namely:

o a. Missing Completely at Random (MCAR)

This type of missing data has no pattern between the value of the missing data
[48] [50] [51] [52] [53]. This means that the probability that the value of the
missing variable does not depend on the value of the observed data or the value of
the missing data.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

P(Missing|Data Complete) = P(Missing)

o b. Missing at Random (MAR)

Missing at random (MAR) indicates that the probability of lost data relies on
observed data [48] [50] [51] [52] [53]. This means that the probability of the
value of the missing variable depends partly on other data observed.

P(missing|complete data) = P(missing|observed data)

o c. Missing Not at Random (MNAR)

Missing not at random (MNAR) occurs when the probability of a missing value is
directly related to the missing value itself. In other words, there is no correlation
between observed and missing data [48] [50] [51] [52] [53].

P(missing|complete data) ≠ P(missing|observed data)

Various techniques can be used to deal with missing data. Little and
Rubin, 2002 group them into two, traditional methods and modern
methods. The complete methodology can be seen in the below Figure.

Fig: Method for Handling Missing Data [48].

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

Comparison for Handling Missing Data:

Use of the method for handling missing data is very dependent on the type of data and needs.
Based on the results of a review of scientific papers that discuss the problem of missing data, to
overcome the missing data

Comparison methods to handling missing data:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 Missing data can have a dramatic effect on the validity of research findings [41]. In
quantitative research, the presence of missing data leads to biased parameter estimates
[16] [17] [18] [19].

 Ignoring missing data has an impact on the results of the analysis [13] [14], learning
outcomes and predictions on the problem of collaborative prediction [15].

 Also, the method of handling improper missing data can affect the performance of the
model in the predictive model [17] [20].

 Research opportunities are quite open to conduct studies related to predictive analytics
with the presence of missing data. How to build a predictive analytics model with the
existence of missing data from internal or from external with the following conditions:

1. Dataset is complete and new data containing missing data

2. The dataset contains missing data, and new data is complete.

3. Datasets and new data contain missing data.

9. Explain in detail about serial correlation.


List of topics:
 Introduction
 Serial correlation

Introduction:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

 Serial correlation occurs in a time series when a variable and a lagged version of itself
(for instance a variable at times T and at T-1) are observed to be correlated with one
another over periods of time.

 Repeating patterns often show serial correlation when the level of a variable affects its
future level. In finance, this correlation is used by technical analysts to determine how
well the past price of a security predicts the future price.

 Serial correlation is similar to the statistical concepts of autocorrelation or lagged


correlation.

Serial correlation:
a. Serial correlation is the relationship between a given variable and a lagged version of itself
over various time intervals.
b. It measures the relationship between a variable's current value given its past values.
c. A variable that is serially correlated indicates that it may not be random.
d. Technical analysts validate the profitable patterns of a security or group of securities and
determine the risk associated with investment opportunities.

10. Explain predictive analysis in detail.

Introduction:

Predictive analytics determines the likelihood of future outcomes using techniques like data
mining, statistics, data modelling, artificial intelligence, and machine learning.

Put simply, predictive analytics interprets an organization’s historical data to make predictions
about the future.

Today’s predictive analytics techniques can discover patterns in the data to identify upcoming
risks and opportunities for an organization.

Predictive analysis Importance:

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

Predictive analytics allows organizations to be more proactive in the way they do business,
detecting trends to guide informed decision-making.

With the predictive models outlined above, organizations no longer have to rely on educated
guesses because forecasts provide additional insight.

The benefits of predictive analytics vary by industry, but here are some common reasons for
forecasting.

Improve profit margins.

Optimize marketing campaigns.

Reduce risk

Steps to effectively implement Predictive analysis:

Problem Definition

It may seem obvious, but the very first step to introduce Predictive Analytics is to precisely
define its scope. There could be various applications that may change accordingly to their
purpose and to the company’s industry.

Some well-known examples come from forecasting models, anomaly detection algorithms or
Churn Analysis tools.
During this phase is also important to understand which data are necessary and where do they
exist.

Data collection

In this step we take the necessary data (both structured and unstructured) from different sources.
In the ideal scenario there is a Data Lake, designed and maintained for this purpose, or at least a
Data Warehouse with its staging area from which we can retrieve the data.

Data manipulation and descriptive analysis

During this phase data are organized for their final scope: being used by Predictive Analytics’
models to solve problems

Statistical analysis

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

Once the final forma of data is obtained, it is possible to go on with a Statistical Analysis of
parameters, so that previous hypotheses are directly tested, or insights are extracted thanks to
metrics visualization.

Modeling

Once thoroughly setting up data, predictive models can be tested, and necessary experiments can
be carried out to obtain a model with a satisfactory predictiveness.

Implementation

It is the stage of the actual deploy. After performing all the required tests, evaluating the quality
of models, and validating output data, it is possible to implement the Predictive Analytics tool in
production, so that it provides predictions able to solve the problem stated in the first point.

Predictive analytics tools:

Identify the business objective.

Before you do anything else, clearly define the question you want predictive analytics to answer.
Generate a list of queries and prioritize the questions that mean the most to your organization.

Determine the datasets.

Once you outline a list of clear objectives, determine if you have the data available to answer
those queries. Make sure that the datasets are relevant, complete, and large enough for predictive
modeling.

Create processes for sharing and using insights.

Any opportunities or threats you uncover will be useless if there’s not a process in place to act on
those findings. Ensure proper communication channels are in place so that valuable predictions
end up in the right hands.

Choose the right software solutions.

Your organization needs a platform it can depend on and tools that empower people of all skill
levels to ask deeper questions of their data. Tableau’s advanced analytics tools support time-
series analysis, allowing you to run predictive analysis like forecasting within a visual analytics
interface.

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)


lOMoARcPSD|42231367

CS 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT V

Prepared By: Mr.S.Thiyaneswaran AP/CSBS

Downloaded by Meganath (megan727803@gmail.com)

You might also like