Unit V Fds Notes
Unit V Fds Notes
Unit V Fds Notes
PART A
a. Weather forecasts
b. Creating video games
c. Translating voice to text for mobile phone messaging
d. Customer service
5. Define Credit?
6. Define Underwriting?
Individuals who work in this field look at how consumers have reacted
to the overall economy when planning on a new campaign.
They can use these shifts in demographics to determine if the current
mix of products will entice consumers to make a purchase.
Active traders, meanwhile, look at a variety of metrics based on past
events when deciding whether to buy or sell a security.
Decision trees are the simplest models because they're easy to understand and
dissect. They're also very useful when you need to make a decision in a short
period of time.
9. Define Regression?
Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
Step 2: In the next two columns, find xy and (x)2.
Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
Step 4: Find the value of slope m using the above formula.
Step 5: Calculate the value of b using the above formula.
Goodness of fit is a measure of how well a statistical model fits a set of observations. When
goodness of fit is high, the values expected based on the model are close to the observed values.
In the case of the linear regression model, two types of hypothesis testing are done. They are T-
tests and F-tests. In other words, there are two types of statistics that are used to assess whether
linear regression models exist representing response and predictor variables. They are t-statistics
and f-statistics.
A sample in which each Sampling unit has been assigned a weight for use in subsequent
analysis. Common uses include survey weights to adjust for intentional oversampling of some
units relative to others.
Resampling is a method that involves repeatedly drawing samples from the training dataset.
These samples are then used to refit a specific model to retrieve more information about the
fitted model. The aim is to gather more information about a sample and improve the accuracy
and estimate the uncertainty.
1. Linear Regression.
2. Logistic Regression.
3. Ridge Regression.
4. Lasso Regression.
5. Polynomial Regression.
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response variable.
Multiple regression is a statistical technique that can be used to analyze the relationship between
a single dependent variable and several independent variables.
A nonlinear relationship between two variables is one for which the slope of the curve showing
the relationship changes as the value of one of the variables changes. A nonlinear curve is a
curve whose slope changes as the value of one of the variables changes.
Logistic regression is a data analysis technique that uses mathematics to find the relationships
between two data factors. It then uses this relationship to predict the value of one of those factors
based on the other. The prediction usually has a finite number of outcomes, like yes or no.
Parameter estimation is the process of computing a model’s parameter values from measured
data. You can apply parameter estimation to different types of mathematical models, including
statistical models, parametric dynamic models, and data-based Simulink models.
Time series analysis is a specific way of analyzing a sequence of data points collected over an
interval of time. In time series analysis, analysts record data points at consistent intervals over a
set period of time rather than just recording the data points intermittently or randomly.
Missing value analysis helps address several concerns caused by incomplete data. If cases with
missing values are systematically different from cases without missing values, the results can be
misleading.
Serial correlation is used in statistics to describe the relationship between observations of the
same variable over specific periods. If a variable's serial correlation is measured as zero, there is
no correlation, and each of the observations is independent of one another.
Autocorrelation refers to the degree of correlation of the same variables between two successive
time intervals. It measures how the lagged version of the value of a variable is related to the
original version of it in a time series. Autocorrelation, as a statistical concept, is also known as
serial correlation.
Survival analysis is a collection of statistical procedures for data analysis where the outcome
variable of interest is time until an event occurs.
PART- B
Making Predictions
Introduction:
Linear Regression is the simplest form of machine learning out there.
In this post, we will see how linear regression works and implement it in Python from
scratch.
Linear regression:
In statistics, linear regression is a linear approach to modelling the relationship between a
dependent variable and one or more independent variables.
In the case of one independent variable it is called simple linear regression. For more than
one independent variable, the process is called mulitple linear regression.
We will be dealing with simple linear regression in this tutorial.
Let X be the independent variable and Y be the dependent variable. We will define a
linear relationship between these two variables as follows:
This is the equation for a line that you studied in high school.
Our challenege today is to determine the value of m and c, that gives the minimum error
for the given dataset.
We will be doing this by using the Least Squares method.
We are squaring it because, for the points below the regression line y — p will be
negative and we don’t want negative values in our total error.
Least squares method:
Now that we have determined the loss function, the only thing left to do is minimize it.
This is done by finding the partial derivative of L, equating it to 0 and then finding an
expression for m and c.
After we do the math, we are left with these equations:
Here x̅ is the mean of all the values in the input X and ȳ is the mean of all the values in
the desired output Y.
This is the Least Squares method.
Now we will implement this in python and make predictions.
Making Predictions:
# Making predictions
Y_pred = m*X + c
plt.scatter(X, Y) # actual
# plt.scatter(X, Y_pred, color='red')
plt.plot([min(X), max(X)], [min(Y_pred), max(Y_pred)], color='red')
# predicted
plt.show()
There wont be much accuracy because we are simply taking a straight line and forcing it
to fit into the given data in the best possible way.
But you can use this to make simple predictions or get an idea about the magnitude/range
of the real value.
Also this is a good first step for beginners in Machine Learning.
Introduction
Linear Regression
Introduction:
There wont be much accuracy because we are simply taking a straight line and forcing it
to fit into the given data in the best possible way.
But you can use this to make simple predictions or get an idea about the magnitude/range
of the real value.
Linear Regression:
Linear models with independently and identically distributed errors, and for errors with
heteroscedasticity or autocorrelation.
This module allows estimation by ordinary least squares (OLS), weighted least squares (WLS),
generalized least squares (GLS), and feasible generalized least squares with autocorrelated
AR(p) errors.
List of topics:
Introduction
Multiple Regression Definition
Multiple Regression formula
Multiple Regression Analysis
Advantages of Stepwise Multiple Regression
Introduction:
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response
variable. Multiple regression is an extension of linear (OLS) regression that uses just one
explanatory variable.
In our daily lives, we come across variables, which are related to each other. To study the
degree of relationships between these variables, we make use of correlation.
To find the nature of the relationship between the variables, we have another measure,
which is known as regression.
In this, we use correlation and regression to find equations such that we can estimate the
value of one variable when the values of other variables are given.
In linear regression, there is only one independent and dependent variable involved. But, in the
case of multiple regression, there will be a set of independent variables that helps us to explain
better or predict the dependent variable y.
where x1, x2, ….xk are the k independent variables and y is the dependent variable.
Only independent variables with non zero regression coefficients are included in the
regression equation.
The changes in the multiple standard errors of estimate and the coefficient of
determination are shown.
The stepwise multiple regression is efficient in finding the regression equation with only
significant regression coefficients.
Nonlinearity is a statistical term used to describe a situation where there is not a straight-
line or direct relationship between an independent variable and a dependent variable
In a nonlinear relationship, changes in the output do not change in direct proportion to
changes in any of the inputs.
For example, let's say we're studying the relationship between the temperature and the number of
visitors to a zoo.
At first, as the temperature increases, more people visit the zoo. But at some point, when
the temperature gets too hot, fewer people visit the zoo. This is a nonlinear relationship.
This is also commonly known as the log odds, or the natural logarithm of odds, and this
logistic function is represented by the following formulas:
In this type of logistic regression model, the dependent variable has three or more
possible outcomes; however, these values have no specified order.
This type of logistic regression model is leveraged when the response variable has three
or more possible outcome, but in this case, these values do have a defined order.
Logistic Regression:
Logistic Regression not only gives a measure of how relevant a predictor (coefficient
size) is, but also its direction of association (positive or negative).
Main limitation of Logistic Regression is the assumption of linearity between the dependent
variable and the independent variables.
Introduction:
Time series analysis is indispensable in data science, statistics, and analytics.
At its core, time series analysis focuses on studying and interpreting a sequence of data
points recorded or collected at consistent time intervals.
Unlike cross-sectional data, which captures a snapshot in time, time series data is
fundamentally dynamic, evolving over chronological sequences both short and extremely
long.
This type of analysis is pivotal in uncovering underlying structures within the data, such
as trends, cycles, and seasonal variations.
Technically, time series analysis seeks to model the inherent structures within the data,
accounting for phenomena like autocorrelation, seasonal patterns, and trends.
The order of data points is crucial; rearranging them could lose meaningful insights or
distort interpretations.
Furthermore, time series analysis often requires a substantial dataset to maintain the
statistical significance of the findings.
This enables analysts to filter out 'noise,' ensuring that observed patterns are not mere
outliers but statistically significant trends or cycles.
Time series data is generally comprised of different components that characterize the patterns
and behaviour of the data over time. By analyzing these components, we can better understand
the dynamics of the time series and create more accurate models.
Trends
Seasonality
Cycles
Noise
List of topics:
Introduction
Moving average smoothing:
Example
Autoplot code
Introduction:
The predictive moving average indicators are designed to calculate the price required on
the next bar in order for the market and a moving average, or two moving averages, to
meet and potentially cross.
The predictive moving average indicators employ three of the most popular moving
averages found in TradeStation.
Moving averages are usually calculated to identify the direction of a trend. This can be
done in a variety of ways, with the most common being simple and weighted moving
averages.
Simple moving average forecasting is what we commonly think of by averaging. It can
be used for a single period or multiple periods.
where m=2k+1. That is, the es琀椀mate of the trend-cycle at 琀椀me t is obtained by averaging values of the
琀椀me series within k periods of t.
Observa琀椀ons that are nearby in 琀椀me are also likely to be close in value.
For example, consider in the above which shows the volume of electricity sold to residen琀椀al
customers in South Australia each year from 1989 to 2008 (hot water sales have been excluded). The
data are also shown in the below table.
Autoplot code:
The problem of missing data is relatively common in almost all studies and can have a
significant effect on conclusions that can be drawn from the data [49].
In general, there are three types of data lost by the mechanism of loss, namely:
This type of missing data has no pattern between the value of the missing data
[48] [50] [51] [52] [53]. This means that the probability that the value of the
missing variable does not depend on the value of the observed data or the value of
the missing data.
Missing at random (MAR) indicates that the probability of lost data relies on
observed data [48] [50] [51] [52] [53]. This means that the probability of the
value of the missing variable depends partly on other data observed.
Missing not at random (MNAR) occurs when the probability of a missing value is
directly related to the missing value itself. In other words, there is no correlation
between observed and missing data [48] [50] [51] [52] [53].
Various techniques can be used to deal with missing data. Little and
Rubin, 2002 group them into two, traditional methods and modern
methods. The complete methodology can be seen in the below Figure.
Use of the method for handling missing data is very dependent on the type of data and needs.
Based on the results of a review of scientific papers that discuss the problem of missing data, to
overcome the missing data
Missing data can have a dramatic effect on the validity of research findings [41]. In
quantitative research, the presence of missing data leads to biased parameter estimates
[16] [17] [18] [19].
Ignoring missing data has an impact on the results of the analysis [13] [14], learning
outcomes and predictions on the problem of collaborative prediction [15].
Also, the method of handling improper missing data can affect the performance of the
model in the predictive model [17] [20].
Research opportunities are quite open to conduct studies related to predictive analytics
with the presence of missing data. How to build a predictive analytics model with the
existence of missing data from internal or from external with the following conditions:
Introduction:
Serial correlation occurs in a time series when a variable and a lagged version of itself
(for instance a variable at times T and at T-1) are observed to be correlated with one
another over periods of time.
Repeating patterns often show serial correlation when the level of a variable affects its
future level. In finance, this correlation is used by technical analysts to determine how
well the past price of a security predicts the future price.
Serial correlation:
a. Serial correlation is the relationship between a given variable and a lagged version of itself
over various time intervals.
b. It measures the relationship between a variable's current value given its past values.
c. A variable that is serially correlated indicates that it may not be random.
d. Technical analysts validate the profitable patterns of a security or group of securities and
determine the risk associated with investment opportunities.
Introduction:
Predictive analytics determines the likelihood of future outcomes using techniques like data
mining, statistics, data modelling, artificial intelligence, and machine learning.
Put simply, predictive analytics interprets an organization’s historical data to make predictions
about the future.
Today’s predictive analytics techniques can discover patterns in the data to identify upcoming
risks and opportunities for an organization.
Predictive analytics allows organizations to be more proactive in the way they do business,
detecting trends to guide informed decision-making.
With the predictive models outlined above, organizations no longer have to rely on educated
guesses because forecasts provide additional insight.
The benefits of predictive analytics vary by industry, but here are some common reasons for
forecasting.
Reduce risk
Problem Definition
It may seem obvious, but the very first step to introduce Predictive Analytics is to precisely
define its scope. There could be various applications that may change accordingly to their
purpose and to the company’s industry.
Some well-known examples come from forecasting models, anomaly detection algorithms or
Churn Analysis tools.
During this phase is also important to understand which data are necessary and where do they
exist.
Data collection
In this step we take the necessary data (both structured and unstructured) from different sources.
In the ideal scenario there is a Data Lake, designed and maintained for this purpose, or at least a
Data Warehouse with its staging area from which we can retrieve the data.
During this phase data are organized for their final scope: being used by Predictive Analytics’
models to solve problems
Statistical analysis
Once the final forma of data is obtained, it is possible to go on with a Statistical Analysis of
parameters, so that previous hypotheses are directly tested, or insights are extracted thanks to
metrics visualization.
Modeling
Once thoroughly setting up data, predictive models can be tested, and necessary experiments can
be carried out to obtain a model with a satisfactory predictiveness.
Implementation
It is the stage of the actual deploy. After performing all the required tests, evaluating the quality
of models, and validating output data, it is possible to implement the Predictive Analytics tool in
production, so that it provides predictions able to solve the problem stated in the first point.
Before you do anything else, clearly define the question you want predictive analytics to answer.
Generate a list of queries and prioritize the questions that mean the most to your organization.
Once you outline a list of clear objectives, determine if you have the data available to answer
those queries. Make sure that the datasets are relevant, complete, and large enough for predictive
modeling.
Any opportunities or threats you uncover will be useless if there’s not a process in place to act on
those findings. Ensure proper communication channels are in place so that valuable predictions
end up in the right hands.
Your organization needs a platform it can depend on and tools that empower people of all skill
levels to ask deeper questions of their data. Tableau’s advanced analytics tools support time-
series analysis, allowing you to run predictive analysis like forecasting within a visual analytics
interface.