Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Cheat Sheet - BT1101

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 29

Business Analytics is the use of:

Data, information technology, statistical analysis, quantitative methods and mathematical or computer-based models

To help manager gain improved insight about their business operations and make better, fact-based decisions

Descriptive Analytics → Use of data to understand past and current business performances and make informed decisions

Predictive analytics → Predict future by examining historical data, detecting patterns or relationship in these data then extrapolating these relationships forward in time

Prescriptive analytics → Identify the best alternatives to minimize or maximize some objective

Recognizing a problem → Defining the problem → Structuring the problem → Analyzing the problem → Interpreting results and making a decision → Implementing the solution

Structured vs Unstructured data


Structured → information has High degree of organization, are related.
Unstructured → text heavy.

Big Data → Massive amount of business data from a wide variety of sources, much of which is available in real time and much of which is uncertain or unpredictable.
4 Vs → Volume , Variety , Velocity and Veracity

Metric → a unit of measurement that provides a way to objectively quantify performance.


Measurement → act of obtaining data associated with a metric
Measures → numerical values associated with a metric

Discrete Metric → One that is derived from counting something. (whole numbers . Yes/ No)
Continuous Metric → Based on a continuous scale of measurement (dollar, length, weight , time, volume)

Categorical (nominal) → Sorted into categories according to specified characteristics (Gender)


Ordinal data → can be ordered or ranked according to some relationship to one another (Order, rating , rank)
Interval data → ordinal but have constant difference between observation and have arbitrary zero points ( dates, month)
Ratio data → Continuous and have a natural zero (dollar and time)

Reliability → Data is accurate and consistent


Validity → Data correctly measures what is it supposed to measure
Overview to Business Analytics

Question Answer

What is the difference between validity and reliability? Validity means that data correctly measure what they are supposed
to measure while reliability means that data are accurate and
consistent.

Assume that you have been given a dataset containing all items that an aircraft i) Categorical ; ii) Ordinal; iii) Ratio; iv) Ratio and v) Interval
component manufacturing company has purchased over the past 3 months. The data
provide the supplier; order number; item number, description, and cost; quantity
ordered; cost per order; the order and arrival dates. How would you classify the
following types of data? i) Supplier data; ii) order number; iii) item cost; iv) cost per
order and v) order date

In what way does big data provide an opportunity for organizations to gain a i, ii
competitive advantage?
i. If the data can be understood and analyzed effectively to make better business
decisions.
ii. If organizations employ advanced analytics techniques such as data mining, and
text analysis.
iii. If the unstructured big data is transformed to structured, and easily understandable
information.
iv. If the volume of data input is controlled.

Which of the following is the first phase in problem-solving for business analytics? Recognizing the problem

According to IBM, which of the following are characteristics of big data? Volume, variety, velocity, and veracity

Which of the following characterizes business analytics? i, iii


i. The use of data, information technology, statistical analysis, quantitative methods,
and mathematical or computer-based models to help managers gain improved insight
about their business operations and make better, fact based decisions.
ii. A term for simulated intelligence in machines.
iii. A process of transforming data into actions through analysis and insights in the
context of organizational decision making and problem solving.
iv. An application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed.

A manager at Gampco Inc. wishes to know the impact a marketing program will have Predictive analytics
on sales. Which of the following business analytics will help the manager?
Which of the following is an example of a measure of continuous metrics? Weight and volume of a sheet of steel

Typical questions that descriptive analytics help answer are: How many and what types of complaints did we resolve?
How much did we sell in each region?
What was our revenue and profit last quarter?
Which factory has the lowest productivity?

Assume that you are a business analyst for a bank. Your manager has asked you to Prescriptive analytics
compute the optimal staffing to achieve a given profitability constrained by a fixed
cycle time. Which of the following would you apply?

Net profit, return on investment, market share, percentage of orders filled accurately, Metrics
the proportion of defective parts produced, the number of inventory turns each month,
and customer satisfaction are examples of :

Which of the following are part of structuring the problem phase? i, ii, iii, iv
i. Stating goals and objectives
ii. Characterizing the possible decisions
iii. Identifying any constraints or restrictions
iv. Developing a formal model
v. Communication of the problem to management

Which of the following are challenges in the application of business analytics? All of the above
i. Lack of understanding of how to use analytics
ii. Insufficient analytical skills
iii. Difficulty in getting good data and sharing information
iv. Data privacy, security, and compliance
v. Building the right governance and organizational structure

Which of the following is the most appropriate as an example of interval data? Calendar Month (e.g 1,2,3,4..12)

Which of the following is an example of a discrete metric? Number of watches sold

Which of the following is an example of a structured data? Postal Code

Which one of the following is most aligned with value-generation approach for BA Consider how analytics can bring value to organization as the first
step in an organization’s analytics strategy
Data Visualization, Tabulations & Frequencies

Question Answer

type= " " in line charts can take the following values except: type description
p points
l lines
o overplotted points and lines
b, c points (empty if "c") joined by lines
s, S stair steps
h histogram-like vertical lines
n does not produce any points or lines

Which of the following is useful for displaying data over time? Line Charts

We may express the frequencies as a fraction, or proportion, of the total; this is called the: Relative frequency

What does the parameter 'ylim' mean when using the plot function in R? ylim is the limits of the values of y used for plotting

Bar charts are useful for comparing: Categorical or ordinal data

A tabular summary of cumulative relative frequencies is called a: Cumulative relative frequency distribution

Histogram represents the frequency distribution of ___________variables. Conversely, a continuous; discrete


bar graph is a diagrammatic comparison of _______ variables. Histogram presents
numerical data whereas bar graph shows categorical data.

Which of the following is true about contingency tables? All


i. They are one of most basic statistical tool for summarizing categorical data
ii. They are a tabular method that displays number of observations in a data set for different
subcategories of two or more categorical variables.
iii. Contingency tables can accept numerical variables but grouping variable must be
categorical.
iv. Subcategories of variables must be mutually exclusive and exhaustive (i.e. each
observation can be classified into only one subcategory, and, taken together over all
subcategories, they must constitute the complete data set)

Which of the following parameters allows you to create a clustered bar chart? Beside = TRUE

What does the output of this code: quantile(cars$mpg) mean? It breaks the data into four parts. The 25th percentile is called the
first quartile, Q1; the 50th percentile is called the second quartile,
Q2; the 75th percentile is called the third quartile, Q3; and the
100th percentile is the fourth quartile.
Which of the following codes adds a legend at the top right of a clustered bar chart? legend("topright", MS, cex=0.8, fill=colors)

Which of the following is true about a stacked bar chart? To create a stacked chart, the 'beside' parameter does not need
to be included because the 'beside' parameter is FALSE by
default

A graphical depiction of a frequency distribution for numerical data in the form of a column Histogram
chart is called:
In a _____ the range of values of a numeric variable of interest is usually laid out on the
horizontal scale (x-axis). The scale is divided into sections called class. The vertical scale
(y-axis) shows how many observations fall into each class.

Horizontal and vertical bar plots are useful for the following except? Displaying data over time

In creating histograms in R using the 'hist' function, the ____ parameter is used to specify break
the width of each bar.

A ___ represents the proportion of the total number of observations that fall at or below Cumulative relative frequency
the upper limit of each group.

When using the barplot function, what does the parameter 'cex' mean? It is a number indicating the amount by which plotting text and
symbols should be scaled relative to the default. 1=default, 1.5 is
50% larger, 0.5 is 50% smaller, etc.

Which of the following is true about the 'names.arg' parameter when using the barplot names.arg=(character vector) to label the bars
function?

What does the 'table' function accomplish? Uses the cross-classifying factors to build a contingency table of
the counts at each combination of factor levels.

legend(x,y=NULL, legend, fill, col, bg). What does x and y represent? x and y are coordinates to be used to position the legend
Descriptive Statistics - Statistical Measures

Question Answer

refers to the degree of variation in the data, that is, the numerical spread (or compactness) Dispersion
of the data.

The mean can be affected by outliers. What are outliers? Observations that are radically different from the rest—which pull
the value of the mean toward these values.

The________measures the degree of asymmetry of observations around the mean. Coefficient of skewness

Which of the following formulas computes variance of a sample?

The z-score for the i-th observation in a sample is calculated as:

According to the Empirical rule, the proportion of a normally distributed data which falls 95%
within 2 standard deviations from its mean is about ______ .

________ states that for any set of data, the proportion of values that lie within k standard Chebyshev’s theorem
deviations (k > 1) of the mean is at least 1 - 1/k2.

The ________is the difference between the maximum value and the minimum value in the range
data set

The ________provides a relative measure of the dispersion in data relative to the mean Coefficient of variation

Which of the following values of the coefficients of variation of stocks represents the least 0.005
risky stock?

An "outlier" in a data is strictly defined by whether 1.5* IQR to the left or right

A z-score of 1 means that ______. the observation is 1.0 standard deviation to the right of the mean

is a measure of the linear relationship between two variables, X and Y, which does not correlation
depend on the units of measurement.

The measure of location that specifies the middle value when the data are arranged from median
least to greatest is the:
The ________ is the square root of the variance. Standard deviation

Which of the following is TRUE of covariance, between two variables, when one of the
deviations from the mean is positive and the other is negative?

Which of the following statements about correlation is false?

The ________is the observation that occurs most frequently mode

The linear association between two variables, X & Y, can be measured by ________ Pearson’s correlation coefficient

refers to the peakedness (i.e., high, narrow) or flatness (i.e., short, flat-topped) of a histogra kurtosis

The empirical rule is applicable to data that is ___________ Normally distrubuted

Process A and B fill up milk cartons with a standard deviation of 19.28ml while Process C C , lower SD the better, more reliable
fills up milk cartons with a standard deviation of 7.58ml. Which process(es) should a milk
packaging company use?

What is the variance of the following dataset: 10, 10, 10, 10, 10, 10, 10, 10, 10 0 ,. Sd = 0 , variance = 0
Probability & Data Modeling

Question Answer

Which of the following is a difference between interval estimates and point estimates? Point estimates provide only a single value for a sample, while
interval estimates provide a range of values.

A ________ is one that provides a range for predicting the value of a new observation from Prediction interval
the same population.

Which of the following is true about probability? 1234


i) Probability is the likelihood that an outcome—such as whether a new product will be
profitable or not or whether a project will be completed within 15 weeks—occurs.
ii) Probabilities are expressed as values between 0 and 1, although many people convert
them to percentages.
iii) The statement that there is a 10% chance that oil prices will rise next quarter is another
way of stating that the probability of a rise in oil prices is 0.1.
iv) The closer the probability is to 1, the more likely it is that the outcome will occur.
v) A probability is a process that results in an outcome.

A (n) ________ random variable is one for which the number of possible outcomes can be Discrete
counted

The distribution for students’ examination scores follow a normal distribution with a mean of 0.4207
78 and variance of 100. What is the probability that a student’s examination score will be at
least 80?

Which of the following is true of normal distributions? 3M are equal

What is the confidence coefficient when the level of significance is 0.05? 0.95

While rolling two dice, what is the probability of rolling a sum of 7 or more? 7/12

A ________ is a range of values between which the value of the population parameter is Confidence interval
believed to be, along with a probability that the interval correctly estimates the true
(unknown) population parameter.

X is a random variable that is normally distributed with mean of 60 and standard deviation pnorm(75,60,15,lower.tail=FALSE)
of 15. Which of the following is the R code that computes P(X>75)?

A________ is the characterization of the possible values that a random variable may Probability distrubution
assume along with the probability of assuming these values.

Which of the following is true about the relative frequency definition of probability? It is based on empirical data
The collection of all possible outcomes of an experiment is called the ________. Sample space

The distribution for students’ examination scores follow a normal distribution with a mean of 88
78 and variance of 100. Find x such that the probability of obtaining a score greater than x
is 0.1587

B is a random variable that follows the normal distribution with mean of 300, and standard pnorm(400,300,100) - pnorm(250,300,100)
deviation of 100. What is the R code that computes P(250<=B<=400)?

states that if the sample size is large enough, the sampling distribution of the mean is Central Limit Theorem
approximately normally distributed, regardless of the distribution of the population and that
the mean of the sampling distribution will be the same as that of the population.

A probability density function characterizes outcomes of a continuous random variable.

For a discrete random variable X, the probability distribution of the discrete outcomes is Probability mass function
called a (n) ________ and is denoted by a mathematical function, f(x).

Probability may be defined from one of three perspectives. If the process that generates Classical
the outcomes is known, probabilities can be deduced from theoretical arguments; this is
the________ definition of probability.

is the probability of occurrence of one event A, given that another event B is known to be Conditional probability
true or has already occurred.
Sampling & Estimation

Question Answer

occurs when the sample does not adequately represent the target population Nonsampling error

Sampling distribution of the mean will be approximately normally distributed ____. If population is normally distributed

Which of the following is an example of a point estimate? mean

Which of the following is a difference between interval estimates and point estimates? Point estimates provide only a single value for a sample, while
interval estimates provide a range of values.

The formula ∑(x−x¯)^2 / n−1 is the estimator for S ^2

As sample size ________ , sampling error ________ . Increases, decreases

A ________consists of all items of interest for a particular decision or investigation—for population


example, all individuals in the United States who do not own cell phones, all subscribers to
Netflix, or all stockholders of Google.

A ________ is a range of values between which the value of the population parameter is Confidence interval
believed to be, along with a probability that the interval correctly estimates the true
(unknown) population parameter.

____ is associated with the sampling distribution of a statistic while ___ is associated with Confidence interval; prediction interval
the distribution of the random variable itself.

______ refers to the ___ of the sampling distribution of the mean. Standard error of the mean; standard deviation

If the expected value of an estimator equals the population parameter it is intended to Unbiased
estimate, the estimator is said to be ________.

The means of all possible samples of a fixed size n from some population will form a Sampling distribution of the mean
distribution which is known as the ________.

The ________ is a family of probability distributions with a shape similar to the standard t-distribution
normal distribution.

________ states that if the sample size is large enough, the sampling distribution of the Central limit theorem
mean is approximately normally distributed, regardless of the distribution of the population
and that the mean of the sampling distribution will be the same as that of the population.

p^ unbiased estimator of population proportion


Which of the following is one of the purposes of sampling? To obtain sufficient information to draw a valid inference about a
population

The means of all possible samples of a fixed size n from some population will form a Sampling distribution of the mean
distribution which is known as the ________
Hypothesis Testing

Question Answer

Which of the following is true about the power of the test? i, ii, iii, iv, v
i. It is the probability of not committing a type II error
ii. It should be high to allow us to make a valid conclusion
iii. Power of test is sensitive to sample size where small sample sizes generally result in a
lower value of 1- β
iv. Power can be increased by taking larger samples
v. Large samples allow detection of small differences between sample statistics and
population parameters with more accuracy

Which of the following are steps in the hypothesis testing procedure? All
i. Identifying the population parameter of interest and formulating the hypotheses to test
ii. Selecting a level of significance, which defines the risk of drawing an incorrect conclusion
when the assumed hypothesis is actually true
iii. Determining a decision rule on which to base a conclusion
iv. Collecting data and calculating a test statistic
v. Applying the decision rule to the test statistic and drawing a conclusion

The further out in the tail of a distribution our critical value falls, the greater the risk of Type II error
making a:

Which of the following is a valid ANOVA hypothesis test? H0: μ1=μ2=μ3; H1: At least one of the mean is different from the
others

A manufacturer wishes to determine if the average profit from the sale of his product H0: population mean profit from sale ≤ $6,710 vs. H1: population mean
exceeds $6,710. Which of the following is the appropriate hypothesis test? profit from sale > $6,710

The probability of making a Type I error, that is, P(rejecting H0 | H0 is true) , is denoted by Level of significance
α and is called the

Which of the following is true about one-tailed and two-tailed tests? For standard normal and t-distributions, which have a mean of
zero, lower-tail critical values are negative and upper-tail critical
values are positive

Which of the following is the test statistic for a one-sample test for mean when the
population standard deviation is unknown?

Which of the following is a valid two-sample hypothesis test? H0: μ1= μ2 vs. H1: μ1 ≠ μ2
Rejecting the null hypothesis when the null hypothesis is true would be incorrect. This type Type I error
of error is called a ________.

Failure to reject the null hypothesis when the alternative hypothesis is true is known as Type II error
________.

Statistical inference focuses on drawing conclusions about populations from samples. tRUE

Which of the following is a valid one-sample hypothesis test? H0: population parameter = constant vs. H1: population parameter ≠
constant

Sampling
Mean +- Z(a/2) (sd / sqrt(sample))
zα/2 : value of standard normal random variable for an upper tail area of α/2 (or a lower tail area of 1 − α/2).
Example: if a = 0.05 (for a 95% confidence interval), then z 0.975 = 1.96;
Example: if a = 0.10 (for a 90% confidence interval), then z 0.95= 1.645.

(a random walk process {Xt} … Wt is a white-noise term


[0.559,0.629]
II AND IV are statistics
ii , iii , v . . correlation ==/== causation

Paired t-test is for paired values, for examples values that generally occurred by the same row of data, 1 person give feedback on 2 different products
Welch two-sample t-test, is used when the data of two samples are statistically independent, while the paired t-test is used when data is in the form of matched pairs.

You might also like