Cheat Sheet - BT1101
Cheat Sheet - BT1101
Cheat Sheet - BT1101
Data, information technology, statistical analysis, quantitative methods and mathematical or computer-based models
To help manager gain improved insight about their business operations and make better, fact-based decisions
Descriptive Analytics → Use of data to understand past and current business performances and make informed decisions
Predictive analytics → Predict future by examining historical data, detecting patterns or relationship in these data then extrapolating these relationships forward in time
Prescriptive analytics → Identify the best alternatives to minimize or maximize some objective
Recognizing a problem → Defining the problem → Structuring the problem → Analyzing the problem → Interpreting results and making a decision → Implementing the solution
Big Data → Massive amount of business data from a wide variety of sources, much of which is available in real time and much of which is uncertain or unpredictable.
4 Vs → Volume , Variety , Velocity and Veracity
Discrete Metric → One that is derived from counting something. (whole numbers . Yes/ No)
Continuous Metric → Based on a continuous scale of measurement (dollar, length, weight , time, volume)
Question Answer
What is the difference between validity and reliability? Validity means that data correctly measure what they are supposed
to measure while reliability means that data are accurate and
consistent.
Assume that you have been given a dataset containing all items that an aircraft i) Categorical ; ii) Ordinal; iii) Ratio; iv) Ratio and v) Interval
component manufacturing company has purchased over the past 3 months. The data
provide the supplier; order number; item number, description, and cost; quantity
ordered; cost per order; the order and arrival dates. How would you classify the
following types of data? i) Supplier data; ii) order number; iii) item cost; iv) cost per
order and v) order date
In what way does big data provide an opportunity for organizations to gain a i, ii
competitive advantage?
i. If the data can be understood and analyzed effectively to make better business
decisions.
ii. If organizations employ advanced analytics techniques such as data mining, and
text analysis.
iii. If the unstructured big data is transformed to structured, and easily understandable
information.
iv. If the volume of data input is controlled.
Which of the following is the first phase in problem-solving for business analytics? Recognizing the problem
According to IBM, which of the following are characteristics of big data? Volume, variety, velocity, and veracity
A manager at Gampco Inc. wishes to know the impact a marketing program will have Predictive analytics
on sales. Which of the following business analytics will help the manager?
Which of the following is an example of a measure of continuous metrics? Weight and volume of a sheet of steel
Typical questions that descriptive analytics help answer are: How many and what types of complaints did we resolve?
How much did we sell in each region?
What was our revenue and profit last quarter?
Which factory has the lowest productivity?
Assume that you are a business analyst for a bank. Your manager has asked you to Prescriptive analytics
compute the optimal staffing to achieve a given profitability constrained by a fixed
cycle time. Which of the following would you apply?
Net profit, return on investment, market share, percentage of orders filled accurately, Metrics
the proportion of defective parts produced, the number of inventory turns each month,
and customer satisfaction are examples of :
Which of the following are part of structuring the problem phase? i, ii, iii, iv
i. Stating goals and objectives
ii. Characterizing the possible decisions
iii. Identifying any constraints or restrictions
iv. Developing a formal model
v. Communication of the problem to management
Which of the following are challenges in the application of business analytics? All of the above
i. Lack of understanding of how to use analytics
ii. Insufficient analytical skills
iii. Difficulty in getting good data and sharing information
iv. Data privacy, security, and compliance
v. Building the right governance and organizational structure
Which of the following is the most appropriate as an example of interval data? Calendar Month (e.g 1,2,3,4..12)
Which one of the following is most aligned with value-generation approach for BA Consider how analytics can bring value to organization as the first
step in an organization’s analytics strategy
Data Visualization, Tabulations & Frequencies
Question Answer
type= " " in line charts can take the following values except: type description
p points
l lines
o overplotted points and lines
b, c points (empty if "c") joined by lines
s, S stair steps
h histogram-like vertical lines
n does not produce any points or lines
Which of the following is useful for displaying data over time? Line Charts
We may express the frequencies as a fraction, or proportion, of the total; this is called the: Relative frequency
What does the parameter 'ylim' mean when using the plot function in R? ylim is the limits of the values of y used for plotting
A tabular summary of cumulative relative frequencies is called a: Cumulative relative frequency distribution
Which of the following parameters allows you to create a clustered bar chart? Beside = TRUE
What does the output of this code: quantile(cars$mpg) mean? It breaks the data into four parts. The 25th percentile is called the
first quartile, Q1; the 50th percentile is called the second quartile,
Q2; the 75th percentile is called the third quartile, Q3; and the
100th percentile is the fourth quartile.
Which of the following codes adds a legend at the top right of a clustered bar chart? legend("topright", MS, cex=0.8, fill=colors)
Which of the following is true about a stacked bar chart? To create a stacked chart, the 'beside' parameter does not need
to be included because the 'beside' parameter is FALSE by
default
A graphical depiction of a frequency distribution for numerical data in the form of a column Histogram
chart is called:
In a _____ the range of values of a numeric variable of interest is usually laid out on the
horizontal scale (x-axis). The scale is divided into sections called class. The vertical scale
(y-axis) shows how many observations fall into each class.
Horizontal and vertical bar plots are useful for the following except? Displaying data over time
In creating histograms in R using the 'hist' function, the ____ parameter is used to specify break
the width of each bar.
A ___ represents the proportion of the total number of observations that fall at or below Cumulative relative frequency
the upper limit of each group.
When using the barplot function, what does the parameter 'cex' mean? It is a number indicating the amount by which plotting text and
symbols should be scaled relative to the default. 1=default, 1.5 is
50% larger, 0.5 is 50% smaller, etc.
Which of the following is true about the 'names.arg' parameter when using the barplot names.arg=(character vector) to label the bars
function?
What does the 'table' function accomplish? Uses the cross-classifying factors to build a contingency table of
the counts at each combination of factor levels.
legend(x,y=NULL, legend, fill, col, bg). What does x and y represent? x and y are coordinates to be used to position the legend
Descriptive Statistics - Statistical Measures
Question Answer
refers to the degree of variation in the data, that is, the numerical spread (or compactness) Dispersion
of the data.
The mean can be affected by outliers. What are outliers? Observations that are radically different from the rest—which pull
the value of the mean toward these values.
The________measures the degree of asymmetry of observations around the mean. Coefficient of skewness
According to the Empirical rule, the proportion of a normally distributed data which falls 95%
within 2 standard deviations from its mean is about ______ .
________ states that for any set of data, the proportion of values that lie within k standard Chebyshev’s theorem
deviations (k > 1) of the mean is at least 1 - 1/k2.
The ________is the difference between the maximum value and the minimum value in the range
data set
The ________provides a relative measure of the dispersion in data relative to the mean Coefficient of variation
Which of the following values of the coefficients of variation of stocks represents the least 0.005
risky stock?
An "outlier" in a data is strictly defined by whether 1.5* IQR to the left or right
A z-score of 1 means that ______. the observation is 1.0 standard deviation to the right of the mean
is a measure of the linear relationship between two variables, X and Y, which does not correlation
depend on the units of measurement.
The measure of location that specifies the middle value when the data are arranged from median
least to greatest is the:
The ________ is the square root of the variance. Standard deviation
Which of the following is TRUE of covariance, between two variables, when one of the
deviations from the mean is positive and the other is negative?
The linear association between two variables, X & Y, can be measured by ________ Pearson’s correlation coefficient
refers to the peakedness (i.e., high, narrow) or flatness (i.e., short, flat-topped) of a histogra kurtosis
Process A and B fill up milk cartons with a standard deviation of 19.28ml while Process C C , lower SD the better, more reliable
fills up milk cartons with a standard deviation of 7.58ml. Which process(es) should a milk
packaging company use?
What is the variance of the following dataset: 10, 10, 10, 10, 10, 10, 10, 10, 10 0 ,. Sd = 0 , variance = 0
Probability & Data Modeling
Question Answer
Which of the following is a difference between interval estimates and point estimates? Point estimates provide only a single value for a sample, while
interval estimates provide a range of values.
A ________ is one that provides a range for predicting the value of a new observation from Prediction interval
the same population.
A (n) ________ random variable is one for which the number of possible outcomes can be Discrete
counted
The distribution for students’ examination scores follow a normal distribution with a mean of 0.4207
78 and variance of 100. What is the probability that a student’s examination score will be at
least 80?
What is the confidence coefficient when the level of significance is 0.05? 0.95
While rolling two dice, what is the probability of rolling a sum of 7 or more? 7/12
A ________ is a range of values between which the value of the population parameter is Confidence interval
believed to be, along with a probability that the interval correctly estimates the true
(unknown) population parameter.
X is a random variable that is normally distributed with mean of 60 and standard deviation pnorm(75,60,15,lower.tail=FALSE)
of 15. Which of the following is the R code that computes P(X>75)?
A________ is the characterization of the possible values that a random variable may Probability distrubution
assume along with the probability of assuming these values.
Which of the following is true about the relative frequency definition of probability? It is based on empirical data
The collection of all possible outcomes of an experiment is called the ________. Sample space
The distribution for students’ examination scores follow a normal distribution with a mean of 88
78 and variance of 100. Find x such that the probability of obtaining a score greater than x
is 0.1587
B is a random variable that follows the normal distribution with mean of 300, and standard pnorm(400,300,100) - pnorm(250,300,100)
deviation of 100. What is the R code that computes P(250<=B<=400)?
states that if the sample size is large enough, the sampling distribution of the mean is Central Limit Theorem
approximately normally distributed, regardless of the distribution of the population and that
the mean of the sampling distribution will be the same as that of the population.
For a discrete random variable X, the probability distribution of the discrete outcomes is Probability mass function
called a (n) ________ and is denoted by a mathematical function, f(x).
Probability may be defined from one of three perspectives. If the process that generates Classical
the outcomes is known, probabilities can be deduced from theoretical arguments; this is
the________ definition of probability.
is the probability of occurrence of one event A, given that another event B is known to be Conditional probability
true or has already occurred.
Sampling & Estimation
Question Answer
occurs when the sample does not adequately represent the target population Nonsampling error
Sampling distribution of the mean will be approximately normally distributed ____. If population is normally distributed
Which of the following is a difference between interval estimates and point estimates? Point estimates provide only a single value for a sample, while
interval estimates provide a range of values.
A ________ is a range of values between which the value of the population parameter is Confidence interval
believed to be, along with a probability that the interval correctly estimates the true
(unknown) population parameter.
____ is associated with the sampling distribution of a statistic while ___ is associated with Confidence interval; prediction interval
the distribution of the random variable itself.
______ refers to the ___ of the sampling distribution of the mean. Standard error of the mean; standard deviation
If the expected value of an estimator equals the population parameter it is intended to Unbiased
estimate, the estimator is said to be ________.
The means of all possible samples of a fixed size n from some population will form a Sampling distribution of the mean
distribution which is known as the ________.
The ________ is a family of probability distributions with a shape similar to the standard t-distribution
normal distribution.
________ states that if the sample size is large enough, the sampling distribution of the Central limit theorem
mean is approximately normally distributed, regardless of the distribution of the population
and that the mean of the sampling distribution will be the same as that of the population.
The means of all possible samples of a fixed size n from some population will form a Sampling distribution of the mean
distribution which is known as the ________
Hypothesis Testing
Question Answer
Which of the following is true about the power of the test? i, ii, iii, iv, v
i. It is the probability of not committing a type II error
ii. It should be high to allow us to make a valid conclusion
iii. Power of test is sensitive to sample size where small sample sizes generally result in a
lower value of 1- β
iv. Power can be increased by taking larger samples
v. Large samples allow detection of small differences between sample statistics and
population parameters with more accuracy
Which of the following are steps in the hypothesis testing procedure? All
i. Identifying the population parameter of interest and formulating the hypotheses to test
ii. Selecting a level of significance, which defines the risk of drawing an incorrect conclusion
when the assumed hypothesis is actually true
iii. Determining a decision rule on which to base a conclusion
iv. Collecting data and calculating a test statistic
v. Applying the decision rule to the test statistic and drawing a conclusion
The further out in the tail of a distribution our critical value falls, the greater the risk of Type II error
making a:
Which of the following is a valid ANOVA hypothesis test? H0: μ1=μ2=μ3; H1: At least one of the mean is different from the
others
A manufacturer wishes to determine if the average profit from the sale of his product H0: population mean profit from sale ≤ $6,710 vs. H1: population mean
exceeds $6,710. Which of the following is the appropriate hypothesis test? profit from sale > $6,710
The probability of making a Type I error, that is, P(rejecting H0 | H0 is true) , is denoted by Level of significance
α and is called the
Which of the following is true about one-tailed and two-tailed tests? For standard normal and t-distributions, which have a mean of
zero, lower-tail critical values are negative and upper-tail critical
values are positive
Which of the following is the test statistic for a one-sample test for mean when the
population standard deviation is unknown?
Which of the following is a valid two-sample hypothesis test? H0: μ1= μ2 vs. H1: μ1 ≠ μ2
Rejecting the null hypothesis when the null hypothesis is true would be incorrect. This type Type I error
of error is called a ________.
Failure to reject the null hypothesis when the alternative hypothesis is true is known as Type II error
________.
Statistical inference focuses on drawing conclusions about populations from samples. tRUE
Which of the following is a valid one-sample hypothesis test? H0: population parameter = constant vs. H1: population parameter ≠
constant
Sampling
Mean +- Z(a/2) (sd / sqrt(sample))
zα/2 : value of standard normal random variable for an upper tail area of α/2 (or a lower tail area of 1 − α/2).
Example: if a = 0.05 (for a 95% confidence interval), then z 0.975 = 1.96;
Example: if a = 0.10 (for a 90% confidence interval), then z 0.95= 1.645.
Paired t-test is for paired values, for examples values that generally occurred by the same row of data, 1 person give feedback on 2 different products
Welch two-sample t-test, is used when the data of two samples are statistically independent, while the paired t-test is used when data is in the form of matched pairs.