Machine Learning Interview Questions.
Machine Learning Interview Questions.
=
• Where pi is the ith prediction, ai the ith actual
response, log(b) the natural logarithm of bb.
• Weighted Mean Absolute Error
The weighted average of absolute errors. MAE and RMSE consider
that each prediction provides equally precise information
about the error variation, i.e. the standard variation of the
error term is constant over all the predictions. Examples:
recommender systems (differences between past and recent
products)
Common metrics in classification:
• Recall / Sensitivity / True positive rate:
High when FN low. Sensitive to unbalanced classes.
Sensitivity=
• Accuracy
High when FP and FN are low. Sensitive to unbalanced classes
(see “Accuracy paradox”)
•
Accuracy=
• ROC / AUC
ROC is a graphical plot that illustrates the performance of a
binary classifier (Sensitivity Vs 1− −Specificity or
Sensitivity Vs Specificity). They are not sensitive to
unbalanced classes.
AUC is the area under the ROC curve. Perfect classifier:
AUC=1, fall on (0,1); 100% sensitivity (no FN) and 100%
specificity (no FP)
• Logarithmic loss
Punishes infinitely the deviation from the true value! It’s
better to be somewhat wrong than emphatically wrong!
• Misclassification Rate:
Misclassification=
F1Score=
4. Explain what regularization is and why it is useful. What are the benefits and
drawbacks of specific methods, such as ridge regression and lasso?
Ridge regression:
Validation using :
- % of variance retained by the model
- Issue: is always increased when adding variables
Analysis of residuals:
- Heteroskedasticity (relation between the variance of the model
errors and the size of an independent variable’s observations)
- Scatter plots residuals Vs predictors
- Normality of errors
- Etc. : diagnostic plots
Out-of-sample evaluation: with cross-validation
7. Explain what precision and recall are. How do they relate to the ROC curve?
See question 3. “How to define/select metrics? Do you know compound
metrics?”.
When using Precision/Recall curves.
8. What is latent semantic indexing? What is it used for? What are the specific
limitations of the method?
Used for:
9. Explain what resampling methods are and why they are useful
• Repeatedly drawing samples from a training set and refitting a
model of interest on each sample in order to obtain additional
information about the fitted model
• example: repeatedly draw different samples from training data,
fit a linear regression to each new sample, and then examine
the extent to which the resulting fit differ
• most common are: cross-validation and the bootstrap
• cross-validation: random sampling with no replacement
• bootstrap: random sampling with replacement
• cross-validation: evaluating model performance, model
selection (select the appropriate level of flexibility)
• bootstrap: mostly used to quantify the uncertainty associated
with a given estimator or statistical learning method
10. What is principal component analysis? Explain the sort of problems you would
use PCA for. Also explain its limitations as a method
Statistical method that uses an orthogonal transformation to convert
a set of observations of correlated variables into a set of values
of linearly uncorrelated variables called principal components.
Reduce the data from n to k dimensions: find the k vectors onto
which to project the data so as to minimize the projection error.
Algorithm:
1) Preprocessing (standardization): PCA is sensitive to the relative
scaling of the original variable
2) Compute covariance matrix Σ
3) Compute eigenvectors of Σ
4) Choose k principal components so as to retain x% of the variance
(typically x=99)
Applications:
1) Compression
- Reduce disk/memory needed to store data
- Speed up learning algorithm. Warning: mapping should be defined
only on training set and then applied to test set
Limitations:
- PCA is not scale invariant
- The directions with largest variance are assumed to be of most
interest
- Only considers orthogonal transformations (rotations) of the
original variables
- PCA is only based on the mean vector and covariance matrix. Some
distributions (multivariate normal) are characterized by this but
some are not
- If the variables are correlated, PCA can achieve dimension
reduction. If not, PCA just orders them according to their variances
11. Explain what a false positive and a false negative are. Why is it important
these from each other? Provide examples when false positives are more
important than false negatives, false negatives are more important than false
positives and when these two types of errors are equally important
• False positive
Improperly reporting the presence of a condition when it’s not
in reality. Example: HIV positive test when the patient is
actually HIV negative
• False negative
Improperly reporting the absence of a condition when in
reality it’s the case. Example: not detecting a disease when
the patient has this disease.
When false positives are more important than false negatives:
- In a non-contagious disease, where treatment delay doesn’t have
any long-term consequences but the treatment itself is grueling
- HIV test: psychological impact
When false negatives are more important than false positives:
- If early treatment is important for good outcomes
- In quality control: a defective item passes through the cracks!
- Software testing: a test to catch a virus has failed
Major tasks:
- Machine translation
- Question answering: “what’s the capital of Canada?”
- Sentiment analysis: extract subjective information from a set of
documents, identify trends or public opinions in the social media
- Information retrieval
15. When would you use random forests Vs SVM and why?
16. How do you take millions of users with 100’s transactions each, amongst
10k’s of products and group the users together in meaningful segments?
• Transactions by date
• Count of customers Vs number of items bought
• Total items Vs total basket per customer
• Total items Vs total basket per area
Counts:
Distributions:
• PCA
18. How do you test whether a new credit risk scoring model works?
Kolmogorov-Smirnov test:
- Non-parametric test
- Compare a sample with a reference probability distribution or
compare two samples
- Quantifies a distance between the empirical distribution function
of the sample and the cumulative distribution function of the
reference distribution
- Or between the empirical distribution functions of two samples
- Null hypothesis (two-samples test): samples are drawn from the
same distribution
- Can be modified as a goodness of fit test
- In our case: cumulative percentages of good, cumulative
percentages of bad
20. What is better: good data or good models? And how do you define “good”?
Is there a universal good model? Are there any models that are definitely not so
good?
21. Why is naive Bayes so bad? How would you improve a spam detection
algorithm that uses naive Bayes?
22. What are the drawbacks of linear model? Are you familiar with alternatives
(Lasso, ridge regression)?
23. Do you think 50 small decision trees are better than a large one? Why?
• Yes!
• More robust model (ensemble of weak learners that come and
make a strong learner)
• Better to improve a model by taking many small steps than
fewer large steps
• If one tree is erroneous, it can be auto-corrected by the
following
• Less prone to overfitting
24. Why is mean square error a bad measure of model performance? What
would you suggest instead?
25. How can you prove that one improvement you’ve brought to an algorithm is
really an improvement over not doing anything? Are you familiar with A/B
testing?
Example with linear regression:
- F-statistic (ANOVA)
26. What do you think about the idea of injecting noise in your data set to test
the sensitivity of your models?
27. Do you know / used data reduction techniques other than PCA? What do you
think of step-wise regression? What kind of step-wise techniques are you familiar
with?
data reduction techniques other than PCA?:
Partial least squares: like PCR (principal component regression) but
chooses the principal components in a supervised way. Gives higher
weights to variables that are most strongly related to the response
step-wise regression?
- the choice of predictive variables are carried out using a
systematic procedure
- Usually, it takes the form of a sequence of F-tests, t-tests,
adjusted R-squared, AIC, BIC
- at any given step, the model is fit using unconstrained least
squares
- can get stuck in local optima
- Better: Lasso
step-wise techniques:
- Forward-selection: begin with no variables, adding them when they
improve a chosen model comparison criterion
- Backward-selection: begin with all the variables, removing them
when it improves a chosen model comparison criterion
Better than reduced data:
Example 1: If all the components have a high variance: which
components to discard with a guarantee that there will be no
significant loss of the information?
Example 2 (classification):
- One has 2 classes; the within class variance is very high as
compared to between class variance
- PCA might discard the very information that separates the two
classes
Better than a sample:
- When number of variables is high relative to the number of
observations
28. How would you define and measure the predictive power of a metric?
30. What are the assumptions required for linear regression? What if some of
these assumptions are violated?
Predict y from x: 1) + 2)
Estimate the standard error of predictors: 1) + 2) + 3)
Get an unbiased estimation of y from x: 1) + 2) + 3) + 4)
Make probability statements, hypothesis testing involving slope and
correlation, confidence intervals: 1) + 2) + 3) + 4) + 5)
Note:
- Common mythology: linear regression doesn’t assume anything about
the distributions of x and y
- It only makes assumptions about the distribution of the residuals
- And this is only needed for statistical tests to be valid
- Regression can be applied to many purposes, even if the errors are
not normally distributed
31. What is collinearity and what to do with it? How to remove multicollinearity?
Collinearity/Multicollinearity:
- In multiple regression: when two or more variables are highly
correlated
- They provide redundant information
- In case of perfect multicollinearity: doesn’t exist,
the design matrix isn’t invertible
- It doesn’t affect the model as a whole, doesn’t bias results
- The standard errors of the regression coefficients of the affected
variables tend to be large
- The test of hypothesis that the coefficient is equal to zero may
lead to a failure to reject a false null hypothesis of no effect of
the explanatory (Type II error)
- Leads to overfitting
Remove multicollinearity:
- Drop some of affected variables
- Principal component regression: gives uncorrelated predictors
- Combine the affected variables
- Ridge regression
- Partial least square regression
Detection of multicollinearity:
- Large changes in the individual coefficients when a predictor
variable is added or deleted
- Insignificant regression coefficients for the affected predictors
but a rejection of the joint
hypothesis that those coefficients are all zero (F-test)
- VIF: the ratio of variances of the coefficient when fitting the
full model divided by the variance of the coefficient when fitted on
its own
- rule of thumb: VIF>5 indicates multicollinearity
- Correlation matrix, but correlation is a bivariate relationship
whereas multicollinearity is multivariate
32. How to check if the regression model fits the data well?
R squared/Adjusted R squared:
F test:
- Evaluate the hypothesis Ho: all regression coefficients are equal
to zero Vs H1: at least one doesn’t
- Indicates that R2 is reliable
RMSE:
- Absolute measure of fit (whereas R2 is a relative measure of fit)
Information Gain/Deviance
InformationGain=
Better than Gini when are very small: multiplying very small
numbers leads to rounding errors, we can instead take logs.
37. What is the maximal margin classifier? How this margin can be achieved?
• Gaussian kernel
• Linear kernel
• Polynomial kernel
• Laplace kernel
• Esoteric kernels: string kernels, chi-square kernels
• If number of features is large (relative to number of
observations): SVM with linear kernel ; e.g. text
classification with lots of words, small training example
• If number of features is small, number of observations is
intermediate: Gaussian kernel
• If number of features is small, number of observations is
small: linear kernel
- Nearly all of the dimensional space is far away from the center
- It consists almost entirely of the corners of the hypercube, with
no middle!
43. What is Ax=b? How to solve it?
• A∈
• Each entry: ABij=
Pick an end of rope. Of the remaining 2n−1 ends of rope, only one
end creates a loop (the other end of the same piece of rope). There
are then n−1 untied pieces of rope. The rest of the time, two
separates pieces of rope are tied together and there are
effectively n−1 untied pieces of rope. The recurrence is therefore:
• Ln=
• Ln=
• Where Hk is the kth harmonic number
Since Hk= k for large-ish k, where gamma=0.57722. is the
Euler-Mascheroni constant, we have:
• Ln=
Statistics
1. How do you assess the statistical significance of an insight?
Examples:
1) Natural language
- Given some corpus of natural language - The frequency of any word
is inversely proportional to its rank in the frequency table
- The most frequent word will occur twice as often as the second
most frequent, three times as often as the third most frequent…
- “The” accounts for 7% of all word occurrences (70000 over 1
million)
- “of” accounts for 3.5%, followed by “and”…
- Only 135 vocabulary items are needed to account for half the
English corpus!
2. Allocation of wealth among individuals: the larger portion of
the wealth of any society is controlled by a smaller
percentage of the people
3. File size distribution of Internet Traffic
Additional: Hard disk error rates, values of oil reserves in a field
(a few large fields, many small ones), sizes of sand particles,
sizes of meteorites
Importance in classification and regression problems:
- Skewed distribution
- Which metrics to use? Accuracy paradox (classification), F-score,
AUC
- Issue when using models that make assumptions on the linearity
(linear regression): need to apply a monotone transformation on the
data (logarithm, square root, sigmoid function…)
- Issue when sampling: your data becomes even more unbalanced! Using
of stratified sampling of random sampling, SMOTE (“Synthetic
Minority Over-sampling Technique”, NV Chawla) or anomaly detection
approach
5. Explain selection bias (with regard to a dataset, not variable selection). Why is
it important? How can data management procedures such as missing data
handling make it worse?
Types:
- Sampling bias: systematic error due to a non-random sample of a
population causing some members to be less likely to be included
than others
- Time interval: a trial may terminated early at an extreme value
(ethical reasons), but the extreme value is likely to be reached by
the variable with the largest variance, even if all the variables
have similar means
- Data: “cherry picking”, when specific subsets of the data are
chosen to support a conclusion (citing examples of plane crashes as
evidence of airline flight being unsafe, while the far more common
example of flights that complete safely)
- Studies: performing experiments and reporting only the most
favorable results
- Can lead to unaccurate or even erroneous conclusions
- Statistical methods can generally not overcome it
Why data handling make it worse?
- Example: individuals who know or suspect that they are HIV
positive are less likely to participate in HIV surveys
- Missing data handling will increase this effect as it’s based on
most HIV negative
-Prevalence estimates will be unaccurate
6. Provide a simple example of how an experimental design can help answer a
question about behavior. How does experimental data contrast with
observational data?
8. What is an outlier? Explain how you might screen for outliers and what would
you do if you found them in your dataset. Also, explain what an inlier is and how
you might screen for them and what would you do if you found them in your
dataset
Outliers:
- An observation point that is distant from other observations
- Can occur by chance in any distribution
- Often, they indicate measurement error or a heavy-tailed
distribution
- Measurement error: discard them or use robust statistics
- Heavy-tailed distribution: high skewness, can’t use tools assuming
a normal distribution
- Three-sigma rules (normally distributed data): 1 in 22
observations will differ by twice the standard deviation from the
mean
- Three-sigma rules: 1 in 370 observations will differ by three
times the standard deviation from the mean
Three-sigma rules example: in a sample of 1000 observations, the
presence of up to 5 observations deviating from the mean by more
than three times the standard deviation is within the range of what
can be expected, being less than twice the expected number and hence
within 1 standard deviation of the expected number (Poisson
distribution).
If the nature of the distribution is known a priori, it is possible
to see if the number of outliers deviate significantly from what can
be expected. For a given cutoff (samples fall beyond the cutoff with
probability p), the number of outliers can be approximated with a
Poisson distribution with lambda=pn. Example: if one takes a normal
distribution with a cutoff 3 standard deviations from the mean,
p=0.3% and thus we can approximate the number of samples whose
deviation exceed 3 sigmas by a Poisson with lambda=3
Identifying outliers:
- No rigid mathematical method
- Subjective exercise: be careful
- Boxplots
- QQ plots (sample quantiles Vs theoretical quantiles)
Handling outliers:
- Depends on the cause
- Retention: when the underlying model is confidently known
- Regression problems: only exclude points which exhibit a large
degree of influence on the estimated coefficients (Cook’s distance)
Inlier:
- Observation lying within the general distribution of other
observed values
- Doesn’t perturb the results but are non-conforming and unusual
- Simple example: observation recorded in the wrong unit (°F instead
of °C)
Identifying inliers:
- Mahalanobi’s distance
- Used to calculate the distance between two random vectors
- Difference with Euclidean distance: accounts for correlations
- Discard them
10. You have data on the durations of calls to a call center. Generate a plan for
how you would code and analyze these data. Explain a plausible scenario for
what the distribution of these durations might look like. How could you test, even
graphically, whether your expectations are borne out?
• Histogram of durations
• histogram of durations per service type, per day of week, per
hours of day (durations can be systematically longer from 10am
to 1pm for instance), per employee…
2. Distribution: lognormal?
3. Test graphically with QQ plot: sample quantiles
of log(durations)log(durations) Vs normal quantiles
12. You are compiling a report for user content uploaded every month and notice
a spike in uploads in October. In particular, a spike in picture uploads. What
might you think is the cause of this, and how would you test it?
• Halloween pictures?
• Look at uploads in countries that don’t observe Halloween as a
sort of counter-factual analysis
• Compare uploads mean in October and uploads means with
September: hypothesis testing
13. You’re about to get on a plane to Seattle. You want to know if you should
bring an umbrella. You call 3 random friends of yours who live there and ask
each independently if it’s raining. Each of your friends has a 2/3 chance of telling
you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you
that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?
• All say yes: all three lie or three say the truth
• P("all say the truth")=
• P("all lie")=
• P("all yes")=
14. There’s one box - has 12 black and 12 red cards, 2nd box has 24 black and
24 red; if you want to draw 2 cards at random from one of the 2 boxes, which
box has the higher probability of getting the same color? Can you tell intuitively
why the 2nd box has a higher probability
15. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20
rule?
Lift:
It’s measure of performance of a targeting model (or a rule) at
predicting or classifying cases as having an enhanced response (with
respect to the population as a whole), measured against a random
choice targeting model. Lift is simply: target response/average
response.
Suppose a population has an average response rate of 5% (mailing for
instance). A certain model (or rule) has identified a segment with a
response rate of 20%, then lift=20/5=4
Typically, the modeler seeks to divide the population into
quantiles, and rank the quantiles by lift. He can then consider each
quantile, and by weighing the predicted response rate against the
cost, he can decide to market that quantile or not.
“if we use the probability scores on customers, we can get 60% of
the total responders we’d get mailing randomly by only mailing the
top 30% of the scored customers”.
KPI:
- Key performance indicator
- A type of performance measurement
- Examples: 0 defects, 10/10 customer satisfaction
- Relies upon a good understanding of what is important to the
organization
More examples:
Marketing & Sales:
- New customers acquisition
- Customer attrition
- Revenue (turnover) generated by segments of the customer
population
- Often done with a data management platform
IT operations:
- Mean time between failure
- Mean time to repair
Robustness:
- Statistics with good performance even if the underlying
distribution is not normal
- Statistics that are not affected by outliers
- A learning algorithm that can reduce the chance of fitting noise
is called robust
- Median is a robust measure of central tendency, while mean is not
- Median absolute deviation is also more robust than the standard
deviation
Model fitting:
- How well a statistical model fits a set of observations
- Examples: AIC, R2, Kolmogorov-Smirnov test, Chi 2, deviance (glm)
Design of experiments:
The design of any task that aims to describe or explain the
variation of information under conditions that are hypothesized to
reflect the variation.
In its simplest form, an experiment aims at predicting the outcome
by changing the preconditions, the predictors.
- Selection of the suitable predictors and outcomes
- Delivery of the experiment under statistically optimal conditions
- Randomization
- Blocking: an experiment may be conducted with the same equipment
to avoid any unwanted variations in the input
- Replication: performing the same combination run more than once,
in order to get an estimate for the amount of random error that
could be part of the process
- Interaction: when an experiment has 3 or more variables, the
situation in which the interaction of two variables on a third is
not additive
80/20 rule:
- Pareto principle
- 80% of the effects come from 20% of the causes
- 80% of your sales come from 20% of your clients
- 80% of a company complaints come from 20% of its customers
17. Give examples of data that does not have a Gaussian distribution, nor log-
normal.
18. What is root cause analysis? How to identify a cause vs. a correlation? Give
examples
Root cause analysis:
- Method of problem solving used for identifying the root causes or
faults of a problem
- A factor is considered a root cause if removal of it prevents the
final undesirable event from recurring
Identify a cause vs. a correlation:
- Correlation: statistical measure that describes the size and
direction of a relationship between two or more variables. A
correlation between two variables doesn’t imply that the change in
one variable is the cause of the change in the values of the other
variable
- Causation: indicates that one event is the result of the
occurrence of the other event; there is a causal relationship
between the two events
- Differences between the two types of relationships are easy to
identify, but establishing a cause and effect is difficult
Example: sleeping with one’s shoes on is strongly correlated with
waking up with a headache. Correlation-implies-causation fallacy:
therefore, sleeping with one’s shoes causes headache.
More plausible explanation: both are caused by a third factor: going
to bed drunk.
Identify a cause Vs a correlation: use of a controlled study
- In medical research, one group may receive a placebo (control)
while the other receives a treatment If the two groups have
noticeably different outcomes, the different experiences may have
caused the different outcomes
19. Give an example where the median is a better measure than the mean
When data is skewed
20. Given two fair dices, what is the probability of getting scores that sum to 4?
to 8?
• Total: 36 combinations
• Of these, 3 involve a score of 4: (1,3), (3,1), (2,2)
• So: 3/36=1/12
• Considering a score of 8: (2,6), (3,5), (4,4), (6,2), (5,3)
• So: 5/36
- Similar: ME=
Notes:
- Randomization: in randomized control trials, research participants
are assigned by chance, rather than by choice to either the
experimental group or the control group.
- Random sampling: obtaining data that is representative of the
population of interest
27. An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject
from a population of prevalence 0.1% receives a positive test result. What is the
precision of the test (i.e the probability he is HIV positive)?
Bayes rule:P(Actu+|Pred+)=
We have:
28. Infection rates at a hospital above a 1 infection per 100 person days at risk
are considered high. An hospital had 10 infections over the last 1787 person days
at risk. Give the p-value of the correct one-sided test of whether the hospital is
below the standard
One-sided test, assume a Poisson distribution
Ho: lambda=0.01 ; H1:lambda>0.01
R code:
ppois(10,1787*0.01)
## [1] 0.03237153
29. You roll a biased coin (p(head)=0.8) five times. What’s the probability of
getting three or more heads?
• 5 trials, p=0.8
P("3 or more
heads")=
30. A random variable X is normal with mean 1020 and standard deviation 50.
Calculate P(X>1200)
X\~N(1020,50) Our new quantile: z
R Code:
pnorm(3.6,lower.tail=F)
## [1] 0.0001591086
31. Consider the number of people that show up at a bus station is Poisson with
mean 2.5/h. What is the probability that at most three people show up in a four
hour period?
X\~Poisson(λ=2.5×t)
R code:
ppois(3,lambda=2.5*4)
## [1] 0.01033605
32. You are running for office and your pollster polled hundred people. 56 of
them claimed they will vote for you. Can you relax?
Quick:
- Intervals take the form
34. The homicide rate in Scotland fell last year to 99 from 115 the year before. Is
this reported change really networthy?
35. Consider influenza epidemics for two parent heterosexual families. Suppose
that the probability is 17% that at least one of the parents has contracted the
disease. The probability that the father has contracted influenza is 12% while the
probability that both the mother and father have contracted the disease is 6%.
What is the probability that the mother has contracted influenza?
• P("MotherorFather")=P("Mother")+P("Father")−P("MotherandFather
") Hence: P("Mother")=0.17+0.06−0.12=0.11
36. Suppose that diastolic blood pressures (DBPs) for men aged 35-44 are
normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10.
About what is the probability that a random 35-44 year old has a DBP less than
70?
38. A diet pill is given to 9 subjects over six weeks. The average difference in
weight (follow up - baseline) is -2 pounds. What would the standard deviation of
the difference in weight have to be for the upper endpoint of the 95% T
confidence interval to touch 0?
• find σ=
• R code:
2*3/qt(0.975,df=8)
## [1] 2.601903
• We get [−2.75,−1.25]
40. To further test the hospital triage system, administrators selected 200 nights
and randomly assigned a new triage system to be used on 100 nights and a
standard system on the remaining 100 nights. They calculated the nightly median
waiting time (MWT) to see a physician. The average MWT for the new system
was 4 hours with a standard deviation of 0.5 hours while the average MWT for
the old system was 6 hours with a standard deviation of 2 hours. Consider the
hypothesis of a decrease in the mean MWT associated with the new treatment.
What does the 95% independent group confidence interval with unequal
variances suggest vis a vis this hypothesis? (Because there’s so many
observations per group, just use the Z quantile instead of the T.)
5. Is it better to have 100 small hash tables or one big hash table, in memory, in
terms of access speed (assuming both fit within RAM)? What do you think about
in-database analytics?
Hash tables:
- Average case O(1)O(1) lookup time
- Lookup time doesn’t depend on size
Even in terms of memory:
- O(n)O(n) memory
- Space scales linearly with number of elements
- Lots of dictionaries won’t take up significantly less space than a
larger one
In-database analytics:
- Integration of data analytics in data warehousing functionality
- Much faster and corporate information is more secure, it doesn’t
leave the enterprise data warehouse
Good for real-time analytics: fraud detection, credit scoring,
transaction processing, pricing and margin analysis, behavioral ad
targeting and recommendation engines
• Python example
• Requesting and fetching the webpage into the code: httplib2
module
• Parsing the content and getting the necessary info:
BeautifulSoup from bs4 package
• Twitter API: the Python wrapper for performing API requests.
It handles all the OAuth and API queries in a single Python
interface
• MongoDB as the database
• PyMongo: the Python wrapper for interacting with the MongoDB
database
• Cronjobs: a time based scheduler in order to run scripts at
specific intervals; allows to bypass the “rate limit exceed”
error
Other answer:
- Depends on the context
- Is error acceptable? Fraud detection, quality assurance
17. Explain the difference between “long” and “wide” format data. Why would
you use one or the other?
• Long: one column containing the values and another column
listing the context of the value Fam_id year fam_inc
• Wide: each different variable in a separate column
Fam_id fam_inc96 fam_inc97 fam_inc98
Long Vs Wide:
- Data manipulations are much easier when data is in the wide
format: summarize, filter
- Program requirements
18. Do you know a few “rules of thumb” used in statistical or computer science?
Or in business analytics?
Pareto rule:
- 80% of the effects come from 20% of the causes
- 80% of the sales come from 20% of the customers
Computer science: “simple and inexpensive beats complicated and
expensive” - Rod Elder
Finance, rule of 72:
- Estimate the time needed for a money investment to double
- 100$ at a rate of 9%: 72/9=8 years
Rule of three (Economics):
- There are always three major competitors in a free market within
one industry
visit: https://rpubs.com/JDAHAN/172473