Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
694 views

Statistics and Data Analytics Cheat Sheets

This document provides a cheat sheet on analytics concepts and tools. It defines analytics, describes the analytics lifecycle process (CRISP-DM), and lists common job titles in analytics such as business analyst, data analyst, and data scientist. It also defines big data using the three V's: volume, velocity, and variety. Key software tools mentioned include Microsoft Excel, Tableau, Python, R, SQL, and databases for exploring, visualizing, and building models with data.

Uploaded by

Giova Rossi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
694 views

Statistics and Data Analytics Cheat Sheets

This document provides a cheat sheet on analytics concepts and tools. It defines analytics, describes the analytics lifecycle process (CRISP-DM), and lists common job titles in analytics such as business analyst, data analyst, and data scientist. It also defines big data using the three V's: volume, velocity, and variety. Key software tools mentioned include Microsoft Excel, Tableau, Python, R, SQL, and databases for exploring, visualizing, and building models with data.

Uploaded by

Giova Rossi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

(Cheat Sheet)

Data Types: Helps answer: “Is the sample a fair representation of the population?”

- Attribute Data – Qualitative: Hypotheses:


* Text Data – e.g. yes/no, pass/fail, approve/reject…
- Variable Data – Quantitative: - Null Hypothesis (Ho) – assumes NO differences (the same), p-value > 0.05
* Discrete – counted numbers – e.g. # of defects (74), # of customer returns (13) - Alternative Hypothesis (Ha) – states there is a difference, p-value < 0.05
* Continuous – decimal numbers – e.g. time (12:24:59), money ($17.4354), pressure (25.44534 lbs.)
Tests for Normal Data (“t-tests”):
Types of Statistics:
- 1-Sample t-Test – study one sample’s mean against a target.
- Descriptive Stats – Used to describe and summarize data. - 2-Sample t-Test – study means from two different samples.
- Inferential Stats – Drawing conclusions about a population, when sample data is used. - ANOVA Test – study means from more than two samples.
* As we gather data, we work with samples. - Paired t-Test – study paired data (e.g. same part before/after improvement).
* We need confidence that our sample represents the population.
Normal vs Non-Normal Data
Measures of Central Tendency:
- Hypothesis tests with NORMAL data use the mean for central tendency
- Mean – The average.
- Hypothesis tests with NON-NORMAL data use the median for central tendency
- Median – The middle value.
- Mode – The most frequently occurring value.
- Trimmed Mean – A compromise between the mean and median, removes some outliers then averages.
Measures of Variation:
- Range – Difference between the largest and smallest value.
- Interquartile Range – Difference between the 75th and 25th percentile. - Shows the cause and effect relationship between X and Y.
- Standard Deviation – Average deviation of values from the mean. - Helps determine the proper settings (levels) for our inputs (X) in order to optimize our output (Y).
- Variance – Average squared deviation of values from the mean.
Key Terminology:
Basic Graphs:
- Factors (x) – The independent variables being used (e.g. temperature).
- Histogram – shows central tendency and variation within a single distribution. - Levels – The various settings for the factors (e.g. 300°, 500°).
- Dotplot – similar to a histogram, but shows each value as an individual point. - Run – A set of experimental conditions. (Experiments have multiple.)
- Boxplot – shows central tendency and a variation within several distributions, not just one. - Response (y) – The result from an experimental run (e.g. material strength).
- Time-Series Plot – shows critical quality measurements over time. - Replication – The repetition of experimental runs. (Challenges the result.)
- Scatterplot – shows the relationship between two variables. Common Types of Experiments:
Data Measurement Scales: - Full Factorials – use 2-5 input variables with all combinations of levels (or settings).
- Fractional Factorials – use 4-15 input variables and a fraction of combinations.
- Nominal – Cannot be ordered; no arithmetic can be performed. e.g. city (Detroit, Cleveland, Seattle).
- Ordinal – Can be ordered; differences between values meaningless. e.g. taste (bad, okay, good). General Notation for Full Factorial Design (2k):
- Interval – Can be ordered; differences between values meaningful (not ratios). e.g. temp (0o, 10o , 20o).
- Ratio – Can be ordered; ratios meaningful; zero indicates an absence. e.g. weight (0kg, 25kg, 50kg). - k = # of input variables
- 2 = # of levels used for each factor
Types of Sampling & Measurement Errors:
Principles of Good Experimental Design:
- Sampling Error – Differences among samples drawn at random (“luck of the draw”).
- Randomization of runs to remove bias and spread noise
- Sampling Bias – A lack of random samples (e.g. height of basketball players only).
- Replication of the experiment to challenge or strengthen the validity of results.
- Measurement Error – Issues with our measurement systems.
- Monitoring of noise.
- Measurement Invalidity – Not measuring what it is intended (e.g. temperature near a furnace). - Holding other factors constant. (Those that are not a focus on the experiment.)
(Cheat Sheet)

What is Analytics?
Microsoft Excel
Allows you to explore/analyze smaller data sets
data
business
data
insight Tableau Desktop (or Power BI)
data
Allows you to visualize your data with dashboards

Types of Analytics Python Language (or R)


o Descriptive Analytics What happened? Allows you to build models to make predictions
o Predictive Analytics What might happen?
o Prescriptive Analytics What should we do?
Structured Query Language (SQL)
Lifecycle of Analytics (“CRISP-DM”) Allows you to communicate and interact with databases
o Business Understanding Define the business problem.
o Data Understanding Identify available data and gaps in data.
o Data Preparation Clean and prepare the data.
o Modeling Build predictive models.
o Evaluation Evaluate how the models perform.
o Deployment Start using the chosen model.
Common Job Titles
o Business Analyst
o Business Intel. Analyst
o Analytics Manger
o Data Analyst
Big data is so large that “it requires the use of new technical architectures … o Data Scientist *
to enable insights that unlock new sources of business value.” (McKinsey) o …

* Most people feel this job is more technical.


3 V’s of Big Data (Defining Characteristics)

Most job postings ask about software, so:


o Select a tool from above
o Download a free trial.
o Get a pizza!
o Spend a weekend to learn.
o State you have “Experience with…” on your resume.
Volume Velocity Variety

You might also like