Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Statistics
vs
Machine
Learning
Nerdy stuff
Definition of Data
data
noun
1. a plural of datum
datum
noun
1. a single piece of information, as a fact, statistic, or code; an item of data
2. Philosophy
a. any fact assumed to be a matter of direct observation
b. any proposition assumed or given, from which conclusions may be drawn
c. Also called sense datum. Epistemology. The object of knowledge as
presented to the mind
What do you think data really *is* though?
Me thinks:
● Data is are inert fragments, or shards, of information
● Logical building blocks capable of leading to stories
● “True” data (information) requires consciousness to exist (b/c contextual)
● Even when data are is auto-generated or never directly seen by a human,
consciousness is needed for “design”, “interpretation” (aka, meaning), etc.
● In everyday use, we think of data as representing quantities, characters,
or symbols on which operations are performed by a computer1
● Data can be organized into many different data structures (i.e. lists,
tuples, arrays, data frames) and data types (i.e. integer, dates, strings)
1 https://en.wikipedia.org/wiki/Data_(computing)
History of Data & Data Analysis
What is Statistics? It’s Applied Math
● Two kinds - descriptive vs inferential
○ Descriptive statistics: motivation is to accurately reflect the past
○ Inferential statistics: motivation is to accurately predict the future
● Want to draw (infer) valid conclusions from samples and subsets
○ To save time, energy, $$$
○ Census counts, Agriculture crop yields, genetics, drug efficacy, baseball, …
○ Requires many assumptions be made about underlying data for results to be valid
○ Goal: “simple”, human-understandable model, or formula, that explains most variability
● Deeply rooted in rigorous mathematical theory, especially probability
and matrix algebra theory, and “bell shaped curves” matter
And Machine Learning? Computer Science
● Two kinds - supervised vs unsupervised
○ Supervised: “learns” rules for mapping inputs (aka, features) to an output (aka, label)
○ Unsupervised: “learns" patterns without any outputs involved
● Want to derive rules that provide the maximum accuracy as possible
○ To save time, energy, $$$
○ Census counts, Agriculture crop yields, genetics, drug efficacy, baseball, …
○ Requires virtually no prior assumptions be made about input data (i.e. proof in the pudding)
○ “Learned” rules are often difficult to interpret; you might not even look at them directly
● Motivated by artificial intelligence (AI), minimizing “cost functions” while
not “over-fitting” model based on training data is what matters
Similarities and Differences
What about “Deep Learning”? Fact or Fiction?
- Deep learning and machine
learning are at what it calls the
"peak of inflated expectation",
but are just two to five years
away from mainstream adoption.
Cognitive computing is also at
peak hype, but up to 10 years
away, while general artificial
intelligence remains more than a
decade away and is still at the
stage of early innovation.
- Effective machine learning is
difficult because finding patterns
is hard and often not enough
training data is available; as a
result, machine-learning
programs often fail to deliver
Upper limit. When does it all “end”? 2045?
● Nobody (and nothing) can predict how life will be in 30 years - it’s not possible
● Technology is always neutral, good or bad, depending on intent and application
● Forecasts for the future generally reflect our own human hopes and fears more
than anything else
● Human intelligence is of a different kind than machine “intelligence” (right?)
Back to the Future ... All Roads Lead to ⇒
Statistics vs machine learning
Statistics vs machine learning
Or is it really more like this? :)
Techie stuff
Tools of the trade (today)
Some popular tools (ever-changing)
Statistics: linear regression example (via RStudio)
> df = data.frame(
hr=c(762, 755, 714, 696, 660, 630, 614, 612, 609),
rbi=c(1996, 2297, 2214, 2086, 1903, 1836, 1918, 1699, 1667))
> fit = lm(rbi ~ hr, data=df)
> summary(fit)
> eq = paste("RBI = ", round(fit$coefficients[2],1), "HR + ",
round(fit$coefficients[1],1), sep="")
> plot(df$rbi ~ df$hr, ylab="RBI", xlab="HR", pch=19, main=eq)
> abline(fit, col="red", lty=2)
Career RBI vs HR, for MLB players with 600+ home runs
1) Matrix form
2) Best-fit coefficients are “deterministic”
3) Thus, we have a formula for estimating RBI from HR
ML: logistic regression (Python via Jupyter on AWS)
Career HR and RBI vs HOF=Y/N, for MLB players 1950-2010
ML: neural network (Python via Visual Studio)
Career HR and RBI vs HOF=Y/N, for MLB players 1950-2010
“Big Data” ML with Spark and Scala (via Docker Zeppelin)
This demo covers the following (plus fetching data stored in an AWS S3)
● Hadoop=distributed I/O; Spark=distributed I//O and RAM
● Scala (is more Java than JavaScript) = default language for Spark
● Zeppelin (Scala) ~ Jupyter (Python)
(p.s. there’s others out there too; i.e. Beaker, Sage)
● Docker = pre-configured software & services bundled into “containers”
(note: mostly Linux-based and non-GUI based programs)
“Old School” BI + “New Age” Data Science
Career HR and RBI vs HOF=Y/N, for MLB players 1950-2010
Data from AWS RDS joined to local text file data
w/ slice & dice interactivity + R statistical graphing
GUI-based Machine Learning with Orange
(Python)● Orange (used to be called “Orange Canvas”) is an open-source Python library
● I first came across it circa 2013 and really like it’s potential, but bit buggy on Windows
● Works best with “small data” but it keeps improving and getting better
Julia: invented at MIT in 2012* and built for speed
● R is based on S language from Bell Labs in mid-1970’s; built for single workstations
● Python has had rebirth of sorts in recent years thanks to Anaconda “data science” distro
● Julia designed from scratch to be best of all modern numerical computing languages and constructs
Pros Cons
- New and modern (designed for parallelism, etc.) - Still very early and immature (not even to 1.0 yet)
- Fast; 5x faster than Python and 10x faster than R - Packages/modules buggy; not as stable or proven as Python and R
- Supports unicode and math symbols; 1-based arrays :-) - Can be very hard to find help and working examples
- Can directly invoke existing Python and R modules - No acceptable native graphics library; must call Python or R
- Attracting lots of attention (Apple, Amazon, Facebook,
IBM, Intel, hiring Julia programmers)
- Yet another language to learn?! Besides, there’s a lot of
dependencies on Python, why not just learn Python?
Resources
Links
Historical Timeline of Computable Knowledge
http://www.wolframalpha.com/docs/timeline
Data (Computing)
https://en.wikipedia.org/wiki/Data_(computing)
Electronic Statistics Textbook
http://www.statsoft.com/Textbook
Wikipedia: Statistics
https://en.wikipedia.org/wiki/Statistics
Wikipedia: Machine Learning
https://en.wikipedia.org/wiki/Machine_learning
Dr. Andrew Ng’s World-Famous Machine Learning Course
https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599
Data Science Concepts
http://www.saedsayad.com/data_mining_map.htm
26 Hilariously Inaccurate Predictions About the Future (from 2014)
http://www.cracked.com/pictofacts-101-26-hilariously-inaccurate-predictions-about-future/
Python: where to even begin?
Step 1/3: Download the free Windows 64-bit Anaconda “Data Science” distro:
https://www.anaconda.com/download/#windows
Python: where to even begin?
Step 2/3: Open “Anaconda Navigator” and launch “jupyter notebook”
Note: it may be slow to launch (especially first time), but will open a new browser window at
http://localhost:8888/tree (or something close, port number 8888 can vary, depending)
Python: where to even begin?
Step 3/3: Create a new Python 3 notebook and start learning at your own pace
i.e. like open in 2nd tab “Python for Data Analysis”: https://github.com/wesm/pydata-book

More Related Content

What's hot

Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
Paras Kohli
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
butest
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
QuantUniversity
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective
Saurabh Kaushik
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Shrey Malik
 
Machine learning
Machine learningMachine learning
Machine learning
Amit Kumar Rathi
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Machine learning
Machine learningMachine learning
Machine learning
Sanjay krishne
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
2.mathematics for machine learning
2.mathematics for machine learning2.mathematics for machine learning
2.mathematics for machine learning
KONGU ENGINEERING COLLEGE
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Rahul Kumar
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
Knoldus Inc.
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Vivek Garg
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learning
Justin Sebok
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Kumar P
 
Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
Amruth Veerabhadraiah
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
Rohit Kumar
 

What's hot (20)

Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Machine learning
Machine learningMachine learning
Machine learning
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
2.mathematics for machine learning
2.mathematics for machine learning2.mathematics for machine learning
2.mathematics for machine learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 

Similar to Statistics vs machine learning

Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
Tom Dierickx
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
ds4good
 
100_Days_of_Data_Science
100_Days_of_Data_Science100_Days_of_Data_Science
100_Days_of_Data_Science
Sajzat hossain
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
Ta-Wei (David) Huang
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
Roopesh Kohad
 
Data science with Perl & Raku
Data science with Perl & RakuData science with Perl & Raku
Data science with Perl & Raku
Sören Laird Sörries
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
Marta Fajlhauer
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
Juuso Parkkinen
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
ImonBennett
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
AmmarChalifah
 
A Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptxA Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptx
RajSingh512965
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
Zavain Dar
 
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGargColloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Shiv Shakti Ghosh
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
Shanmugasundaram M
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of Intelligence
Abhishek Upadhyay
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Semantic Web Company
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
TrainerAnalogicx
 

Similar to Statistics vs machine learning (20)

Data science presentation
Data science presentationData science presentation
Data science presentation
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
 
100_Days_of_Data_Science
100_Days_of_Data_Science100_Days_of_Data_Science
100_Days_of_Data_Science
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
Data science with Perl & Raku
Data science with Perl & RakuData science with Perl & Raku
Data science with Perl & Raku
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
 
A Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptxA Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptx
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGargColloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of Intelligence
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 

Recently uploaded

chapter one 1 cloudcomputing .pptx someone
chapter one 1 cloudcomputing .pptx someonechapter one 1 cloudcomputing .pptx someone
chapter one 1 cloudcomputing .pptx someone
abeeeeeeeer588
 
BRIGADA eskwela 2024 slip BRIGADA eskwela 2024 slip
BRIGADA eskwela  2024 slip  BRIGADA eskwela  2024 slipBRIGADA eskwela  2024 slip  BRIGADA eskwela  2024 slip
BRIGADA eskwela 2024 slip BRIGADA eskwela 2024 slip
Lucien Maxwell
 
Indian KS Unit 2 Mathematicians (1).pptx
Indian KS Unit 2 Mathematicians (1).pptxIndian KS Unit 2 Mathematicians (1).pptx
Indian KS Unit 2 Mathematicians (1).pptx
Nikita Gaikwad
 
004_Cybersecurity Fundamentals Network Security.pdf
004_Cybersecurity Fundamentals Network Security.pdf004_Cybersecurity Fundamentals Network Security.pdf
004_Cybersecurity Fundamentals Network Security.pdf
DaraputriOktiara
 
Hadoop Vs Snowflake Blog PDF Submission.pptx
Hadoop Vs Snowflake Blog PDF Submission.pptxHadoop Vs Snowflake Blog PDF Submission.pptx
Hadoop Vs Snowflake Blog PDF Submission.pptx
dewsharon760
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
一比一原版(macewan毕业证书)加拿大麦科文大学毕业证如何办理
一比一原版(macewan毕业证书)加拿大麦科文大学毕业证如何办理一比一原版(macewan毕业证书)加拿大麦科文大学毕业证如何办理
一比一原版(macewan毕业证书)加拿大麦科文大学毕业证如何办理
da42ki0
 
emotional interface - dehligame satta for you
emotional interface  -  dehligame satta for youemotional interface  -  dehligame satta for you
emotional interface - dehligame satta for you
bkldehligame1
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Larry Smarr
 
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZKeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
jp3113ig
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
Graph Machine Learning - Past, Present, and Future -
Graph Machine Learning - Past, Present, and Future -Graph Machine Learning - Past, Present, and Future -
Graph Machine Learning - Past, Present, and Future -
kashipong
 
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to EdgeNYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
Timothy Spann
 
ChessMaster Project Presentation for Batch 1643.pptx
ChessMaster Project Presentation for Batch 1643.pptxChessMaster Project Presentation for Batch 1643.pptx
ChessMaster Project Presentation for Batch 1643.pptx
duduphc
 
Flow Diagram Infographics by Slidesgo.pptx
Flow Diagram Infographics by Slidesgo.pptxFlow Diagram Infographics by Slidesgo.pptx
Flow Diagram Infographics by Slidesgo.pptx
DannyInfante1
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
Module-4_Docker_Training Course outline_
Module-4_Docker_Training Course outline_Module-4_Docker_Training Course outline_
Module-4_Docker_Training Course outline_
AmanTiwari297384
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
evwcarr
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
SantuJana12
 

Recently uploaded (20)

chapter one 1 cloudcomputing .pptx someone
chapter one 1 cloudcomputing .pptx someonechapter one 1 cloudcomputing .pptx someone
chapter one 1 cloudcomputing .pptx someone
 
BRIGADA eskwela 2024 slip BRIGADA eskwela 2024 slip
BRIGADA eskwela  2024 slip  BRIGADA eskwela  2024 slipBRIGADA eskwela  2024 slip  BRIGADA eskwela  2024 slip
BRIGADA eskwela 2024 slip BRIGADA eskwela 2024 slip
 
Indian KS Unit 2 Mathematicians (1).pptx
Indian KS Unit 2 Mathematicians (1).pptxIndian KS Unit 2 Mathematicians (1).pptx
Indian KS Unit 2 Mathematicians (1).pptx
 
004_Cybersecurity Fundamentals Network Security.pdf
004_Cybersecurity Fundamentals Network Security.pdf004_Cybersecurity Fundamentals Network Security.pdf
004_Cybersecurity Fundamentals Network Security.pdf
 
Hadoop Vs Snowflake Blog PDF Submission.pptx
Hadoop Vs Snowflake Blog PDF Submission.pptxHadoop Vs Snowflake Blog PDF Submission.pptx
Hadoop Vs Snowflake Blog PDF Submission.pptx
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
一比一原版(macewan毕业证书)加拿大麦科文大学毕业证如何办理
一比一原版(macewan毕业证书)加拿大麦科文大学毕业证如何办理一比一原版(macewan毕业证书)加拿大麦科文大学毕业证如何办理
一比一原版(macewan毕业证书)加拿大麦科文大学毕业证如何办理
 
emotional interface - dehligame satta for you
emotional interface  -  dehligame satta for youemotional interface  -  dehligame satta for you
emotional interface - dehligame satta for you
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
 
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZKeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
Graph Machine Learning - Past, Present, and Future -
Graph Machine Learning - Past, Present, and Future -Graph Machine Learning - Past, Present, and Future -
Graph Machine Learning - Past, Present, and Future -
 
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to EdgeNYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
 
ChessMaster Project Presentation for Batch 1643.pptx
ChessMaster Project Presentation for Batch 1643.pptxChessMaster Project Presentation for Batch 1643.pptx
ChessMaster Project Presentation for Batch 1643.pptx
 
Flow Diagram Infographics by Slidesgo.pptx
Flow Diagram Infographics by Slidesgo.pptxFlow Diagram Infographics by Slidesgo.pptx
Flow Diagram Infographics by Slidesgo.pptx
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
Module-4_Docker_Training Course outline_
Module-4_Docker_Training Course outline_Module-4_Docker_Training Course outline_
Module-4_Docker_Training Course outline_
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
 

Statistics vs machine learning

  • 3. Definition of Data data noun 1. a plural of datum datum noun 1. a single piece of information, as a fact, statistic, or code; an item of data 2. Philosophy a. any fact assumed to be a matter of direct observation b. any proposition assumed or given, from which conclusions may be drawn c. Also called sense datum. Epistemology. The object of knowledge as presented to the mind
  • 4. What do you think data really *is* though? Me thinks: ● Data is are inert fragments, or shards, of information ● Logical building blocks capable of leading to stories ● “True” data (information) requires consciousness to exist (b/c contextual) ● Even when data are is auto-generated or never directly seen by a human, consciousness is needed for “design”, “interpretation” (aka, meaning), etc. ● In everyday use, we think of data as representing quantities, characters, or symbols on which operations are performed by a computer1 ● Data can be organized into many different data structures (i.e. lists, tuples, arrays, data frames) and data types (i.e. integer, dates, strings) 1 https://en.wikipedia.org/wiki/Data_(computing)
  • 5. History of Data & Data Analysis
  • 6. What is Statistics? It’s Applied Math ● Two kinds - descriptive vs inferential ○ Descriptive statistics: motivation is to accurately reflect the past ○ Inferential statistics: motivation is to accurately predict the future ● Want to draw (infer) valid conclusions from samples and subsets ○ To save time, energy, $$$ ○ Census counts, Agriculture crop yields, genetics, drug efficacy, baseball, … ○ Requires many assumptions be made about underlying data for results to be valid ○ Goal: “simple”, human-understandable model, or formula, that explains most variability ● Deeply rooted in rigorous mathematical theory, especially probability and matrix algebra theory, and “bell shaped curves” matter
  • 7. And Machine Learning? Computer Science ● Two kinds - supervised vs unsupervised ○ Supervised: “learns” rules for mapping inputs (aka, features) to an output (aka, label) ○ Unsupervised: “learns" patterns without any outputs involved ● Want to derive rules that provide the maximum accuracy as possible ○ To save time, energy, $$$ ○ Census counts, Agriculture crop yields, genetics, drug efficacy, baseball, … ○ Requires virtually no prior assumptions be made about input data (i.e. proof in the pudding) ○ “Learned” rules are often difficult to interpret; you might not even look at them directly ● Motivated by artificial intelligence (AI), minimizing “cost functions” while not “over-fitting” model based on training data is what matters
  • 9. What about “Deep Learning”? Fact or Fiction? - Deep learning and machine learning are at what it calls the "peak of inflated expectation", but are just two to five years away from mainstream adoption. Cognitive computing is also at peak hype, but up to 10 years away, while general artificial intelligence remains more than a decade away and is still at the stage of early innovation. - Effective machine learning is difficult because finding patterns is hard and often not enough training data is available; as a result, machine-learning programs often fail to deliver
  • 10. Upper limit. When does it all “end”? 2045? ● Nobody (and nothing) can predict how life will be in 30 years - it’s not possible ● Technology is always neutral, good or bad, depending on intent and application ● Forecasts for the future generally reflect our own human hopes and fears more than anything else ● Human intelligence is of a different kind than machine “intelligence” (right?)
  • 11. Back to the Future ... All Roads Lead to ⇒
  • 14. Or is it really more like this? :)
  • 16. Tools of the trade (today)
  • 17. Some popular tools (ever-changing)
  • 18. Statistics: linear regression example (via RStudio) > df = data.frame( hr=c(762, 755, 714, 696, 660, 630, 614, 612, 609), rbi=c(1996, 2297, 2214, 2086, 1903, 1836, 1918, 1699, 1667)) > fit = lm(rbi ~ hr, data=df) > summary(fit) > eq = paste("RBI = ", round(fit$coefficients[2],1), "HR + ", round(fit$coefficients[1],1), sep="") > plot(df$rbi ~ df$hr, ylab="RBI", xlab="HR", pch=19, main=eq) > abline(fit, col="red", lty=2) Career RBI vs HR, for MLB players with 600+ home runs 1) Matrix form 2) Best-fit coefficients are “deterministic” 3) Thus, we have a formula for estimating RBI from HR
  • 19. ML: logistic regression (Python via Jupyter on AWS) Career HR and RBI vs HOF=Y/N, for MLB players 1950-2010
  • 20. ML: neural network (Python via Visual Studio) Career HR and RBI vs HOF=Y/N, for MLB players 1950-2010
  • 21. “Big Data” ML with Spark and Scala (via Docker Zeppelin) This demo covers the following (plus fetching data stored in an AWS S3) ● Hadoop=distributed I/O; Spark=distributed I//O and RAM ● Scala (is more Java than JavaScript) = default language for Spark ● Zeppelin (Scala) ~ Jupyter (Python) (p.s. there’s others out there too; i.e. Beaker, Sage) ● Docker = pre-configured software & services bundled into “containers” (note: mostly Linux-based and non-GUI based programs)
  • 22. “Old School” BI + “New Age” Data Science Career HR and RBI vs HOF=Y/N, for MLB players 1950-2010 Data from AWS RDS joined to local text file data w/ slice & dice interactivity + R statistical graphing
  • 23. GUI-based Machine Learning with Orange (Python)● Orange (used to be called “Orange Canvas”) is an open-source Python library ● I first came across it circa 2013 and really like it’s potential, but bit buggy on Windows ● Works best with “small data” but it keeps improving and getting better
  • 24. Julia: invented at MIT in 2012* and built for speed ● R is based on S language from Bell Labs in mid-1970’s; built for single workstations ● Python has had rebirth of sorts in recent years thanks to Anaconda “data science” distro ● Julia designed from scratch to be best of all modern numerical computing languages and constructs Pros Cons - New and modern (designed for parallelism, etc.) - Still very early and immature (not even to 1.0 yet) - Fast; 5x faster than Python and 10x faster than R - Packages/modules buggy; not as stable or proven as Python and R - Supports unicode and math symbols; 1-based arrays :-) - Can be very hard to find help and working examples - Can directly invoke existing Python and R modules - No acceptable native graphics library; must call Python or R - Attracting lots of attention (Apple, Amazon, Facebook, IBM, Intel, hiring Julia programmers) - Yet another language to learn?! Besides, there’s a lot of dependencies on Python, why not just learn Python?
  • 26. Links Historical Timeline of Computable Knowledge http://www.wolframalpha.com/docs/timeline Data (Computing) https://en.wikipedia.org/wiki/Data_(computing) Electronic Statistics Textbook http://www.statsoft.com/Textbook Wikipedia: Statistics https://en.wikipedia.org/wiki/Statistics Wikipedia: Machine Learning https://en.wikipedia.org/wiki/Machine_learning Dr. Andrew Ng’s World-Famous Machine Learning Course https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599 Data Science Concepts http://www.saedsayad.com/data_mining_map.htm 26 Hilariously Inaccurate Predictions About the Future (from 2014) http://www.cracked.com/pictofacts-101-26-hilariously-inaccurate-predictions-about-future/
  • 27. Python: where to even begin? Step 1/3: Download the free Windows 64-bit Anaconda “Data Science” distro: https://www.anaconda.com/download/#windows
  • 28. Python: where to even begin? Step 2/3: Open “Anaconda Navigator” and launch “jupyter notebook” Note: it may be slow to launch (especially first time), but will open a new browser window at http://localhost:8888/tree (or something close, port number 8888 can vary, depending)
  • 29. Python: where to even begin? Step 3/3: Create a new Python 3 notebook and start learning at your own pace i.e. like open in 2nd tab “Python for Data Analysis”: https://github.com/wesm/pydata-book

Editor's Notes

  1. http://www.wolframalpha.com/docs/timeline/