Statistics vs machine learning

Statistics
vs
Machine
Learning

Definition of Data
data
noun
1. a plural of datum
datum
noun
1. a single piece of information, as a fact, statistic, or code; an item of data
2. Philosophy
a. any fact assumed to be a matter of direct observation
b. any proposition assumed or given, from which conclusions may be drawn
c. Also called sense datum. Epistemology. The object of knowledge as
presented to the mind

What do you think data really *is* though?
Me thinks:
● Data is are inert fragments, or shards, of information
● Logical building blocks capable of leading to stories
● “True” data (information) requires consciousness to exist (b/c contextual)
● Even when data are is auto-generated or never directly seen by a human,
consciousness is needed for “design”, “interpretation” (aka, meaning), etc.
● In everyday use, we think of data as representing quantities, characters,
or symbols on which operations are performed by a computer1
● Data can be organized into many different data structures (i.e. lists,
tuples, arrays, data frames) and data types (i.e. integer, dates, strings)
1 https://en.wikipedia.org/wiki/Data_(computing)

History of Data & Data Analysis

What is Statistics? It’s Applied Math
● Two kinds - descriptive vs inferential
○ Descriptive statistics: motivation is to accurately reflect the past
○ Inferential statistics: motivation is to accurately predict the future
● Want to draw (infer) valid conclusions from samples and subsets
○ To save time, energy, $$$
○ Census counts, Agriculture crop yields, genetics, drug efficacy, baseball, …
○ Requires many assumptions be made about underlying data for results to be valid
○ Goal: “simple”, human-understandable model, or formula, that explains most variability
● Deeply rooted in rigorous mathematical theory, especially probability
and matrix algebra theory, and “bell shaped curves” matter

And Machine Learning? Computer Science
● Two kinds - supervised vs unsupervised
○ Supervised: “learns” rules for mapping inputs (aka, features) to an output (aka, label)
○ Unsupervised: “learns" patterns without any outputs involved
● Want to derive rules that provide the maximum accuracy as possible
○ To save time, energy, $$$
○ Census counts, Agriculture crop yields, genetics, drug efficacy, baseball, …
○ Requires virtually no prior assumptions be made about input data (i.e. proof in the pudding)
○ “Learned” rules are often difficult to interpret; you might not even look at them directly
● Motivated by artificial intelligence (AI), minimizing “cost functions” while
not “over-fitting” model based on training data is what matters

What about “Deep Learning”? Fact or Fiction?
- Deep learning and machine
learning are at what it calls the
"peak of inflated expectation",
but are just two to five years
away from mainstream adoption.
Cognitive computing is also at
peak hype, but up to 10 years
away, while general artificial
intelligence remains more than a
decade away and is still at the
stage of early innovation.
- Effective machine learning is
difficult because finding patterns
is hard and often not enough
training data is available; as a
result, machine-learning
programs often fail to deliver

Upper limit. When does it all “end”? 2045?
● Nobody (and nothing) can predict how life will be in 30 years - it’s not possible
● Technology is always neutral, good or bad, depending on intent and application
● Forecasts for the future generally reflect our own human hopes and fears more
than anything else
● Human intelligence is of a different kind than machine “intelligence” (right?)

Back to the Future ... All Roads Lead to ⇒

Statistics vs machine learning

Or is it really more like this? :)

Some popular tools (ever-changing)

Statistics: linear regression example (via RStudio)
> df = data.frame(
hr=c(762, 755, 714, 696, 660, 630, 614, 612, 609),
rbi=c(1996, 2297, 2214, 2086, 1903, 1836, 1918, 1699, 1667))
> fit = lm(rbi ~ hr, data=df)
> summary(fit)
> eq = paste("RBI = ", round(fit$coefficients[2],1), "HR + ",
round(fit$coefficients[1],1), sep="")
> plot(df$rbi ~ df$hr, ylab="RBI", xlab="HR", pch=19, main=eq)
> abline(fit, col="red", lty=2)
Career RBI vs HR, for MLB players with 600+ home runs
1) Matrix form
2) Best-fit coefficients are “deterministic”
3) Thus, we have a formula for estimating RBI from HR

ML: logistic regression (Python via Jupyter on AWS)
Career HR and RBI vs HOF=Y/N, for MLB players 1950-2010

ML: neural network (Python via Visual Studio)

“Big Data” ML with Spark and Scala (via Docker Zeppelin)
This demo covers the following (plus fetching data stored in an AWS S3)
● Hadoop=distributed I/O; Spark=distributed I//O and RAM
● Scala (is more Java than JavaScript) = default language for Spark
● Zeppelin (Scala) ~ Jupyter (Python)
(p.s. there’s others out there too; i.e. Beaker, Sage)
● Docker = pre-configured software & services bundled into “containers”
(note: mostly Linux-based and non-GUI based programs)

“Old School” BI + “New Age” Data Science
Data from AWS RDS joined to local text file data
w/ slice & dice interactivity + R statistical graphing

GUI-based Machine Learning with Orange
(Python)● Orange (used to be called “Orange Canvas”) is an open-source Python library
● I first came across it circa 2013 and really like it’s potential, but bit buggy on Windows
● Works best with “small data” but it keeps improving and getting better

Julia: invented at MIT in 2012* and built for speed
● R is based on S language from Bell Labs in mid-1970’s; built for single workstations
● Python has had rebirth of sorts in recent years thanks to Anaconda “data science” distro
● Julia designed from scratch to be best of all modern numerical computing languages and constructs
Pros Cons
- New and modern (designed for parallelism, etc.) - Still very early and immature (not even to 1.0 yet)
- Fast; 5x faster than Python and 10x faster than R - Packages/modules buggy; not as stable or proven as Python and R
- Supports unicode and math symbols; 1-based arrays :-) - Can be very hard to find help and working examples
- Can directly invoke existing Python and R modules - No acceptable native graphics library; must call Python or R
- Attracting lots of attention (Apple, Amazon, Facebook,
IBM, Intel, hiring Julia programmers)
- Yet another language to learn?! Besides, there’s a lot of
dependencies on Python, why not just learn Python?

Links
Historical Timeline of Computable Knowledge
http://www.wolframalpha.com/docs/timeline
Data (Computing)
https://en.wikipedia.org/wiki/Data_(computing)
Electronic Statistics Textbook
http://www.statsoft.com/Textbook
Wikipedia: Statistics
https://en.wikipedia.org/wiki/Statistics
Wikipedia: Machine Learning
https://en.wikipedia.org/wiki/Machine_learning
Dr. Andrew Ng’s World-Famous Machine Learning Course
https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599
Data Science Concepts
http://www.saedsayad.com/data_mining_map.htm
26 Hilariously Inaccurate Predictions About the Future (from 2014)
http://www.cracked.com/pictofacts-101-26-hilariously-inaccurate-predictions-about-future/

Python: where to even begin?
Step 1/3: Download the free Windows 64-bit Anaconda “Data Science” distro:
https://www.anaconda.com/download/#windows

Step 2/3: Open “Anaconda Navigator” and launch “jupyter notebook”
Note: it may be slow to launch (especially first time), but will open a new browser window at
http://localhost:8888/tree (or something close, port number 8888 can vary, depending)

Step 3/3: Create a new Python 3 notebook and start learning at your own pace
i.e. like open in 2nd tab “Python for Data Analysis”: https://github.com/wesm/pydata-book

Statistics vs machine learning

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Statistics vs machine learning

Similar to Statistics vs machine learning (20)

Recently uploaded

Recently uploaded (20)

Statistics vs machine learning

Editor's Notes