Introduction To Data Science
Introduction To Data Science
Outline
What? Why? Who? How?
Outline
What? Why? Who? How?
Data Science
To gain insights into data through computation, statistics, and visualization
Nate Silver
#natesilverfacts
http://techcrunch.com/2012/11/07/nate-silver-as-software/
Human Genome
Microarrays
Afmetrix Chip
[wikipedia]
Sequencing
Sequencing Cost
Genome Data
Genome Visualization
[Krzywinski+2009]+
[Thorvaldsd,r-2013]-
[Meyer&2009]&
Personalized Therapy
...10 years from now, each cancer patient is going to want to get a genomic analysis of their cancer and will expect customized therapy based on that information. Director, The Cancer Genome Atlas (TCGA), Time Magazine, 6/13/11
Netix Prize
Some Challenges
massive data (500k users, 20k movies, 100m ratings) curse of dimensionality (very high-dimensional problem) missing data (99% of data missing; not missing at random) extremely complicated set of factors that affect peoples ratings of movies (actors, directors, genre, ...) need to avoid overtting (test data vs. training data)
http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html
Connectome
What is the connectivity of large brain circuits?
Connectome Workow
Ultra-Thin Section EM
Automatic Reconstruction
2D Segmentation
2012
Data Science
Computer Science Statistics
Domain Science
Drew Conway
Data Science
Outline
What? Why? Who? How?
BBC, 2013
Crime Prevention
Big Data
2.5 exabytes
daily data
years
2012
[IBMbigdata]
[Domo]
Between the dawn of civilization and 2003, we only created ve exabytes of information; now were creating that amount every two days. Eric Schmidt, Google (and others)
http://onesecond.designly.com/
Smarter Devices
Commodity Computing
Ubiquitous Connectivity
travers808,Visual.ly
By 2018, the US could face a shortage of up to 190,000 workers with analytical skills McKinsey Global Institute The sexy job in the next 10 years will be statisticians. Data Scientists? Hal Varian, Prof. Emeritus UC Berkeley Chief Economist, Google
What is the scientic goal? What would you do if you had all the data? What do you want to predict or estimate?
How were the data sampled? Which data are relevant? Are there privacy issues?
What did we learn? Do the results make sense? Can we tell a story?
Outline
What? Why? Who? How?
Hanspeter Pster
An Wang
My Background
Grew up in Switzerland M.Sc. in EE from ETH Zurich Ph.D. in CS from SUNY Stony Brook 11 years in industry (MERL) At Harvard since 2007, Visual Computing Group (4 Ph.D., 7 PD) Teach CS109 / CS171, taught CS175 / CS264 / CS205 Director of the Institute of Applied Computational Science (IACS) Two daughters, Lilly (10) and Audrey (7)
Professor of the Practice in Statistics, Co-Director of Undergraduate Studies in Statistics blitz@fas.harvard.edu, twitter @stat110, SC 714
Joe Blitzstein
CS109 Staff
Chris Beaumont, Head TF Johanna Beyer Nicolas Bonneel Alex DAmour Rahul Dave Brandon Haynes
Ray Jones Steffen Kirchhoff Seymour Knowles-Barley Alexander Lex Deqing Sun Tim Brenner, A/V
About You
Outline
What? Why? Who? How?
Act I: Predictions
Data Science Process Data Types and Data Munging Probability Review Classication & Regression Cross Validation, Clustering Visualization & Story Telling
Abstractions...
...and Tools
xkcd
Homework
Real-World focus Scrape and wrangle messy data Apply sophisticated statistical analysis Visualize and communicate results Election data, movie reviews,Yelp! data, etc.
Final Project
Pick a project of your choosing Teams of up to 2 students Process books, web sites, screencasts IPython (exceptions possible) Best project prizes!
cs109.org
Prerequisites
Programming experience
This can be time consuming You will need to read online documentation
http://davidzinger.wordpress.com/2007/05/page/2/
Next Steps
HW 0
Good test of your basic skills Installation of several Python frameworks Not graded, do it as soon as possible