1b Datascience
1b Datascience
1b Datascience
MSc: https://lms.uzh.ch/url/RepositoryEntry/17589469505
PhD: https://lms.uzh.ch/url/RepositoryEntry/17589469506
Data Science: Origins
John Tukey: The Future of Data Analysis. The Annals of Math. Stats., 1962
data analysis,
2
Reactions from the Statistics and Computer Science Communities
3
Reactions from the Statistics and Computer Science Communities
• Data science is concerned with really big data, which traditional computing
resources could not accommodate
• Data science trainees have the skills needed to cope with such big datasets.
3
The Two Cultures
Leo Breiman: Statistical Modeling: The Two Cultures. Statistical Science, 2001.
1. Generative Modeling
• Develop stochastic models which fit the data
• Make inferences about the data-generating mechanism based on model
structure
• Implicit assumption: There is a true model generating the data, and often a ‘best’
way to analyse the data.
2. Predictive Modeling
• Silent about the underlying mechanism generating the data
• Allows for many different predictive algorithms
• Interest: accuracy of prediction made by different algorithm on various datasets
• Epicenter: Machine Learning; sitting within CS departments
4
The Two Cultures
5
The Two Cultures
5
The Predictive Culture’s Secret Sauce: The Common Task Framework
6
The Predictive Culture’s Secret Sauce: The Common Task Framework
(a) Publicly available training datasets with feature measurements and class
label for each observation.
(b) Competitors whose common task is to infer a class prediction rule from the
training data.
(c) A scoring referee, to which competitors can submit their prediction rule.
The referee runs the prediction rule against a testing dataset which is not
available to the competitors.
The referee objectively and automatically reports the score (prediction
accuracy) achieved by the submitted rule.
7
General Experience with CTF
Those fields where machine learning has scored successes are essentially
those fields where CTF has been applied systematically.
The Common Task Framework is the single idea from machine learning and
data science that is most lacking attention in today’s statistical training.
8
Driving Forces behind this new Science
3. The challenge, in many fields, of more and ever larger bodies of data
As science itself becomes a body of data that we can analyze and study, there are
opportunities for improving the accuracy and validity of science, through the scientific
study of data analysis.
9
We Currently Witness an Industrial Revolution of Data!
10
How much Data is Generated each Day? (World Economic Forum, 2019)
https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/ 11
The ”Big Data” Buzz
12
The ”Big Data” Buzz
12
Sciences Become Increasingly More Computational
http://www.economist.com/node/15579717
The Economist: “Our ability to capture, warehouse, and understand massive amounts of
data is changing science, medicine, business, and technology. As our collection of facts and
figures grows, so will the opportunity to find answers to fundamental questions. Because in
13
the era of big data, more isn’t just more. More is different.“
The Data Deluge Makes the Scientific Method Obsolete
”All models are wrong, and increasingly you can succeed without them.”
14
Full Scope of Data Science
4. Data Modeling
Each of the above facets of data science require special skills beyond those
taught in e.g., Statistics and Computer Science when taken alone.
15
Data Exploration and Preparation
80% of the effort devoted to data science is diving into messy data
• Value grouping
16
Data Representation and Transformation
• Maths representations
• Fourier transform for acoustic data
• wavelet transform for image and sensor data
17
Computing with Data
18
Data Modeling
Data scientists use tools and viewpoints from Breiman’s modelling cultures
• Generative modeling
• Propose stochastic models that could have generated the data
• Derives methods to infer properties of the underlying generative mechanism
19
Data Visualisation and Presentation
20
Science about Data Science
21
Scope of this Course:
Basics of Data Modelling
˜ Machine Learning˜
21