1b Datascience

Foundations of Data Science, Fall 2024
Introduction to Data Science for Doctoral Students, Fall 2024
1b. Introduction: Data Science
Dr. Haozhe Zhang
Sept 16, 2024
MSc: https://lms.uzh.ch/url/RepositoryEntry/17589469505
PhD: https://lms.uzh.ch/url/RepositoryEntry/17589469506
Data Science: Origins
John Tukey: The Future of Data Analysis. The Annals of Math. Stats., 1962
”All in all I have come to feel that my central interest is in
data analysis,
which I take to include, among other things:
procedures for analyzing data,
techniques for interpreting the results of such procedures,
ways of planning the gathering of data to make its analysis

easier, more precise or more accurate,
and all the machinery and results of (mathematical) statistics

which apply to analyzing data”
1
Data Science: These Days
Coupling of scientific discovery and practice that involves
the collection, management, processing, analysis,

visualisation, and interpretation
of vast amounts of heterogeneous data
associated with a diverse array of

scientific, translational, and interdisciplinary applications
2
Reactions from the Statistics and Computer Science Communities
Statistics = Science of collecting and analysing numerical data in large quantities
• Aren’t WE Data Science?

• A grand debate: Is Data Science just a ’rebranding’ of statistics?
• Why Do We Need Data Science When We’ve had Statistics for Centuries?
3
Reactions from the Statistics and Computer Science Communities
Statistics = Science of collecting and analysing numerical data in large quantities
• Aren’t WE Data Science?

• A grand debate: Is Data Science just a ’rebranding’ of statistics?
• Why Do We Need Data Science When We’ve had Statistics for Centuries?
Computer Science pragmatic view:
• Data science is concerned with really big data, which traditional computing
resources could not accommodate
• Data science trainees have the skills needed to cope with such big datasets.
An account to these reactions and their legitimacy:
David Donoho: 50 years of Data Science. 2015
3
The Two Cultures
Leo Breiman: Statistical Modeling: The Two Cultures. Statistical Science, 2001.
1. Generative Modeling
• Develop stochastic models which fit the data
• Make inferences about the data-generating mechanism based on model
structure
• Implicit assumption: There is a true model generating the data, and often a ‘best’
way to analyse the data.
Proponents: Academic Statisticians
2. Predictive Modeling
• Silent about the underlying mechanism generating the data
• Allows for many different predictive algorithms
• Interest: accuracy of prediction made by different algorithm on various datasets
• Epicenter: Machine Learning; sitting within CS departments
Proponents: Computer scientists and *industrial* statisticians.
4
The Two Cultures
”The statistical community has been committed to the almost

exclusive use of [generative] models. This commitment has led
to irrelevant theory, questionable conclusions, and has kept
statisticians from working on a large range of interesting problems.
5
The Two Cultures
”The statistical community has been committed to the almost

exclusive use of [generative] models. This commitment has led
to irrelevant theory, questionable conclusions, and has kept
statisticians from working on a large range of interesting problems.
[Predictive] modeling, both in theory and practice, has developed

rapidly in fields outside statistics. It can be used both on
large complex data sets and as a more accurate and informative
alternative to data modeling on smaller data sets.
If our goal as a field is to use data to solve problems, then we need

to move away from exclusive dependence on [generative] models”
5
The Predictive Culture’s Secret Sauce: The Common Task Framework
The common task framework (CTF)
• provides a standardized way to evaluate different algorithms

• encourages collaboration and competition among researchers
and practitioners
6
The Predictive Culture’s Secret Sauce: The Common Task Framework
(a) Publicly available training datasets with feature measurements and class
label for each observation.
(b) Competitors whose common task is to infer a class prediction rule from the
training data.
(c) A scoring referee, to which competitors can submit their prediction rule.
The referee runs the prediction rule against a testing dataset which is not
available to the competitors.
The referee objectively and automatically reports the score (prediction
accuracy) achieved by the submitted rule.
CTF applied by DARPA successfully in many problems, e.g.,:
• machine translation, speaker identification, fingerprint recognition,

• information retrieval, OCR, automatic target recognition.
7
General Experience with CTF
1. Error rates decline by some percentage each year, to an asymptote

depending on task and data quality.
2. Progress usually comes from many small improvements

A change of 1% can be a reason to break out the champagne.
3. Shared data plays a crucial role – and is re-used in unexpected ways.
Those fields where machine learning has scored successes are essentially
those fields where CTF has been applied systematically.
The Common Task Framework is the single idea from machine learning and
data science that is most lacking attention in today’s statistical training.
8
Driving Forces behind this new Science
1. The formal theories of statistics
Statistics thus represents a fraction of data science
2. Accelerating developments in computers
Faster hardware, better algorithms
3. The challenge, in many fields, of more and ever larger bodies of data
Sciences and society become increasingly more digitalised
4. The emphasis on quantification in an ever wider variety of disciplines
Extract compact knowledge out of a sea of data
As science itself becomes a body of data that we can analyze and study, there are
opportunities for improving the accuracy and validity of science, through the scientific
study of data analysis.
9
We Currently Witness an Industrial Revolution of Data!
• Much cheaper to generate data

Inexpensive sensors, smart devices, social
software Web 2.0, multiplayer games, Internet of
Things connecting homes, cars, appliances,
RFIDs, GPS, software logs, audio & video
• Much cheaper to process data

Advances in multicore CPUs, inexpensive cloud
computing, open source software, unlimited fibre
power broadband
• Society has become increasingly more

computational
Many categories of people involved in
generating, processing, and consuming data
10
How much Data is Generated each Day? (World Economic Forum, 2019)
https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/ 11
The ”Big Data” Buzz
“Between the dawn of civilization and

2003, we only created five exabytes of
information; now we’re creating that
amount every two days.”
Eric Schmidt, Google (and others)
12
The ”Big Data” Buzz
“Between the dawn of civilization and

2003, we only created five exabytes of
information; now we’re creating that
amount every two days.”
Eric Schmidt, Google (and others)
12
Sciences Become Increasingly More Computational
http://www.economist.com/node/15579717
The Economist: “Our ability to capture, warehouse, and understand massive amounts of
data is changing science, medicine, business, and technology. As our collection of facts and
figures grows, so will the opportunity to find answers to fundamental questions. Because in
13
the era of big data, more isn’t just more. More is different.“
The Data Deluge Makes the Scientific Method Obsolete
• George Box, statistician (1970s):
”All models are wrong, but some are useful.“
• Peter Norvig, Google’s research director (2008):
”All models are wrong, and increasingly you can succeed without them.”
• Anand Rajaraman, academic/VC, and others (2012):
“The new oil/oxygen of Google/Facebook/Twitter/. . . = Simple models + big

data.” No need for a-priori sophisticated and inherently wrong models.
14
Full Scope of Data Science
1. Data Exploration and Preparation
2. Data Representation and Transformation
3. Computing with Data
4. Data Modeling
5. Data Visualisation and Presentation
6. Science about Data Science
Each of the above facets of data science require special skills beyond those
taught in e.g., Statistics and Computer Science when taken alone.
15
Data Exploration and Preparation
80% of the effort devoted to data science is diving into messy data
• Exploratory Data Analysis requires serious time and effort

• to learn about the data and
• to prepare it for further exploitation.
• Data cleaning to address anomalies
• Value recoding and reformatting
• Value grouping
16
Data Representation and Transformation
Central step: Implement an appropriate transformation restructuring

the originally given data into a new and more revealing form.
• Modern data management and database skills

• managing unstructured data (text)
• spreadsheets
• (no)SQL DB
• distributed DB
• Maths representations
• Fourier transform for acoustic data
• wavelet transform for image and sensor data
17
Computing with Data
Data scientists need to keep current on new computing idioms
Programming languages for
• Data analysis and processing
• Text transformation and managing complex computational pipelines
Efficient centralised and distributed computing paradigms
• Distributive computation, algorithms, computational complexity
• Cloud computing to run massive number of jobs
18
Data Modeling
Data scientists use tools and viewpoints from Breiman’s modelling cultures
• Generative modeling
• Propose stochastic models that could have generated the data
• Derives methods to infer properties of the underlying generative mechanism
• Predictive (algorithmic) modeling

• Construct methods that predict well over some concrete dataset
19
Data Visualisation and Presentation
Crystallise understanding of a dataset by developing a new plot which codifies it
• Histograms, scatterplots, time series plots
• Dashboards for monitoring data processing pipelines that access streaming

or widely distributed data
• Visualisations for presenting conclusions from a modelling exercise or CTF

challenge
20
Science about Data Science
The true effectiveness of a tool:

(the prob. of deployment) × (the prob. of effective results once deployed)
Identify commonly-occurring analysis/processing workflows
• Use data about their frequency of occurrence in scholarly/business domains
• Measure the effectiveness of standard workflows in terms of performance

metrics: human time, computing resource, analysis validity
• Uncover emergent phenomena in data analysis, e.g.,

• new patterns arising in data analysis workflows
• disturbing artefacts in published analysis results
21
Scope of this Course:
Basics of Data Modelling
˜ Machine Learning˜
21

1b Datascience

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

1b Datascience

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1b Datascience

Uploaded by

Copyright:

Available Formats

Foundations of Data Science, Fall 2024

Introduction to Data Science for Doctoral Students, Fall 2024

1b. Introduction: Data Science

Dr. Haozhe Zhang

Sept 16, 2024

”All in all I have come to feel that my central interest is in

which I take to include, among other things:

procedures for analyzing data,

techniques for interpreting the results of such procedures,

ways of planning the gathering of data to make its analysis

and all the machinery and results of (mathematical) statistics

Coupling of scientific discovery and practice that involves

the collection, management, processing, analysis,

of vast amounts of heterogeneous data

associated with a diverse array of

Statistics = Science of collecting and analysing numerical data in large quantities

• Aren’t WE Data Science?

Statistics = Science of collecting and analysing numerical data in large quantities

• Aren’t WE Data Science?

Computer Science pragmatic view:

An account to these reactions and their legitimacy:

David Donoho: 50 years of Data Science. 2015

Proponents: Academic Statisticians

Proponents: Computer scientists and *industrial* statisticians.

”The statistical community has been committed to the almost

”The statistical community has been committed to the almost

[Predictive] modeling, both in theory and practice, has developed

If our goal as a field is to use data to solve problems, then we need

The common task framework (CTF)

• provides a standardized way to evaluate different algorithms

CTF applied by DARPA successfully in many problems, e.g.,:

• machine translation, speaker identification, fingerprint recognition,

1. Error rates decline by some percentage each year, to an asymptote

2. Progress usually comes from many small improvements

3. Shared data plays a crucial role – and is re-used in unexpected ways.

1. The formal theories of statistics

Statistics thus represents a fraction of data science

2. Accelerating developments in computers

Faster hardware, better algorithms

Sciences and society become increasingly more digitalised

4. The emphasis on quantification in an ever wider variety of disciplines

Extract compact knowledge out of a sea of data

• Much cheaper to generate data

• Much cheaper to process data

• Society has become increasingly more

“Between the dawn of civilization and

Eric Schmidt, Google (and others)

“Between the dawn of civilization and

Eric Schmidt, Google (and others)

• George Box, statistician (1970s):

”All models are wrong, but some are useful.“

• Peter Norvig, Google’s research director (2008):

• Anand Rajaraman, academic/VC, and others (2012):

“The new oil/oxygen of Google/Facebook/Twitter/. . . = Simple models + big

1. Data Exploration and Preparation

2. Data Representation and Transformation

3. Computing with Data

Proponents: Computer scientists and industrial statisticians.