Data Strategy Feb 9 Part 2

To do for Today
Discuss “plans” for project 1 that will result in your product 1.

Discuss “data strategy”
B.Ramamurthy 2021 10/27/2021

Where can you get your data?
2
Search for “top 10 data sources”

You will get 100’s of data sources/ topics of interest: For example
education, election data etc.
Don’t go confused.. Choose the one domain you are passionate about.
Develop and sketch a question or questions around that topic.
Form a problem statement.
Let’s now go through the project description.
Pew research is good source.
B.Ramamurthy 2021 10/27/2021

To do (for each of you)
3
Sign up:
 All: Sign up on Piazza: Can you do it now? https://piazza.com/class/kk6xmjqekrl1e9
 TAs: Please set Google sheet with slots (30, 30, 10) Each slot will take 2 students.
 Students: Please for a team of at most 2, sign up the google sheets asap.
Task 1 of project was explained. Work on it.
 “Research” and identity an application domain
 Form data-driven questions to answer.
 Form the problem statement around this question, data and the application domain.
 Record it for submission.
B.Ramamurthy 2021 10/27/2021

Lets review the Project 1 for the last time
4
1. Frame the problem: understand the use case

2. Understand the data: Exploratory data analysis
3. Extract features: what are the dependent and independent variables,
cols and rows in a table data for example.
4. Model the data and analyze: big data, small data, historical, steaming,
realtime etc.
5. Design, code and experiment: use tools to clean, extract, plot, view
6. Present and test results: two types of clients: humans and systems
7. Go back to any of the steps based on the insights!
CSE4/587 B. Ramamurthy 10/27/2021

Frame The problem
5
Have a standard use case format (What, why, how, stakeholders, data
in, info out, challenges, limitations, scope etc.)
Refer to your software engineering course
Statement of work (SOW): clearly state what you will accomplish

Understand Data
6
Data represents the traces of the real-world processes.

 What traces we collect depends on the sampling methods
 You build models to understand the data and extract meaning and information from
the data: statistical inference
Two sources of randomness and uncertainty:
 The process that generates data is random
 The sampling process itself is random
Your mind-set should be “statistical thinking in the age of big-data”
 Combine statistical approach with big-data

Here are some questions to ask?
7
How big is the data?

Any outliers?
Missing data?
Sparse or dense?
Collision of identifiers in different sets of data

New Kinds of Data
8
Traditional: numerical, categorical, or binary

Text: emails, tweets, NY times articles
Records: user-level data, time-stamped event data, json formatted log
files
Geo-based location data
 Network data (How do you sample and preserve network structure?)
Sensor data
Images

Uncertainty and Randomness
9
A mathematical model for uncertainty and randomness is offered by probability

theory.
A world/process is defined by one or more variables. The model of the world is
defined by a function:
Model == f(w) or f(x,y,z) (A multivariate function)
The function is unknown model is unclear, at least initially. Typically our task
is to come up with the model, given the data.
Uncertainty: is due to lack of knowledge: this week’s weather prediction (e.g.
90% confident)
Randomness: is due lack of predictability: 1-6 face of when rolling a die
Both can be expressed by probability theory

Statistical Inference
10
World  Collect Data Capture the understanding/meaning of data

through models or functions  statistical estimators for predicting
things about world
Development of procedures, methods, and theorems that allow us to
extract meaning and information from data that has been generated by
stochastic (random) processes

Population and Sample
11
Population is complete set of traces/data points

 US population 314 Million, world population is 7 billion for example
 All voters, all things
Sample is a subset of the complete set (or population): how we select the sample
introduces biases into the data
See an example in http://www.sca.isr.umich.edu/
Here out of the 314 Million US population, 250000 households are form the sample
(monthly)
Population mathematical model  sample
(My) big-data approach for the world population: k-nary tree (MR) of 1 billion (of the
order of 7 billion) : I basically forced the big-data solution/did not sample: This is
possible in the age of big-data infrastructures

Population and Sample (contd.)
12
Example: Emails sent by people in the CSE dept. in a year.

Method 1: 1/10 of all emails over the year randomly chosen
Method 2: 1/10 of people randomly chosen; all their email over the year
Both are reasonable sample selection method for analysis.
However estimations pdfs (probability distribution functions) of the
emails sent by a person for the two samples will be different.

Big Data vs statistical inference
13
Sample size N
For statistical inference N < All
For big data N == All
For some atypical big data analysis N == 1
 World model through the eyes of a prolific twitter user
 Followers of Ashton Kuchar: If you analyze the twitter data you may get a world view
from his point of view

Big-data context
14
Analysis for inference purposes you don’t need all the data.
At Google (at the originator big data algs.) people sample all the time.
However if you want to render, you cannot sample.
Some DNA-based search you cannot sample.
Say we make some conclusions with samples from Twitter data we
cannot extend it beyond the population that uses twitter. And this is
what is happening now…be aware of biases.
Another example is of the tweets pre- and post- hurricane Sandy..
Yelp example..

Exploratory Data Analysis (EDA)
15
 You achieve two things to get you started:

 Get an intuitive feel for the data
 You can get a list of hypotheses
 Traditionally: histograms
 EDA is the prototype phase of ML and other sophisticated approaches;
 Basic tools of EDA are plots, graphs, and summary stats.
 It is a method for “systematically” going through data, plotting distributions, plotting time series,
looking at pairwise relationships using scatter plots, generating summary stats.eg. mean, min,
max, upper, lower quartiles, identifying outliers.
 Gain intuition and understand data.
 EDA is done to understand Big data before using expensive big data methodology.

Extract Features
16
Data is cleaned up : Data wrangling

Ex: remove tags from html data
Filter out only the important fields or features, say from a json file
Often defined by the problem analysis and use case defined.
Example: location and temperature are the only important data in a
tweet for a particular analysis

Modeling
17
 Abstraction of a real world process

Lets say we have a data set with two columns x and y and y is
dependent on x, we could write is as:
y=
(linear relationship)
How to build a model?
Probability distribution functions (pdf) are building blocks of statistical
models.
There are many distributions possible

Probability Distributions
18
Normal, uniform, Cauchy, t-, F-, Chi-square, exponential, Weibull,

lognormal,..
They are know as continuous density functions
Any random variable x or y can be assumed to have probability
distribution p(x), if it maps it to a positive real number.
For a probability density function, if we integrate the function to find
the area under the curve it is 1, allowing it to be interpreted as
probability.
Further, joint distributions, conditional distribution..

Fitting a Model
19
Fitting a model means estimating the parameters of the model: what

distribution, what are the values of min, max, mean, stddev, etc.
Don’t worry R has built-in optimization algorithms that readily offer all
these functionalities
It involves algorithms such as maximum likelihood estimation (MLE)
and optimization methods…
Example: y = β1+β2∗𝑥  y = 7.2 + 4.5*x

Design, code, deploy
20
Design first before you code: an important principle

Code using best practices and “Software engineering” principles
Choose the right language and development environment
Document within the code and outside
Clear state the steps in deploying the code
Provide trouble shooting tips

Present the Results
21
Good annotated graphs and visuals are important explaining the results
 Annotate using text, markup and markdown
Extras: provide ability to interact with plots and assess what-if
conditions
Explore
(d3.js : https://d3js.org/
Tableau: https://www.tableau.com/academic)
But keep to Python viz libraries.
And a lot of creativity. Do not underestimate this: how to present your
results effectively?
Should need no explanation!

Iterate
22
Iterate thru’ any of steps as warranted by the feedback and the results
Data science process is an iterative process
Before you develop a tool or automation based on the results test the
code thoroughly.
Read Chapter 2

What is your data strategy?
23
B.RAMAMURTHY
B.Ramamurthy 2021 10/27/2021

Intelligence and Scale of Data
24
Intelligence is a set of discoveries made by federating/processing information collected

from diverse sources.
Information is a cleansed form of raw data.
For statistically significant information we need reasonable amount of data.
For gathering good intelligence we need large amount of information.
As pointed out by Jim Grey in the Fourth Paradigm book enormous amount of data is
generated by the millions of experiments and applications.
Thus intelligence applications are invariably data-heavy, data-driven and data-intensive.
Data is gathered from the web (public or private, covert or overt), generated by large
number of domain applications.
Data-intensive computing is the fourth paradigm. Empirical evidence, scientific theory
and scientific computing are the other three.
B.Ramamurthy 2021 10/27/2021

Intelligence (or origins of Big-data computing?)
 Search for Extra Terrestrial Intelligence (seti@home

project)
The Wow signal http://www.bigear.org/wow.htm
10/27/2021 B.Ramamurthy 2021

25
Characteristics of intelligent applications
26
 Google search: How is different from regular search in existence before it?
It took advantage of the fact the hyperlinks within web pages form an underlying structure
that can be mined to determine the importance of various pages.
 Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to
go to CityGrille”?
 Learning capacity from previous data of habits, profiles, and other information gathered over
time.
 Collaborative and interconnected world inference capable: facebook friend suggestion
 Large scale data requiring indexing
 …Do you know amazon is going to ship things before you order? Here
B.Ramamurthy 2021 10/27/2021

Data-intensive application
characteristics
Models
Algorithms
(thinking)
Data structures
(infrastructure)
AggregatedC Reference
ontent (Raw Structures
data) (knowledge)
B.Ramamurthy 2021 27 10/27/2021

Basic Elements
28
Aggregated content: large amount of data pertinent to the specific application;

each piece of information is typically connected to many other pieces. Ex: DBs
Reference structures: Structures that provide one or more structural and
semantic interpretations of the content. Reference structure about specific
domain of knowledge come in three flavors: dictionaries, knowledge bases, and
ontologies
Algorithms: modules that allows the application to harness the information
which is hidden in the data. Applied on aggregated content and some times
require reference structure Ex: MapReduce
Data Structures: newer data structures to leverage the scale and the WORM
characteristics; ex: MS Azure, Apache Hadoop, Google BigTable
B.Ramamurthy 2021 10/27/2021

Different Type of Storage
• Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale”
• But observe that this type of data has an uniquely different characteristic than your transactional or the “customer
order” data, or “bank account data” :
• The data type is “write once read many (WORM)” ;
• Privacy protected healthcare and patient information;
• Historical financial data;
• Other historical data
 Relational file system and tables are insufficient.
• Large <key, value> stores (files) and storage management system.
• Built-in features for fault-tolerance, load balancing, data-transfer and aggregation,…
• Clusters of distributed nodes for storage and computing.
• Computing is inherently parallel
• Streaming systems
B.Ramamurthy 2021 10/27/2021 29

Big-data Concepts
Originated from the Google File System (GFS) is the special <key, value> store
Hadoop Distributed file system (HDFS) is the open source version of this. (Currently
an Apache project)
Parallel processing of the data using MapReduce (MR) programming model
Challenges
 Formulation of MR algorithms
 Proper use of the features of infrastructure (Ex: sort)
 Best practices in using MR and HDFS
An extensive ecosystem consisting of other components such as column-based store

(Hbase, BigTable), big data warehousing (Hive), workflow languages, etc.
And now Kafka, confluent, ..
B.Ramamurthy 2021 10/27/2021 30

Cloud Computing
Cloud is a facilitator for Big Data computing and is an indispensable in
this context
Cloud provides processor, software, operating systems, storage,
monitoring, load balancing, clusters and other requirements as a service
Cloud offers accessibility to Big Data computing
Cloud computing models:
 platform (PaaS), Microsoft Azure
 software (SaaS), Google App Engine (GAE)
 infrastructure (IaaS), Amazon web services (AWS)
 Services-based application programming interface (API)
B.Ramamurthy 2021 10/27/2021 31

Data Strategy
32
 In this era of big data, what is your data strategy?

 Strategy as in simple “Planning for the data challenge”
 It is not only about big data: all sizes and forms of data
 Data collections from customers used to be an elaborate task: surveys, and other
such instruments
 Nowadays data is available in abundance: thanks to the technological advances
as well as the social networks
 Data is also generated by many of your own business processes and applications
 Data strategy means many different things: we will discuss this next
B.Ramamurthy 2021 10/27/2021

Components of a data Strategy2
33
Data integration
Meta data
Data modeling
Organizational roles and responsibilities
Performance and metrics
Security and privacy
Structured data management
Unstructured data management
Business intelligence
Data analysis and visualization
Tapping into social data
This course will provide skills in big data technologies, tools, environments and APIs available for
developing and implementing one or more of these components.
B.Ramamurthy 2021 10/27/2021

Data Strategy for newer kinds of data
34
How will you collect data? Aggregate data? What are your sources?
How will you store them? And Where?
How will you use the data? Analyze them? Analytics? Data mining?
Pattern recognition?
How will you present or report the data to the stakeholders and
decision makers? visualization?
Archive the data for provenance and accountability?
B.Ramamurthy 2021 10/27/2021

References
35
[1] Bloomberg global COVID tracker:

https://www.bloomberg.com/graphics/covid-vaccine-tracker-global-distribution/, last viewed
Feb3, 2021.
[2] S. Adelman, L. Moss, M. Abai. Data Strategy. Addison-Wesley, 2005.
[3] T. Davenport. A Predictive Analytics Primer. Sept2, 2014, Harvard Business Review.
http://blogs.hbr.org/2014/09/a-predictive-analytics-primer/
[4] M. NemSchoff. A quick guide to structured and unstructured data. In Smart Data Collective,
June 28, 2014.
[5] Lin and Dyer. https://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf, last
viewed 2021.
[6] J. Grey. Fourth Paradigm.
https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientifi
c-discovery/
, last viewed 2021.
B.Ramamurthy 2021 10/27/2021

Summary
36
We are entering a watershed moment in the internet era.

This involves in its core and center, big data analytics and tools that
provide intelligence in a timely manner to support decision making.
Newer storage models, processing models, and approaches have
emerged.
We will learn about these and develop software using these newer
approaches (streaming systems) to data.
Next lecture I will begin the discussion of big data analytics and
Hadoop ecosystem.
B.Ramamurthy 2021 10/27/2021

Data Strategy Feb 9 Part 2

Uploaded by

Copyright:

Available Formats

Data Strategy Feb 9 Part 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Strategy Feb 9 Part 2

Uploaded by

Copyright:

Available Formats

To do for Today

Discuss “plans” for project 1 that will result in your product 1.

B.Ramamurthy 2021 10/27/2021

Search for “top 10 data sources”

B.Ramamurthy 2021 10/27/2021

B.Ramamurthy 2021 10/27/2021

1. Frame the problem: understand the use case

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 10/27/2021

Data represents the traces of the real-world processes.

CSE4/587 B. Ramamurthy 10/27/2021

How big is the data?

CSE4/587 B. Ramamurthy 10/27/2021

Traditional: numerical, categorical, or binary

CSE4/587 B. Ramamurthy 10/27/2021

A mathematical model for uncertainty and randomness is offered by probability

CSE4/587 B. Ramamurthy 10/27/2021

World  Collect Data Capture the understanding/meaning of data

CSE4/587 B. Ramamurthy 10/27/2021

Population is complete set of traces/data points

CSE4/587 B. Ramamurthy 10/27/2021

Example: Emails sent by people in the CSE dept. in a year.

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 10/27/2021

 You achieve two things to get you started:

CSE4/587 B. Ramamurthy 10/27/2021

Data is cleaned up : Data wrangling

CSE4/587 B. Ramamurthy 10/27/2021

 Abstraction of a real world process

CSE4/587 B. Ramamurthy 10/27/2021

Normal, uniform, Cauchy, t-, F-, Chi-square, exponential, Weibull,

CSE4/587 B. Ramamurthy 10/27/2021

Fitting a model means estimating the parameters of the model: what

CSE4/587 B. Ramamurthy 10/27/2021

Design first before you code: an important principle

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 10/27/2021

B.Ramamurthy 2021 10/27/2021

Intelligence is a set of discoveries made by federating/processing information collected

B.Ramamurthy 2021 10/27/2021

 Search for Extra Terrestrial Intelligence (seti@home

10/27/2021 B.Ramamurthy 2021

B.Ramamurthy 2021 10/27/2021

B.Ramamurthy 2021 27 10/27/2021

Aggregated content: large amount of data pertinent to the specific application;

B.Ramamurthy 2021 10/27/2021

B.Ramamurthy 2021 10/27/2021 29

 Best practices in using MR and HDFS

An extensive ecosystem consisting of other components such as column-based store

B.Ramamurthy 2021 10/27/2021 30

 infrastructure (IaaS), Amazon web services (AWS)

 Services-based application programming interface (API)

B.Ramamurthy 2021 10/27/2021 31

 In this era of big data, what is your data strategy?

B.Ramamurthy 2021 10/27/2021

B.Ramamurthy 2021 10/27/2021

B.Ramamurthy 2021 10/27/2021

[1] Bloomberg global COVID tracker: