Data Strategy Feb 9 Part 2
Data Strategy Feb 9 Part 2
Data Strategy Feb 9 Part 2
Sign up:
All: Sign up on Piazza: Can you do it now? https://piazza.com/class/kk6xmjqekrl1e9
TAs: Please set Google sheet with slots (30, 30, 10) Each slot will take 2 students.
Students: Please for a team of at most 2, sign up the google sheets asap.
Task 1 of project was explained. Work on it.
“Research” and identity an application domain
Form data-driven questions to answer.
Form the problem statement around this question, data and the application domain.
Record it for submission.
Have a standard use case format (What, why, how, stakeholders, data
in, info out, challenges, limitations, scope etc.)
Refer to your software engineering course
Statement of work (SOW): clearly state what you will accomplish
Sample size N
For statistical inference N < All
For big data N == All
For some atypical big data analysis N == 1
World model through the eyes of a prolific twitter user
Followers of Ashton Kuchar: If you analyze the twitter data you may get a world view
from his point of view
Analysis for inference purposes you don’t need all the data.
At Google (at the originator big data algs.) people sample all the time.
However if you want to render, you cannot sample.
Some DNA-based search you cannot sample.
Say we make some conclusions with samples from Twitter data we
cannot extend it beyond the population that uses twitter. And this is
what is happening now…be aware of biases.
Another example is of the tweets pre- and post- hurricane Sandy..
Yelp example..
Good annotated graphs and visuals are important explaining the results
Annotate using text, markup and markdown
Extras: provide ability to interact with plots and assess what-if
conditions
Explore
(d3.js : https://d3js.org/
Tableau: https://www.tableau.com/academic)
But keep to Python viz libraries.
And a lot of creativity. Do not underestimate this: how to present your
results effectively?
Should need no explanation!
Iterate thru’ any of steps as warranted by the feedback and the results
Data science process is an iterative process
Before you develop a tool or automation based on the results test the
code thoroughly.
Read Chapter 2
B.RAMAMURTHY
Google search: How is different from regular search in existence before it?
It took advantage of the fact the hyperlinks within web pages form an underlying structure
that can be mined to determine the importance of various pages.
Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to
go to CityGrille”?
Learning capacity from previous data of habits, profiles, and other information gathered over
time.
Collaborative and interconnected world inference capable: facebook friend suggestion
Large scale data requiring indexing
…Do you know amazon is going to ship things before you order? Here
Models
Algorithms
(thinking)
Data structures
(infrastructure)
AggregatedC Reference
ontent (Raw Structures
data) (knowledge)
Data integration
Meta data
Data modeling
Organizational roles and responsibilities
Performance and metrics
Security and privacy
Structured data management
Unstructured data management
Business intelligence
Data analysis and visualization
Tapping into social data
This course will provide skills in big data technologies, tools, environments and APIs available for
developing and implementing one or more of these components.
How will you collect data? Aggregate data? What are your sources?
How will you store them? And Where?
How will you use the data? Analyze them? Analytics? Data mining?
Pattern recognition?
How will you present or report the data to the stakeholders and
decision makers? visualization?
Archive the data for provenance and accountability?