Data
Data
Data
Outline
Why Data Science?
Users
Interface
Data Sources
How Big is Your Data?
Yottabyte (1 000 000 000 000 000 000 000 000 bytes)
Why Data Science?
Google, Yahoo today
– Web Search and Computational advertising
– Google: 35,000 searches/sec
– Yahoo! scale: 600 million users per month, 4 billion clicks per day, 25
terabytes of data collected every day
Netflix 20012
– Movie recommendations, netflix prize
– 100 million ratings, 500,000 users, 18,000 movies
Amazon 2013
– Product recommendations, reviews
– 29 million customers, millions of products
The key to answering these questions is: understand the data you have
and what the data inductively tells you.
History
The term Data Science appeared in the computer science
literature throughout the 1960s-1980s.
It was not until the late 1990s however, that the field as we
describe it here, began to emerge from the statistics and data
mining communities.
Data types:Variety
Data Quality:Veracity
With big data technology we can now store and use these data sets with
the help of distributed systems, where parts of the data is stored in
different locations, connected by networks and brought together by
software.
Velocity refers to the speed at which new data is
generated and the speed at which data moves around.
Eliminating the need for silos gives us access to all the data at
once – including data from outside sources.
Acquire
This first activity is heavily dependent upon the situation and
circumstances.
Figure out what’s missing: Ask yourself what data would make a big
difference to your processes if you had access to it.Then go find it!
Striking a balance between the data that you need and the data
that you have.
Organizations often make decisions based on inexact data. They are not
able to see the whole picture and fail to look at their data and challenges
holistically. The end result is that valuable information is withheld from
decision makers.
Research has shown almost 33% of decisions are made without good
data or information.
When Data Scientists are able to explore and analyze all the data, new
opportunities arise for analysis and data-driven decision making.
The insights gained from these new opportunities will significantly change
the course of action and decisions within an organization.
Now you can dive into the lake, bringing your analytics to the
data.
Data Scientists work across the spectrum of analytic goals
– Describe, Discover, Predict and Advise.
The finding must make sense with relatively little up-front training
or preparation on the part of the decision maker.
The finding must make the most meaningful patterns, trends and
exceptions easy to see and interpret.
The logic used to arrive at the finding must be clear and compelling
as well as traceable back through the data.
Organizations will repeat these activities with each new Data Science endeavor.
Over time, however, the level of effort necessary for each activity will change.
As more data is Acquired and Prepared, significantly less effort will need to be
expended on these activities. This is indicative of a maturing Data Science
capability.
The Data Science Maturity Model as a common framework for describing the
maturity progression and components that make up a Data Science capability.
This means simply that each step forward in maturity drives you
to the right in the model diagram.
Fractal Analytic Model
Fractals are mathematical sets that display self-similar patterns.
As you zoom in on a fractal, the same patterns reappear.
Imagine a stalk of broccoli.
Rip off a piece of broccoli and the piece looks much like the
original stalk. Progressively smaller pieces of broccoli still look
like the original stalk.
Set up the infrastructure, aggregate and prepare the data, and incorporate domain expert
knowledge.
In order to achieve a greater analytic goal, you need to first decompose the
problem into sub-components to divide and conquer.
Decomposing the Problem
Decomposing the problem into manageable pieces is the
first step in the analytic selection process.