Week 1-Introduction-V3-class
Week 1-Introduction-V3-class
501
Unless Otherwise Stated this presentation refers to study material provided by AWS academy
• Introductions
• Quick Survey
• Course Overview
• Syllabus Walkthrough
• Introduction to Data and More…
3
Syllabus Overview
What is Possible 4
Course Composition
Emphasis
7
What is Data?
…since decades…
“Without big data, you are blind and deaf and in the middle of a freeway.” —
Geoffrey Moore
“Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has
to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives
profitable activity; so must data be broken down, analyzed for it to have value.”
- Michael Palmer
“With data collection, ‘the sooner the better’ is always the best answer.” — Marissa
Mayer
“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.”
— Jim Barksdale
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1
Example: Identify pictures of dogs
15
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1
16
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1
17
More Prescriptive
valuable
How can we
Predictive make
something
What will
happen or
happen?
Diagnostic prevent
something from
Why did happening?
Descriptive something
happen?
What
happened?
Less valuable,
easier to derive
More difficult
17
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Literacy and Gen D Worker
18
Data Aware
Data is collected but not
necessarily used to make
decision. Used discretionarily or
only on need basis
Data Indifference
Most Decision are made
from gut feeling than
even being curious
about data Data Driven
Data is the DNA for all
decision making. Data
collection, cleaning, analytics
and insights is matured
Data Informed
Business users use
data many times to
make business
decisions
https://www.smartsheet.com/data-driven-decision-making-management
20
https://www.bcg.com/publications/2020/increasing-odds-of-success-in-digital-transformation
Hinderances to Data Progression
24
Source: https://www.newvantage.com/_files/ugd/e5361a_ad5a8b3da8254a71807d2dccdb0844be.pdf
Few More Insights 26
High Level Challenges…
27
Data becomes less valuable for decision- 28
Most valuable:
Preventive/
Predictive
Actionable
Reactive
Less valuable:
Historical
How quickly can
you analyze
incoming data? Near real time Within seconds Within minutes to Within days to
hours months
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 2
The trade-offs of data-driven decisions
29
© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Five Key challenges 30
• volume
• velocity
• variety
• veracity
• value
31
Volume
Amount of Data
• It is the base of big data
Velocity
Variety
• Structured
• Unstructured
• Semi-structure
Type of Data 34
Structured Un-Structured
data that has been predefined and formatted to a data stored in its native format and not processed
set structure until it is used
Easily used by business users Need analytics to derive patterns and behaviors
Predefined structure makes it easy on Machine Native format helps to use data as-is and easy to
Learning algorithms collect as there are not much rules on the
structure
Cons: Limited storage formats and choices. RDBMS Requires more storage as compared to structured
or DW data e.g. Data lakes or cloud data DWs
Cons: Predefine structure also forces limitation on Limited skill set availability due to technical nature
use an manipulations. of the toolset an frameworks
Examples: Bank Account Data, HRMS, Company Examples: IOT sensor data, Log files, Social Media
Financial System , CRM networks, Collaboration tools/websites
Sources of Data 35
Business or HRMS, ERP, CRM, PPM, Structured Low Mid Mid Low
Enterprise EMR
Application
Documents PDF, XLS, JSSON and so Unstructured Mid Low Low Mid
on
Data Storage File streams, NoSQL, R- Structured/ Low Mid High Mid
ORDBMS Hybrid
• Structured
• Organized into a formatted repository
• Data is made more addressable for effective data
processing and analysis
• Unstructured
• Emails
• Text files
• Semi-structure
• Hasn’t been organized
• Has meta-data – example – pictures group by tags
37
Veracity
https://datascience.aero/big-data-veracity-value/
38
Value
https://mattturck.com/data2021/
Data Life Cycle 40
https://online.hbs.edu/blog/post/data-life-cycle
Example: Data Sci Lifecycle 41
Reis, Joe; Housley, Matt. Fundamentals of Data Engineering (p. 34). O'Reilly Media.
Kindle Edition.
42
• process
• clean and
• analyze
• Collect
• Process
• Clean
• Analyze
Collect
44
Clean
• Data is cleansed to improve its quality.
Analyze
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
50
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
50
51
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
51
52
Storage Predictions
Data Analysis & and decisions
sources Ingestion
Visualization
Processing
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
52
53
Data wrangling:
Discover
Clean
Normalize Transformation
Enrich
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
53
Iterative processing through the pipeline
54
Evaluate
results
Additional and
3
data iterate
2
1
Predictions
Data
and
sources
decisions
© 2022, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
54
Critical Elements of Data Lifecyle 55
Reis, Joe; Housley, Matt. Fundamentals of Data Engineering (p. 24). O'Reilly Media. Kindle Edition.
Modern Data Professional
56
Considerations
• Depicts the interdependencies of data elements and respective steps involved in the data life cycle.
• Any missing piece will burden other elements in the chain and worse, can break the chain leading to
derailing analytics goals.
Refactor,
Architect Operationalize
Build skills Govern with
Organize Imbibe data agility
driven culture
59
needs to follow.
with all staff handling sensitive issues, data scientists should regularly
Don’t collect the data for the sake of its availability and potential
purpose.
65
only ensure that the most appropriate data for the analysis has
been collected.
Misconceptions of BIG Data
66
can use big data to form patterns and predict many things, big data
are at higher risk of heart ailments so that precautions can be taken but
many things in more complex domains such as law and politics cannot
be predicted.
71
can use big data to form patterns and predict many things, big
examples
delays to deliveries.
79
• Data analysis is the process of compiling, processing, and analyzing data so that
References:
• https://www.newvantage.com/_files/ugd/e5361a_ad5a8b3da825
4a71807d2dccdb0844be.pdf
• https://www.newvantage.com/_files/ugd/e5361a_ad5a8b3da825
4a71807d2dccdb0844be.pdf
• https://online.hbs.edu/blog/post/data-life-cycle
• https://mattturck.com/data2021/