Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DSUP Chapter 1 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Introduction to Big Data and

Data Science
What is Data?

Numeric data Streams of images Data Generated by social networks


Problems of Large Volume Of Data

1. Expenses in storing and handling huge amounts of data


2. Data is heterogeneous
3. Accessing and processing speed:
if we have a 100MBps I/O channel and we need to process 2TBs of
data-it will take 6 hours to process the data(1 terabyte (TB) equals
1,000 gigabytes (GB) or 1,000,000 megabytes (MB).)
6 hrs =21600 secs
in 1 sec 100MB
in 21600secs 100*21600=2160000MB (equivalent to 2TB)
Big Data
1. The basic idea behind Big Data is that everything we do
leaves a digital trace, or data, which can be analyzed to
obtain actionable insights.
2. Big Data is the extraction, analysis and management of
processing a large volume of data. It revolves around the
data type – Big Data which is a collection of a large amount
of data.
3. Almost every industry in the world today makes use of Big
Data. Industries like finance, healthcare, banking,
manufacturing have to deal with surplus amounts of data.
4. Such amount of data, which could not be processed earlier
due to limitations in the computational techniques can now
be performed with highly advanced tools and
methodologies.
5. Some of the tools for Big Data are – Apache Hadoop, Spark,
Flink etc.
Characteristics Of Big Data
1. The characteristics of big data are often referred to as the three Vs:
• Volume—How much data is there?
• Variety—How diverse are different types of data?
• Velocity—At what speed is new data generated.
• Veracity: How accurate is the data?
Data Scientist
Data scientists tackle questions about the future.
Tools used
Data Science
1. Data Science is the study of data. It is about finding patterns in data through
an in-depth analysis.
2. Data Science is a field or domain which includes and involves working with a
huge amount of data and using it for building predictive models.
3. It’s about digging, capturing, (building the model) analyzing(validating the
model), and utilizing the data(deploying the best model).
4. It is an intersection of Data and computing. It is a blend of the field of
Computer Science, Business, and Statistics together
5. This technology field uses various modeling techniques such as ML algorithms,
statistical methods, and mathematical analysis.
6. With Data Science, employees can assist in the decision-making process which
will help the business to grow and enhance the quality of the product.
7. This is a field of applied mathematics and statistics. It brings into play a
scientific approach to extract meaningful information and insights and predict
future patterns and behaviors from data.
How Data Science Finds
Relationships Between Data
Big Data vs. Data Science
1. Big data is a collection of data sets so large or
complex that it becomes difficult to process them
using traditional data management techniques
such as, for example, the RDBMS (relational
database management systems).
2. Big Data deals with handling and managing huge
amount of data. Prior to Big Data, industries did
not possess the required tools and resources to
manage such a large volume of data.
3. Data science involves using methods to
scientifically analyze massive amounts of data
using statistical techniques and extract the
knowledge it contains. It is more quantitative in
nature and uses various statistical approaches to
identify the patterns within the data.
4. The process of Data Science involves the
extraction, data transformation, data analysis and
prediction to gain insights about the data.
5. The relationship between big data and data science
is like the relationship between crude oil and an oil
refinery.
1. While Big Data is about storing data, Data Science is
about analyzing it. However, it is to be kept in mind that Data
Science is an ocean of data operations, one that also includes Big
Data. A Data Scientist analyzes the data that is quite large and
requires a big data platform. Therefore, an ideal data scientist
must also possess knowledge of big data tools.
2. The roles of Data Scientists and Big Data specialists also differ. A
Data Scientist is required to analyze, draw insights from the data,
visualize the data and communicate the results through robust
storytelling. A Big Data Specialist, on the other hand, develops,
maintains, and administers Big Data clusters that hold a
voluminous amount of data.
Benefits of Data Science and Big Data
1. This field is applicable in more than one industry, including finance,
professional services, and information technology. For example,
businesses rely on this field to unveil deeper insights that can help
them make smarter business decisions, better understand customers,
increase security, analyze company finances, and predict future market
trends.
2. Commercial companies in almost every industry use data science and
big data to gain insights into their customers, processes, staff and
products.
3. Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their
offerings.
4. A good example of this is Google AdSense, which collects data from
internet users, so relevant commercial messages can be matched to the
person browsing the internet.
Applications
Categories of Data
The main categories of data are these:
1. Structured
2. Unstructured
3. Natural language
4. Machine-generated
5. Graph-based
6. Audio, video, and images
Structured Data
1. The data that depends on a data model and
resides in a fixed field within a record
2. it’s often easy to store structured data in
tables within databases or Excel files (figure
1.1).
3. SQL, or Structured Query Language, is the
preferred way to manage and query data
that resides in databases.
Example of Structured Data
Unstructured data
1. The data that isn’t easy to fit into a data
model because the content is context-
specific or varying. One example of
unstructured data is your regular email
(figure 1.2).
Unstructured data
Natural language
1. Natural language is a special type of unstructured
data; it’s challenging to process because it requires
knowledge of specific data science techniques and
linguistics. It can take different forms, namely either
a spoken language or a sign language.
2. NLP does the following
– spam filters, uncovering certain words or phrases that
signal a spam message
– Gmail's email classification(primary,social,updates,spam).
– Amazon’s Alexa recognize patterns in speech
– Google not only predicts what popular searches may apply
to your query as you start typing
Machine-generated Data
1. Machine-generated data is information
that’s automatically created by a
computer, process, application, or other
machine without human intervention.
2. Examples of machine data are web
server logs, call detail records, network
event logs
Example
Graph-based or Network Data
1. Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
2. The graph structures use nodes, edges, and properties to
represent and store graphical data.
3. Graph-based data is a natural way to represent social networks,
and its structure allows you to calculate specific metrics such as
the influence of a person and the shortest path between two
people.
4. Graph databases are used to store graph-based data and are
queried with specialized query languages such as SPARQL.
– Netflix uses Graph Database for its Digital Asset
Management because it is a perfect way to track which movies
(assets) each viewer has already watched, and which movies they
are allowed to watch (access management)
Examples
• Examples of graph-based data can be found on many
social media websites (figure 1.4). For instance, on
LinkedIn you can see who you know at which company.
Your follower list on Twitter is another example of graph-
based data.
• The power and sophistication comes from multiple,
overlapping graphs of the same nodes. For example,
imagine the connecting edges here to show “friends” on
Facebook. Imagine another graph with the same people
which connects business colleagues via LinkedIn.
• Imagine a third graph based on movie interests on Netflix.
Overlapping the three different-looking graphs makes
more interesting questions possible
Social Network
Audio, image, and video
• Audio, image, and video are data types that pose specific
challenges to a data scientist.
• Tasks that are trivial for humans, such as recognizing
objects in pictures, turn out to be challenging for
computers.
• MLBAM (Major League Baseball Advanced Media)
announced in 2014 that they’ll increase video capture to
approximately 7 TB per game for the purpose of live, in-
game analytics.
• High-speed cameras at stadiums will capture ball and
athlete movements to calculate in real time, for example,
the path taken by a defender relative to two baselines.
Streaming data
• While streaming data can take almost any of the
previous forms.
• The data flows into the system when an event
happens instead of being loaded into a data store in a
batch.
• Although this isn’t really a different type of data, we
treat it here as such because you need to adapt your
process to deal with this type of information.
• Examples are the “What’s trending” on
Twitter(What’s Trending delivers the latest video news
for all things), live sporting or music events, and the
stock market.

You might also like