Bda CHP1
Bda CHP1
Bda CHP1
⚫Volume:
⚫The amount of data matters. With big data,
you’ll have to process high volumes of low-
density, unstructured data. This can be
data of unknown value, such as Twitter
data feeds, clickstreams on a web page or
a mobile app, or sensor-enabled
equipment. For some organizations, this
might be tens of terabytes of data. For
others, it may be hundreds of petabytes.
⚫Velocity
⚫Velocity is the fast rate at which data is
received and (perhaps) acted on. Normally,
the highest velocity of data streams
directly into memory versus being written
to disk. Some internet-enabled smart
products operate in real time or near real
time and will require real-time evaluation
and action.
⚫Variety:
Variety refers to the many types of data
that are available. Traditional data types
were structured and fit neatly in a relational
database. With the rise of big data, data
comes in new unstructured data types.
Unstructured and semistructured data
types, such as text, audio, and video,
require additional preprocessing to derive
meaning and support metadata.
The history of big data
Predictive
Product development maintenance
Drive innovation
⚫ Product development:
⚫ Companies like Netflix and Procter & Gamble
use big data to anticipate customer demand.
⚫ They build predictive models for new products and
services by classifying key attributes of past and
current products or services and modeling the
relationship between those attributes and the
commercial success of the offerings.
⚫ In addition, P&G uses data and analytics from
focus groups, social media, test markets, and early
store rollouts to plan, produce, and launch new
products.
⚫ Predictive maintenance
⚫ Factors that can predict mechanical failures
may be deeply buried in structured data, such
as the year, make, and model of equipment, as
well as in unstructured data that covers
millions of log entries, sensor data, error
messages, and engine temperature.
⚫ By analyzing these indications of potential
issues before the problems happen,
organizations can deploy maintenance more
cost effectively and maximize parts and
equipment uptime.
⚫ Customer experience:
⚫ Big data enables you to gather data from social
media, web visits, call logs, and other sources to
improve the interaction experience and maximize the
value delivered. Start delivering personalized offers,
reduce customer churn, and handle issues proactively.
⚫ Operational :
⚫ With big data, you can analyze and assess
production, customer feedback and returns, and
other factors to reduce outages and anticipate
future demands. Big data can also be used to
improve decision-making in line with current
market demand.
⚫Drive innovation
⚫Big data can help innovate by studying
interdependencies among humans,
institutions, entities, and process and then
determining new ways to use those
insights. Use data insights to improve
decisions about financial and planning
considerations.
Data Categories
Structured data: In Structured schema, along with all
the
H required columns. It is in a tabular form. Structured
Data is stored in the relational database management
system.
Semi-structured: In Semi-structured, the schema is not
appropriately defined, e.g., JSON, XML, CSV, TSV,
and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is
stored in relations, i.e., tables.
Unstructured Data: All the unstructured files, log
files, audio files, and image files are included in the
unstructured data. Some organizations have much data
available, but they did not know how to derive the value
of data since the data is raw.
Quasi-structured Data: The data format contains textual
data with inconsistent data formats that are formatted with
effort and time with some tools.
Example: Web server logs, i.e., the log file is created
and maintained by some server that contains a list
of activities.
types of Big Data
⚫Cons of Structured Data
⚫structured data has limited flexibility and is
suitable for certain specific use cases only.
⚫ 1) Starbucks:
⚫ With 90 million transactions a week in 25,000 stores worldwide the coffee
giant is in many ways on the cutting edge of using big data and artificial
intelligence to help direct marketing, sales and business decisions
⚫ Through its popular loyalty card program and mobile application,
Starbucks owns individual purchase data from millions of customers.
Using this information and BI tools, the company predicts purchases and
sends individual offers of what customers will likely prefer via their app
and email. This system draws existing customers into its stores more
frequently and increases sales volumes.
⚫ The same intel that helps Starbucks suggest new products to try also
helps the company send personalized offers and discounts that go far
beyond a special birthday discount. Additionally, a customized email goes
out to any customer who hasn’t visited a Starbucks recently with enticing
offers—built from that individual’s purchase history—to re-engage them.
2) Netflix:
The online entertainment company’s 148 million
subscribers give it a massive BI advantage.
Netflix has digitized its interactions with its 151 million
subscribers. It collects data from each of its users and with
the help of data analytics understands the behavior of
subscribers and their watching patterns. It then leverages
that information to recommend movies and TV shows
customized as per the subscriber’s choice and preferences.
As per Netflix, around 80% of the viewer’s activity is
triggered by personalized algorithmic recommendations.
Where Netflix gains an edge over its peers is that by
collecting different data points, it creates detailed profiles
of its subscribers which helps them engage with them
better.
The recommendation system of Netflix contributes to more
than 80% of the content streamed by its subscribers which
has helped Netflix earn a whopping one billion via
customer retention. Due to this reason, Netflix doesn’t
have to invest too much on advertising and marketing
Concept of Hadoop
⚫ Hadoop is an open-source software framework
High
Distributed Fault- Availability
Storage Tolerance
Data
Scalability locality
Flexible Data
Data YARN
Replication
Processing
Data Data
Integrity Compressio
n
⚫ Distributed Storage: Hadoop stores large data sets across
multiple machines, allowing for the storage and processing of
extremely large amounts of data.
consists of:
Pig Latin - This is the language for scripting
Pig Latin Compiler - This converts Pig Latin code
into executable code
Pig also provides Extract, Transfer, and Load (ETL),
and a platform for building data flow. Did you know
that ten lines of Pig Latin script equals approximately
200 lines of MapReduce job? Pig uses simple, time-
efficient steps to analyze datasets.
Programmers write scripts in Pig Latin to analyze data
using Pig. Grunt Shell is Pig’s interactive shell, used to
execute all Pig scripts.
If the Pig script is written in a script file, the Pig Server
executes it. The parser checks the syntax of the Pig
script, after which the output will be a DAG (Directed
Acyclic Graph).