CS8091 Big Data Analytics Unit5
CS8091 Big Data Analytics Unit5
CS8091 Big Data Analytics Unit5
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
CS8091
Big Data Analytics
Department: IT
Date: 17.04.2021
Table of Contents
S NO CONTENTS PAGE NO
1 Contents 5
2 Course Objectives 6
3 Pre Requisites (Course Names with Code) 7
4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 13
NOSQL DATA MANAGEMENT FOR BIG DATA AND
9 5 14
VISUALIZATION
NoSQL Databases : Schema-less Models: Increasing
5.1 14
Flexibility for Data Manipulation
Key Value Stores- Document Stores - Tabular Stores -
5.2 15
Object Data Stores - Graph Databases
5.3 Hive 19
5.4 HBase 22
5.5 Sharding 24
5.6 Analyzing big data with twitter 28
10 Assignments 57
11 Part A (Questions & Answers) 58
12 Part B Questions 63
5
Course Objectives
To know the fundamental concepts of big data and analytics.
To explore tools and practices for working with big data
To learn about stream computing.
To know about the research that requires the integration of large amounts of
data.
Pre Requisites
Evolution of Big data - Best Practices for Big data Analytics - Big data
characteristics Validating – The Promotion of the Value of Big Data - Big
Data Use Cases- Characteristics of Big Data Applications - Perception and
Quantification of Value - Understanding Big Data Storage - A General
Overview of High- Performance Architecture - HDFS – Map Reduce and
YARN - Map Reduce Programming Model.
CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2
CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
CO6 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT – V
No
S of Proposed Actual Pertain Taxon Mode of
Topics
No peri date Date ing CO omy delivery
ods level
NoSQL Databases :
Schema-less Models”:
1 1 28.04.2021 28.04.2021 CO5 K4 PPT
Increasing Flexibility for
Data Manipulation
Manipulation-Key Value
Stores- Document Stores -
2 Tabular Stores - Object 1 01.05.2021 01.05.2021 CO5 K4 PPT
Data Stores - Graph
Databases
3 Hive - Hbase 1 04.05.2021 04.05.2021 CO5 K4 PPT/ Video
Crossword Puzzle
https://crosswordlabs.com/view/cs8091-bda-unit-5
Flash cards
https://quizlet.com/in/597267333/big-data-analytics-unit-5-flash-cards/?x=1qqt
Lecture Notes
5.1 NoSQL Databases
A simple type of NoSQL data store is a key-value store, a schema-less model in which
values (or sets of values, or complex entity objects) are associated with distinct
character strings called keys. Programmers may see similarity with the data structure
known as a hash table. Other alternative NoSQL data stores are variations on the key-
value theme, which lends a degree of credibility to the model.
Consider the data subset represented in the below Table.
The key is the name of the automobile make, while the value is a list of names of
models associated with that automobile make. From the example, the key-value store
does not impose any constraints about data typing or data structure—the value
associated with the key is the value, and it is up to the business applications to assert
expectations about the data values and their semantics and interpretation. This
demonstrates the schema-less property of the model.
The core operations performed on a key-value store include:
❖Get(key), which returns the value associated with the provided key.
❖Put(key, value), which associates the value with the key.
❖Multi-get(key1, key2,.., keyN), which returns the list of values associated with the list
of keys.
❖Delete(key), which removes the entry for the key from the data store.
The critical characteristic of a key-value store is uniqueness of the key, (ie) to find the
values the exact key is used. In this data management approach, if you want to
associate multiple values with a single key, then consider the representations of the
objects and how they are associated with the key.
For example, you want to associate a list of attributes with a single key, which
suggests that the value stored with the key is another key-value store object itself.
Key-value stores are very long, and presumably thin tables (in that there are not many
columns associated with each row). The table’s rows can be sorted by the key value to
simplify finding the key during a query. Alternatively, the keys can be hashed using a
hash function that maps the key to a particular location (sometimes called a “bucket”)
in the table.
Additional supporting data structures and algorithms (such as bit vectors and bloom
filters) can be used to even determine whether the key exists in the data set at all.
The representation can grow indefinitely, which makes it good for storing large
amounts of data that can be accessed relatively quickly, as well as environments
requiring incremental appends of data.
Examples include capturing system transaction logs, managing profile data about
individuals, or maintaining access counts for millions of unique web page URLs. The
simplicity of the representation allows massive amounts of indexed data values to be
appended to the same key-value table, which can then be sharded, or distributed
across the storage nodes.
Under the right conditions, the table is distributed in a way that is aligned with the way
the keys are organized, so that the hashing function that is used to determine where
any (i.e., the portion of the table holding that key). specific key exists in the table can
also be used to determine which node holds that key’s bucket.
Key-value pairs are very useful for both storing the results of analytical algorithms (such
as phrase counts among massive numbers of documents) and for producing those results
for reports.
Drawbacks
A document store is similar to a key-value store in that stored objects are associated (and
therefore accessed via) character string keys. The difference is that the values being
stored, are referred to as “documents,” provide some structure and encoding of the
managed data.
There are different common encodings, including XML (Extensible Markup Language),
JSON (Java Script Object Notation), BSON (binary encoding of JSON objects), or other
means of serializing data (i.e., packaging up the potentially linearizing data values
associated with a data record or object).
In the below example, we have some examples of documents stored in association with
the names of specific retail locations. Note that while the three examples all represent
locations, yet the representative models differ. The document representation embeds the
model so that the meanings of the document values can be inferred by the application.
The differences between a key-value store and a document store is that while the key-
value store requires the use of a key to retrieve data, the document store provides a
means (either through a programming API or using a query language) for querying the
data based on the contents. The approaches used for encoding the documents embed
the object metadata, one can use methods for querying by example.
For instance, using the example, one could execute a FIND (MallLocation: “Westfield
Wheaton”) that would pull out all documents associated with the Retail Stores in that
particular shopping mall.
Tabular, or table-based stores are largely descended from Google’s original Bigtable
design to manage structured data. The HBase model is an example of a Hadoop-
related NoSQL data management system that evolved from bigtable.
The bigtable NoSQL model allows sparse data to be stored in a three-dimensional table
that is indexed by a row key (similar to the key-value and document stores), a column
key that indicates the specific attribute for which a data value is stored, and a
timestamp that may refer to the time at which the row’s column value was stored.
Example: Various attributes of a web page can be associated with the web page’s URL:
the HTML content of the page, URLs of other web pages that link to this web page,
and the author of the content.
Columns in a Bigtable model are grouped together as “families,” and the timestamps
enable management of multiple versions of an object. The timestamp can be used to
maintain history— each time the content changes, new column affiliations can be
created with the timestamp of when the content was downloaded.
Object data stores and object databases seem to bridge the worlds of schema-less
data management and the traditional relational models. Approaches to object
databases can be similar to document stores except that the document stores explicitly
serializes the object so the data values are stored as strings, while object databases
maintain the object structures as they are bound to object-oriented programming
languages such as C11, Objective-C, Java, and Smalltalk.
Object database management systems are more likely to provide traditional ACID
(atomicity, consistency, isolation, and durability) compliance—characteristics that are
bound to database reliability. Object databases are not relational databases and are not
queried using SQL.
Graph databases provide a model of representing individual entities and numerous kinds
of relationships that connect those entities. It employs the graph abstraction for
representing connectivity, consisting of a collection of vertices (which are also referred to
as nodes or points) that represent the modeled entities, connected by edges (which are
also referred to as links, connections, or relationships) that capture the way that two
entities are related. Graph analytics performed on graph data stores are different than
more frequently used querying and reporting.
5.3 HIVE
Hive Architecture
Hive User Interface: Hive creates interaction between user and HDFS through Hive
WebUI, hive command line.
Metadata store: Hive stores the database schema and its HDFS mapping in the
database server.
HDFS/HBase: HDFS or HBase are the data storage techniques to store data into the
file system.
Hive Query Processing Engine(HiveQL): HiveQL is used for querying on schema
information on the metadata store. A query can be written for Mapreduce job and
processed.
Execution Engine: The execution engine is used to process the query and generate
results.
Hive – Working Principles
The Hive User Interface (WebUI, command line) sends a query to database driver
(JDBC/ODBC) to execute.
The driver with the help of the query compiler parses the query, checks the syntax and
requirement of the query.
The compiler then sends the meta data request to database where the metadata is
stored.
The database sends the response to the compiler. The compiler sends the response to
the driver which is passed to the execution engine.
The execution engine (Map reduce process) sends the job to jobtracker which is in
namenode and it assigns this job to task tracker which is in datanode.
The execution engine will receive the results from datanodes and then sends the
results to the driver.
The driver sends it to UI.
HiveQL Basics
From the command prompt, a user enters the interactive Hive environment by simply
entering hive:
$ hive
hive>
A user can define new tables, query them, or summarize their contents.
Example:
This defines a new Hive table to hold customer data, load existing HDFS data into the
Hive table, and query the table.
The first step is to create a table called customer to store customer details.
Because the table will be populated from an existing tab (‘\t’)-delimited HDFS file, this
format is specified in the table creation query.
hive> select count(*) from customer;
Result: 0
The HiveQL query is executed to count the number of records in the newly created
table, customer. Because the table is currently empty, the query returns a result of
zero. The query is converted and run as a MapReduce job, which results in one map
task and one reduce task being executed.
❖ Exploratory or ad-hoc analysis of HDFS data: Data can be queried, transformed, and
exported to analytical tools, such as R.
❖ Extracts or data feeds to reporting systems, dashboards, or data repositories such
as HBase: Hive queries can be scheduled to provide such periodic feeds.
❖ Combining external structured data to data already residing in HDFS: Hadoop is
excellent for processing unstructured data, but often there is structured data residing
in an RDBMS, such as Oracle or SQL Server, that needs to be joined with the data
residing in HDFS. The data from an RDBMS can be periodically added to Hive tables for
querying with existing data in HDFS.
Reference Video
https://www.youtube.com/watch?v=cMziv1iYt28
5.4 HBASE
Hbase Architecture
1. Hmaster:
The implementation of Master Server in HBase is HMaster.
It is a process in which regions are assigned to region server and takes the help
of Apache Zookeeper.
It handles load balancing of the regions across region servers.
It unloads the busy servers and shifts the regions to less occupied servers and
maintains the state of the cluster by negotiating the load balancing.
It is responsible for schema changes and other metadata operations such as
creation of tables and column families.
2. Region:
Regions are nothing but tables that are split up and spread across the region
servers. The default size of a region is 256 MB.
3. RegionServer:
The region server has regions that communicate with the client and handle data
related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
4. Zookeeper:
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master
servers use these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or
network partitions.
Clients communicate with region servers via zookeeper. In pseudo and standalone
modes, HBase itself will take care of zookeeper.
Advantages of Hbase
1. Can store large data sets
2. Database can be shared
3. Cost-effective from gigabytes to petabytes
4.High availability through failover and replication
Disadvantages of Hbase
1. No support of SQL structure
2. No transaction supports
3. Sorted only on key
4. Memory issues on the cluster
Reference Video
https://www.youtube.com/watch?v=VRD775iqAko
5.5 Sharding
Sharding (also known as Data Partitioning) is the process of splitting a large dataset into
many small partitions which are placed on different machines. Each partition is known as
a "shard". Each shard has the same database schema as the original database. Most data
is distributed such that each row appears in exactly one shard. The combined data from
all shards is the same as the data from the original database.
Sharding is the process of storing data records across multiple machines and it is
MongoDB's approach to meeting the demands of data growth. As the size of the data
increases, a single machine may not be sufficient to store the data nor provide an
acceptable read and write throughput. Sharding solves the problem with horizontal
scaling. With sharding, you add more machines to support data growth and the
demands of read and write operations. MongoDB supports horizontal scaling through
sharding.
Sharding Strategies
The data is split based on the value ranges that are inherent in each entity. For
example, if you store the contact information for online customers, you might choose
to store the information for customers whose last name starts with A-H on one shard,
while storing the rest on another shard.
The disadvantage of this scheme is that the last names of the customers may not be
evenly distributed. You might have a lot more customers whose names fall in the
range of A-H than customers whose last name falls in the range I-Z. In that case, your
first shard will be experiencing a much heavier load than the second shard and can
become a system bottleneck.
The benefit of this approach is that it's the simplest sharding scheme available. Each
shard also has the same schema as the original database. It works well for relative non
static data -- for example to store the contact info for students in a college because
the data is unlikely to see huge churn.
2. Vertical Sharding
An entity has a value (Eg. IP address of a client application) which can be used as an
input to a hash function and a resultant hash value generated. This hash value
determines which database server(shard) to use.
Example: Imagine you have 4 database servers and each request contained an
application id which was incremented by 1 every time a new application is registered.
Perform a modulo operation on the application id with the number 4 and take the
remainder to determine which server the application data should be placed on.
The main drawback of this method is that elastic load balancing (dynamically
adding/removing database servers) becomes very difficult and expensive.
4. Directory based Sharding
Directory based shard partitioning involves placing a lookup service in front of the
sharded databases. The lookup service knows the current partitioning scheme and
keeps a map of each entity and which database shard it is stored on. The lookup
service is usually implemented as a webservice.
The client application first queries the lookup service to figure out the shard (database
partition) on which the entity resides/should be placed. Then it queries / updates the
shard returned by the lookup service.
In the previous example, we had 4 database servers and a hash function that
performed a modulo 4 operation on the application ids. Now, if we wanted to add 6
more database servers without incurring any downtime, we'll need to do the following
steps:
1.Keep the modulo 4 hash function in the lookup service .
2.Determine the data placement based on the new hash function - modulo 10.
3.Write a script to copy all the data based on #2 into the six new shards and possibly
on the 4 existing shards. Note that it does not delete any existing data on the 4
existing shards.
4. Once the copy is complete, change the hash function to modulo 10 in the lookup
service
5. Run a cleanup script to purge unnecessary data from 4 existing shards based on
step#2. The reason being that the purged data is now existing on other shards.
5.6 Twitter Data Analysis
Carefully listening to voice of the customer on Twitter using sentiment analysis allows
companies to understand their audience, keep on top of what’s being said about their
brand – and their competitors – and discover new trends in the industry.
What is Sentiment Analysis?
Sentiment analysis is the automated process of identifying and classifying subjective
information in text data. This might be an opinion, a judgment, or a feeling about a
particular topic or product feature.
The most common type of sentiment analysis is ‘polarity detection’ and involves
classifying statements as positive, negative or neutral.
Performing sentiment analysis on Twitter data involves five steps:
i. Gather relevant Twitter data
ii. Clean data using pre-processing techniques
iii. Create a sentiment analysis machine learning model
iv. Analyze Twitter data using sentiment analysis model
v. Visualize the results of Twitter sentiment analysis
P(label) is the prior probability of a label or the likelihood that a random feature set the
label. P(features | label) is the prior probability that a given feature set is being
classified as a label. P(features) is the prior probability that a given feature set is
occurred. When techniques like lemmatization, stop word removal, and TF-IDF are
Incorporating big data in E-commerce industry will allow businesses to gain access to
significantly large amount of data in order to convert the growth in to revenue,
streamline operation processes and gain more customers.
Big data solutions can help ecommerce industry to flourish.
Eight ways big data can foster positive change in any E-commerce business:
1.Elevated shopping experience
2. More secure online payment
3. Increased personalization
4. Increased focus on “Micro moments”
5. Optimized pricing and increased sales
6. Dynamic customer service
7. Generate increased sales
8. Predict Trends, forecast demand
E-commerce companies have an endless supply of data to fuel predictive analytics that
anticipate how customers will behave in the future. Retail websites track the number of
clicks per page, the average number of products people add to their shopping carts
before checking out, and the average length of time between a homepage visit and a
purchase. If the customers are signed up for a rewards or subscription program,
companies can analyze demographic, age, style, size, and socioeconomic information.
Predictive analytics can help companies develop new strategies to prevent shopping cart
abandonment, lessen time to purchase, and cater to budding trends. Likewise, E-
commerce companies use this data to accurately predict inventory needs with changes in
seasonality or the economy.
To provide a peak shopping experience, customers need to know that their payments are
secure. Big data analysis can recognize a typical spending behavior and notify customers
as it happens. Companies can set up alerts for various fraudulent activities, like a series
of different purchases on the same credit card within a short time frame or multiple
payment methods coming from the same IP address.
Many E-commerce sites now offer several payment methods on one centralized platform.
Big data analysis can determine which payment methods work best for which customers,
and can measure the effectiveness of new payment options like “bill me later”. Some e-
commerce sites have implemented an easy checkout experience to decrease the chances
of an abandoned shopping cart. The checkout page gives customers the ability to put an
item on a wish list, choose a “bill me later” option, or pay with multiple various credit
cards.
3. Increased personalization
Besides enabling customers to make secure, simple payments, big data can cultivate a
more personalized shopping experience. 86% of consumers say that personalization plays
an important role in their buying decisions. Millennials are especially interested in
purchasing online, and assume they will receive personalized suggestions.
Using big data analytics, e-commerce companies can establish a 360-degree view of the
customer. This view allows e-commerce companies to segment customers based on their
gender, location, and social media presence. With this information, companies can create
and send emails with customized discounts, use different marketing strategies for different
target audiences, and launch products that speak directly to specific groups of consumers.
Many retailers cash in on this strategy, giving members loyalty points that can be used on
future purchases. Sometimes, e-commerce companies will pick several dates throughout
the year to give loyalty members extra bonus points on all purchases. This is done during
a slow season, and increases customer engagement, interest, and spending. Not only do
loyalty members feel like VIPs, they give information companies can use to deliver
personalized shopping recommendations.
Micro Moments” is the latest e-commerce trend. Customers generally seek quick actions —
I want to go, I want to know, I want to buy, etc and they look at accessing what they
want on their smart phones. Ecommerce retailers use these micro-moments to foresee
customers tendencies and action patterns. Smartphone technologies help big data
analytics to a large extent.
5. Optimized pricing and increased sales
Customer satisfaction is key to customer retention. Even companies with the most
competitive prices and products suffer without exceptional customer service.
Business.com states that acquiring new customers costs 5 to 10 times more than
selling to a new customer. Loyal customers spend up to 67% more than new
customers. Companies focused on providing the best customer service increases their
chances of good referrals and sustains recurring revenue. Keeping customers happy
and satisfied should be a priority for every e-commerce company.
How does big data improve customer service?
1. Big data can reveal problems in product delivery, customer satisfaction
levels, and even brand perception in social media.
2. Big data analytics can identify the exact points in time when customer
perception or satisfaction changed.
3. It is easier to make sustainable change to customer service when companies
have defined areas for improvement.
Big data helps e-retailers customize their recommendations and coupons to fit
customer desires. High traffic results from this personalized customer experience,
yielding higher profit. Big data about consumers can also help e-commerce businesses
run precise marketing campaigns, give appropriate coupons, and reminding people
that they still have something sitting in their cart.
8. Predict trends and forecast demand
Definition :
A blog (shortening of “weblog”) is an online journal or informational website displaying
information in the reverse chronological order, with the latest posts appearing first. It is a
platform where a writer or even a group of writers share their views on an individual
subject.
Purpose of a blog:
The main purpose of a blog is to connect you to the relevant audience. The more
frequent and better your blog posts are, the higher the chances for your website to get
discovered and visited by your target audience.
❖Blogs need frequent updates. Good examples include a food blog sharing meal recipes
or a company writing about their industry news. Blogs promote perfect reader
engagement. Readers get a chance to comment and voice their different concerns to the
viewer.
❖Static websites, on the other hand, consists of the content presented on static pages.
Static website owners rarely update their pages. Blog owners update their site with new
blog posts on a regular basis.
5.8 Review of Basic Data Analytic Methods using R
Introduction to R
R is a programming language and software framework for statistical analysis and
graphics. Available for use under the GNU General Public License.
The annual sales in U.S. dollars for 10,000 retail customers have been provided in the
form of a comma separated- value (CSV) file. The read.csv() function is used to import
the CSV file. This dataset is stored to the R variable sales using the assignment operator
<-.
In this example, the data file is imported using the read.csv() function. Once the file has
been imported, it is useful to examine the contents to ensure that the data was loaded
properly as well as to become familiar with the data. In the example, the head()
function, by default, displays the first six records of sales.
Reference Video
https://www.youtube.com/watch?v=_V8eKsto3Ug
The summary() function provides some descriptive statistics, such as the mean and
median, for each data column. Additionally, the minimum and maximum values as
well as the 1st and 3rd quartiles are provided. Because the gender column contains
two possible characters, an “F” (female) or “M” (male), the summary() function
provides the count of each character’s occurrence.
The resulting intercept and slope values are –154.1 and 166.2, respectively, for the
fitted linear equation. However, results stores considerably more information that
can be examined with the summary() function. Details on the contents of results are
examined by applying the attributes() function.
The summary() function is an example of a generic function. A generic function is
a group of functions sharing the same name but behaving differently depending on
the number and the type of arguments they receive. Utilized previously, plot() is
another example of a generic function; the plot is determined by the passed
variables. the following R code uses the generic function hist() to generate a
histogram (Figure) of the residuals stored in results. The function call illustrates that
optional parameter values can be passed. In this case, the number of breaks is
specified to observe the large residuals.
R software uses a command-line interface (CLI) that is similar to the BASH shell in
Linux or the interactive versions of scripting languages such as Python. UNIX and
Linux users can enter command R at the terminal prompt to use the CLI. For
Windows installations, R comes with RGui.exe, which provides a basic graphical user
interface (GUI). However, to improve the ease of writing, executing, and debugging
R code, several additional GUIs have been written for R. Popular GUIs include the R
commander, Rattle, and R Studio.
Plots: Displays the plots generated by the R code and provides a straightforward
mechanism to export the plots
Vectors are a basic building block for data in R. As seen previously, simple R
variables are actually vectors. A vector can only consist of values in the same class.
The tests for vectors can be conducted using the is.vector() function.
R provides functionality that enables the easy creation and manipulation of vectors.
The following R code illustrates how a vector can be created using the combine
function, c() or the colon operator, :, to build a vector from the sequence of integers
from 1 to 5.
Data Frames
Similar to the concept of matrices, data frames provide a structure for storing and
accessing several variables of possibly different data types. The is.data.frame()
function indicates, a data frame was created by the read.csv() function.
Lists
Lists can contain any type of objects, including other lists. Using the vector v and
the matrix M created in earlier examples, the following R code creates assortment, a
list of different object types.
In displaying the contents of assortment, the use of the double brackets, [[]], is of
particular importance. As the following R code illustrates, the use of the single set
of brackets only accesses an item in the list, not its content.
The summary() function provides several descriptive statistics, such as the mean
and median, about a variable such as the sales data frame.
The following code provides some common R functions that include descriptive
statistics.
The IQR() function provides the difference between the third and the first quartiles.
The other functions are fairly self-explanatory by their names.
Functions such as summary() can help analysts easily get an idea of the magnitude
and range of the data, but other aspects such as linear relationships and
distributions are more difficult to see from descriptive statistics. For example, the
following code shows a summary view of a data frame data with two columns x and
y. The output shows the range of x and y, but it’s not clear what the relationship
may be between these two variables.
A useful way to detect patterns and anomalies in the data is through the exploratory
data analysis with visualization. Visualization gives a succinct, holistic view of the
data that may be difficult to grasp from the numbers and summaries alone.
Variables x and y of the data frame data can instead be visualized in a scatterplot
(Figure 3-5), which easily depicts the relationship between two variables. An
important facet of the initial data exploration, visualization assesses data cleanliness
and suggests potentially important relationships in the data prior to the model
planning and building phases.
The four datasets in Anscombe’s quartet have nearly identical statistical properties,
as shown in Table 3-3.
Based on the nearly identical statistical properties across each dataset, one might
conclude that these four datasets are quite similar. However, the scatterplots in
Figure 3-7 tell a different story. Each dataset is plotted as a scatterplot, and the
includes eight attributes: x1, x2, x3, x4, y1, y2, y3, and y4. The expression part in
the code creates a data frame from the anscombe dataset, and it only includes three
attributes: x, y, and the group each data point belongs to (mygroup).
fitted lines are the result of applying linear regression models. The estimated
regression line fits Dataset 1 reasonably well. Dataset 2 is definitely nonlinear.
Dataset 3 exhibits a linear trend, with one apparent outlier at x =13. For Dataset 4,
the regression line fits the dataset quite well. However, with only points at two x
values, it is not possible to determine that the linearity assumption is proper.
The R code requires the R package ggplot2, which can be installed simply by
running the command install.packages(“ggplot2”). The anscombe dataset for the
plot is included in the standard R distribution. Enter data() for a list of datasets
included in the R base distribution. Enter data(DatasetName) to make a dataset
available in the current workspace. In the code that follows, variable levels is
created using the gl() function, which generates factors of four levels (1, 2, 3, and
4), each repeating 11 times. Variable mydata is created using the with(data,
expression) function, which evaluates an expression in an environment
constructed from data. In this example, the data is the anscombe dataset, which
which includes eight attributes: x1, x2, x3, x4, y1, y2, y3, and y4. The expression
part in the code creates a data frame from the anscombe dataset, and it only
includes three attributes: x, y, and the group each data point belongs to (mygroup).
Dirty Data
In general, analysts should look for anomalies, verify the data with domain
knowledge, and decide the most appropriate approach to clean the data. In R, the
is.na() function provides tests for missing values. The following example creates a
vector x where the fourth value is not available (NA). The is.na() function returns
TRUE at each NA value and FALSE otherwise.
is.na(x)
The scatterplot in Figure 3.13 portrays the relationship of two variables: x and y.
The red line shown on the graph is the fitted line from the linear regression. Figure
3.13 shows that the regression line does not fit the data well. This is a case in which
linear regression cannot model the relationship between the variables. Alternative
methods such as the loess() function can be used to fit a nonlinear line to the data.
The blue curve shown on the graph represents the LOESS curve, which fits the data
better than linear regression.
Statistical Methods for Evaluation
Visualization is useful for data exploration and presentation, but statistics is crucial
because it may exist throughout the entire Data Analytics Lifecycle. Statistical
techniques are used during the initial data exploration and data preparation, model
building, evaluation of the final models, and assessment of how the new models
improve the situation when deployed in the field.
Model Evaluation
● Model Deployment
● Does the model have the desired effect (such as reducing the cost)?
This section discusses some useful statistical tools that may answer these questions.
Hypothesis Testing
When comparing populations, such as testing or evaluating the difference of the
means from two samples of data (Figure 3-22), a common technique to assess the
difference or the significance of the difference is hypothesis testing.
The basic concept of hypothesis testing is to form an assertion and test it with data.
When performing hypothesis tests, the common assumption is that there is no
difference between two samples. This assumption is used as the default position for
building the test or conducting a scientific experiment. Statisticians refer to this as
the null hypothesis (H0). The alternative hypothesis (HA) is that there is a
difference between two samples. For example, if the task is to identify the effect of
drug A compared to drug B on patients, the null hypothesis and alternative
hypothesis would be this.
● H0: Campaign C does not reduce customer churn better than the current
campaign method.
● HA: Campaign C does reduce customer churn better than the current campaign.
Difference of Means
Specifically, the two hypothesis tests in this section consider the following null and
alternative hypotheses.
The μ1 and μ2 denote the population means of pop1 and pop2, respectively.
The basic testing approach is to compare the observed sample means, X1 and X2,
corresponding to each population. If the values of X1 and X2 are approximately
equal to each other, the distributions of X1 and X2 overlap substantially (Figure 3-
23), and the null hypothesis is supported. A large observed difference between the
sample means indicates that the null hypothesis should be rejected. Formally, the
difference in means can be tested using Student’s t-test or the Welch’s t-test.
Wilcoxon Rank-Sum Test
The Wilcoxon rank-sum test [15] is a nonparametric hypothesis test that checks
whether two populations are identically distributed.
Let the two populations again be pop1 and pop2, with independently random
samples of size n1 and n2 respectively. The total number of observations is then N
=n1 +n2. The first step of the Wilcoxon test is to rank the set of observations from
the two groups as if they came from one large group. The smallest observation
receives a rank of 1, the second smallest observation receives a rank of 2, and so on
with the largest observation being assigned the rank of N. Ties among the
observations receive a rank equal to the average of the ranks they span. The test
uses ranks instead of numerical outcomes to avoid specific assumptions about the
shape of the distribution. After ranking all the observations, the assigned ranks are
summed for at least one population’s sample.
If the distribution of pop1 is shifted to the right of the other distribution, the rank-
sum corresponding to pop1’s sample should be larger than the rank-sum of pop2.
The Wilcoxon rank-sum test determines the significance of the observed rank-sums.
The following R code performs the test on the same dataset used for the previous t-
test.
The wilcox.test() function ranks the observations, determines the respective rank-
sums corresponding to each population’s sample, and then determines the
probability of such rank-sums of such magnitude being observed assuming that the
population distributions are identical.
In this example, the probability is given by the p-value of 0.04903. Thus, the null
hypothesis would be rejected at a 0.05 significance level.
A hypothesis test may result in two types of errors, depending on whether the test
accepts or rejects the null hypothesis. These two errors are known as type I and
type II errors.
● A type I error is the rejection of the null hypothesis when the null hypothesis is
TRUE. The probability of the type I error is denoted by the Greek letter α.
● A type II error is the acceptance of a null hypothesis when the null hypothesis is
FALSE. The probability of the type II error is denoted by the Greek letter β.
Assignments
Q. Question CO K Level
No. Level
1
Write a R program to create a 5 × 4 matrix , 3 × 3 matrix
with labels and fill the matrix by rows and 2 × 2 matrix with
CO6 K4
labels and fill the matrix by columns.
NoSQL
Stands for Not Only SQL
No declarative query language
No predefined schema
Key-Value pair storage, Column Store, Document Store, Graph databases
Eventual consistency rather ACID property
Unstructured and unpredictable data
CAP Theorem
Prioritizes high performance, high availability and scalability
3. What Hive is NOT? (CO5,K2)
Hive is not designed for online transaction processing. It is best used for traditional
data warehousing tasks. Hive is layered on top of the file system and execution
framework for Hadoop and enables applications and users to organize data in a
structured data warehouse and therefore query the data using a query language
called HiveQL that is similar to SQL (the standard Structured Query Language used
for most modern relational database management systems). The Hive system
provides tools for extracting/transforming/loading data (ETL) into a variety of
different data formats.
4. What is meant by Hbase. Mention some basic operations on it? (CO5,K2)
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable. Apache HBase is
capable of providing real-time read and write access to datasets with billions of rows
and millions of columns. HBase is derived from Google’s BigTable and is a column-
oriented data layout layered on top of Hadoop, provides a fault-tolerant method for
storing and manipulating large data tables.
There are some basic operations for HBase:
Get (which access a specific row in the table),
Put (which stores or updates a row in the table),
Scan (which iterates over a collection of rows in the table), and
Delete (which removes a row from the table).
5. Compare Hbase and HDFS. (CO5,K4)
▪ HBase provides low latency access while HDFS provide high latency operations.
▪ HBase supports random read and write while HDFS supports Write once Read
Many times.
▪ HBase is accessed through shell commands, Java API, REST, Avro or Thrift API
while HDFS is accessed through MapReduce jobs.
6. Mention the key terms representing the table schema in Hbase. (CO5,K2)
Table: Collection of rows present.
Row: Collection of column families.
Column Family: Collection of columns.
Column: Collection of key-value pairs.
Namespace: Logical grouping of tables.
Cell: A {row, column, version} tuple exactly specifies a cell definition in HBase.
7. Define Sharding. (CO5, K2)
Sharding (also known as Data Partitioning) is the process of splitting a large dataset
into many small partitions which are placed on different machines. Each partition is
known as a "shard".
Each shard has the same database schema as the original database. Most data is
distributed such that each row appears in exactly one shard. The combined data from
all shards is the same as the data from the original database.
8. Write the benefit of the Vertical sharding scheme. (CO5,K2)
The main benefit of this scheme is that you can handle the critical part of your data
(for examples User Profiles) differently from the not so critical part of your data (for
example, blog posts) and build different replication and consistency models around it.
9. What are the disadvantage of vertical sharding scheme? (CO5,K2)
The two main disadvantages of vertical sharding scheme are as follows:
Depending on your system, the application layer might need to combine data from
multiple shards to answer a query. For example, a profile view request will need a
combine data from the User Profile, Connections and Articles shard. This increases
the development and operational complexity of the system.
If the Site/System experiences additional growth then it may be necessary to further
shard a feature specific database across multiple servers.
10. Define Sentiment Analysis. (CO5,K2)
Sentiment analysis is the automated process of identifying and classifying subjective
information in text data. This might be an opinion, a judgment, or a feeling about a
particular topic or product feature. The most common type of sentiment analysis is
‘polarity detection’ and involves classifying statements as positive, negative or neutral.
11. Define CAP Theorem. (CO5, K2)
CAP theorem states that there are three basic requirements which exist in a special
relation when designing applications for a distributed architecture.
Consistency - The data in the database remains consistent after the execution of an
operation.
Availability - The system is always on (service guarantee availability), no downtime.
Partition Tolerance - The system continues to function even the communication
among the servers is unreliable.
12. What is the zookeeper? (CO5,K2)
❖ Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
❖ Zookeeper has ephemeral nodes representing different region servers.
❖ Master servers use these nodes to discover available servers. In addition to availability,
the nodes are also used to track server failures or network partitions.
❖ Clients communicate with region servers via zookeeper.
❖ In pseudo and standalone modes, HBase itself will take care of zookeeper.
13. How Region Servers works? (CO5,K2)
The region servers have regions that,
❖ Communicate with the client and handle data-related operations.
❖ Handle read and write requests for all the regions under it.
❖ Decide the size of the region by following the region size thresholds.
14. What is the work of Master Server? (CO5,K2)
❖ Assigns regions to the region servers and takes the help of Apache Zoo Keeper for this
task.
❖ Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupy servers.
❖ Maintain the state of the cluster by negotiating the load balancing.
❖ Is responsible for schema changes and other metadata operations such as creation of
tables and column families.
15. What is the use of Vector? (CO6,K2)
Vectors are a basic building block for data in R, simple R variables are actually vectors.
❖ A vector can only consist of values in the same class.
❖ The tests for vectors can be conducted using the is.vector ()function
is.vector(i)# returns TRUE
is.vector(flag)# returns TRUE
is.vector(sport)# returns TRUE
16. Write few points about R Programming Language? (CO6,K2)
R is a programming language and software framework for statistical analysis and
graphics. Available for use under the GNU General Public License. R software uses a
command-line interface (CLI) that is similar to the BASH shell in Linux or the
interactive versions of scripting languages such as Python. UNIX and Linux users can
enter command R at the terminal prompt to use the CLI. For Windows installations, R
comes with RGui.exe, which provides a basic graphical user interface (GUI). However,
to improve the ease of writing, executing, and debugging R code, several additional
GUIs have been written for R. Popular GUIs include the R commander , Rattle, and
RStudio.
17. What is meant by data frames in R Programming? (CO6,K2)
Data frames provide a structure for storing and accessing several variables of possibly
different data types. The is.data.frame() function indicates, a data frame was created
by the read.csv() function.
i. Mozilla
“Mozilla” uses HBase to store all crash data in HBase
ii. Facebook
To store real-time messages, “Facebook” uses HBase storage.
iii. Infolinks
to process advertisement selection and user events for the In-Text ad network, Infolinks uses
HBase. It is an In-Text ad provider company. Moreover, to optimize ad selection, they use the
reports which HBase generates as feedback for their production system.
iv. Twitter
A company like Twitter also runs HBase across its entire Hadoop cluster. For them, HBase
offers a distributed, read/write the backup of all MySQL tables in their production backend.
That helps engineers to run MapReduce jobs over the data while maintaining the ability to
apply periodic row updates.
v. Yahoo!
One of the most famous companies Yahoo! also uses HBase. There HBase helps to store
document fingerprint in order to detect near-duplicates.
Content Beyond the Syllabus
APACHE PIG
Apache Pig is an abstraction over Map Reduce. It is a tool/platform which is used to
analyze larger sets of data representing them as data flows. Pig is generally used
with Hadoop; we can perform all the data manipulation operations in Hadoop using
Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can
develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and Reduce tasks.
Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts
as input and converts those scripts into Map Reduce jobs.
Why Do We Need Apache Pig?
Programmers who are not so good at Java normally used to struggle working with
Hadoop, especially while performing any Map Reduce tasks. Apache Pig is a boon
for all such programmers.
Using Pig Latin, programmers can perform MapReduce tasks easily without having
to type complex codes in Java.
Apache Pig uses multi-query approach, thereby reducing the length of codes. For
example, an operation that would require you to type 200 lines of code (LoC) in
Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately
Apache Pig reduces the development time by almost 16 times.
Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
Apache Pig provides many built-in operators to support data operations like joins,
filters, ordering, etc. In addition, it also provides nested data types like tuples,
bags, and maps that are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features
Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you
are good at SQL.
Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
Extensibility − Using the existing operators, users can develop their own functions to read,
process, and write data.
UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well
as unstructured. It stores the results in HDFS.
Apache Pig uses a language called Pig Hive uses a language called HiveQL. It was
Latin. It was originally created at Yahoo. originally created at Facebook.
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.
Reference Video
https://www.youtube.com/watch?v=Hve24pRW_Ps
Assessment Schedule
Proposed Date
28.05.2021
Actual Date
28.05.2021
Text & Reference Books
Sl. Book Name & Author Book
No.
1 Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Text Book
Datasets", Cambridge University Press, 2012.
http://infolab.stanford.edu/~ullman/mmds/bookL.pdf
2 David Loshin, "Big Data Analytics: From Strategic Planning to Text Book
Enterprise Integration with Tools, Techniques, NoSQL, and
Graph", Morgan Kaufmann/Elsevier Publishers, 2013.
http://digilib.stmik-
banjarbaru.ac.id/data.bc/5.%20Computer%20Graphic/2013%2
0Big%20Data%20Analytics%20From%20Strategic%20Planning
%20to%20Enterprise%20Integration%20with%20Tools%2C%
20Techniques%2C%20NoSQL%2C%20and%20Graph.pdf
3 EMC Education Services, "Data Science and Big Data Analytics: Text Book
Discovering, Analyzing, Visualizing and Presenting Data", Wiley
publishers, 2015.
https://bhavanakhivsara.files.wordpress.com/2018/06/data-
science-and-big-data-analy-nieizv_book.pdf
4 Bart Baesens, "Analytics in a Big Data World: The Essential Text Book
Guide to Data Science and its Applications", Wiley Publishers,
2015.
5 Dietmar Jannach and Markus Zanker, "Recommender Systems: Text Book
An Introduction", Cambridge University Press, 2010.
https://drive.google.com/file/d/1Wr4fllOj03X72rL8CHgVJ1dGxG
58N63S/view?usp=sharing
6 Kim H. Pries and Robert Dunnigan, "Big Data Analytics: A Reference
Practical Guide for Managers " CRC Press, 2015. Book
7 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Mini Project Suggestions
Q. Question CO K Level
No. Level
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.