Data Analytics Course File 2021-22 Odd Semester
Data Analytics Course File 2021-22 Odd Semester
(Autonomous)
Cheeryal (V), Keesara (M), Medchal District, Telangana State– 501 301
DATA ANALYTICS
(18CS4101)
COURSE FILE
IV Year B.Tech. CSE – I Semester
A.Y:2021 - 2022
DEPARTMENT OF
COMPUTER SCIENCE & ENGINEERING
(2021-2022)
1
Geethanjali College of Engineering and Technology
(Autonomous)
Cheeryal (V), Keesara (M), Medchal District, Telangana State– 501 301
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Name of the Course: DATA ANALYTICS
Subject code: 18CS4101 Programme: UG
Branch: CSE Version No: 1
Year: IV Document Number: GCET/CSE/DA/01
Semester: I No. of Pages:130
Section: IV- CSE - A,B,C,D,E
Classification status (Unrestricted/Restricted ): Restricted
Distribution List: Department, Library
2
Contents
S.No Topic Page. No.
1 Cover Page 1
2 Syllabus copy 4
3 Vision of the Department 6
4 Mission of the Department 6
5 PEOs and Pos 6
6 Course objectives and outcomes 8
7 Course mapping with Pos 9
8 Brief notes on the importance of the course and how it fits into the curriculum 10
9 Prerequisites if any 13
10 Instructional Learning Outcomes 13
11 Class Time Table 14
12 Individual Time Table 19
13 Lecture schedule with methodology being used/adopted 20
14 Detailed notes 22
15 Additional topics 89
16 University Question papers of previous years 91
17 Question Bank 92
18 Assignment Questions 93
19 Unit wise Quiz Questions and long answer questions 94
20 Tutorial problems 118
21 Known gaps ,if any and inclusion of the same in lecture schedule 118
22 Discussion topics , if any 118
23 References, Journals, websites and E-links if any 119
24 Quality Measurement Sheets 120
A Course End Survey
B Teaching Evaluation
25 Student List 121
26 Group-Wise students list for discussion topic 126
2. Syllabus
18CS4101-DATA ANALYTICS
3
L T P/D C
3 - -/- 3
IV Year. B.Tech. (CSE) – I Sem
Prerequisites:
● 18CS2102 - Object Oriented Programming using Java
● 18MA2102 - Probability and Statistics
● 18CS2203 - Database Management Systems
Course Objectives:
Develop ability to
1. Know the basic elements of Big Data and Data science to handle huge amount of data.
2. Gain knowledge of basic mathematics behind the Big data.
3. Understand the different Big data processing technologies.
4. Apply the Analytical concepts of Big data using R and Python.
5. Visualize the Big Data using different tools.
UNIT I
Introduction: Data Science and Big Data:
Introduction to Data science and Big Data, Defining Data science and Big Data, Big Data
examples, Data explosion, Data volume, Data Velocity, Big data infrastructure and challenges, Big
Data Processing Architectures, Data Warehouse.
UNIT II
Summarizing Data & Revisiting Probability:
Summary Statistics- Summarizing data with R, Probability, Expected, Random, Bivariate Random
variables, Probability distribution. Central Limit Theorem, Regression Analysis, Regression
Modeling.
UNIT III
Big Data processing:
Big Data technologies, Introduction to Google file system, Hadoop Architecture, Hadoop Storage:
HDFS, Common Hadoop Shell commands, NameNode, Secondary NameNode, and DataNode,
Hadoop MapReduce paradigm, Map Reduce tasks, Job, Task trackers, Introduction to NOSQL,
Textual ETL processing.
UNIT IV
4
Big Data analytics:
Data analytics life cycle, Data cleaning , Data transformation, Comparing reporting and analysis,
Types of analysis, Analytical approaches, Data analytics using R, Exploring basic features of R,
Exploring R GUI, Reading data sets, Manipulating and processing data in R, Functions and
packages in R, Performing graphical analysis.
UNIT – V
Big Data Visualization:
Introduction to Data visualization, Challenges to Big data visualization, Types of data visualization,
Visualizing Big Data, Tools used in data visualization, Proprietary Data Visualization tools, Open
source data visualization tools, Data visualization with Tableau.
TEXT BOOK(S)
1. Data warehousing in the age of Big Data, Krish Krishnan, Elsevier, ISBN: 9780124058910,
1st Edition. (Units I,III)40%
2. Probability and Computing: Randomized Algorithms and Probabilistic Analysis,
Mitzenmacher and Upfal, Cambridge University press, ISBN:521835402 hardback. (Units
II)20%
3. Big Data, Black Book, DT Editorial Services, ISBN: 9789351197577, 2016 Edition.
(UNITS III,IV,V)40%
REFERENCES BOOK(S)
1. Algorithmic and Analysis Techniques in Property Testing, Dana Ron, School of EE.
2. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, Foundation and
trends in databases, Graham Cormode, Minos Garofalakis, Peter J. Haas and Chris
Jermaine.
3. R for Business Analytics, A.Ohri, Springer, ISBN:978-1-4614-4343-8.
4. Hadoop in practice, Alex Holmes, Dreamtech press, ISBN:9781617292224.
5
3. Vision of the Department
To produce globally competent and socially responsible computer science engineers
contributing to the advancement of engineering and technology which involves creativity and
innovation by providing excellent learning environment with world class facilities.
4. Mission of the Department
1. To be a centre of excellence in instruction, innovation in research and scholarship, and service
to the stake holders, the profession, and the public.
2. To prepare graduates to enter a rapidly changing field as a competent computer science
engineer.
3. To prepare graduate capable in all phases of software development, possess a firm
understanding of hardware technologies, have the strong mathematical background necessary
for scientific computing, and be sufficiently well versed in general theory to allow growth
within the discipline as it advances.
4. To prepare graduates to assume leadership roles by possessing good communication skills, the
ability to work effectively as team members, and an appreciation for their social and ethical
responsibility in a global setting.
6
Program Outcomes (CSE)
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions : Design solutions for complex engineering
problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex
engineering activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge
to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of,
and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give
and receive clear instructions.
7
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning : Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.
PSO 1: To identify and define the computing requirements for its solution under given
constraints.
PSO 2: To follow the best practices namely SEI-CMM levels and six sigma which vary from
time to time for software development project using open ended programming environment to
produce software deliverables as per customer needs.
8
7. Course Mapping with POs
Course PEOs POs
DATA PEO1, PEO2, PO1,PO2,PO3,PO4,PO5,
ANALYTICS PEO3 PO11,PO12,PSO1,PSO2
Program Outcomes
Course Name
DATA ANALYTICS
1 PS PS
1 2 3 4 5 6 7 8 9 10 11
2 O1 O2
CO1: Observe Big Data 2 1 1 1 1 - - - - - 2 2 1 2
elements and Architectures.
8. Brief notes on the importance of the course and how it fits into the curriculum
9
a. What role does this course play within the Program?
The data analyst serves as a gatekeeper for an organization's data so stakeholders can
understand data and use it to make strategic business decisions. It is a technical role that
requires an undergraduate degree or master's degree in analytics, computer modelling, science,
or mathematics.
b. How is the course unique or different from other courses of the Program?
Data analytics is more specific and concentrated it focuses more on viewing the historical
data in context data analytics involves a few different branches of broader statistics and analysis
c. What essential knowledge or skills should they gain from this experience?
SQL. SQL, or Structured Query Language, is the ubiquitous industry-standard database
language and is possibly the most important skill for data analysts to know. ...
Microsoft Excel.
Critical Thinking.
R or Python–Statistical Programming. ...
Data Visualization.
Presentation Skills.
Machine Learning
d. What knowledge or skills from this course will students need to have mastered to perform
well in future classes or later (Higher Education / Jobs)?
Data Visualization.
Data Cleaning.
MATLAB.
R.
Python.
SQL and NoSQL.
Machine Learning.
Linear Algebra and Calculus
g. When students complete this course, what do they need know or be able to do?
This is the obvious benefit to learning data analytics, and the one most often focused on by s tudents in
higher education. In practical terms, this means graduates with analytics skills will be able to command
higher salaries and enjoy their pick of the available jobs. Data analysis consolidates information to
provide the big picture of trends and patterns for higher education leadership teams that can be used to
evaluate and streamline processes, create efficiencies, and improve the overall student experience.
h. Is there specific knowledge that the students will need to know in the future?
Analytical Thinking
SQL Database
Decision Analysis
Mathematical and Statistical Skills
Software Analytics
Programming Skills
Functions and Formulas
Data Cleaning and Preparation
Quantitative Skills
Data Visualization Skills
Query Languages
Problem Solving
Domain Knowledge
i. Are there certain practical or professional skills that students will need to apply in the
future?
For beginners, the Business Analytics Nano degree program is a great place to start
learning Excel, SQL, and Tableau. If you have some experience, you could start with the Data
Analyst Nano degree program, where you'll use Python, R, and SQL to tackle data projects
j. Five years from now, what do you hope students will remember from this course?
Business analytics graduates can pursue a career in many different industries and companies as
a business analyst or data scientist.
k. What is it about this course that makes it unique or special?
Data analytics is important because it helps businesses optimize their performances. A company
can also use data analytics to make better business decisions and help analyze customer trends
and satisfaction, which can lead to new—and better—products and services.
n. What unique contributions to students’ learning experience does this course make?
Technology provides students with easy-to-access information, accelerated learning, and fun
opportunities to practice what they learn. It enables students to explore new subjects and
deepen their understanding of difficult concepts
o. What is the value of taking this course? How exactly does it enrich the program?
Data analytics is important because it helps businesses optimize their performances.
Implementing it into the business model means companies can help reduce costs by
identifying more efficient ways of doing business and by storing large amounts of data.
p. What are the major career options that require this course?
Machine learning engineer.
Data architect.
Statistician.
Data analyst.
Chief technology officer (CTO) .
Chief data officer (CDO) .
Application architect.
Project manager.
9. Prerequisites if any
SQL.
Microsoft Excel.
Critical Thinking.
R or Python–Statistical Programming.
Data Visualization.
Presentation Skills.
Machine Learning.
10. Instructional learning outcomes
S.No Unit Contents Outcomes
1 I Introduction: Data Observe Big Data elements and Architectures.
Science and Big Data
12
2 Apply different mathematical models for Big
Summarizing Data &
II Data.
Revisiting Probability
3 Demonstrate their Big Data skills by developing
III Big Data processing different applications.
4 IV Big Data analytics Apply each learning model for different datasets.
5 Analyze needs, challenges and techniques for big
V Big Data Visualization data visualization.
13
14
15
16
17
12. Individual Time Table
Lesson Schedule
S.No Date No of Topics to be covered Regular / Teaching aids
perio Additional used
18
ds LCD/OHP/BB
Unit - I
1. Day 1 2 Course objectives and Course Outcomes, Regular LCD
Introduction to Data science and Big Data
2. Day 2 2 Defining Data science and Big Data, Big Regular LCD
Data examples
3. Day 3 2 Data explosion, Data volume, Data Regular LCD
Velocity
4. Day 4 2 Big data infrastructure and challenges Regular LCD
5. Day 5 2 Big Data Processing Architectures Regular LCD
6. Day 6 2 Data Warehouse Regular LCD
UNIT-II
1. Day 7 2 Summary Statistics- Summarizing data with Regular LCD
R, Probability, Expected, Random,
2. Day 8 2 Bivariate Random variables, Probability Regular LCD
distribution.
3. Day 9 2 Central Limit Theorem, Regression Regular LCD
Analysis, Regression Modeling
UNIT-III
1. Day 10 2 Big Data technologies, Introduction to Regular LCD
Google file system
2. Day 11 2 Hadoop Architecture, Hadoop Storage: Regular LCD
HDFS
3. Day 12 2 Common Hadoop Shell commands, Regular LCD
NameNode, Secondary NameNode and
DataNode
4. Day 13 2 Hadoop MapReduce paradigm, Map Regular LCD
Reduce tasks, Job, Task trackers,
5. Day 14 2 Introduction to NOSQL, Textual ETL Regular LCD
processing.
UNIT-IV
1. Day 15 2 Data analytics life cycle, Data cleaning , Regular LCD
Data transformation
19
UNIT-V
1. Day 20 2 Introduction to Data visualization, Regular LCD
Challenges to Big data visualization
2. Day 21 2 Types of data visualization, Visualizing Big Regular LCD
Data, Tools used in data visualization
3. Day 22 2 Proprietary Data Visualization tools, Open Regular LCD
source data visualization tools
4. Day 23 2 Data visualization with Tableau. Regular LCD
UNIT I
20
Introduction: Data Science and Big Data:
Introduction to Data science and Big Data, Defining Data science and Big Data, Big Data
examples, Data explosion, Data volume, Data Velocity, Big data infrastructure and challenges, Big
Data Processing Architectures, Data Warehouse.
22
If one were to consider all the data, the associated processes, and the metrics used in any
decision making situation within any organization, we realize that we have used information
(volumes of data) in a variety of formats and varying degrees of complexity and derived decisions
with the data in non-traditional software processes. Before we get to Big Data, let us look at a few
important events in computing history. In the late 1980s, we were introduced to the concept of
decision support and data warehousing. This wave of being able to create trends, perform historical
analysis, and provide predictive analytics and highly scalable metrics created a series of solutions,
companies, and an industry in itself.
All these entities have contributed to the consumerization of data, from data creation,
acquisition, and consumption perspectives. The business models and opportunities that came with
the large-scale growth of data drove the need to create powerful metrics to tap from the knowledge
of the crowd that was driving them, and in return offer personalized services to address the need of
the moment.
Here are some examples:
● Weather data—there is a lot of weather data reported by governmental agencies around the
world, scientific organizations, and consumers like farmers. What we hear on television or radio is
an analytic key performance indicator (KPI) of temperature and forecasted conditions based on
several factors.
● Contract data—there are many types of contracts that an organization executes every year, and
there are multiple liabilities associated with each of them.
● Labor data—elastic labor brings a set of problems that organizations need to solve.
● Maintenance data—records from maintenance of facilities, machines, non-computer-related
systems, and more.
● Financial reporting data—corporate performance reports and annual filing to Wall Street.
● Compliance data—financial, healthcare, life sciences, hospitals, and many other agencies that file
compliance data for their corporations.
● Clinical trials data—pharmaceutical companies have wanted to minimize the life cycle of
processing for clinical trials data and manage the same with rules-based processing; this is an
opportunity for Big Data.
● Processing doctors’ notes on diagnosis and treatments—another key area of hidden insights and
value for disease state management and proactive diagnosis; a key machine learning opportunity.
23
● Contracts—every organization writes many types of contracts every year, and must process and
mine the content in the contracts along with metrics to measure the risks and penalties.
1. Fraud detection
For businesses whose operations involve any type of claims or transaction processing, fraud
detection is one of the most compelling Big Data application examples. Historically, fraud detection
on the fly has proven an elusive goal. In most cases, fraud is discovered long after the fact, at which
point the damage has been done and all that's left is to minimize the harm and adjust policies to
prevent it from happening again. Big Data platforms that can analyze claims and transactions in real
time, identifying large-scale patterns across many transactions or detecting anomalous behavior
from an individual user, can change the fraud detection game.
2. IT log analytics
IT solutions and IT departments generate an enormous quantity of logs and trace data. In the
absence of a Big Data solution, much of this data must go unexamined: organizations simply don't
have the manpower or resource to churn through all that information by hand, let alone in real time.
With a Big Data solution in place, however, those logs and trace data can be put to good use.
Within this list of Big Data application examples, IT log analytics is the most broadly applicable.
Any organization with a large IT department will benefit from the ability to quickly identify large-
scale patterns to help in diagnosing and preventing problems. Similarly, any organization with a
24
large IT department will appreciate the ability to identify incremental performance optimization
opportunities.
Now we turn to the customer-facing Big Data application examples, of which call center
analytics are particularly powerful. What's going on in a customer's call center is often a great
barometer and influencer of market sentiment, but without a Big Data solution, much of the insight
that a call center can provide will be overlooked or discovered too late. Big Data solutions can help
identify recurring problems or customer and staff behavior patterns on the fly not only by making
sense of time/quality resolution metrics, but also by capturing and processing call content itself.
A Big Data solution built to harvest and analyze social media activity, like IBM's Cognos
Consumer Insights, a point solution running on IBM's BigInsights Big Data platform, can make
sense of the chatter. Social media can provide real-time insights into how the market is responding
to products and campaigns. With those insights, companies can adjust their pricing, promotion, and
campaign placement on the fly for optimal results.
DATA EXPLOSION
What has led to this explosive growth of data? One answer is innovation. Innovation has
transformed the way we engage in business, provide services, and the associated measurement of
value and profitability. Three fundamental trends that shaped up the data world in the last few years
are business model transformation, globalization, and personalization of services.
Data volume Data volume is characterized by the amount of data that is generated
continuously. Different data types come in different sizes. For example, a blog text is a few
kilobytes; voice calls or video files are a few megabytes; sensor data, machine logs, and clickstream
data can be in gigabytes. Machine-generated data is often characterized by a steady pattern of
numbers and text, which occurs in a rapid-fire fashion.
There are several examples of machine-generated data; for instance, a robotic arm that is on
an assembly line at an automotive plant will be sending signals for every movement it makes, and
that pattern will be steady, as opposed to a dredging machine that is doing roadwork and sending
signals back on how much movement it had during the day, payload moved, and system and
machine status. Sensors on top of buildings that regulate heating and cooling send different signals
all through the day; though the structure may seem the same, the values differ depending on many
factors. Sensors on automobiles send different signals depending on road type, driving speed,
weight, and more to support centers.
DATA VELOCITY
Velocity can be defined as the speed and direction of motion of an object. Constant velocity
of an object is the motion of an object at constant speed and direction. With the advent of Big Data,
understanding the velocity of data is extremely important. The basic reason for this arises from the
fact that in the early days of data processing, we used to analyze data in batches, acquired over
time. Typically, data is broken into fixed-size chunks and processed through different layers from
source to targets, and the end result is stored in a data warehouse for further use in reporting and
analysis.
For example, a flight from London to New York generates 650TB of data from the airplane
engine sensors. There is a lot of value in reading this information during the stream processing and
postgathering for statistical modeling purposes. Mobile networks The most popular way to share
pictures, music, and data today is via mobile devices. The sheer volume of data that is transmitted
by mobile networks provides insights to the providers on the performance of their network, the
amount of data processed at each tower, the time of day, the associated geographies, user
demographics, location, latencies, and much more.
Different options for processing Big Data, the list of features for handling data velocity
included that the:
26
● System must be elastic for handling data velocity along with volume.
● System must scale up and scale down as needed without increasing costs.
● System must be able to process data across the infrastructure in the least processing time.
● System throughput should remain stable independent of data velocity.
● System should be able to process data on a distributed platform.
After selecting the most appropriate architecture for data processing, the data velocity problem was
handled by the IT team. The key takeaways from the exercise were:
● Data architecture needs to be based on workload.
● Metadata requirements need to be clearly documented.
● The data platform supports a wide range of real-time analytics.
● Result sets can be integrated easily with high-volume analytic data stores.
● Read and write latencies below 50 milliseconds.
● Scale out on commodity hardware.
● Database must automatically implement the defined partitioning strategy.
27
exist within multiple layers of the architecture stack. 40 Chapter 3 Big Data Processing
Architectures Data processing explained
● Data model-less architecture. Due to the size and complexity of data, there is no fixed data model
when you process Big Data. In other words, the data processing is on a schemaless or nonstructured
data set.
● Near-real-time data collection. Big Data can be either collected as a batch or in real time. To
process the landing of data very effectively, a highly scalable and performing storage platform is
needed.
● Microbatch processing. Big Data can be processed as real time, batch, or microbatch data sets.
The underlying platform must be capable of processing data in any of these formats, concurrently
and in a scalable fashion.
● Minimal data transformation. Big Data processing with programming languages provides for
some transformation. Most transformations are executed as a multistep derivation and complexity is
kept to a minimal within each step. This feature is necessary to improve speed and design fault
tolerance.
● Efficient data reads. The data processing design should be read-oriented, as there is no
schemabased processing of data.
● Multipartition capability. The data processing architecture should be able to support vertical and
horizontal partitions of the data. Due to the size and the volume of data, both the partitioning
techniques will be implemented in the processing workflow.
● Store result in file system or DBMS (not relational). Big Data processing is file-based and
therefore the result sets need to be stored in a file system architecture. Alternately, a DBMS
platform can be integrated into the processing cycle, but not an RDBMS.
● Share data across multiple processing points. Replication and sharing of data is a key processing
requirement. The reasons for this include the design of fault tolerance, multistep processing, and
multipartitioning. In a nutshell, Big Data processing can be summarized as a file system–based
scalable and distributed data processing architecture, designed and deployed on a high-performance
and scalable infrastructure.
28
Disk I/O bottlenecks are one common source of delays in data processing. Fortunately, there are
some tricks that you can use to minimize their impact.
One solution is to upgrade your data infrastructure solid-state disks (SSDs), which typically run
faster. Alternatively, you could use in-memory data processing, which is much faster than relying
on conventional storage.
SSDs and in-memory storage are more costly, of course, especially when you use them at scale. But
that does not mean you can’t take advantage of them strategically in a cost-effective way: Consider
deploying SSDs or in-memory data processing for workloads that require the highest speed, but
sticking with conventional storage where the benefits of faster I/O won’t outweigh the costs.
Do you need to bring together massive amounts of data in a variety of forms and integrate it all in a
cohesive way that enables business users to make real-time decisions? This eBook will guide you
through the ins and outs of building a successful big data project on a solid foundation of data
integration.
Lack of scalability
If your data infrastructure can’t increase in size as your data needs grow, it will undercut your
ability to turn data into value.
At the same time, of course, you don’t want to maintain substantially more big data infrastructure
than you need today just so that it’s there for the future. Otherwise, you will be paying for
infrastructure you’re not currently using, which is not a good use of money.
One way to help address this challenge is to deploy big data workloads in the cloud, where you can
increase the size of your infrastructure virtually instantaneously when you need it, without paying
for it when you don’t. If you prefer not to shift all of your big data workloads to the cloud, you
might also consider keeping most workloads on-premise, but having a cloud infrastructure set up
and ready to handle “spillover” workloads when they arise—at least until you can create a new on-
premise infrastructure to handle them permanently.
29
If your data is large in size, transferring it across the network can take time—especially if network
transfers require using the public internet, where bandwidth tends to be much more limited than it is
on internal company networks.
Paying for more bandwidth is one way to mitigate this problem, but that will only get you so far
(and it will cost you). A better approach is to architect your big data infrastructure in a way that
minimizes the amount of data transfer that needs to occur over the network. You could do this by,
for example, using cloud-based analytics tools to analyze data that is collected in the cloud, rather
than downloading that data to an on-premise location first. (The same logic applies in reverse: If
your data is born or collected on-premise, analyze it there.)
Getting data from the format in which it is born into the format that you need to analyze it or share
it with others can be very tricky. Most applications structure data in ways that work best for them,
with little consideration of how well those structures work for other applications or contexts.
This is why data transformation is so important. Data transformation allows you to convert data
When done incorrectly—which means manually and in ways that do not control for data quality—
data transformation can quickly cause more trouble than it is worth. But when you automate data
transformation and ensure the quality of the resulting data, you maximize your data infrastructure’s
ability to meet your big data needs, no matter how your infrastructure is constructed.
● lack of proper understanding of Big Data. Companies fail in their Big Data initiatives due to
insufficient understanding. ...
● Data growth issues. ...
● Confusion while Big Data tool selection. ...
● Lack of data professionals. ...
● Securing data. ...
● Integrating data from a variety of sources.
30
Data processing has been a complex subject to deal with since the primitive days of
computing. The underlying reason for this stems from the fact that complexity is induced from the
instrumentation of data rather than the movement of data.This is a type of architecture where there
are no dedicated servers and clients; instead, all the processing responsibilities are allocated among
all machines, known as peers.
Each machine can perform the role of a client or server or just process data.
Distributed processing has a lot of advantages and disadvantages.
● Advantages: – Scalability of systems and resources can be achieved based on isolated needs. –
Processing and management of information can be architected based on desired unit of operation. –
Parallel processing of data reducing time latencies.
● Disadvantages: – Data redundancy – Process redundancy – Resource overhead – Volumes
● The most popular distributed processing implementations in the data world are:
● Peer to peer
● Hub and spoke
● Federated Processing data in either the centralized processing or distributed processing style has a
lot of infrastructure challenges that continue to dominate the space.
Shared-everything architecture.
The data processing life cycle for Big Data differs from transactional data (Figure 3.5). In a
traditional environment you first analyze the data and create a set of requirements, which leads to
data discovery and data model creation, and then a database structure is created to process the data.
The resulting architecture is very efficient from the perspective of write performance, as data’s
finite shape, structure, and state are loaded in the end state. Big Data widely differs in its processing
31
cycle. The data is first collected and loaded to a target platform, then a metadata layer is applied to
the data, and a data structure for the content is created. Once the data structure is applied, the data is
then transformed and analyzed. The end result from the process is what provides insights into the
data and any associated context (based on the business rules processed). To process the data in this
flexible manner, a database-driven architecture will be unsuitable and will underperform. To
process this volume and complexity, a file-driven architecture with a programming language
interface is best suited.
The key requirements for infrastructure and processing architecture as follows:
● Data model-less architecture
● Near-real-time data collection
● Microbatch processing
● Minimal data transformation
● Efficient data reads
● Multipartition capability
● Store result in file system or DBMS (not relational)
● Share data across multiple processing points
● Infrastructure requirements
● Linear scalability
● High throughput Collect Process Capturing Discover transaction data Collection of data from
subsystems .
32
Shared nothing architecture
UNIT II
33
Summarizing Data & Revisiting Probability:
Summary Statistics- Summarizing data with R, Probability, Expected, Random, Bivariate Random
variables, Probability distribution. Central Limit Theorem, Regression Analysis, Regression
Modeling.
Summary Statistics
Summary statistics summarize and provide information about your sample data. It tells you
something about the values in your data set. This includes where the mean lies and whether your
data is skewed. Summary statistics fall into three main categories:
● Measures of location (also called central tendency).
● Measures of spread.
● Graphs/charts.
Summary Statistics: Measures of location
Measures of location tell you where your data is centered at, or where a trend lies. Click on one of
the following common measures of location for a full definition and examples for that particular
measure:
● Mean (also called the arithmetic mean or average).
● Geometric mean (used for interest rates and other types of growth).
● Trimmed Mean (the mean with outliers excluded).
● Median (the middle of a data set).
Summary Statistics: Measures of Spread
Measures of spread tell you (perhaps not surprisingly!) how spread out or varied your data set is.
This can be important information. For example, test scores that are in the 60-90 range might be
expected while scores in the 20-70 range might indicate a problem. Range isn’t the only measure of
spread though. Click on one of the names below for a full definition of that particular measure of
spread.
● Histogram.
● Frequency Distribution Table.
● Box plot.
● Bar chart.
● Scatter plot.
● Pie char.t
34
● sapply
● tapply
● by
● sqldf
● ddply
Introduction
● People remain confused when it comes to summarizing data real quick in R. There are
various options. But, which one is the best ? I’ve answered this question below. You must
choose one at first. And, become expert at it. That’s how you should move to the next.
● People who transition from SAS or SQL are used to writing simple queries on these
languages to summarize data sets. For such audience, the biggest concern is to how do we
do the same thing on R.
● Generally, summarizing data means finding statistical figures such as mean, median, box
plot etc. If understand well with scatter plots & histogram, you can refer to guide on data
visualization in R.
3. sapply
“sapply” does the same thing as apply but returns a vector or matrix. Let’s consider the last
example again.
l <- list(a = 1:10, b = 11:20) l.mean <- sapply(l, mean)
class(l.mean)
[1] "numeric"
4. tapply
Till now, all the function we discussed cannot do what Sql can achieve. Here is a function which
completes the palette for R. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”,
where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same
length as X”. Here is an example which will make the usage clear.
attach(iris)
# mean petal length by species
tapply(iris$Petal.Length, Species, mean)
setosa versicolor virginica
1.462 4.260 5.552
5. by
Now comes a slightly more complicated algorithm. Function ‘by’ is an object-oriented wrapper for
‘tapply’ applied to data frames. Hopefully the example will make it more clear.
attach(iris)
by(iris[, 1:4], Species, colMeans)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
------------------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
------------------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026
What did the function do? It simply splits the data by a class variable, which in this case is the
specie. And then it creates a summary at this level. So it does apply function on split frames. The
returned object is of class “by”.
6. sqldf
If you found any of the above statements difficult, don’t panic. I bring you a life line which you can
use anytime. Let’s fit in the SQL queries in R. Here is a way you can do the same.
attach(iris)
36
summarization <- sqldf(select Species, mean(Petal.Length) from Petal.Length_mean where Species
is not null group by Species’)
And it’s done. Wasn’t it simple enough? One setback of this approach is the amount of time it takes
to execute. In case you are interested in getting speed and same results read the next section.
7. ddply
Fastest of all we discussed. You will need an additional package. Let’s do what we exactly did in
tapply section.
library(plyr)
attach(iris)
# mean petal length by species
ddply(iris,"Species",summarise, Petal.Length_mean = mean (Petal.Length))
Independent events
If two events, A and B are independent then the joint probability is
For example, if two coins are flipped, then the chance of both being heads is
Inverse probability
In statistical notations, a random variable is generally represented by a capital letter, and its
realizations/observed values are represented by small letters.
Consider the experiment of tossing two coins. We can define X to be a random variable that
measures the number of heads observed in the experiment. For the experiment, the sample space is
shown below:
There are 4 possible outcomes for the experiment, and this is the domain of X. The random variable
X takes these 4 outcomes/events and processes them to give different real values. For each
outcome, the associated value is shown as:
38
Thus, we can represent X as follows:
There are three types of random variables- discrete random variables, continuous random variables,
and mixed random variables.
1) Discrete Random Variables: Discrete random variables are random variables, whose range is a
countable set. A countable set can be either a finite set or a countably infinite set. For instance, in
the above example, X is a discrete variable as its range is a finite set ({0, 1, 2}).
3) Mixed Random Variables: Lastly, mixed random variables are ones that are a mixture of both
continuous and discrete variables. These variables are more complicated than the other two. Hence,
they are explained at the end of this article.
When we describe the values in the range of a random variable in terms of the probability of
their occurrence, we are essentially talking about the probability distribution of the random
variable. In other words, the probability distribution of a random variable can be determined by
calculating the probability of occurrence of every value in the range of the random variable. A
probability distribution is described for discrete and continuous random variables in subtly different
ways.
For discrete variables, the term ‘Probability mass function (PMF)’ is used to describe their
distributions. Using the example of coin tosses, as discussed above, we calculate the probability of
X taking the values 0, 1 and 2 as follows:
39
We use the notation PX(x) to refer to
the PMF of the random variable X. The distribution is shown as follows:
This also leads us to the general description of the distribution in tabular format:
40
Properties of probability mass function:
2) PMF must sum to one over the entire range set of a random variable.
For continuous variables, the term ‘Probability density function (PDF)’ is used to describe
their distributions. We’ll consider the example of the distribution of heights. Suppose, we survey a
group of 1000 people and measure the height of each person very precisely. The distribution of the
heights can be shown by a density histogram as follows:
We have grouped the different heights in certain intervals. But let’s see what happens when we try
to reduce the size of the histogram bins. In other words, we make the grouping intervals smaller and
smaller.
41
Central Limit Theorem
“In probability theory, the central limit theorem (CLT) states that, given certain conditions,
the arithmetic mean of a sufficiently large number of iterates of independent random variables, each
with a well-defined expected value and well-defined variance, will be approximately normally
distributed, regardless of the underlying distribution. That is, suppose that a sample is obtained
containing a large number of observations, each observation being randomly generated in a way
that does not depend on the values of the other observations, and that the arithmetic average of the
observed values is computed.
If this procedure is performed many times, the central limit theorem says that the computed
values of the average will be distributed according to the normal distribution (commonly known as
a ‘bell curve’) n the domain of statistics, the most commonly used statistical technique is
Regression Analysis which is used to estimate particular relationships among variables. Under this
technique, the main focus is upon the relationship between dependent variable and any one or more
independent variables.
There are several techniques within this analysis that are used for modeling and analyzing
several variables. This technique helps you see how the particular value of a dependent variable
changes when any one of the independent variable varies with all others fixed. In simple terms,
through this approach you get to estimate the conditional expectation or the average value of the
dependent variable. Thoroughly in all the cases, the target for estimation is a function of any one or
more independent variables, which is termed as regression function. The main goal of regression
analysis is to ascertain the values of all the parameters to derive a function that will fit the data
observations in the best way possible.
There are a variety of techniques within data analytic that are employed to carry out
regression analysis. Some of the famous ones are- Linear Regression Analytic, Logistic Regression
Analytic and ordinary least squares. Linear regression and squares techniques are parametric; in
both of these methodologies regression function is managed from the limited number of peculiar
parameters.
42
The way how regression analysis methods perform depends on the types of the data
generating process. Below given is the list of variables that various regression models incorporate
● Cluster Analysis
● Experimental Simulation
● MTMM
● Conjoint Analysis
● Multi-normal distribution
● Correspondence analysis
● Factor analysis
● SEM (Structural Equation Modeling)
● Linear probability models
● R-squared or R2: Coefficient of determination
● LISREL
● Multiple discriminant analysis
● PCA (Principal Components Analysis)
43
Regression Analysis
Regression analysis is a statistical technique for analysing and comprehending the
connection between two or more variables of interest. The methodology used to do regression
analysis aids in understanding which elements are significant, which may be ignored, and how they
interact with one another.
● Regression is a statistical approach used in finance, investment, and other fields to identify
the strength and type of a connection between one dependent variable (typically represented
by Y) and a sequence of other variables (known as independent variables).
● Regression is essentially the "best guess" at utilising a collection of data to generate some
form of forecast. It is the process of fitting a set of points to a graph.
● Regression analysis is a mathematical method for determining which of those factors has an
effect. It provides answers to the following questions:
Types of Regression Analysis
1. Forecasting:
45
The most common use of regression analysis in business is for forecasting future
opportunities and threats. Demand analysis, for example, forecasts the amount of things a customer
is likely to buy.
When it comes to business, though, demand is not the only dependent variable. Regressive
analysis can anticipate significantly more than just direct income.
2. CAPM:
The Capital Asset Pricing Model (CAPM), which establishes the link between an asset's
projected return and the related market risk premium, relies on the linear regression model.
3. Comparing with competition:
It may be used to compare a company's financial performance to that of a certain
counterpart. It may also be used to determine the relationship between two firms' stock prices (this
can be extended to find correlation between 2 competing companies, 2 companies operating in an
unrelated industry etc).
4. Identifying problems:
Regression is useful not just for providing factual evidence for management choices
but also for detecting judgement mistakes.
A retail store manager, for example, may assume that extending shopping hours will significantly
boost sales.
5.Reliable source
Many businesses and their top executives are now adopting regression analysis (and
other types of statistical analysis) to make better business decisions and reduce guesswork and gut
instinct.
Regression enables firms to take a scientific approach to management. Both small and large
enterprises are frequently bombarded with an excessive amount of data.
Regression Modelling
Regression is a method to mathematically formulate relationship between variables that in
due course can be used to estimate, interpolate and extrapolate. Suppose we want to estimate the
weight of individuals, which is influenced by height, diet, workout, etc. Here, Weight is
the predicted variable. Height, Diet, Workout are predictor variables.
Collect sufficient number of data points. Use a suitable estimation technique to arrive at the
mathematical formula between predicted and predictor variables. No model is perfect. Hence, give
error bounds.
Finally, assess the model's stability by applying it to different samples of the same population.
When predictor variables are given for a new data point, estimate the predicted variable. If stable,
the model's accuracy should not decrease. This process is called model cross-validation.
46
I've heard of Least Squares. What's this and how is it related to regression?
The least squares regression line. Source: Sultana 2014, slide 6.
Least Squares is a term that signifies that the square of errors are at a minimum. The error is
defined as the difference between observed value and predicted value. The objective of regression
estimation is produce least squared errors as a result. When error approaches zero, we term it
as overfitting.
Least Squares Method provides linear equations with unknowns that can be solved for any given
data. The unknowns are regression parameters. The linear equations are called as Normal
Equations. The normal equations are derived using calculus to minimize squared errors.
● Could you explain the difference between interpolation and extrapolation w.r.t. regression?
We can interpolate missing data points using regression. For instance, we want to estimate
height given weight and data collection process missed out certain weights, we can use regression
to interpolate. This missing data can estimated by other means too. The missing data estimation is
called imputation.
47
The height and weight data is bound by nature and can be sourced. Say, we want to estimate future
weight of an individual given historical weight variations of the individual. This is extrapolation. In
regression, we call it forecasting. This is solved using a distinct set of techniques called as Time
Series Regression.
48
What's the right interpretation of correlation coefficient?
Different samples with same correlation coefficient although their regression lines may differ.
Source: Stanton 2001, fig. 2.
Correlation coefficient rr is measure of linear association strength. It doesn't quantify non-linearity.
A correlation coefficient of 80% (0.8) means that 80% of variation in one variable is explained by
variation in the other variable. Example, 80% of variation in rainfall is explained by the number of
trees; 20% is due to factors other than the number of trees.
It will be apparent from the formula that rr factors in the sample variance. On a X-Y scatterplot, the
regression line may have different slopes due to different sample variance even when all of them
share the same correlation coefficient. In other words, rr is not simply the slope of the regression
line.
Could you give examples of non-linear correlation?
49
Illustrating linear, non-linear and no correlation types. Source: Johnivan 2011.
A non-linear correlation is where the relationship between the variables cannot be expressed by a
straight line. We call this relationship curvilinear.
Non-linear relationship can exhibit monotonous positive, monotonous negative, or both patterns
together.How can we do data analysis when relationships are non-linear?
If the relationship is non-linear but not curvilinear, we can split the data into distinct
segments. Data within some segments may be linear. In other segments, if it's curvilinear,
transformations can be applied to make them linear. Analysis is thus segment-wise, sometimes
called segmented regression. As an example, yield of mustard is not affected by soil salinity for
low values. For salinity above a threshold, there's a negative linear relation. This dataset can be
segmented at the threshold.
50
One study about college education, showed positive correlation between SAT scores of
incoming students and their earnings when they graduate. Moreover, we can state that graduating
from elite colleges (high SAT scores) had a role in higher salaries.
Here are some possible examples of causality. High scores lead to higher earnings. Regular exercise
results in better health. Current season influences power consumption. All pairs of variables that
have causal relationship will exhibit significant correlation.
An example where correlation does not imply causation. Source: Stark 2017.
51
No. Sometimes correlations are purely coincidental. For example, non-commercial space
launches and sociology doctorates awarded are completely unrelated but the image shows them to
be strongly correlated. This is called a Spurious Correlation. This is a clear case where correlation
does not imply causation.
Another example is when ice cream sales are positively correlated with violent crime.
However, violent crime is not caused by ice cream sales. It so happens that there's a confounding
variable, which in this case is weather. Hot weather influences both ice cream sales and violent
crimes. It's therefore obvious that,
Correlation shouldn't be mistaken for causation. Look at the physical mechanism causing
such a relationship. For example, is rain driving the sale of your product? Data may show a
correlation. It need not be causal unless your product is an umbrella. However, proving causality is
hard. At best, we can do randomized trials to establish causality.
Regression is a useful tool in either predictive or causal analysis. With the growth of Big Data, it's
being used more often for predictive analysis.
52
● For base model, the sum of squared deviation of actual value YY from mean value E(Y)E(Y) is
referred to as Total Variance or SST (Total Sum of
Squares).SST=n∑i=1(yi−¯y)2SST=∑i=1n(yi−y¯)2
● For regression model, the sum of squared deviation of estimated value ˆYY^ from mean
value E(Y)E(Y) is referred to as Explained Variance or SSR (Regression Sum of
Squares).SSR=n∑i=1(ˆyi−¯y)2SSR=∑i=1n(y^i−y¯)2
● The accuracy of the model is called R-Squared.R2=Explained VarianceTotal
Variance=SSRSSTR2=Explained VarianceTotal Variance=SSRSST
Higher the R2R2, larger the explained variance and lower the unexplained. Hence,
higher R2R2 value is desired. For example, if R2=0.8R2=0.8, 80% of variation in data is explained
by model.
What are some challenges with regression and how to overcome them?
High multicollinearity is a challenge. It basically means one or more independent variables are
highly linearly dependent on another independent variable. This makes it difficult to estimate the
coefficients. One possible solution is to increase the sample size.
Regression assumes that errors from one observation are not related to other observations. This is
often not true with time series data. Auto correlated errors are therefore a challenge. One
approach is to estimate the pattern in the errors and refine the regression model.
Another problem is overfitting that occurs when the model is "too well-trained". Such a model will
not fit any other data. Regularization is the technique used to avoid overfitting. For parametric
models, there are regression routines that address overfitting concerns. Lasso regression and ridge
regression are a couple of such routines.
UNIT III
Big Data processing:
Big Data technologies, Introduction to Google file system, Hadoop Architecture, Hadoop Storage:
HDFS, Common Hadoop Shell commands, NameNode, Secondary NameNode, and DataNode,
Hadoop MapReduce paradigm, Map Reduce tasks, Job, Task trackers, Introduction to NOSQL,
Textual ETL processing.
Big data is a field that treats ways to analyze, systematically extract information from, or otherwise
deal with data sets that are too large or complex to be dealt with by traditional data-
processing application software. Data with many fields (columns) offer greater statistical power,
while data with higher complexity (more attributes or columns) may lead to a higher false discovery
rate.[2] Big data analysis challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, updating, information privacy, and data source.
53
Big data was originally associated with three key concepts: volume, variety, and velocity.[3] The
analysis of big data presents challenges in sampling, and thus previously allowing for only
observations and sampling. Therefore, big data often includes data with sizes that exceed the
capacity of traditional software to process within an acceptable time and value.
Current usage of the term big data tends to refer to the use of predictive analytics, user
behavior analytics, or certain other advanced data analytics methods that extract value from big
data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now
available are indeed large, but that's not the most relevant characteristic of this new data
ecosystem."[4] Analysis of data sets can find new correlations to "spot business trends, prevent
diseases, combat crime and so on".[5] Scientists, business executives, medical practitioners,
advertising and governments alike regularly meet difficulties with large data-sets in areas
including Internet searches, fintech, healthcare analytics, geographic information systems, urban
informatics, and business informatics. Scientists encounter limitations in e-Science work,
including meteorology, genomics, connectomics, complex physics simulations, biology, and
environmental research.[7]
The size and number of available data sets has grown rapidly as data is collected by devices
such as mobile devices, cheap and numerous information-sensing Internet of things devices, aerial
(remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID)
readers and wireless sensor networks.[8][9] The world's technological per-capita capacity to store
information has roughly doubled every 40 months since the 1980s;[10] as of 2012, every day
2.5 exabytes (2.5×260 bytes) of data are generated.[11] Based on an IDC report prediction, the global
data volume was predicted to grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013
and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. [12] One question for large
enterprises is determining who should own big-data initiatives that affect the entire organization.[13]
The term big data has been in use since the 1990s, with some giving credit to John
Mashey for popularizing the term. Big data usually includes data sets with sizes beyond the ability
of commonly used software tools to capture, curate, manage, and process data within a tolerable
elapsed time.[18] Big data philosophy encompasses unstructured, semi-structured and structured
data, however the main focus is on unstructured data.[19] Big data "size" is a constantly moving
target; as of 2012 ranging from a few dozen terabytes to many zettabytes of data. Big data requires
a set of techniques and technologies with new forms of integration to reveal insights from data-
sets that are diverse, complex, and of a massive scale.
"Variety", "veracity", and various other "Vs" are added by some organizations to describe it, a
revision challenged by some industry authorities. The Vs of big data were often referred to as the
"three Vs", "four Vs", and "five Vs". They represented the qualities of big data in volume, variety,
velocity, veracity, and value.[3] Variability is often included as an additional quality of big data.
A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and
notes, "This represents a distinct and clearly defined change in the computer science used, via
parallel programming theories, and losses of some of the guarantees and capabilities made
by Codd's relational model."
In a comparative study of big datasets, Kitchin and McArdle found that none of the commonly
considered characteristics of big data appear consistently across all of the analyzed cases. [24] For this
reason, other studies identified the redefinition of power dynamics in knowledge discovery as the
defining trait.[25] Instead of focusing on intrinsic characteristics of big data, this alternative
perspective pushes forward a relational understanding of the object claiming that what matters is
the way in which data is collected, stored, made available and analyzed.
Big data vs. business intelligence
54
The growing maturity of the concept more starkly delineates the difference between "big data" and
"business intelligence"
Business intelligence uses applied mathematics tools and descriptive statistics with data with high
information density to measure things, detect trends, etc.
Big data uses mathematical analysis, optimization, inductive statistics, and concepts from nonlinear
system identification[27] to infer laws (regressions, nonlinear relationships, and causal effects) from
large sets of data with low information density[28] to reveal relationships and dependencies, or to
perform predictions of outcomes and behaviors.
Characteristics
Shows the growth of big data's primary characteristics of volume, velocity, and variety
Big data can be described by the following characteristics:
Volume
The quantity of generated and stored data. The size of the data determines the value and potential
insight, and whether it can be considered big data or not. The size of big data is usually larger than
terabytes and petabytes.
Variety
The type and nature of the data. The earlier technologies like RDBMSs were capable to handle
structured data efficiently and effectively. However, the change in type and nature from structured
to semi-structured or unstructured challenged the existing tools and technologies. The big data
technologies evolved with the prime intention to capture, store, and process the semi-structured and
unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later,
these tools and technologies were explored and used for handling structured data also but preferable
for storage. Eventually, the processing of structured data was still kept as optional, either using big
data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden
insights exposed from the data collected via social media, log files, sensors, etc. Big data draws
from text, images, audio, video; plus it completes missing pieces through data fusion.
Velocity
The speed at which the data is generated and processed to meet the demands and challenges that lie
in the path of growth and development. Big data is often available in real-time. Compared to small
data, big data is produced more continually. Two kinds of velocity related to big data are the
frequency of generation and the frequency of handling, recording, and publishing.[31]
Veracity
The truthfulness or reliability of the data, which refers to the data quality and the data value. [32] Big
data must not only be large in size, but also must be reliable in order to achieve value in the analysis
of it. The data quality of captured data can vary greatly, affecting an accurate analysis.[33]
Value
The worth in information that can be achieved by the processing and analysis of large datasets.
Value also can be measured by an assessment of the other qualities of big data. [34] Value may also
represent the profitability of information that is retrieved from the analysis of big data.
Variability
55
The characteristic of the changing formats, structure, or sources of big data. Big data can include
structured, unstructured, or combinations of structured and unstructured data. Big data analysis may
integrate raw data from multiple sources. The processing of raw data may also involve
transformations of unstructured data to structured data.
Other possible characteristics of big data are:[35]
Exhaustive
Whether the entire system (i.e., =all) is captured or recorded or not. Big data may or may not
include all the available data from sources.
Fine-grained and uniquely lexical
Respectively, the proportion of specific data of each element per element collected and if the
element and its characteristics are properly indexed or identified.
Relational
If the data collected contains common fields that would enable a conjoining, or meta-analysis, of
different data sets.
Extensional
If new fields in each element of the data collected can be added or changed easily.
Scalability
If the size of the big data storage system can expand rapidly.
Architecture[edit]Big data repositories have existed in many forms, often built by corporations with
a special need. Commercial vendors historically offered parallel database management systems for
big data beginning in the 1990s. For many years, WinterCorp published the largest database report.
Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems
were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991
so the definition of big data continuously evolves according to Kryder's law. Teradata installed the
first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class
Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008
were 100% structured relational data. Since then, Teradata has added unstructured data types
including XML, JSON, and Avro.In 2000, Seisint Inc. (now LexisNexis Risk Solutions) developed
a C++-based distributed platform for data processing and querying known as the HPCC
Systems platform. This system automatically partitions, distributes, stores and delivers structured,
semi-structured, and unstructured data across multiple commodity servers. Users can write data
processing pipelines and queries in a declarative dataflow programming language called ECL. Data
analysts working in ECL are not required to define data schemas upfront and can rather focus on
the particular problem at hand, reshaping data in the best possible manner as they develop the
solution.
In 2004, LexisNexis acquired Seisint Inc.[37] and their high-speed parallel processing platform and
successfully used this platform to integrate the data systems of Choicepoint Inc. when they acquired
that company in 2008.[38] In 2011, the HPCC systems platform was open-sourced under the Apache
v2.0 License.CERN and other physics experiments have collected big data sets for many decades,
usually analyzed via high-throughput computing rather than the map-reduce architectures usually
meant by the current "big data" movement.In 2004, Google published a paper on a process
called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel
processing model, and an associated implementation was released to process huge amounts of data.
With MapReduce, queries are split and distributed across parallel nodes and processed in parallel
(the "map" step). The results are then gathered and delivered (the "reduce" step). The framework
was very successful,[39] so others wanted to replicate the algorithm. Therefore, an implementation of
the MapReduce framework was adopted by an Apache open-source project named "Hadoop".
56
Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as
[40]
it adds the ability to set up many operations (not just map followed by reducing).
MIKE2.0 is an open approach to information management that acknowledges the need for revisions
due to big data implications identified in an article titled "Big Data Solution Offering". [41] The
methodology addresses handling big data in terms of useful permutations of data
sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual
records.[42]Studies in 2012 showed that a multiple-layer architecture was one option to address the
issues that big data presents. A distributed parallel architecture distributes data across multiple
servers; these parallel execution environments can dramatically improve data processing speeds.
This type of architecture inserts data into a parallel DBMS, which implements the use of
MapReduce and Hadoop frameworks. This type of framework looks to make the processing power
transparent to the end-user by using a front-end application server. [43]The data lake allows an
organization to shift its focus from centralized control to a shared model to respond to the changing
dynamics of information management. This enables quick segregation of data into the data lake,
thereby reducing the overhead time
Big data has increased the demand of information management specialists so much so that Software
AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP, and Dell have spent more than
$15 billion on software firms specializing in data management and analytics. In 2010, this industry
was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast
as the software business as a whole.[5]
Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-
phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet.
[5]
Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which
means more people became more literate, which in turn led to information growth. The world's
effective capacity to exchange information through telecommunication networks was
281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007[10] and
predictions put the amount of internet traffic at 667 exabytes annually by 2014.[5] According to one
estimate, one-third of the globally stored information is in the form of alphanumeric text and still
image data,[55] which is the format most useful for most big data applications. This also shows the
potential of yet unused data (i.e. in the form of video and audio content).
While many vendors offer off-the-shelf products for big data, experts promote the development of
in-house custom-tailored systems if the company has sufficient technical capabilities.
The use and adoption of big data within governmental processes allows efficiencies in terms of
cost, productivity, and innovation,[57] but does not come without its flaws. Data analysis often
requires multiple parts of government (central and local) to work in collaboration and create new
and innovative processes to deliver the desired outcome. A common government organization that
makes use of big data is the National Security Administration (NSA), who monitor the activities of
the Internet constantly in search for potential patterns of suspicious or illegal activities their system
may pick up.
Civil registration and vital statistics (CRVS) collects all certificates status from birth to death.
CRVS is a source of big data for governments.
International development
Research on the effective usage of information and communication technologies for development
(also known as "ICT4D") suggests that big data technology can make important contributions but
also present unique challenges to international development.[58][59] Advancements in big data
analysis offer cost-effective opportunities to improve decision-making in critical development areas
such as health care, employment, economic productivity, crime, security, and natural disaster and
57
resource management.[60][61][62] Additionally, user-generated data offers new opportunities to give the
unheard a voice.[63] However, longstanding challenges for developing regions such as inadequate
technological infrastructure and economic and human resource scarcity exacerbate existing
concerns with big data such as privacy, imperfect methodology, and interoperability issues. [60] The
challenge of "big data for development" [60] is currently evolving toward the application of this data
through machine learning, known as "artificial intelligence for development (AI4D).[64]
Benefits
A major practical application of big data for development has been "fighting poverty with data".
[65]
In 2015, Blumenstock and colleagues estimated predicted poverty and wealth from mobile
phone metadata [66] and in 2016 Jean and colleagues combined satellite imagery and machine
learning to predict poverty. Using digital trace data to study the labor market and the digital
economy in Latin America, Hilbert and colleagues argue that digital trace data has several benefits
such as:
Thematic coverage: including areas that were previously difficult or impossible to measure
Geographical coverage: our international sources provided sizable and comparable data for almost
all countries, including many small countries that usually are not included in international
inventories
Level of detail: providing fine-grained data with many interrelated variables, and new aspects, like
network connections
Timeliness and timeseries: graphs can be produced within days of being collected
Challenges
At the same time, working with digital trace data instead of traditional survey data does not
eliminate the traditional challenges involved when working in the field of international quantitative
analysis. Priorities change, but the basic discussions remain the same. Among the main challenges
are:
Representativeness.
While traditional development statistics is mainly concerned with the representativeness of random
survey samples, digital trace data is never a random sample.
Generalizability.
While observational data always represents this source very well, it only represents what it
represents, and nothing more. While it is tempting to generalize from specific observations of one
platform to broader settings, this is often very deceptive.
Harmonization.
Digital trace data still requires international harmonization of indicators. It adds the challenge of so-
called "data-fusion", the harmonization of different sources.
Data overload.
Analysts and institutions are not used to effectively deal with a large number of variables, which is
efficiently done with interactive dashboards. Practitioners still lack a standard workflow that would
allow researchers, users and policymakers to efficiently and effectively.[68]
Healthcare
Big data analytics was used in healthcare by providing personalized medicine and
prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability
reduction, automated external and internal reporting of patient data, standardized medical terms and
patient registries. Some areas of improvement are more aspirational than actually implemented.
The level of data generated within healthcare systems is not trivial. With the added adoption
of mHealth, eHealth and wearable technologies the volume of data will continue to increase. This
includes electronic health record data, imaging data, patient generated data, sensor data, and other
58
forms of difficult to process data. There is now an even greater need for such environments to pay
greater attention to data and information quality.
Big data in health research is particularly promising in terms of exploratory biomedical
research, as data-driven analysis can move forward more quickly than hypothesis-driven research.
Then, trends seen in data analysis can be tested in traditional, hypothesis-driven follow up
biological research and eventually clinical research.
Big Data Technologies
Big data technology and Hadoop is a big buzzword as it might sound. As there has been a huge
increase in the data and information domain from every industry and domain, it becomes very
important to establish and introduce an efficient technique that takes care of all the needs and
requirements of clients and big industries which are responsible for data generation. Earlier the data
was being handled by normal programming languages and simple structured query language but
now these systems and tools don’t seem to do much in case of big data.
Big data technology is defined as the technology and a software utility that is designed for analysis,
processing, and extraction of the information from a large set of extremely complex structures and
large data sets which is very difficult for the traditional systems to deal with. Big data technology is
used to handle both real-time and batch related data. Machine learning has become a very critical
component of everyday lives and every industry and therefore managing data through big data
becomes very important.
Types of Big Data Technologies
Before starting with the list of technologies let us first see the broad classification of all these
technologies.
They can mainly be classified into 4 domains.
Data storage
Analytics
Data mining
Hadoop:
When it comes to big data, Hadoop is the first technology that comes into play. This is
based on map-reduce architecture and helps in the processing of batch related jobs and process
batch information. It was designed to store and process the data in a distributed data processing
environment along with commodity hardware and a simple programming execution model. It can
be used to store and analyze the data present in various different machines with high storage, speed,
and low costs. This forms one of the main core components of big data technology which was
developed by the Apache software foundation in the year 2011 and is written in Java.Visualization
MongoDB:
Another very essential and core component of big data technology in terms of storage is the
MongoDB NoSQL database. It is a NoSQL database which means that the relational properties and
other RDBMS related properties do not apply to it. It is different from traditional RDBMS
databases which makes use of structured query language. It makes use of schema documents and
the structure of data storage is also different and therefore they are helpful in holding a large
amount of data. It is a cross-platform document-oriented design and database program that makes
use of JSON like documents along with schema. This becomes a very useful use-case of operational
data stores in the majority of financial institutions and thereby working to replace the traditional
mainframes. MongoDB handles flexibility and also a wide variety of data types at high volumes
and among distributed architectures.
Hunk:
It is useful in accessing data through remote Hadoop clusters by making use of virtual
indexes and also makes use of Splunk search processing language which can be used for the
59
analysis of data. The hunk can be used to report and visualize huge amounts of data from the
Hadoop and NoSQL databases and sources. It was developed by team Splunk in the year 2013
which was written in Java.
Cassandra:
Cassandra forms a top choice among the list of popular NoSQL databases which is a free
and an open-source database, which is distributed and has a wide columnar storage and can
efficiently handle data on large commodity clusters i.e. it is used to provide high availability along
with no single failure point. Among the list of main features includes the ones like distributed
nature, scalability, fault-tolerant mechanism, MapReduce support, tunable consistency, query
language property, supports multi data center replication and eventual consistency.
GFS is enhanced for Google's core data storage and usage needs (primarily the search engine),
which can generate enormous amounts of data that must be retained; Google File System grew out
of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the early days
of Google, while it was still located in Stanford. Files are divided into fixed-size chunks of
64 megabytes, similar to clusters or sectors in regular file systems, which are only extremely rarely
overwritten, or shrunk; files are usually appended to or read. It is also designed and optimized to
run on Google's computing clusters, dense nodes which consist of cheap "commodity" computers,
which means precautions must be taken against the high failure rate of individual nodes and the
subsequent data loss. Other design decisions select for high data throughputs, even when it comes
at the cost of latency.
A GFS cluster consists of multiple nodes. These nodes are divided into two types: one Master node
and multiple Chunkservers. Each file is divided into fixed-size chunks. Chunkservers store these
chunks. Each chunk is assigned a globally unique 64-bit label by the master node at the time of
creation, and logical mappings of files to constituent chunks are maintained. Each chunk is
60
replicated several times throughout the network. At default, it is replicated three times, but this is
configurable.[3] Files which are in high demand may have a higher replication factor, while files for
which the application client uses strict storage optimizations may be replicated less than three times
- in order to cope with quick garbage cleaning policies.[3]
The Master server does not usually store the actual chunks, but rather all
the metadata associated with the chunks, such as the tables mapping the 64-bit labels to chunk
locations and the files they make up (mapping from files to chunks), the locations of the copies of
the chunks, what processes are reading or writing to a particular chunk, or taking a "snapshot" of
the chunk pursuant to replicate it (usually at the instigation of the Master server, when, due to node
failures, the number of copies of a chunk has fallen beneath the set number). All this metadata is
kept current by the Master server periodically receiving updates from each chunk server ("Heart-
beat messages").
Programs access the chunks by first querying the Master server for the locations of the
desired chunks; if the chunks are not being operated on (i.e. no outstanding leases exist), the Master
replies with the locations, and the program then contacts and receives the data from the chunkserver
directly (similar to Kazaa and its supernodes).
Unlike most other file systems, GFS is not implemented in the kernel of an operating system, but is
instead provided as a userspace library.[4]
Interface
The Google File System does not provide a POSIX interface.[5] Files are organized hierarchically in
directories and identified by pathnames. The file operations such as create, delete, open, close, read,
write are supported. It supports Record Append which allows multiple clients to append data to the
same file concurrently and atomicity is guaranteed.
Performance
Deciding from benchmarking results,[3] when used with relatively small number of servers (15), the
file system achieves reading performance comparable to that of a single disk (80–100 MB/s), but
has a reduced write performance (30 MB/s), and is relatively slow (5 MB/s) in appending data to
existing files. The authors present no results on random seek time. As the master node is not
directly involved in data reading (the data are passed from the chunk server directly to the reading
client), the read rate increases significantly with the number of chunk servers, achieving 583 MB/s
for 342 nodes. Aggregating multiple servers also allows big capacity, while it is somewhat reduced
by storing data in three independent locations (to provide redundancy).
HDFS architecture
HDFS can be presented as the master/slave architecture. HDFS master is named
as NameNode whereas slave as DataNode. NameNode is a sever that manages the
filesystem namespace and adjusts the access (open, close, rename, and more) to
files by the client. It divides the input data into blocks and announces which data
block will be store in which DataNode. DataNode is a slave machine that stores the
replicas of the partitioned dataset and serves the data as the request comes. It also
performs block creation and deletion.
The internal mechanism of HDFS divides the file into one or more blocks; these
blocks are stored in a set of data nodes. Under normal circumstances of the
replication factor three, the HDFS strategy is to place the first copy on the local node,
second copy on the local rack with a different node, and a third copy into different
racks with different nodes. As HDFS is designed to support large files, the HDFS
block size is defined as 64 MB. If required, this can be increased.
Understanding HDFS components
61
HDFS is managed with the master-slave architecture included with the
following components:
• NameNode: This is the master of the HDFS system. It maintains the
directories, files, and manages the blocks that are present on the DataNodes.
• DataNode: These are slaves that are deployed on each machine and provide
actual storage. They are responsible for serving read-and-write data requests
for the clients.
• Secondary NameNode: This is responsible for performing periodic
checkpoints. So, if the NameNode fails at any time, it can be replaced with
a snapshot image stored by the secondary NameNode checkpoints.
MapReduce architecture
MapReduce is also implemented over master-slave architectures. Classic MapReduce contains job
submission, job initialization, task assignment, task execution, progress and status update, and job
completion-related activities, which are mainly managed by the Job Tracker node and executed by
Task Tracker. Client application submits a job to the Job Tracker. Then input is divided across the
cluster. The Job Tracker then calculates the number of map and reducer to be processed. It
commands the Task Tracker to start executing the job. Now, the Task Tracker copies the resources
to a local machine and launches JVM to map and reduce program over the data.
Along with this, the Task Tracker periodically sends update to the Job Tracker, which can be
considered as the heartbeat that helps to update JobID, job status, and usage of resources.
62
Hadoop is a top-level Apache project and is a very complicated Java framework.
To avoid technical complications, the Hadoop community has developed a number
of Java frameworks that has added an extra value to Hadoop features. They are
considered as Hadoop subprojects. Here, we are departing to discuss several Hadoop
components that can be considered as an abstraction of HDFS or MapReduce.
The following are the components of Hadoop that are responsible for performing
analytics over Big Data:
• Client: This initializes the job
• JobTracker: This monitors the job
• TaskTracker: This executes the job
• HDFS: This stores the input and output data
The four main stages of Hadoop MapReduce data processing are as follows:
• The loading of data into HDFS
63
• The execution of the Map phase
• Shuffling and sorting
• The execution of the Reduce phase
Loading data into HDFS
The input dataset needs to be uploaded to the Hadoop directory so it can be used by
MapReduce nodes. Then, Hadoop Distributed File System (HDFS) will divide the
input dataset into data splits and store them to DataNodes in a cluster by taking care
of the replication factor for fault tolerance. All the data splits will be processed by
TaskTracker for the Map and Reduce tasks in a parallel manner.
Also, there are some alternative ways to get the dataset in HDFS with
Hadoop components:
• Sqoop: This is an open source tool designed for efficiently transferring bulk
data between Apache Hadoop and structured, relational databases. Suppose
your application has already been configured with the MySQL database
and you want to use the same data for performing data analytics, Sqoop is
recommended for importing datasets to HDFS. Also, after the completion
of the data analytics process, the output can be exported to the MySQL
database.
• Flume: This is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data to HDFS.
Flume is able to read data from most sources, such as logfiles, sys logs, and
the standard output of the Unix process.
Using the preceding data collection and moving the framework can make this data
transfer process very easy for the MapReduce application for data analytics.
The list of (key, value) pairs is generated such that the key attribute will be repeated
many times. So, its key attribute will be re-used in the Reducer for aggregating
values in MapReduce. As far as format is concerned, Mapper output format values
64
and Reducer input values must be the same.
After the completion of this Map operation, the Task Tracker will keep the
result in its buffer storage and local disk space (if the output data size is more
than the threshold).
For example, suppose we have a Map function that converts the input text into
lowercase. This will convert the list of input strings into a list of lowercase strings.
Reducing phase execution
As soon as the Mapper output is available, Task Tracker in the Reducer node will
retrieve the available partitioned Map's output data, and they will be grouped
together and merged into one large file, which will then be assigned to a process
with a Reducer method. Finally, this will be sorted out before data is provided to the
Reducer method.
The Reducer method receives a list of input values from an input (key, list
(value)) and aggregates them based on custom logic, and produces the output
(key, value) pairs.
The output of the Reducer method of the Reduce phase will directly be written into
HDFS as per the format specified by the MapReduce job configuration class.
MapReduce objects
65
MapReduce operations in Hadoop are carried out mainly by three objects: Mapper, Reducer, and
Driver.
• Mapper:
This is designed for the Map phase of MapReduce, which starts MapReduce operations by
carrying input files and splitting them into several pieces. For each piece, it will emit a key-value
data pair as the output value.
• Reducer:
This is designed for the Reduce phase of a MapReduce job; it accepts key-based grouped
data from the Mapper output, reduces it by aggregation logic, and emits the (key, value) pair for the
group of values.
• Driver:
This is the main file that drives the MapReduce process. It starts the execution of
MapReduce tasks after getting a request from the client application with parameters. The Driver file
is responsible for building the configuration of a job and submitting it to the Hadoop cluster. The
Driver code will contain the main() method that accepts arguments from the command line. The
input and output directory of the Hadoop MapReduce job will be accepted by this program. Driver
is the main file for defining job configuration details, such as the job name, job input format, job
output format, and the Mapper, Combiner, Partitioner, and Reducer classes. MapReduce is
initialized by calling this main() function of the Driver class.
66
Map reduce data flow
The two APIs available for Hadoop MapReduce are: New (Hadoop 1.x and 2.x) and Old
Hadoop (0.20). YARN is the next generation of Hadoop MapReduce and the new Apache Hadoop
subproject that has been released for Hadoop resource management.
Hadoop data processing includes several tasks that help achieve the final output from an input
dataset. These tasks are as follows:
1. Preloading data in HDFS.
2. Running MapReduce by calling Driver.
3. Reading of input data by the Mappers, which results in the splitting of the data execution of the
Mapper custom logic and the generation of intermediate key-value pairs
4. Executing Combiner and the shuffle phase to optimize the overall Hadoop MapReduce process.
5. Sorting and providing of intermediate key-value pairs to the Reduce phase.
The Reduce phase is then executed. Reducers take these partitioned key value pairs and aggregate
them based on Reducer logic.
6. The final output data is stored at HDFS.
Here, Map and Reduce tasks can be defined for several data operations as follows:
• Data extraction
• Data loading
• Data segmentation
• Data cleaning
• Data transformation
• Data integration
Common Hadoop Shell commands
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
bin/hdfs dfs -ls <path>
mkdir:
To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
67
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
NameNode:
This is the master of the HDFS system. It maintains the directories, files, and manages the blocks
that are present on the DataNodes.
• DataNode: These are slaves that are deployed on each machine and provide actual storage. They
are responsible for serving read-and-write data requests for the clients.
• Secondary NameNode: This is responsible for performing periodic checkpoints. So, if the
NameNode fails at any time, it can be replaced with a snapshot image stored by the secondary
NameNode checkpoints.
HADOOP MAPREDUCE PARADIGM
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an
input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
68
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data resides
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data and creates several small chunks of
data.
Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it produces a
new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
Terminology
PayLoad − Applications implement the Map and the Reduce functions, and form the core of the
job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
SlaveNode − Node where Map and Reduce program runs.
69
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to JobTracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
Task − An execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
Example Scenario
Given below is the data regarding the electrical consumption of an organization. It contains the
monthly electrical consumption and the annual average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This is a
walkover for the programmers with finite number of records. They will simply write the logic to
produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the large scale industries of a
particular state, since its formation.
When we write applications to process such bulk data,they will take a lot of time to execute.
There will be a heavy network traffic when we move data from source to network server and so on.
To solve these problems, we have the MapReduce framework.
Input Data
The above data is saved as sample.txt and given as input. The input file looks as shown below.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
70
public class ProcessUnits {
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable ,/*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()) {
lasttoken = s.nextToken(); }
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
} }
//Reducer class
public static class E_EReduce extends MapReduceBase implements Reducer< Text, IntWritable,
Text, IntWritable > {
//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;
while (values.hasNext()) {
if((val = values.next().get())>maxavg) {
output.collect(key, new IntWritable(val));
} }} }
//Main function
public static void main(String args[])throws Exception {
JobConf conf = new JobConf(ProcessUnits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}}
Save the above program as ProcessUnits.java. The compilation and execution of the program is
explained below.
71
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program.
Visit the following link mvnrepository.com to download the jar. Let us assume the downloaded
folder is /home/hadoop/.
Step 3
The following commands are used for compiling the ProcessUnits.java program and creating a jar
for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5
The following command is used to copy the input file named sample.txtin the input directory of
HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input files from
the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Wait for a while until the file is executed. After execution, as shown below, the output will contain
the number of input splits, the number of Map tasks, the number of reducer tasks, etc.
INFO mapreduce.Job: Job job_1414748220717_0002
completed successfully
14/10/31 06:02:52
INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read = 61
FILE: Number of bytes written = 279400
FILE: Number of read operations = 0
FILE: Number of large read operations = 0
FILE: Number of write operations = 0
HDFS: Number of bytes read = 546
HDFS: Number of bytes written = 40
HDFS: Number of read operations = 9
HDFS: Number of large read operations = 0
HDFS: Number of write operations = 2 Job Counters
Launched map tasks = 2
Launched reduce tasks = 1
72
Data-local map tasks = 2
Total time spent by all maps in occupied slots (ms) = 146137
Total time spent by all reduces in occupied slots (ms) = 441
Total time spent by all map tasks (ms) = 14613
Total time spent by all reduce tasks (ms) = 44120
Total vcore-seconds taken by all map tasks = 146137
Total vcore-seconds taken by all reduce tasks = 44120
Total megabyte-seconds taken by all map tasks = 149644288
Total megabyte-seconds taken by all reduce tasks = 45178880
Map-Reduce Framework
Map input records = 5
Map output records = 5
Map output bytes = 45
Map output materialized bytes = 67
Input split bytes = 208
Combine input records = 5
Combine output records = 5
Reduce input groups = 5
Reduce shuffle bytes = 6
Reduce input records = 5
Reduce output records = 5
Spilled Records = 10
Shuffled Maps = 2
Failed Shuffles = 0
Merged Map outputs = 2
GC time elapsed (ms) = 948
CPU time spent (ms) = 5160
Physical memory (bytes) snapshot = 47749120
Virtual memory (bytes) snapshot = 2899349504
Total committed heap usage (bytes) = 277684224
73
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_dir
/home/hadoop
Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running
the Hadoop script without any arguments prints the description for all commands.
Usage − hadoop [--config confdir] COMMAND
The following table lists the options available and their description.
Sr.No. Option & Description
1 namenode -format
Formats the DFS filesystem.
2 secondarynamenode
Runs the DFS secondary namenode.
3 namenode
Runs the DFS namenode.
4 datanode
Runs a DFS datanode.
5 dfsadmin
Runs a DFS admin client.
6 mradmin
Runs a Map-Reduce admin client.
7 fsck
Runs a DFS filesystem checking utility.
8 fs
Runs a generic filesystem user client.
9 balancer
Runs a cluster balancing utility.
10 oiv
Applies the offline fsimage viewer to an fsimage.
11 fetchdt
Fetches a delegation token from the NameNode.
12 jobtracker
Runs the MapReduce job Tracker node.
13 pipes
Runs a Pipes job.
74
14 tasktracker
Runs a MapReduce task Tracker node.
15 historyserver
Runs job history servers as a standalone daemon.
16 job
Manipulates the MapReduce jobs.
17 queue
Gets information regarding JobQueues.
18 version
Prints the version.
19 jar <jar>
Runs a jar file.
23 classpath
Prints the class path needed to get the Hadoop jar and the required libraries.
24 daemonlog
Get/Set the log level for each daemon
1 -submit <job-file>
Submits the job.
2 -status <job-id>
Prints the map and reduce completion percentage and all job counters.
75
Prints the counter value.
4 -kill <job-id>
Kills the job.
7 -list[all]
Displays all jobs. -list displays only jobs which are yet to complete.
8 -kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.
9 -fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.
The number of Maps is usually defined by the size of the input data and size of the data split
block that is calculated by the size of the HDFS file / data split. Therefore, if we have an HDFS
data file of 5 TB and a block size of 128 MB, there will be 40,960 maps present in the file. But
sometimes, the number of Mappers created will be more than this count because of speculative
execution. This is true when the input is a file, though it entirely depends on the InputFormat class.
In Hadoop MapReduce processing, there will be a delay in the result of the job when the
assigned Mapper or Reducer is taking a long time to finish.
MapReduce dataflow
78
Now that we have seen the components that make a basic MapReduce job possible, we will
distinguish how everything works together at a higher level. From the following diagram, we will
understand MapReduce dataflow with multiple nodes in a Hadoop cluster:
Introduction to NoSQL
A NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than the tabular
relations used in relational databases. Such databases came into existence in the late 1960s, but did
not obtain the NoSQL moniker until a surge of popularity in the early twenty-first century.
The suitability of a given NoSQL database depends on the problem it should solve. Data
structures used by NoSQL databases are sometimes also viewed as more flexible than relational
79
database tables. Most NoSQL databases offer a concept of eventual consistency in which database
changes are propagated to all nodes so queries for data might not return updated data immediately
or might result in reading data that is not accurate which is a problem known as stale reads.
Advantages of NoSQL:
There are many advantages of working with NoSQL databases such as MongoDB and Cassandra.
The main advantages are high scalability and high availability.
High scalability –
NoSQL database use sharing for horizontal scaling. Partitioning of data and placing it on
multiple machines in such a way that the order of the data is preserved is sharing. Vertical scaling
means adding more resources to the existing machine whereas horizontal scaling means adding
more machines to handle the data.
Vertical scaling is not that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra etc. NoSQL can handle huge
amount of data because of scalability, as the data grows NoSQL scale itself to handle that data in
efficient manner.
High availability –
Auto replication feature in NoSQL databases makes it highly available because in case of
any failure data replicates itself to the previous consistent state.
Disadvantages of NoSQL:
NoSQL has the following disadvantages.
Narrow focus –
NoSQL databases have very narrow focus as it is mainly designed for storage but it provides very
little functionality. Relational databases are a better choice in the field of Transaction Management
than NoSQL.
Open-source –
NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other words two
database systems are likely to be unequal.
Management challenge –
The purpose of big data tools is to make management of a large amount of data as simple as
possible. But it is not so easy. Data management in NoSQL is much more complex than a relational
database. NoSQL, in particular, has a reputation for being challenging to install and even more
hectic to manage on a daily basis.
GUI is not available –
GUI mode tools to access the database is not flexibly available in the market.
Backup –
Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no
approach for the backup of data in a consistent manner.
Large document size –
Some database systems like MongoDB and CouchDB store data in JSON format. Which means that
documents are quite large (BigData, network bandwidth, speed), and having descriptive key names
actually hurts, since they increase the document size.
Types of NoSQL database:
Types of NoSQL databases and the name of the databases system that falls in that category are:
MongoDB falls in the category of NoSQL document based database.
80
Key value store: Memcached, Redis, Coherence
Tabular: Hbase, Big Table, Accumulo
Document based: MongoDB, CouchDB, Cloudant
When should NoSQL be used:
When huge amount of data need to be stored and retrieved .The relationship between the data you
store is not that important. The data changing over time and is not structured. Support of
Constraints and Joins is not required at database level. The data is growing continuously and you
need to scale the database regular to handle the data.
NoSQL Data Architecture Patterns
Architecture Pattern is a logical way of categorising data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store it in
a valid format. It is widely used because of its flexibilty and wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:
This model is one of the most basic models of NoSQL databases. As the name suggests, the
data is stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or co-related to
the key. The key-value pair storage databases generally store data as a hash table where each key is
unique. The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc). This type
of pattern is usually used in shopping websites or e-commerce applications.
Advantages:
Can handle large amounts of data and heavy load.
Easy retrieval of data by keys.
Limitations:
Complex queries may attempt to involve multiple key-value pairs which may delay performance.
Data can be involving many-to-many relationships which may collide.
Examples:
DynamoDB
Berkeley DB
3. Document Database:
The document database fetches and accumulates data in forms of key-value pairs but here,
the values are called as Documents. Document can be stated as a complex data structure. Document
here can be a form of text, arrays, strings, JSON, XML or any such format. The use of nested
documents is also very common. It is very affective as most of the data created is usually in form of
JSONs and is unstructured.
Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:
Handling multiple documents is challenging
Aggregation operations may not work accurately.
Examples:
MongoDB
CouchDB
82
Figure – Document Store Model in form of JSON documents
4. Graph Databases:
Clearly, this architecture pattern deals with storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some data.
The objects or entities are called as nodes and are joined together by relationships called Edges.
Each edge has a unique identifier. Each node serves as a point of contact for the graph..
Advantages:
Fastest traversal because of connections.
Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
Neo4J
FlockDB( Used by Twitter)
83
Let us understand each step of the ETL process in-depth:
Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is
extracted which can be in various formats like relational databases, No SQL, XML, and flat files
into the staging area. It is important to extract the data from various source systems and store it into
the staging area first and not directly into the data warehouse because the extracted data is in
various formats and can be corrupted also. Hence loading it directly into the data warehouse may
damage it and rollback will be much more difficult. Therefore, this is one of the most important
steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are
applied on the extracted data to convert it into a single standard format. It may involve following
processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States,
and America into USA, etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multiple attributes.
Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally
loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular intervals. The rate and period of
loading solely depends on the requirements and varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed data
is being loaded into the data warehouse, the already extracted data can be transformed. The block
diagram of the pipelining of ETL process is shown below:
84
ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder, CloverETL,
and MarkLogic.
Textual ETL is the process of reading text and producing a relational data base suitable for
analytical processing. The text can come from any electronic source and the results can go into
any relational data base. ... Email typically contains spam and personal email which does not belong
in a corporate data base.
85
UNIT IV
Big Data analytics:
Data analytics life cycle, Data cleaning , Data transformation, Comparing reporting and analysis,
Types of analysis, Analytical approaches, Data analytics using R, Exploring basic features of R,
Exploring R GUI, Reading data sets, Manipulating and processing data in R, Functions and
packages in R, Performing graphical analysis.
UNIT 4
Big Data Analytics
Types of analysis
Analysis of data is a vital part of running a successful business. There are four types of data
analysis that are in use across all industries. While we separate these into categories, they are all
linked together and build upon each other. As you begin moving from the simplest type of analytics
to more complexes, the degree of difficulty and resources required increases. At the same time, the
level of added insight and value also increases.
Descriptive Analysis
The first type of data analysis is descriptive analysis. It is at the foundation of all data insight. It is
the simplest and most common use of data in business today. Descriptive analysis answers the
“what happened” by summarizing past data, usually in the form of dashboards.
The biggest use of descriptive analysis in business is to track Key Performance Indicators (KPIs).
KPIs describe how a business is performing based on chosen benchmarks.
● KPI dashboards
● Monthly revenue reports
● Sales leads overview
Diagnostic Analysis
Diagnostic analysis takes the insights found from descriptive analytics and drills down to
find the causes of those outcomes. Organizations make use of this type of analytics as it creates
more connections between data and identifies patterns of behavior.
A critical aspect of diagnostic analysis is creating detailed information. When new problems
arise, it is possible you have already collected certain data pertaining to the issue. By already
86
having the data at your disposal, it ends having to repeat work and makes all problems
interconnected.
Predictive Analysis
This type of analysis is another step up from the descriptive and diagnostic analyses. Predictive
analysis uses the data we have summarized to make logical predictions of the outcomes of events.
This analysis relies on statistical modeling, which requires added technology and manpower to
forecast. It is also important to understand that forecasting is only an estimate; the accuracy of
predictions relies on quality and detailed data.
● Risk Assessment
● Sales Forecasting
● Using customer segmentation to determine which leads have the best chance of converting
● Predictive analytics in customer success teams
Analytical Approach
An analytical approach is the use of analysis to break a problem down into the elements necessary
to solve it. It's the same as formal analysis.
Difficult social problems require an analytical approach because they do not yield to any other
approach.
Researchers use different analytical techniques to examine complex relationships between
variables. There are three basic types of analytical techniques:
● Regression Analysis
● Grouping Methods
● Multiple Equation Models
Regression Analysis
Regression analysis assumes that the dependent, or outcome, variable is directly affected by one or
more independent variables. There are four important types of regression analyses:
1. Ordinary least squares (OLS) regression
o OLS regression (also known as linear regression) is used to determine the
relationship between a dependent variable and one or more independent variables.
o OLS regression is used when the dependent variable is continuous. Continuous
variables, in theory, can take on any value with a range. For example, family child
care expenses, measured in dollars, is a continuous variable.
87
o When used to estimate the associations between two or more independent variables
and a single dependent variable, it is called multiple linear regression.
o In multiple regression, the coefficient (i.e., standardized or unstandardized
regression coefficient for each independent variable) tells you how much the
dependent variable is expected to change when that independent variable increases
by one, holding all the other independent variables constant.
2. Logistic regression
o Logistic regression (or logit regression) is a special form of regression analysis that
is used to examine the associations between a set of independent or predictor
variables and a dichotomous outcome variable. A dichotomous variable is a variable
with only two possible values, e.g. child receives child care before or after the Head
Start program day (yes, no).
o Like linear regression, the independent variables may be either interval, ordinal, or
nominal. A researcher might use logistic regression to study the relationships
between parental education, household income, and parental employment and
whether children receive child care from someone other than their parents (receives
nonparent care/does not receive nonparent care).
3. Hierarchical linear modeling (HLM)
o Used when data are nested. Nested data occur when several individuals belong to the
same group under study. For example, in child care research, children enrolled in a
center-based child care program are grouped into classrooms with several
classrooms in a center. Thus, the children are nested within classrooms and
classrooms are nested within centers.
o Allows researchers to determine the effects of characteristics for each level of nested
data, classrooms and centers, on the outcome variables. HLM is also used to study
growth (e.g., growth in children’s reading and math knowledge and skills over time).
4. Duration models
o Used to estimate the length of time before a given event occurs or the length of time
spent in a state. For example, in child care policy research, duration models have
been used to estimate the length of time that families receive child care subsidies.
o Sometimes referred to as survival analysis or event history analysis.
Grouping Methods
Grouping methods are techniques for classifying observations into meaningful categories. Two of
the most common grouping methods are discriminant analysis and cluster analysis.
a)Discriminant analysis
● Identifies characteristics that distinguish between groups. For example, a researcher could
use discriminant analysis to determine which characteristics identify families that seek child
care subsidies and which identify families that do not.
● It is used when the dependent variable is a categorical variable (e.g., family receives child
care subsidies [yes, no], child enrolled in family care [yes, no], type of child care child
receives [relative care, non-relative care, center-based care]). The independent variables are
interval variables (e.g., years of schooling, family income).
b)Cluster analysis
88
● Used to classify similar individuals together. It uses a set of measured variables to classify a
sample of individuals (or organizations) into a number of groups such that individuals with
similar values on the variables are placed in the same group. For example, cluster analysis
would be used to group together parents who hold similar views of child care or children
who are suspended from school.
● Its goal is to sort individuals into groups in such a way that individuals in the same group
(cluster) are more similar to each other than to individuals in other groups.
● The variables used in cluster analysis may be nominal, ordinal or interval.
89
children's development across multiple domains (e.g., children's scores on standardized assessments
of early math and literacy, language, scores based on teacher reports of children's social skills and
problem behaviors).
What is RStudio?
RStudio is an interface that provides you with a greater ability to conduct your analyses in R. You
can think of RStudio as a overlay on the software R to allow you to visually group together in one
interface the input window, the output window, the objects in your workspace, and plots.
Business analytics in R allows users to analyze business data more efficiently. The following are
some of the main benefits realized by companies employing R in their analytics programs:
Democratizing Analytics across the Organization: R can help democratize analytics by enabling
business users with interactive data visualization and reporting tools. R can be used for data science
by non data scientists so that business users and citizen data scientists can make better business
decisions. R analytics can also reduce time spent on data preparation and data wrangling,
allowing data scientists to focus on more complex data science initiatives.
Providing Deeper, More Accurate Insights: R can help create powerful models to analyze large
amounts of data. With more precise data collection and storage through R analytics, companies can
deliver more valuable insights to users. Analytics and statistical engines using R provide deeper,
more accurate insights for the business. R can be used to develop very specific, in-depth analyses.
Leveraging Big Data: R can handle big datasets and is arguably as easy if not easier for most
analysts to use as any of the other analytics tools available today.
Creating Interactive Data Visualizations: R is also helpful for data visualization and data
exploration because it supports the creation of graphs and diagrams. It includes the ability to create
interactive visualizations and 3D charts and graphs that are helpful for communicating with
business users.
While R programming was originally designed for statisticians, it can be implemented for a variety
of uses including predictive analytics, data modeling, and data mining. Businesses can implement R
to create custom models for data collection, clustering, and analytics. R analytics can provide a
90
valuable way to quickly develop models targeted at understanding specific areas of the business and
delivering tailored insights on day-to-day needs.
● Statistical testing
● Prescriptive analytics
● Predictive analytics
● Time-series analysis
● What-if analysis
● Regression models
● Data exploration
● Forecasting
● Text mining
● Data mining
● Visual analytics
● Web analytics
● Social media analytics
● Sentiment analysis
The R programming language is versatile and can be used for a software development environment
for statistical analysis or graphics representation and reporting purposes.
The below mentioned are the significant features of the R language:
● R is a simple and effective programming language that has been well-developed, as well as R is
data analysis software.
● R has a large, consistent, and incorporated set of tools used for data analysis.
● R contains a suite of operators for different types of calculations on arrays, lists, and vectors.
● R provides highly extensible graphical techniques.
● R graphical techniques for data analysis output either directly display to the computer, or can be
print on paper.
● R has an effective data handling and storage facility.
● R is a vibrant online community.
● R is free, open-source, robust, and highly extensible.
● R supports matrix arithmetic
● R language can also be used with several other scripting languages such as python, perl, ruby, F#,
and Julia.
Exploring R GUI
R GUI is the standard GUI platform for working in R. The R Console Window forms an essential
part of the R GUI. In this window, we input various instructions, scripts and several other important
91
operations. This console window has several tools embedded in it to facilitate ease of operations.
This console appears whenever we access the R GUI.
In the main panel of R GUI, go to the ‘File‘ menu and select the ‘New Script‘ option. This will
create a new script in R.
In order to quit the active R session, you can type the following code after the R prompt ‘>’ as
follows:
>q()
Data structures provide the way to represent data in data analytics. We can manipulate data
in R for analysis and visualization.One of the most important aspects of computing with data in R is
its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few
basic data structures in R:
a. Vectors in R
These are ordered container of primitive elements and are used for 1-dimensional data.
Types – integer, numeric, logical, character, complex
b. Matrices in R
These are Rectangular collections of elements and are useful when all data is of a single class that is
numeric or characters.
Dimensions – two, three, etc.
c. Lists in R
These are ordered container for arbitrary elements and are used for higher dimension data, like
customer data information of an organization. When data cannot be represented as an array or a
data frame, list is the best choice. This is so because lists can contain all kinds of other objects,
including other lists or data frames, and in that sense, they are very flexible.
d. Data frames
These are two-dimensional containers for records and variables and are used for representing data
from spreadsheets etc. It is similar to a single table in the database.
92
Similar to $ in R, the double square brackets operator in R also returns a single element, but it
offers the flexibility of referring to the elements by position rather than by name. It can be used for
data frames and lists.
c. [
The single square bracket operator in R returns multiple elements of data. The index within the
square brackets can be a numeric vector, a logical vector, or a character vector.
For example: To retrieve 5 rows and all columns of already built in data set iris, below command is
used:
1>
iris[1:5,
]
# A sample vector
V <- c(1,5,6,3,2,4,2)
T[t!=”one”]
1 1 99
2 2 84
93
3 1 89
4 2 79
‘)
Merging datasets in R
Sometimes, similar datasets obtained from different sources need to be merged together for further
processing. R provides the following functions to combine different sets of data :
Merge() Function in R - The merge() function is used to combine data frames on the basis of
columns and rows.
The cbind( ) function – It is used to add the columns of datasets having an equal set and identical
order of rows.
The rbind ( ) function – it is used to add rows in datasets having an equal number of columns.
94
The merge() function allows four ways of combining data:
a. Natural join in R
To keep only rows that match from the data frames, specify the argument all=FALSE
b. Full outer join in R
To keep all rows from both data frames, specify all=TRUE
c. Left outer join in R
To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE
d. Right outer join in R
To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE
Using the cbind( ) Function:
cbind() function in R Language is used to combine specified Vector, Matrix or Data Frame by
columns.
95
Syntax: cbind(x1, x2, …, deparse.level = 1)
Parameters:
x1, x2: vector, matrix, data frames
deparse.level: This value determines how the column names generated. The default value of
deparse.level is 1.
Example 1:
x <- 2:7
y <- c(2, 5)
cbind(x, y)
Output:
xy
[1, ] 2 2
[2, ] 3 5
[3, ] 4 2
[4, ] 5 5
[5, ] 6 2
[6, ] 7 5
rbind(): The rbind or the row bind function is used to bind or combine the multiple group of rows
together.
rbind(my_data, new_row)
96
Example 1: rbind Vector to Data Frame
The easiest way of using rbind in R is the combination of a vector and a data frame. First, let’s
create some example data frame…
Output:
## x1 x2 x3
## 1 7 5 1
## 2 4 2 2
## 3 4 8 3
## 4 9 9 4
## 5 9 8 7
Sorting data:
Example: vec1,-c(23,45,10,34,89,20,67,99)
# sorting of a vector
Sort(vec1)
97
#reverse sorting
Sort(vec1, descreasing=TRUE)
Ordering data:
Example:
sampleDataFrame[ order(sampleDataFrame$weight), ]
Transposing Data
t( ) function is used to transpose a matrix or a data frame. This function transposes rows into
columns and columns into rows.
Example:
sampleDataFrame
t(sampleDataFrame)
R provides the following functions of the reshape package to convert data into wide or long
formats:
98
o use the melt ( ) function to convert wide data into the long format
o use the dcast ( ) function to convert long data into the wide format
melt( )
Syntax:
melt(data, na.rm = FALSE, value.name = “value”)
Cast ( )
Syntax:
cast(data, formula, fun.aggregate)
Creating Matrices
Matrices are constructed column-wise, so entries can be thought of starting in the “upper left”
corner and running down the columns. We can create a matrix using the matrix() function and
specifying the values to fill in the matrix and the number of rows and columns to make the matrix.
# numeric matrix
m1 <- matrix(1:6, nrow = 2, ncol = 3)
m1
output :
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
The underlying structure of this matrix is simply an integer vector with an added 2x3 dimension
attribute.
Matrices can also contain character values. Whether a matrix contains data that are of numeric or
character type, all the elements must be of the same class.
# a character matrix
m2 <- matrix(letters[1:6], nrow = 2, ncol = 3)
m2
## [,1] [,2] [,3]
## [1,] "a" "c" "e"
## [2,] "b" "d" "f"
99
Matrices can also be created using the column-bind cbind() and row-bind rbind() functions.
However, keep in mind that the vectors that are being binded must be of equal length and mode.
v1 <- 1:4
v2 <- 5:8
cbind(v1, v2)
## v1 v2
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
rbind(v1, v2)
## [,1] [,2] [,3] [,4]
## v1 1 2 3 4
## v2 5 6 7 8
Adding on to Matrices
We can leverage the cbind() and rbind() functions for adding onto matrices as well. Again, its
important to keep in mind that the vectors that are being binded must be of equal length and mode
to the pre-existing matrix.
Subsetting Matrices
To subset matrices we use the [ operator; however, since matrices have 2 dimensions we need to
incorporate subsetting arguments for both row and column dimensions. A generic form of matrix
subsetting looks like: matrix[rows, columns]. We can illustrate with matrix m2:
m2
## col_1 col_2 col_3
## row_1 1 5 9
## row_2 2 6 10
## row_3 3 7 11
## row_4 4 8 12
By using different values in the rows and columns argument of m2[rows, columns], we can
subset m2 in multiple ways.
Note that you simply use the data.frame function, provide names for your columns, and populate
the contents of the columns (using the c() vector functionality).
Output:
102
Note that a data frame can hold a number of data types (i.e., the columns can be characters,
integers, dates, factors, etc). Above you'll notice that the data types are displayed below the column
name; in this case, our two columns were coded as factors.
We not only see the values of each row in the second column printed but also the
corresponding levels. See here for more on what levels are. The syntax is the same when selecting a
row from a tibble, except the levels aren't included because columns with characters aren't
automatically coded as factors and only factors have levels (don't get hung-up if you don't
understand levels for now). Note that the tibble column prints a little nicer (and is
a chr or character, but not in a jokey way).
1df_tidy[,2]
r
103
Note that in R, when locating a cell, [1,2] refers to the first row and second column, so
that [,2] grabs the entire second column.
To actually do something more interesting with this, and count the number of unique jobs, you use
the same syntax inside a function:
1unique(df_tidy[,2])
r
This is the same for a tibble (this is the last time we'll make the comparison). Again, notice the
levels are gone because the tidyverse defaults to characters instead of factors.
1df_tidy[,"occupation"]
r
104
r
And here's the syntax to grab multiple rows. Note that you span inclusively from the first to last row
of interest.
1df_tidy[1:2,]
r
Packages in R
search()
When the search() command is executed, you can overview the packages that are loaded and are
ready for use. You will see the graphics package that carries out routines to create graphs.
There are many packages that are being installed but not loaded by themselves.
For example – Splines package, that contains routines for smoothing curves, is being installed. But
this splines package is not loaded by itself.
To see what packages are available, you need to type the following command:
installed.packages()
105
Installing R Packages for Windows
In Windows, you get the package menu and install option which is very easy.
After selecting a local mirror site, a list of available binary packages is being shown. You can
choose the ones you need. Once you have selected the packages you need, you need to click
the OK button to download and install them into R.
If you download the package files from the internet(as .zip), you need to use the install package(s)
in the packages menu. It allows you to select the files you need and again packages are unzipped
and installed into R.
To install R packages on the Linux system, you need to perform the below steps:
● Download the required packages as compressed files from the link: Available packages by
name
● Run the following command to install packages:
R CMD INSTALL [options] [l-lib] pkgs
● Use the following command to load the installed package:
library(package)
Installing by the Name of Package
In Linux, you can install the package if you know the name of a package.
install.packages(‘ade4’)
R Packages List
The below table specifies the best packages in R with their usage:
106
How to Use Packages in R
We need to load the package in R after installing them to make them usable.
● Functions can work with any input. You can provide diverse input data to the functions.
● The output of the function is an object that allows you to work with the result.
How to Create a Script in R?
R supports several editors. So the script can be created in any of the editors like Notepad, MS Word
or Word Pad and can be saved with R extension in the current working directory.
For example, if we want to read the sample. R script in R, we need to provide below command:
source("sample.R")
In order to create a script, first, open a script file in the editor mode and type the required code.
We will create a script that takes in input in the form of fractions and converts it into a percentage
by further rounding it to one decimal digit.
Save the above script as script file with any name for example pastePercent.R.
107
Now you can call this script on the console with the help of source command which we have
already seen.
source(‘pastePercent.R’)
Output:
Define a function with a name so that it becomes easier to call an R function and pass arguments to
it as input.
The R function should be followed by parentheses that act as a front gate for your function and
between the parentheses, arguments for the function are provided.
Use the return() statement that acts as a back gate of your function.
The return() statement provides the final result of the function that is returned to your workspace.
Let us now see how we can convert the script that we had written earlier to convert values into
percentage and round off into an R function.
108
The keyword function defines the starting of function. The parentheses after the function
form the front gate, or argument list of the function. Between the parentheses are the arguments to
the function. In this case, there is only one argument.
The return statement defines the end of the function and returns the result. The object put
between the parentheses is returned from inside the function to the workspace. Only one object can
be placed between the parentheses.
The braces, {} are the walls of the function. Everything between the braces is part of the assembly
line or the body of the function. This is how functions are created in R.
Using R Function
After transforming the script into an R function, you need to save it and you can use the function in
R again if required.
As R does not let you know by itself that it loaded the function but it is present in the workspace, if
you want you can check it by using ls() command.
Now as we know what all functions in R are present in the memory and we can use it when
required.
For example, if you want to create percentage from values again, you can use add percent function
for the same as below:
#Author DataFlair
ls()
new.vector <- c(0.8223, 0.02487, 1.62, 0.4)
Percent_add(new.vector)
Output:
109
Using the Function Objects in R
In R, a function is also an object and you can manipulate it as you do for other objects.
You can assign a function to the new object using below command:
110
Reducing the Number of Lines in R
Till now, in all the above code, we have written return() function to return output. But in R, this can
be skipped as by default, R returns the value of the last line of code in the R function body.
Now, the above code will be:
#Author DataFlair
Percent_add <- function(fac){
percentage <- round(fac * 100, digits = 1)
paste(percentage, "%", sep = "")}
Output:
111
You need return if you want to exit the function before the end of the code in the body.
For example, you could add a line to the Percent_add function that checks whether fac is numeric,
and if not, returns NULL, as shown in the following table:
#Author DataFlair
Percent_add <- function(frac){
if( !is.numeric(frac) ) return(NULL)
percentage <- round(frac * 100, digits = 1)
paste(percentage, "%", sep = "")}
Output:
112
2. Dropping the {}
You can drop braces in some cases though they form a proverbial wall around the function.
If a function consists of only one line of code, you can just add that line after the argument list
without enclosing it in braces. R will see the code after the argument list as the body of the
function.
Suppose, you want to calculate the odds from a proportion. You can write a function without using
braces, as shown below:
.
Scope of R Function
Every object you create ends up in this environment, which is also called the global environment.
The workspace or global environment is the universe of the R user where everything happens.
1. External R Function
If you use an R function, the function first creates a temporary local environment. This local
environment is nested within the global environment, which means that, from that local
environment, you also can access any object from the global environment. As soon as the function
ends, the local environment is destroyed along with all the objects in it.
If R sees any object name, it first searches the local environment. If it finds the object there, it uses
that one else it searches in the global environment for that object.
113
2. Internal R Function
Using global variables in an R function is not considered a good practice. Writing your functions in
such a way that they need objects in the global environment is not efficient because you use
functions to avoid dependency on objects in the global environment in the first place.
The whole concept behind R strongly opposes using global variables used in different functions. As
a functional programming language, one of the main ideas of R is that the outcome of a function
should not be dependent on anything but the values for the arguments of that function. If you give
the arguments for the same values, you will always get the same results.
#Author DataFlair
calculate_func <- function(data1, data2, data3){
base_min <- function(z) z - mean(data3)
base_min(data1) / base_min(data2)
}
Output:
A closer look at the R function definition of base_min() shows that it uses an object control but
does not have an argument with that name.
114
Finding the Methods behind the Function
It is easy to find out the function you used in R. You can just look at the function code of print() by
typing its name at the command line.
In order to display the info code of the print() function, we proceed as follows:
print
Output:
The UseMethod() function contains central function in the main generic function system.
The UseMethod() function moves along and looks for a function that can deal with the type of
object that is given as the argument x.
Suppose you have a data frame that you want to print. The object that was passed as an argument
will be printed using the print.data.frame() function that will be first looked up by R.
The other functions are searched through a thorough search of another function. The procedure is
then started with a print which is then followed by the object type and the dot.
115
Using Default Methods in R
R provides the feature to create an object with the names that are already used by R. It is possible
with the use of default keyword.
R will ignore the type of the object in that case and just look for a default method if you use the
default keyword with the name of an object.
116
Performing Graphical Analysis in R
Graphs are useful for non-numerical data, such as colours, flavours, brand names, and more. When
numerical measures are difficult or impossible to compute, graphs play an important role.
● Plots with Single Variable – You can plot a graph for a single variable.
● Plots with Two Variables – You can plot a graph with two variables.
● Plots with Multiple Variables – You can plot a graph with multiple variables.
● Special Plots – R has low and high-level graphics facilities.
You may need to plot for a single variable in graphical data analysis with R programming.
For example – A plot showing daily sales values of a particular product over a period of time. You
can also plot the time series for month by month sales.
The choice of plots is more restricted when you have just one variable to the plot. There are various
plotting functions for single variables in R:
● Histograms – Used to display the mode, spread, and symmetry of a set of data.
● Index Plots – Here, the plot takes a single argument. This kind of plot is especially useful for
error checking.
● Time Series Plots – When a period of time is complete, the time series plot can be used to join
the dots in an ordered set of y values.
● Pie Charts – Useful to illustrate the proportional makeup of a sample in presentations.
Histograms have the response variable on the x-axis, and the y-axis shows the frequency of
different values of the response. In contrast, a bar chart has the response variable on the y-axis and
a categorical explanatory variable on the x-axis.
1.Histograms
Histograms display the mode, the spread, and the symmetry of a set of data. The R function hist() is
used to plot histograms.
The x-axis is divided into which the values of the response variable are distributed and then
counted. This is called bins. Histograms are tricky because it depends on the subjective judgments
of where exactly to put the bin margins that what graph you will be looking at. Wide bins produce
one picture, narrow bins produce a different picture, and unequal bins produce confusion.
Small bins produce multimodality (a combination of audio, textual, and visual modes), whereas
broad bins produce unimodality (contains a single-mode). When there are different bin widths, the
default in R is for this to convert the counts into densities.
The convention adopted in R for showing bin boundaries is to employ square and round brackets,
so that:
● [a,b) means ‘greater than or equal to a but less than b’ [square than round).
● (a,b] means ‘greater than a but less than or equal to b’ (round than square].
You need to take care that the bins can accommodate both your minimum and maximum values.
The cut() function takes a continuous vector and cuts it up into bins that can then be used for
counting.
The hist() function in R does not take your advice about the number of bars or the width of bars. It
helps simultaneous viewing of multiple histograms with similar range. For small integer data, you
can have one bin for each value.
In R, the parameter k of the negative binomial distribution is known as size and the mean is
known as mu.
Drawing histograms of continuous variables is a more challenging task than explanatory variables.
This problem depends on the density estimation that is an important issue for statisticians. To deal
with this problem, you can approximately transform continuous model to a discrete model using a
linear approximation to evaluate the density at the specified points.
The choice of bandwidth is a compromise made between removing insignificant bumps and real
peaks.
118
2 Index Plots
For plotting single samples, index plots can be used. The plot function takes a single
argument. This is a continuous variable and plots values on the y-axis, with the x coordinate
determined by the position of the number in the vector. Index plots are especially useful for error
checking.
The time series plot can be used to join the dots in an ordered set of y values when a period of time
is complete. The issues arise when there are missing values in the time series (e.g., if sales values
for two months are missing during the last five years), particularly groups of missing values (e.g., if
sales values for two quarters are missing during the last five years) and during that period we
typically know nothing about the behaviour of the time series.
ts.plot and plot.ts are the two functions for plotting time series data in R.
4 Pie Chart
You can use pie charts to illustrate the proportional makeup of a sample in presentations. Here the
function pie takes a vector of numbers and turns them into proportions. It then divides the circle on
the basis of those proportions.
To indicate each segment of the pie, it is essential to use a label. The label is provided as a vector of
character strings, here called data$names.
If a names list contains blank spaces then you cannot use read.table with a tab-delimited text file to
enter the data. Instead, you can save the file called piedata as a comma-delimited file, with a “.csv”
extension, and input the data to R using read.csv in place of read.table.
#Author DataFlair
data <- read.csv("/home/dataflair/data/piedata.csv")
data
Output:
119
2. Plots with Two Variables
The two types of variables used in the graphical data analysis with R:
● Response variable
● Explanatory variable
The response variable is represented on the y-axis and the explanatory variable is represented on
the x-axis.
When an explanatory variable is categorical, like genotype or colour or gender, the appropriate plot
is either a box-and-whisker plot or a barplot.
120
Scatterplots
Scatterplots shows a graphical representation of the relationship between two numbered sets. The
plot function draws axis and adds a scatterplot of points. You can also add extra points or lines to
an existing plot by using the functions, point, and lines.
The points and line functions can be specified in the following two ways:
● Cartesian plot (x, y) – A Cartesian coordinate specifies the location of a point in a two-
dimensional plane with the help of two perpendicular vectors that are known as an axis. The
origin of the Cartesian coordinate system is the point where two axes cut each other and the
location of this point is the (0,0).
● Formula plot (y, x) – The formula based plot refers to representing the relationship between
variables in the graphical form. For example – The equation, y=mx+c, shows a straight line in
the Cartesian coordinate system.
The advantage of the formula-based plot is that the plot function and the model fit look and feel the
same. The Cartesian plots build plots using “x then y” while the model fit uses “y then x”.
The plot function uses the following arguments:
121
The best way to identify multiple individuals in scatterplots is to use a combination of colours and
symbols. A useful tip is to use as.numeric to convert a grouping factor into colour and/or a symbol.
Stepped Lines
Stepped lines can be plotted as graphical representation displays in R. These plots, plot data
distinctly and also provide a clear view of the differences in the figures.
While plotting square edges between two points, you need to decide whether to go across and then
up, or up and then across. Let’s assume that we have two vectors from 0 to 10. We plot these points
as follows:
x = 0:10
y = 0:10
plot(x,y)
Output:
122
Also, generate a line by using the upper case “S” as shown below:
> lines(x,y,col="green",type='S')
Output:
A box-and-whisker plot is a graphical means of representing sets of numeric data using quartiles. It
is based on the minimum and maximum values, and upper and lower quartiles.
Boxplots summarizes the information available. The vertical dash lines are called the ‘whiskers’.
Boxplots are also excellent for spotting errors in data. The extreme outliers represent these errors.
123
Barplot
Barplot is an alternative to boxplot to show the heights of the mean values from the different
treatments. Function tapply computes the height of the bars. Thus it works out the mean values for
each level of the categorical explanatory variable.
Let us create a toy dataset of temperatures in a week. Then, we will plot a barplot that will have
labels.
temperature <- c(28, 35, 31, 40, 29, 41, 42)
days <- c("Sun", "Mon", "Tues", "Wed",
"Thurs", "Fri", "Sat")
barplot(temperature, main = "Maximum Temperatures
in a Week",
xlab = "Days",
ylab = "Degree in Celcius",
names.arg= days,
col = "darkred")
Output:
Initial data inspection using plots is even more important when there are many variables, any one of
which might have mistakes or omissions. The principal plot functions that represent multiple
variables are:
● The Pairs Function – For a matrix of scatterplots of every variable against every other.
● The Coplot Function – For conditioning plots where y is plotted against x for different values
of z.
It is better to use more specialized commands when dealing with the rows and columns of data
frames.
124
The Pairs Function
For two or more continuous explanatory variables, it is valuable to check for subtle
dependencies between the explanatory variables. Rows represent the response variables and
columns represent the explanatory variables.
Every variable in the data frame is on the y-axis against every other variable on the x-axis using the
pairs function plots. The pairs function needs only the name of the whole data frame as its first
argument.
The relationship between the two variables may be obscured by the effects of other processes in
multivariate data. When you draw a two-dimensional plot of y against x, then all the effects of other
explanatory variables are shown onto the plane of the paper. In the simplest case, we have one
response variable and just two explanatory variables.
The coplot panels are ordered from lower left to upper right, associated with the values of the
conditioning variable in the upper panel from left to right.
R has extensive facilities for producing graphs. It also has low and high-level graphics
facilities as per the requirement.
The low-level graphics are the basic building blocks that can build up graphs step by step, while a
high-level facility provides the variety of pre-assembled graphical display.
Apart from the various kinds of graphical plots discussed, R supports the following special plots:
● Design Plots – Effective sizes in designed experiments can be visualized using design
plots. One can plot the design plots using the plot.design function
– plot.design(Growth.rate~Water*Detergent*Daphnia)
● Bubble Plots – Useful for illustrating the variation in the third variable across different
locations in the x–y.
● Plots with many Identical Values – Sometimes, two or more points with count data fall in
exactly the same location in a scatterplot. As a result, the repeated values of y are hidden, one
beneath the other.
Using the following functions, we can add the extra graphical objects in plots:
125
● arrows – For plotting arrows and headed bars – The syntax for the arrows function is to draw a
line from the point (xO, yO) to the point (x1, y1) with the arrowhead, by default, at the
“second” end (x1, y1).
arrows(xO, yO, xl, yl)
Adding code = 3 produces a horizontal double-headed arrow from (2,1) to (9,8), for example:
plot(x,y)
arrows(2,1,9,8, code = 3)
#Author DataFlair
Output:
You are likely to want to save each of your plots as a PDF or PostScript file for publication-quality
graphics. This is done by specifying the ‘device’ before plotting, then turning the device off once
finished.
The computer screen is the default device, where we can obtain a rough copy of the graph, using
the following command:
● Line Graph – It displays over the time period. It generally keeps the track of records for both,
long-time period and short-time period according to requirements. In the case of small
126
change, the line graph is more common than the bar graph. In some cases, the line graphs also
compare the changes among different groups in the same time period.
● Pie Chart – It displays comparison within a group.
For example – You can compare students in a college on the basis of their streams, such as
arts, science, and commerce using a pie chart. One cannot use a pie chart to show changes over
the time period.
● Bar Graph – Similar to a line graph, the bar graph generally compares different groups or
tracking changes over a defined period of time. Thus the difference between the two graphs is
that the line graph tracks small changes while a bar graph tracks large changes.
● Area Graph – The area graph tracks the changes over the specific time period for one or more
groups related to a similar category.
● X-Y Plot – The X-Y plot displays a certain relationship between two variables. In this type of
variable, the X-axis measures one variable and Y-axis measures another variable. On the one
hand, if the values of both variables increase at the same time, a positive relationship exists
between variables. On the other hand, if the value of one variable decreases at the time of the
increasing value of another variable, a negative relationship exists between variables. It could
be also possible that the two variables don’t have any relationship. In this case, plotting graph
has no meaning.
UNIT – V
127
Big Data Visualization:
Introduction to Data visualization, Challenges to Big data visualization, Types of data visualization,
Visualizing Big Data, Tools used in data visualization, Proprietary Data Visualization tools, Open
source data visualization tools, Data visualization with Tableau.
Data visualization
Data visualization is a graphical representation of any data or information. Visual elements
such as charts, graphs, and maps are the few data visualization tools that provide the viewers with
an easy and accessible way of understanding the represented information.
Data visualization tools and technologies are essential to analyze massive amounts of
information and make data-driven decisions.
128
Scalability and dynamics are two major challenges in visual analytics
The visualization-based methods take the challenges presented by the “four Vs” of big data and turn
them into following opportunities.
● Volume: The methods are developed to work with an immense number of datasets and
enable to derive meaning from large volumes of data.
● Variety: The methods are developed to combine as many data sources as needed.
● Velocity: With the methods, businesses can replace batch processing with real-time stream
processing.
● Value: The methods not only enable users to create attractive info graphics and heat maps,
but also create business value by gaining insights from big data.
Big data often has unstructured formats. Due to bandwidth limitations and power requirements,
visualization should move closer to the data to extract meaningful information efficiently.
Effective data visualization is a key part of the discovery process in the era of big data. For the
challenges of high complexity and high dimensionality in big data, there are different
dimensionality reduction methods.
There are also following problems for big data visualization:
● Visual noise: Most of the objects in dataset are too relative to each other. Users cannot
divide them as separate objects on the screen.
● Information loss: Reduction of visible data sets can be used, but leads to information loss.
● Large image perception: Data visualization methods are not only limited by aspect ratio and
resolution of device, but also by physical perception limits.
● High rate of image change: Users observe data and cannot react to the number of data
change or its intensity on display.
● High performance requirements: It can be hardly noticed in static visualization because of
lower visualization speed requirements--high performance requirement.
In Big Data applications, it is difficult to conduct data visualization because of the large size
and high dimension of big data. Most of current Big Data visualization tools have poor
performances in scalability, functionalities, and response time. Uncertainty can result in a great
challenge to effective uncertainty-aware visualization and arise during a visual analytics
process.
Potential solutions to some challenges or problems about visualization and big data were presented:
● Meeting the need for speed: One possible solution is hardware. Increased memory and
powerful parallel processing can be used. Another method is putting data in-memory but
using a grid computing approach, where many machines are used.
● Understanding the data: One solution is to have the proper domain expertise in place.
● Addressing data quality: It is necessary to ensure the data is clean through the process of
data governance or information management.
● Displaying meaningful results: One way is to cluster data into a higher-level view where
smaller groups of data are visible and the data can be effectively visualized.
129
● Dealing with outliers: Possible solutions are to remove the outliers from the data or create a
separate chart for the outliers.
As shown in the table, the simplest type of data visualization is 1D representation and the most
complex data visualization is the network representation. The following is a brief description of
each of these data visualizations:
▪ 1D (Linear) Data Visualization – In the linear data visualization, data is presented in the
form of lists. Hence, we cannot term it as visualization. It is rather a data organization
technique. Therefore, no tool is required to visualize data in a linear manner.
130
▪ 2D(planar) Data visualization –This technique presents data in the form of images,
diagrams or charts on aplane surface.
▪ 3D (volumetric) Data visualization –In this method, data presentation involves exactly three dimensions to
show simulations, surface and volume rendering, etc. generally, it is used in scientific studies. Today, many
organizations use 3D computer modeling and volume rendering in advertisements to provide users a better feel
of their products.
▪ Network data visualization – It is used to represent data relations that are too complex to
be represented in the form of hierarchies.
The most common notation used for big data is 3 V’s- volume, velocity, and variety. But, the
most exciting feature is the way in which value is filtered from the haystack of data.
Now a days, IT companies that are using Big Data faces the following challenges:
By considering the above factors, IT companies are focusing more on research and development of
robust algorithm, software, and tools to analyze the data that is scattered in the internet space.
131
Visualization of data produces cluttered images that are filtered with the help of clutter- reduction
techniques. Uniform sampling and dimension reduction are two commonly used clutter- reduction
techniques.
Visual data reduction process involves automated data analysis to measure density, outliers, and
their differences. These measures are then used as quality metrics to evaluate data – reduction
activity.
⮚ Size metrics
A part from representing data, a visualization tool must be able to establish links between different
data values, restore the missing data, and polish data for further analysis.
EXCEL – It is a new tool that is used for data analytics. It helps you to track and visualize data for
deriving better insights. This tool provides various ways to share data and analytical conclusions
within and across organizations.
132
LastForward – It is open – source software provided by last.fm for analyzing and
visualizing social music network.
133
D3 – D3 allows you to bind arbitrary data to a document object model (DOM) and then applies
data- driven transformations to the document. For example, you can use D3 to generate an
HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar
chart with smooth transitions and interactions.
Rootzmap Mapping the internet – It is a tool to generate a series of maps on the basis
of the datasets provided by the National Aeronautics and Space Administration (NASA).
● They deliver performance and are complaint with the web as well as mobile web
security.
Analytical techniques are used to analyze complex relationships among variables. The following
are some commonly used analytical techniques for big data solutions:
Regression analysis – it is a statistical tool used for prediction. Regression analysis is used to
predict continuous dependent variables from independent variables.
o Path analysis
Tableau is a Data visualization software that allows developers to build interactive dashboards
that are easily updated with new data and can be shared with a wider audience. There are various
types of Tableau products available in the market. Some of the commonly known products include
Tableau Desktop, Tableau Server, Tableau Online, Tableau Reader, and Tableau Public.
The important features of Tableau Software include the following:
⮚ Single- click data analytics in visual form
⮚ In- depth statistical analysis
⮚ Management of metadata
⮚ In built, top-class data analytic practices
⮚ In built data engine
⮚ Big data analytics
⮚ Quick and accurate data discovery
⮚ Business dashboards creation
⮚ Various types of data visualization
⮚ Social media analytics, including Facebook and Twitter
⮚ Easy and quick integration of R
⮚ Business intelligence through mobile
⮚ Analysis of time series data
⮚ Analysis of data from surveys
135
⮚ Google analytics
⮚ Google bigquery
⮚ Hadoop
⮚ HP vertica
⮚ SAP
⮚ Splunk
Click the continue button. The open page of tableau tool appears.
Toolbar Icons:
Tableau is a GUI- oriented drag and drop tool. The following figure shows the icons present on the
tableau toolbar.
o Undo/ Redo – scrolls backward or forward on the screen. You can retrieve any step by
clicking the undo/redo button
o File save – saves your work. You need to click this button frequently as tableau does not
have the automated save function.
136
o Auto/ manual update – generates visual. It is particularly helpful for large datasets where
dragging and dropping items consume time
o Group – allows you to group data by selecting more than one headers in a table or values in
a legend.
o Reset cards – provides a menu to turn on or off screen elements, such as caption or
summary.
o Fit Menu – allows different views of the tableau screen. You can fit the screen either
horizontally or vertically.
o Fit Axis - Fixes the axis of view. You can zoom in/out charts with this button
Main menu
File – contains general functions, such as open, save, and save as. Other functions are print to pdf
and Repository location function to review and change the default location of the saved file.
Data – helps to analyze the tabular data on the tableau website. The edit relationships option is used
to blend data when the field names in two data sources are not identical
Worksheet – provides option such as export option, excel crosstab, and duplicate as crosstab
Dashboard – Provides the actions menu, which is the most important option on the dashboard menu
because all the actions related to tableau worksheets and dashboards are defined within the actions
menu. The actions menu is present under the worksheet menu as well.
Story – provides the new story option that is used for explaining the relationship among facts,
providing context to certain events, showing the dependency between decisions and outcomes, etc.
Analysis – provides the aggregate measures and stack mark options. To create new measures or
dimensions, use create calculated field or edit calculated field.
Map – provides options to change the color scheme and replace the default maps
Window – provides the bookmark menu, which is used to create .tbm files that can be shared with
different users
Help – provides options to access tableau’s online manual, training videos, and sample workbooks
Tableau Server
In Tableau Server Users can interact with the dashboards on the server without any installation on
their machines. Tableau Online is Tableau Server hosted by Tableau on a cloud platform.
Tableau server also provides robust security to the dashboards. Tableau Server web-edit feature
allows authorized users to download and edit the dashboards. Tableau server allows users to
publish and share their data sources as live connections or extracts. Tableau Server is highly
secured for visualizing data. It leverages fast databases through live connections
● Tableau workbook – it is the default save type when you save your work on the desktop.
The extension of such files will be .twb. The files with extension .twbx can be shared with
people not having tableau desktop license or those who cannot access the data source.
● Tableau data source – if you frequently connect to a specific data source or if you have
manipulated the metadata of any data source, saving the file as tableau data source is of
great use. The extension of such a file will be .tds, and it includes server address, password,
and metadata.
● Tableau bookmark - if you want to share any specific file with others, use tableau
bookmark.
● Tableau data extract – it compresses your extracted data and improves performance by
incorporating more formulas and functions. The extension of a tableau data extract file
is .tde.
Tableau charts
Tableau can create different types of univariate, bivariate, and multivariate charts.
The following are some of the common chart types that tableau can create:
● Tables – tables are an excellent choice of presenting data as they preserve all the
information, which in turn minimize the chances of misinterpretation.
● Scatter plots – scatter plots are used to describe the relationship between two variables.
● Trend lines – trend lines are used to analyze the relationship between variables as well
as predict the future outcome
138
● Bullet graph – bullet graph is just like a bar graph and is generally used in qualitative
analysis.
● Box plot – box plot represents distribution of data and is used in the comparison of
multiple sets of data. It can effectively compute:
o Median
o Treemap – treemap is one of the best compact techniques to visualize the part to
whole relationships as well as hierarchical models.
o Word cloud – similar to bubble charts, the words in a word cloud are sized
according to the frequency at which they appear in the content.
139
16. UNIVERSITY QUESTION PAPERS OF PREVIOUS YEAR:
Unit – 1
Unit – 2
1. .
Unit-3
1
Unit – 4
Unit – 5
140
18. Assignment Questions
Unit 1
1.Define Big Data and Data science?
2.Give some real world Big Data Examples. Explain Each.
3. What has led to explosive growth of data? Justify.
4.Define Data Velocity. Discuss Some examples on it.
5.Discuss in detail about Big Data Processing Architectures?
Unit 2
Unit 3
1.What is meant by Google File System,Explain?
2.Give an importance of Hadoop Architecture?
3.Explain the Hadoop MapReduce paradigm.
4.What is name node,secondary name node and data node.expalin?
5.What are the tasks perform by map reduce in Hadoop?
Unit 4
1. Discuss the life cycle of Data Analytics.
2.
a. List the techniques used to draw graphs in R.
b. Write short note on time series plots.
3. Discuss the syntax of defining a function in R.
4.
a. What are functions? Explain the two main advantages of using functions over
scripts.
b. What are arguments? Why are they used in functions?
5. a. List some types of data structures available in R.
Unit 5
141
19. Unit wise Quiz questions
Unit – 1
1. Which of the following are parts of the 5 P's of data science and what is the additional P
introduced in the slides?
● People
● Purpose
● Product
● Perception
● Process
● Programmability
● Platforms
2. Which of the following are part of the four main categories to acquire, access, and retrieve
data?
● NoSQL Storage
● Remote Data
● Traditional Databases
● Web Services
● Text Files
4. Of the following, which is a technique mentioned in the videos for building a model?
● Investigation
● Validation
● Evaluation
● Analysis
5. What is the first step in finding a right problem to tackle in data science?
142
● Define the Problem
● Define Goals
● Business Objectives
● Collect Data
● Build In-House Expertise
● Organizational Buy-In
Data exploration...
● Remove outliers.
● Generate best estimates for invalid values.
● Remove data with missing values.
● Data Wrangling
● Merge duplicate records.
● Retrieve Data
● Select Analytical Techniques
● Build Models
● Identify Data Sets and Query Data
● Understanding Nature of Data and Preliminary Analysis
143
11. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop?
A. Open source
B. Real-time
C. Java-based
A. Apple
B. Datamatics
C. Facebook
A. Volume
B. Velocity
C. Variety
15. ________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.
A. Scalding
144
B. Cascalog
C. Hcatalog
D. Hcalding
UNIT-2
Unit – 3
The MapReduce algorithm contains two important tasks, namely __________.
A. mapped, reduce
B. mapping, Reduction
C. Map, Reduction
D. Map, Reduce
2.takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs)
A.Map
B.Reduce
C.BothAandB
D. Node
3. task, which takes the output from a map as an input and combines those data tuples into a smaller
set of tuples.
A. Map
B. Reduce
C. Node
D. Both A and B
A. 2
B. 3
C. 4
D. 5
5.Which of the following is used to schedules jobs and tracks the assign jobs to Task tracker?
A. SlaveNode
B. MasterNode
C. JobTracker
D. Task Tracker
145
6.
Which of the following is used for an execution of a Mapper or a Reducer on a slice of data?
A. Task
B. Job
C. Mapper
D. PayLoad
A. secondaryadminnode
B. nameadmin
C. dfsadmin
D. adminsck
Although the Hadoop framework is implemented in Java, MapReduce applications need not be
written in ____________
A. C
B. C#
C. Java
D. None of the above
A. Inputs
B. Output
C. Task
D. None of the above
SQL
JSON
146
Document databases
None of the Above
Key-value
Document
Wide-column
All of the above
Scalability
Relational data
Faster data access than RDBMS.
Data easily held across multiple servers
No SQL
Only SQL
Not Only SQL
SQL Undefined
15.What ETL stands for?
A.Data Inspection
B.transformtion
C.Extract,Transform,Load
D.Data Flow
16. Which of these steps is executed at the end of every stage of ETL – extract, clean,
conform?
Loggingtheactivitytoaflatfile
Displayingthedatatotheuser
Stagingthedatatothedatabase
Sending a message about the tasks
147
17. ETL execution or operation approach falls into which of these two major categories:
Planning&Execution
Implementation&Testing
Scheduling&Support
Maintenance & Support
18. One of the requirements while designing an ETL system is how quickly source data can be
delivered to end users. This is referred as:
Dataspeed
Datalineage
Datalatency
Data availability
19. Extracting the data from the source systems, is involved in _______ step of ETL process.
extract
transform
load
planning
Point out the correct statement.
Debian
Unix-like
Bare metal
Cross-platform
Unit – 4
1. 1. Which of the following is the process of fixing or removing the incorrect data within the
dataset?
a. Data transformation b. Data cleaning c. Data visualization d. All of the above
Ans: b
148
2. Which of the following is the correct syntax to sort the elements of dataset X in the
decreasing order?
a. > order(X$E1, decreasing=TRUE)
b. >order(X, decreasing)
c. >order(X)
d. >order(X, decreasing = FALSE)
Ans: a
3. Which of the following data objects should be used in R if the data consists of one or more
values of a single type?
a. Vector b. matrix c. list d. data frame
Ans: a
4. Which of the following data objects should be used in R if the data consists of one or more
values of any type?
a. Vector b. matrix c. list d. data frame
Ans: c
5. Which of the following subset operators should be used to get an output with multiple
elements of your data?
a. $ b. [[ c.[[ d. c()
Ans: c
6. Which of the following is not a type of merge operation performed in R?
a. Full outer join b. Left outer join c. Common join d. Natural join
Ans: c
7. ____ are used to represent only one- dimensional data, like a set of digits.
a. Vectors b. Matrices c. Data frames d. Lists
Ans: a
8. The _____ function is used to sort a data frame.
a. Sort() b.order() c. melt() d.dcast()
9. Which of the following is the correct symbol to represent the dot argument?
a. … … … … b. … … … c. … … d. …
Ans: d
10. ____ function can be created without specifying its name.
a. Anonymous function
b. External function
c. Internal function
d. Built in function
Ans: a
11. The environment inside a function is known as the ____ environment.
a. Global b. nested c. local d. child
Ans: c
12. Which character function is used for extracting characters from a string?
a. Grep() b. substr() c. sub() d. paste()
Ans: b
13. Which of the following techniques is not used for plotting mechanism?
a. Using plots for default variable
b. Using plots for single variable
c. Using plots for two variables
d. Using plots for multiple variables
Ans: a
149
14. Which of the following plots is circular?
a. Index b. histogram c. bar d. pie
Ans: d
15. Which of the following arguments is used in the hist() function for setting the color of a
histogram?
a. Color b. clr c. col d. crl
Ans: c
Unit 5
1. Which of the following machine learning technique uses input to study the effects on
outputs?
a. Supervised learning
b. Unsupervised learning
c. Semi- supervised learning
d. Transduction
Ans: d
2. Visual analytics is the science of ____ supported by the interactive visual interface.
a. Computers
b. Optics
c. Technical tools
d. Analytical reasoning
Ans: d
3. Which of the following tools provides state of the art technology to store and process big
data?
a. Yahoo
b. Google analytics
c. Twitter map
d. Hadoop
Ans: d
4. Visualization of data produces images that are filtered with the help of ____ techniques.
a. Clutter reduction
b. Predictive analytical
c. Logical reasoning
d. Open source
Ans: a
5. Which of the following big data visualization methods is used for categorizing observations
on the basis of meaningful differences?
a. Regression analysis
b. Grouping methods
c. Multiple equation models
d. Structural equation modeling
Ans: b
6. _____ provides a menu to turn on or off screen elements.
a. Reset cards b. label marks c. duplicate sheets d. fit axis
Ans: b
150
7. Which of the following main menu options will you use to publish your work on the server
of tableau?\
a. Map
b. Format
c. Server
d. Window
Ans: a
8. In tableau, which of the following charts will you use for comparing multiple sets of data
and representing their minimum and maximum values?
a. Scatter plot b. trend lines c. bullet graph d. box plot
Ans:c
9. Which of the following tableau software is suitable for data sharing?
a. Tableau desktop
b. Tableau server
c. Tableau reader
d. Tableau public
Ans: b
10. Which of the following toolbar icons allows you to clear data of the sheet?
a. Undo/redo
b. File save
c. Duplicate sheet
d. Clear sheet
Ans: d
11. Data can be visualized using?
A. graphs
B. charts
C. maps
D. All of the above
Ans: d
12. Which method shows hierarchical data in a nested format?
A. Treemaps
B. Scatter plots
C. Population pyramids
D. Area charts
Ans: a
13. Which of the following is false?
A. data visualization include the ability to absorb information quickly
B. Data visualization is another form of visual art
C. Data visualization decrease the insights and take solwer decisions
D. None Of the above
Ans: c
14. Common use cases for data visualization include?
A. Politics
B. Sales and marketing
C. Healthcare
D. All of the above
151
Ans: d
----
Websites
1.
REFERENCES
152
Text Book(s)
Reference Book(s)
153
25. Students list
Section-A
154
155
Section-B
156
Section-C
157
Section-D
158
Section-E
159
26. Group-Wise students list for discussion topic
Section-A
B.N AdmnNo StudentName B.No AdmnNo StudentName
o
18R11A0501 ADAVIKOLANU SWAPNA 18R11A0528 MOHAMMED ABDUL AMEEN SIDDIQUI
18R11A0502 ANDUGULA SHASHAANK 18R11A0529 MUDDULA ANUSHA
18R11A0503 AWARI DEEKSHITHA 18R11A0530 MUSALE AASHISH
18R11A0504 B DEEVENA ANGELINE 18R11A0531 MUTYALA SANTOSH
SUNAYANA
1 4
18R11A0505 BHAMIDIPATI SHIRIDI PRASAD 18R11A0532 PARITI DIVYA
REVANTH
18R11A0506 CH SIRI SOWMYA 18R11A0533 PARUCHURI HARSHA VARDHAN
18R11A0507 CHERIPALLI SREEJA 18R11A0534 PATRI SAI SINDHURA
18R11A0509 ERRABELLI RUSHYANTH 18R11A0535 PINNEM TARUN KUMAR
18R11A0510 G N HARSHITA 18R11A0536 PIRANGI NITHIN KALYAN
18R11A0511 GAJJI VARUN KUMAR 18R11A0537 POOJABOINA PREETHI
18R11A0512 SRI SAI PRANAVI GANTI 18R11A0538 PURANAM SATYA SAI RAMA TARUN
18R11A0513 H S SHREYA 18R11A0539 S GUNA SINDHUJA
18R11A0514 JANGAM NAGARJUNA GOUD 18R11A0540 SANGARAJU GREESHMA
2 5
18R11A0515 KANNE NITHESH SAI 18R11A0541 SYED ZAINUDDIN
18R11A0516 KODI AKHIL YADAV 18R11A0542 TELUKUNTLA RAJKUMAR
18R11A0517 KOLA SNEHITHA 18R11A0543 THORUPUNURI JANCY
18R11A0518 KOMURAVELLI KARTHIK 18R11A0544 THUMU RAM SAI TEJA REDDY
18R11A0519 KORADA SANTOSH KUMAR 18R11A0545 VADAKATTU HARISH
18R11A0520 KUNCHALA SAIRAM 18R11A0546 VAISHNAVI SABNA
18R11A0521 L A PRITHVIRAJ KUMAR 18R11A0547 VEMURI MADHU VENKATA SAI
18R11A0522 LAHARI BASAVARAJU 18R11A0548 YARRAM REDDY VENKATA SRIVANI REDDY
18R11A0523 LINGA JAYA KRISHNA 19R15A0501 BHULAXMI KALPANA
3 6
18R11A0524 M SREE CHARAN REDDY 19R15A0502 CHALLA DIVYA REDDY
18R11A0525 MAMBETI SAIRAM 19R15A0503 ADLA LIKITHA
18R11A0526 MAMILLA RAMYA 19R15A0504 GOPALADAS VINAYALATHA
18R11A0527 MOHAMMAD AFROZ KHAN 19R15A0505 GANJI CHARAN KUMAR
Section-B
160
B.N AdmnNo StudentName B.No AdmnNo StudentName
o
18R11A0549 ALETI KALYANI 18R11A0577 LANKE RAVI TEJA
18R11A0550 ARENDRA SAI KRUPA CHARY 18R11A0578 MANDA SAI JAHNAVI
18R11A0551 BALRAMGARI SREENIDHI 18R11A0579 MANJEET KAUR
18R11A0552 BANALA KEERTHANA 18R11A0580 MEDIPALLI KRISHNA SAKETH
1 18R11A0553 BATHINA SIRISHA 4 18R11A0581 MOHAMMED AFREEN
18R11A0554 BHEMISETTY ABHISHEK 18R11A0582 MOHAMMED NAZEER AHMED
18R11A0555 BODDU SURESH KUMAR 18R11A0583 MOTHE SAHITHI
18R11A0556 BRAMANDLAPALLY ADITYA 18R11A0584 MOVVA MANASWINI
18R11A0557 CHAKALI HARINATH 18R11A0585 PABBA JOGENDAR
18R11A0558 CHILAKAPATI DIVYA 18R11A0586 PONTHAGANI SAI TEJA
18R11A0559 CHILUMULA AMULYA 18R11A0587 PURALI LIPIKA
18R11A0560 CHINNAM ROSE MARY 18R11A0588 RAGHU VARDHAN P
18R11A0561 D SWEEKRUTHI 18R11A0589 RUPA REDDY VUDUMULA
2 18R11A0562 DADI GEETHIKA 5 18R11A0590 SHIVA DHANUSH T V S
18R11A0563 DESETTY SRI NIKITA 18R11A0591 SUNKARI DINESHWAR REDDY
18R11A0564 DUSETY DIVYA REDDY 18R11A0592 T VISHNU VARDHAN REDDY
18R11A0565 G RAGHAV 18R11A0593 VALLABHANENI MANIKANTA VINEETH
18R11A0566 G RUCHITHA 18R11A0594 VENKATA APARNA K L
18R11A0567 G SAI KEERTHANA 18R11A0595 VENNAKANDLA KARTHIK
18R11A0569 GORLA SAI CHARAN 18R11A0596 YEDLAPATI JEEVANI
18R11A0570 GORREMUTCHU WINNY REBBECA 19R15A0506 MANCHALA RAJINIBHARATHI
18R11A0571 K G N BALAJI RAO 19R15A0507 CHINNA SWAMY ANANTHIKA
3 18R11A0572 KARRI MAHESH VARMA 6 19R15A0508 SABBAN AJAY
18R11A0573 KODAKANDLA NAVYA 19R15A0509 GANDHALA GANESH
18R11A0574 KODISANA SANJAY RAM 19R15A0510 KONKA UMESH
18R11A0575 KONAKANCHI GOPTRI SULEKHA *14R11A05F1 P SAI PRASAD
18R11A0576 KUMBA SRI CHAKRA *15R11A05N4 PORIKA RAMA KRISHNA
Section-C
161
B.No AdmnNo StudentName B.No AdmnNo StudentName
17R11A0556 CHERUKUPALLY MANISH 18R11A05C3 MOOD GANESH
18R11A0597 ANISHA DHANANJAY SHENDKAR 18R11A05C4 MUDUMBAI ACHARYA
RAGHUTHAMADITHYA
18R11A0598 AVIDI DIVYA 18R11A05C5 NANDIKANTI SRI RAM
18R11A0599 BATTULA SAI ABHISHEK YADAV 18R11A05C6 NIKHILA POLKAMPALLY
1 18R11A05A0 BODDUPALLI RADHA KRISHNA REVANTH 4 18R11A05C7 P BHUMIKA PRIYA
18R11A05A1 BUNGA STEEN STRA DORA 18R11A05C8 P HEMANTH REDDY
18R11A05A2 BURUGADDA SHREYA SREE 18R11A05C9 PENDYALA NIKHIL RAO
18R11A05A3 C SURYA VENKAT 18R11A05D0 PODDUTOORI ANUSHA
18R11A05A4 CHALLAKOLUSU NANDA KISHORE 18R11A05D1 POLA SUMANTH
18R11A05A5 CHARAN RAJU M 18R11A05D2 PRIYADARSHINI VIJAYA RANGAN
18R11A05A6 CHITTIBOMMA BALARAM 18R11A05D3 RACHAKONDA GOPIKRISHNA
18R11A05A7 CHOKKAPURAM SRAVANI 18R11A05D4 RAMAVATH LAVANYA
18R11A05A8 DHONAKANTI ROHITH 18R11A05D5 RAVALKOL PAVANI GOUD
2 18R11A05A9 KAKARLA SATYA SRILATHA 5 18R11A05D6 ARRAM RUSHWANTH
18R11A05B0 KALANCHA SATHVIK 18R11A05D7 S APOORVA
18R11A05B2 KODIPE SAIKIRAN 18R11A05D8 SALVERU SAI ROHAN
18R11A05B3 KOMMURI MANASA 18R11A05D9 SHAIKH AFROZ
18R11A05B4 KOTHA SOHAN 18R11A05E1 VAKITI VINESH REDDY
18R11A05B5 KOTLA ASWINI 18R11A05E2 VARUN N
18R11A05B6 KOTTURI NIKHIL PADMANABH 18R11A05E3 VUTKURI SATHVIK GOUD
18R11A05B7 KYATHAM ASHOK GAJAPATHI RAJU 18R11A05E4 YENIMI REDDY JHANSI
18R11A05B8 LAKAVATH DIVYA 19R15A0511 BURA PREM SAI
3
18R11A05B9 M HARI RAMAKRISHNA 6 19R15A0512 RATNALA ASHWINI
18R11A05C0 M RAHUL 19R15A0513 ETIKALA ANUSHA
18R11A05C1 MALLARAPPU SHIRISHA 19R15A0514 ADEPU KEERTHANA
18R11A05C2 METTU KARUNA SRI REDDY 19R15A0515 T BALAVARDHAN REDDY
*18R15A0532 GANDAMALLA SHESHANK
Section-D
162
B.No AdmnNo StudentName B.No AdmnNo StudentName
18R11A05E5 AKSHITA YERRAM 18R11A05H2 NEELA PAVAN
18R11A05E6 ARYASOMAYAJULA VISHAL BHASKAR 18R11A05H3 NEELAPALA TEJA SHREE
18R11A05E7 BALANNAGARI DEEPAK REDDY 18R11A05H4 NEELAYAVALASA MEGHNA PATNAIK
18R11A05E8 BATHRAJ HARINI 18R11A05H5 NEMANA PRANAMYA
1 18R11A05E9 BHALLAMUDI LAKSHMI PRIYANKA 4 18R11A05H6 PAPAIAHGARI SAI PRIYA
18R11A05F0 BODA AKHILA 18R11A05H7 PENUMARTHI KRISHNA BHARADWAJ
18R11A05F1 BODAGAM DEEKSHITHA REDDY 18R11A05H8 SAI NEHA MANDA
18R11A05F2 BOGURAMPETA SUNIL REDDY 18R11A05H9 SAI PRAVALIKA PERIKA
18R11A05F3 BORRA YASWANTH KUMAR 18R11A05J0 SALLA ANUSHA
18R11A05F4 CHINTAMANENI MEGHANA 18R11A05J1 SANDU JAI VENKATESH
18R11A05F5 DINDU SANDEEP 18R11A05J2 SANKU RAJSHREE RAO
18R11A05F6 DINTYALA NAVYA SREE 18R11A05J3 SEELAM SANJANA
18R11A05F7 DONDAPATI MITHUN 18R11A05J4 SOMI SETTY SAI NEELESH
2 18R11A05F8 DONKENA THARUN KUMAR 5 18R11A05J5 TADEPALLI SAI NANDINI
18R11A05F9 G BHUMIKA 18R11A05J6 THARA REKHA KAKARAPARTHI
18R11A05G0 GAJJALA TEJANARAYANA GOUD 18R11A05J7 TUMMALA VARSHITH
18R11A05G1 GARUGULA VIDYA SAGAR 18R11A05J8 V SATYA NAGA SAI SRILEKHA
18R11A05G2 GATTU BHARGAVI 18R11A05J9 VADDE NITHISH
18R11A05G3 GOWLUGARI ALEKHYA REDDY 18R11A05K0 VARIKUTI LAKSHMI TEJA
18R11A05G4 INJEY DIVYA 18R11A05K1 VIPRAGHNA VISHWANATH
SRIKAKULAPU
18R11A05G5 JYOTI GOUDA 18R11A05K2 YALALA SHALINI
3 18R11A05G7 KOMMERA VAMSHI KRISHNA REDDY 6 19R15A0516 KOLANUCHELIMI SAI CHARAN
18R11A05G8 KONAKANCHI MAHALAKSHMI 19R15A0517 CH NIKHIL
18R11A05G9 KORUKOPPULA SAI KRISHNA 19R15A0518 KANDI PAVAN
18R11A05H0 KOTTAM CHANDRA SHEKAR 19R15A0519 CHITYALA SIRISHA
18R11A05H1 MADHAVI YADAV 19R15A0520 VAGALDAS ARAVIND
Section-E
163
o
17R11A05M5 PATEL VAGGMIC REDDY 18R11A05M7 MALLADI VENKATA VAISHNAVI
18H11A0507 MOHAMMED ARBAAZ SOHAIL 18R11A05M8 MORA AKHILA
18R11A05K3 ANGADI SOWMYA 18R11A05M9 MURUGAN RAJAM MEGHNA
18R11A05K4 B VAGDEVI SAHASRA 18R11A05N0 PALVAI VARSHINI
1 18R11A05K5 BASIREDDY SAI NITIN REDDY 4 18R11A05N1 PENTAM HARITHASRI
18R11A05K6 BATCHU OBED EMMANUEL 18R11A05N2 PINNAMARAJU NAGA SHIVANI
18R11A05K7 BHUKYA SAGAR 18R11A05N3 POSHAM KEERTHANA
18R11A05K8 BUNGA VINAY 18R11A05N4 PUNDRU MADHAVI
18R11A05K9 CHAKKA VYDEHI 18R11A05N5 RAVIKANTI VARSHA
18R11A05L0 CHALLA DHRUVITHA 18R11A05N7 SARAYU RANGU
18R11A05L1 CHINNALA AVINASH CHALAM 18R11A05N8 SRIRAM VENKATA SAI SHRAVAN
18R11A05L2 CHITTA JAYA VAMSI KRISHNA 18R11A05N9 SUDAGANI SANJANA
18R11A05L3 DANTULURI NAVYA MEGHANA 18R11A05P0 SUDHESHNA JINNA
18R11A05L4 DARISE GOPI CHANDU 18R11A05P1 TANDA AKSHAY KUMAR
2 5
18R11A05L5 GURRAM VISHNU SAHITA 18R11A05P2 TANMAYI KASTHURI
18R11A05L6 JATOTH YAKUB 18R11A05P3 THIGULLA PAVAN KUMAR
18R11A05L7 K ANIRUDH REDDY 18R11A05P4 V VAISHNAVI
18R11A05L8 K MADHUMITHA 18R11A05P5 VENGALA SETTI AVINASH
18R11A05L9 K MANGAIAH 18R11A05P6 Y RUSHIKA REDDY
18R11A05M0 JOSHNITHA RANGOLU 18R11A05P7 YELLELA V HARSHA VARDHAN
18R11A05M1 KAMMA DIVYA SREE 18R11A05P8 HANEESH REDDY POTHIREDDY
18R11A05M2 KANDIKONDA SHIVANI 18R11A05Q0 MALLARAPU PAVAN ABHINAV
3 18R11A05M3 KOLLAPU KARTHIK 6 19R15A0521 BOLLI UMARANI
18R11A05M4 KOYYADA ANUSHA GOUD 19R15A0522 MEDA NEERAJA
18R11A05M5 LELLA S V N CHANDANA LAVANYA 19R15A0523 GUJJULA VINAY REDDY
18R11A05M6 LILY VERONICA KOPPOLI 19R15A0524 CHEVALLA VENNELA
164