0% found this document useful (0 votes)

465 views

Data Analytics Course File 2021-22 Odd Semester

This document provides information about the Data Analytics course offered at Geethanjali College of Engineering and Technology. It includes the course code, credits, prerequisites, objectives, outcomes, syllabus, textbook details, reference books, vision and mission statements of the department. The course has 5 units covering topics such as introduction to data science and big data, summarizing data using probability concepts, big data processing technologies, analytics using tools like R and Python, and data visualization. The course aims to develop students' ability to handle large data, apply analytical concepts, and visualize big data using different tools.

Uploaded by

18R11A0530 MUSALE AASHISH

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

465 views

Data Analytics Course File 2021-22 Odd Semester

Uploaded by

18R11A0530 MUSALE AASHISH

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 164

Geethanjali College of Engineering and Technology

(Autonomous)
Cheeryal (V), Keesara (M), Medchal District, Telangana State– 501 301

DATA ANALYTICS
(18CS4101)
COURSE FILE
IV Year B.Tech. CSE – I Semester
A.Y:2021 - 2022

DEPARTMENT OF
COMPUTER SCIENCE & ENGINEERING
(2021-2022)

Course Coordinator HOD-CSE

Mrs K Anusha Dr. A. Sree Lakshmi

1
Geethanjali College of Engineering and Technology
(Autonomous)
Cheeryal (V), Keesara (M), Medchal District, Telangana State– 501 301
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Name of the Course: DATA ANALYTICS
Subject code: 18CS4101 Programme: UG
Branch: CSE Version No: 1
Year: IV Document Number: GCET/CSE/DA/01
Semester: I No. of Pages:130
Section: IV- CSE - A,B,C,D,E
Classification status (Unrestricted/Restricted ): Restricted
Distribution List: Department, Library

Prepared by : Updated by:

1) Name : K. Anusha/G.Swapna 1) Name : K. Anusha/G.Swapna
2) Sign : 2) Sign :
3) Designation : Asst. Prof/Asst. Prof 3) Designation : Asst.prof/Asst.Prof
4) Date : 16.08.2021 4) Date : 16.08.2021

Verified by : *For Q.C only

1) Name : 1)Name :
2) Sign : 2) Sign :
3) Design : 3) Design :
4) Date : 4) Date :
Approved by (HOD) :
1) Name: Dr. A Sree Lakshmi
2) Sign :
3) Date :

2
Contents
S.No Topic Page. No.
1 Cover Page 1
2 Syllabus copy 4
3 Vision of the Department 6
4 Mission of the Department 6
5 PEOs and Pos 6
6 Course objectives and outcomes 8
7 Course mapping with Pos 9
8 Brief notes on the importance of the course and how it fits into the curriculum 10
9 Prerequisites if any 13
10 Instructional Learning Outcomes 13
11 Class Time Table 14
12 Individual Time Table 19
13 Lecture schedule with methodology being used/adopted 20
14 Detailed notes 22
15 Additional topics 89
16 University Question papers of previous years 91
17 Question Bank 92
18 Assignment Questions 93
19 Unit wise Quiz Questions and long answer questions 94
20 Tutorial problems 118
21 Known gaps ,if any and inclusion of the same in lecture schedule 118
22 Discussion topics , if any 118
23 References, Journals, websites and E-links if any 119
24 Quality Measurement Sheets 120
A Course End Survey
B Teaching Evaluation
25 Student List 121
26 Group-Wise students list for discussion topic 126

Course coordinator Program Coordinator HOD

2. Syllabus
18CS4101-DATA ANALYTICS
3
L T P/D C
3 - -/- 3
IV Year. B.Tech. (CSE) – I Sem

Prerequisites:
● 18CS2102 - Object Oriented Programming using Java
● 18MA2102 - Probability and Statistics
● 18CS2203 - Database Management Systems

Course Objectives:
Develop ability to
1. Know the basic elements of Big Data and Data science to handle huge amount of data.
2. Gain knowledge of basic mathematics behind the Big data.
3. Understand the different Big data processing technologies.
4. Apply the Analytical concepts of Big data using R and Python.
5. Visualize the Big Data using different tools.

Course Outcomes (COs):

At the end of the course, student would be able to:
CO1: Observe Big Data elements and Architectures.
CO2: Apply different mathematical models for Big Data.
CO3: Demonstrate their Big Data skills by developing different applications.
CO4: Apply each learning model for different datasets.
CO5: Analyze needs, challenges and techniques for big data visualization.

UNIT I
Introduction: Data Science and Big Data:
Introduction to Data science and Big Data, Defining Data science and Big Data, Big Data
examples, Data explosion, Data volume, Data Velocity, Big data infrastructure and challenges, Big
Data Processing Architectures, Data Warehouse.

UNIT II
Summarizing Data & Revisiting Probability:
Summary Statistics- Summarizing data with R, Probability, Expected, Random, Bivariate Random
variables, Probability distribution. Central Limit Theorem, Regression Analysis, Regression
Modeling.

UNIT III
Big Data processing:
Big Data technologies, Introduction to Google file system, Hadoop Architecture, Hadoop Storage:
HDFS, Common Hadoop Shell commands, NameNode, Secondary NameNode, and DataNode,
Hadoop MapReduce paradigm, Map Reduce tasks, Job, Task trackers, Introduction to NOSQL,
Textual ETL processing.

UNIT IV
4
Big Data analytics:
Data analytics life cycle, Data cleaning , Data transformation, Comparing reporting and analysis,
Types of analysis, Analytical approaches, Data analytics using R, Exploring basic features of R,
Exploring R GUI, Reading data sets, Manipulating and processing data in R, Functions and
packages in R, Performing graphical analysis.

UNIT – V
Big Data Visualization:
Introduction to Data visualization, Challenges to Big data visualization, Types of data visualization,
Visualizing Big Data, Tools used in data visualization, Proprietary Data Visualization tools, Open
source data visualization tools, Data visualization with Tableau.

TEXT BOOK(S)
1. Data warehousing in the age of Big Data, Krish Krishnan, Elsevier, ISBN: 9780124058910,
1st Edition. (Units I,III)40%
2. Probability and Computing: Randomized Algorithms and Probabilistic Analysis,
Mitzenmacher and Upfal, Cambridge University press, ISBN:521835402 hardback. (Units
II)20%
3. Big Data, Black Book, DT Editorial Services, ISBN: 9789351197577, 2016 Edition.
(UNITS III,IV,V)40%

REFERENCES BOOK(S)
1. Algorithmic and Analysis Techniques in Property Testing, Dana Ron, School of EE.
2. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, Foundation and
trends in databases, Graham Cormode, Minos Garofalakis, Peter J. Haas and Chris
Jermaine.
3. R for Business Analytics, A.Ohri, Springer, ISBN:978-1-4614-4343-8.
4. Hadoop in practice, Alex Holmes, Dreamtech press, ISBN:9781617292224.

5
3. Vision of the Department
To produce globally competent and socially responsible computer science engineers
contributing to the advancement of engineering and technology which involves creativity and
innovation by providing excellent learning environment with world class facilities.
4. Mission of the Department
1. To be a centre of excellence in instruction, innovation in research and scholarship, and service
to the stake holders, the profession, and the public.
2. To prepare graduates to enter a rapidly changing field as a competent computer science
engineer.
3. To prepare graduate capable in all phases of software development, possess a firm
understanding of hardware technologies, have the strong mathematical background necessary
for scientific computing, and be sufficiently well versed in general theory to allow growth
within the discipline as it advances.
4. To prepare graduates to assume leadership roles by possessing good communication skills, the
ability to work effectively as team members, and an appreciation for their social and ethical
responsibility in a global setting.

5. Program Educational Objectives (PEOs)

1. To provide graduates with a good foundation in mathematics, sciences and engineering
fundamentals required to solve engineering problems that will facilitate them to find
employment in industry and / or to pursue postgraduate studies with an appreciation for lifelong
learning.
2. To provide graduates with analytical and problem solving skills to design algorithms, other
hardware / software systems, and inculcate professional ethics, inter-personal skills to work in a
multi-cultural team.
3. To facilitate graduates to get familiarized with the art software / hardware tools, imbibing
creativity and innovation that would enable them to develop cutting-edge technologies of multi-
disciplinary nature for societal development.

6
Program Outcomes (CSE)
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions : Design solutions for complex engineering
problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex
engineering activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge
to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of,
and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give
and receive clear instructions.

7
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning : Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.

PSO (Program Specific Outcome):

PSO 1: To identify and define the computing requirements for its solution under given
constraints.
PSO 2: To follow the best practices namely SEI-CMM levels and six sigma which vary from
time to time for software development project using open ended programming environment to
produce software deliverables as per customer needs.

6.Course Objectives and outcomes

Course Outcomes (COs):

8
7. Course Mapping with POs
Course PEOs POs
DATA PEO1, PEO2, PO1,PO2,PO3,PO4,PO5,
ANALYTICS PEO3 PO11,PO12,PSO1,PSO2

Mapping of Course outcomes to Program Outcomes:

Program Outcomes
Course Name
DATA ANALYTICS
1 PS PS
1 2 3 4 5 6 7 8 9 10 11
2 O1 O2
CO1: Observe Big Data 2 1 1 1 1 - - - - - 2 2 1 2
elements and Architectures.

CO2: Apply different 1 1 2 3 2 - - - - - 2 2 1 2

mathematical models for Big
Data.

CO3: Demonstrate their Big 2 1 1 2 1 - - - - - 1 2 1 2

Data skills by developing
different applications.

CO4: Apply each learning 1 1 1 2 1 - - - - - 1 2 1 2

model for different datasets.

CO5: Analyze needs, 2 1 1 2 1 - - - - - 1 2 1 2

challenges and techniques for
big data visualization.

8. Brief notes on the importance of the course and how it fits into the curriculum

9
a. What role does this course play within the Program?
The data analyst serves as a gatekeeper for an organization's data so stakeholders can
understand data and use it to make strategic business decisions. It is a technical role that
requires an undergraduate degree or master's degree in analytics, computer modelling, science,
or mathematics.

b. How is the course unique or different from other courses of the Program?
Data analytics is more specific and concentrated it focuses more on viewing the historical
data in context data analytics involves a few different branches of broader statistics and analysis
c. What essential knowledge or skills should they gain from this experience?
SQL. SQL, or Structured Query Language, is the ubiquitous industry-standard database
language and is possibly the most important skill for data analysts to know. ...
Microsoft Excel.
Critical Thinking.
R or Python–Statistical Programming. ...
Data Visualization.
Presentation Skills.
Machine Learning

d. What knowledge or skills from this course will students need to have mastered to perform
well in future classes or later (Higher Education / Jobs)?
Data Visualization.
Data Cleaning.
MATLAB.
R.
Python.
SQL and NoSQL.
Machine Learning.
Linear Algebra and Calculus

e. Why is this course important for students to take?

Data analysis consolidates information to provide the big picture of trends and patterns for
higher education leadership teams that can be used to evaluate and streamline processes, create
efficiencies, and improve the overall student experience

f. What is/are the prerequisite(s) for this course?

● 18CS2102 - Object Oriented Programming using Java.
● 18MA2102 - Probability and Statistics .
10
● 18CS2203 - Database Management Systems

g. When students complete this course, what do they need know or be able to do?
This is the obvious benefit to learning data analytics, and the one most often focused on by s tudents in
higher education. In practical terms, this means graduates with analytics skills will be able to command
higher salaries and enjoy their pick of the available jobs. Data analysis consolidates information to
provide the big picture of trends and patterns for higher education leadership teams that can be used to
evaluate and streamline processes, create efficiencies, and improve the overall student experience.

h. Is there specific knowledge that the students will need to know in the future?
Analytical Thinking
SQL Database
Decision Analysis
Mathematical and Statistical Skills
Software Analytics
Programming Skills
Functions and Formulas
Data Cleaning and Preparation
Quantitative Skills
Data Visualization Skills
Query Languages
Problem Solving
Domain Knowledge

i. Are there certain practical or professional skills that students will need to apply in the
future?
For beginners, the Business Analytics Nano degree program is a great place to start
learning Excel, SQL, and Tableau. If you have some experience, you could start with the Data
Analyst Nano degree program, where you'll use Python, R, and SQL to tackle data projects

j. Five years from now, what do you hope students will remember from this course?
Business analytics graduates can pursue a career in many different industries and companies as
a business analyst or data scientist.
k. What is it about this course that makes it unique or special?
Data analytics is important because it helps businesses optimize their performances. A company
can also use data analytics to make better business decisions and help analyze customer trends
and satisfaction, which can lead to new—and better—products and services.

l. Why does the program offer this course?

We can easily explore and analyze the huge data sets.
11
m. Why can’t this course be “covered” as a sub-section of another course?
Yes it can be subsection of Big Data.

n. What unique contributions to students’ learning experience does this course make?
Technology provides students with easy-to-access information, accelerated learning, and fun
opportunities to practice what they learn. It enables students to explore new subjects and
deepen their understanding of difficult concepts

o. What is the value of taking this course? How exactly does it enrich the program?
Data analytics is important because it helps businesses optimize their performances.
Implementing it into the business model means companies can help reduce costs by
identifying more efficient ways of doing business and by storing large amounts of data.

p. What are the major career options that require this course?
Machine learning engineer.
Data architect.
Statistician.
Data analyst.
Chief technology officer (CTO) .
Chief data officer (CDO) .
Application architect.
Project manager.

9. Prerequisites if any
SQL.
Microsoft Excel.
Critical Thinking.
R or Python–Statistical Programming.
Data Visualization.
Presentation Skills.
Machine Learning.
10. Instructional learning outcomes
S.No Unit Contents Outcomes
1 I Introduction: Data Observe Big Data elements and Architectures.
Science and Big Data
12
2 Apply different mathematical models for Big
Summarizing Data &
II Data.
Revisiting Probability
3 Demonstrate their Big Data skills by developing
III Big Data processing different applications.

4 IV Big Data analytics Apply each learning model for different datasets.
5 Analyze needs, challenges and techniques for big
V Big Data Visualization data visualization.

11. Time Table

13
14
15
16
17
12. Individual Time Table

13. Lesson Plan

Lesson Schedule
S.No Date No of Topics to be covered Regular / Teaching aids
perio Additional used

18
ds LCD/OHP/BB
Unit - I
1. Day 1 2 Course objectives and Course Outcomes, Regular LCD
Introduction to Data science and Big Data

2. Day 2 2 Defining Data science and Big Data, Big Regular LCD
Data examples
3. Day 3 2 Data explosion, Data volume, Data Regular LCD
Velocity
4. Day 4 2 Big data infrastructure and challenges Regular LCD
5. Day 5 2 Big Data Processing Architectures Regular LCD
6. Day 6 2 Data Warehouse Regular LCD
UNIT-II
1. Day 7 2 Summary Statistics- Summarizing data with Regular LCD
R, Probability, Expected, Random,
2. Day 8 2 Bivariate Random variables, Probability Regular LCD
distribution.
3. Day 9 2 Central Limit Theorem, Regression Regular LCD
Analysis, Regression Modeling
UNIT-III
1. Day 10 2 Big Data technologies, Introduction to Regular LCD
Google file system
2. Day 11 2 Hadoop Architecture, Hadoop Storage: Regular LCD
HDFS
3. Day 12 2 Common Hadoop Shell commands, Regular LCD
NameNode, Secondary NameNode and
DataNode
4. Day 13 2 Hadoop MapReduce paradigm, Map Regular LCD
Reduce tasks, Job, Task trackers,
5. Day 14 2 Introduction to NOSQL, Textual ETL Regular LCD
processing.

UNIT-IV
1. Day 15 2 Data analytics life cycle, Data cleaning , Regular LCD
Data transformation

2. Day 16 2 Comparing reporting and analysis, Types of Regular LCD

analysis, Analytical approaches
3. Day 17 2 Data analytics using R, Exploring basic
features of R, Exploring R GUI
4. Day 18 2 Reading data sets, Manipulating and
processing data in R
5. Day 19 2 Functions and packages in R, Performing Regular LCD
graphical analysis.

19
UNIT-V
1. Day 20 2 Introduction to Data visualization, Regular LCD
Challenges to Big data visualization
2. Day 21 2 Types of data visualization, Visualizing Big Regular LCD
Data, Tools used in data visualization
3. Day 22 2 Proprietary Data Visualization tools, Open Regular LCD
source data visualization tools
4. Day 23 2 Data visualization with Tableau. Regular LCD

14. Detailed Notes

UNIT I
20
Introduction: Data Science and Big Data:
Introduction to Data science and Big Data, Defining Data science and Big Data, Big Data
examples, Data explosion, Data volume, Data Velocity, Big data infrastructure and challenges, Big
Data Processing Architectures, Data Warehouse.

Introduction to data science and Big data

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from structured and unstructured data, and apply
knowledge and actionable insights from data across a broad range of application domains.
Data science is an interdisciplinary field focused on extracting knowledge from data sets, which are
typically large , and applying the knowledge and actionable insights from data to solve problems in
a wide range of application domains.
The field encompasses preparing data for analysis, formulating data science problems,
analyzing data, developing data-driven solutions, and presenting findings to inform high-level
decisions in a broad range of application domains. As such, it incorporates skills from computer
science, statistics, information science, mathematics, information visualization, data integration,
graphic design, complex systems, communication and business. Statistician Nathan Yau, drawing
on Ben Fry, also links data science to human-computer interaction: users should be able to
intuitively control and explore data. In 2015, the American Statistical
Association identified database management, statistics and machine learning, and distributed and
parallel systems as the three emerging foundational professional communities.
The term “data science” has been traced back to 1974, when Peter Naur proposed it as an
alternative name for computer science. In 1996, the International Federation of Classification
Societies became the first conference to specifically feature data science as a topic. However, the
definition was still in flux. After the 1985 lecture in the Chinese Academy of Sciences in Beijing, in
1997 C.F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that
a new name would help statistics shed inaccurate stereotypes, such as being synonymous with
accounting, or limited to describing data. In 1998, Hayashi Chikio argued for data science as a new,
interdisciplinary concept, with three aspects: data design, collection, and analysis.

INTRODUCTION TO BIG DATA

The biggest phenomenon that has captured the attention of the modern computing industry
today since the “Internet” is “Big Data”. These two words combined together was first popularized
21
in the paper on this subject by McKinsey & Co., and the foundation definition was first popularized
by Doug Laney from Gartner. The fundamental reason why “Big Data” is popular today is because
the technology platforms that have emerged along with it, provide the capability to process data of
multiple formats and structures without worrying about the constraints associated with traditional
systems and database platforms. Big Data represents the lowest raw format of information or
knowledge.
The argument presented in the decision-making behaviours and administrative behaviours
makes complete sense, as we limit the data in the process of modelling, applying algorithmic
applications, and have always been seeking discrete relationships within the data as opposed to the
whole picture. In reality, however, decision making has always transcended beyond the traditional
systems used to aid the process. For example, patient treatment and management is not confined to
computers and programs.
These insights provide visible patterns that can be useful in improving quality of care for a
given set of diseases. Data warehousing evolved to support the decision-making process of being
able to collect, store, and manage data, applying traditional and statistical methods of measurement
to create a reporting and analysis platform. The data collected within a data warehouse was highly
structured in nature, with minimal flexibility to change with the needs of data evolution. The
underlying premise for this comes from the transactional databases that were the sources of data for
a data warehouse. This concept applies very well when we talk of transactional models based on
activity generated by consumers in retail, financial, or other industries. For example, movie ticket
sales is a simple transaction, and the success of a movie is based on revenues it can generate in the
opening and following weeks, and in a later stage followed by sales from audio (vinyl to cassette
tapes, CDs’, and various digital formats), video (’DVDs and other digital formats), and
merchandise across multiple channels.
When reporting sales revenue, population demographics, sentiments, reviews, and feedback
were not often reported or at least were not considered as a visible part of decision making in a
traditional computing environment. The reasons for this included rigidity of traditional computing
architectures and associated models to integrate unstructured, semi-structured, or other forms of
data, while these artifacts were used in analysis and internal organizational reporting for revenue
activities from a movie. Looking at these examples in medicine and entertainment business
management, we realize that decision support has always been an aid to the decision-making
process and not the end state itself, as is often confused.

22
If one were to consider all the data, the associated processes, and the metrics used in any
decision making situation within any organization, we realize that we have used information
(volumes of data) in a variety of formats and varying degrees of complexity and derived decisions
with the data in non-traditional software processes. Before we get to Big Data, let us look at a few
important events in computing history. In the late 1980s, we were introduced to the concept of
decision support and data warehousing. This wave of being able to create trends, perform historical
analysis, and provide predictive analytics and highly scalable metrics created a series of solutions,
companies, and an industry in itself.
All these entities have contributed to the consumerization of data, from data creation,
acquisition, and consumption perspectives. The business models and opportunities that came with
the large-scale growth of data drove the need to create powerful metrics to tap from the knowledge
of the crowd that was driving them, and in return offer personalized services to address the need of
the moment.
Here are some examples:
● Weather data—there is a lot of weather data reported by governmental agencies around the
world, scientific organizations, and consumers like farmers. What we hear on television or radio is
an analytic key performance indicator (KPI) of temperature and forecasted conditions based on
several factors.
● Contract data—there are many types of contracts that an organization executes every year, and
there are multiple liabilities associated with each of them.
● Labor data—elastic labor brings a set of problems that organizations need to solve.
● Maintenance data—records from maintenance of facilities, machines, non-computer-related
systems, and more.
● Financial reporting data—corporate performance reports and annual filing to Wall Street.
● Compliance data—financial, healthcare, life sciences, hospitals, and many other agencies that file
compliance data for their corporations.
● Clinical trials data—pharmaceutical companies have wanted to minimize the life cycle of
processing for clinical trials data and manage the same with rules-based processing; this is an
opportunity for Big Data.
● Processing doctors’ notes on diagnosis and treatments—another key area of hidden insights and
value for disease state management and proactive diagnosis; a key machine learning opportunity.

23
● Contracts—every organization writes many types of contracts every year, and must process and
mine the content in the contracts along with metrics to measure the risks and penalties.

Define Data Science

Data science encompasses preparing data for analysis, including cleansing, aggregating,
and manipulating the data to perform advanced data analysis. Analytic applications and data
scientists can then review the results to uncover patterns and enable business leaders to draw
informed insights.

Define Big Data

The term “big data” refers to data that is so large, fast or complex that it's difficult or
impossible to process using traditional methods. The act of accessing and storing large amounts of
information for analytics has been around a long time.
Big Data examples

1. Fraud detection

For businesses whose operations involve any type of claims or transaction processing, fraud
detection is one of the most compelling Big Data application examples. Historically, fraud detection
on the fly has proven an elusive goal. In most cases, fraud is discovered long after the fact, at which
point the damage has been done and all that's left is to minimize the harm and adjust policies to
prevent it from happening again. Big Data platforms that can analyze claims and transactions in real
time, identifying large-scale patterns across many transactions or detecting anomalous behavior
from an individual user, can change the fraud detection game.

2. IT log analytics

IT solutions and IT departments generate an enormous quantity of logs and trace data. In the
absence of a Big Data solution, much of this data must go unexamined: organizations simply don't
have the manpower or resource to churn through all that information by hand, let alone in real time.
With a Big Data solution in place, however, those logs and trace data can be put to good use.
Within this list of Big Data application examples, IT log analytics is the most broadly applicable.
Any organization with a large IT department will benefit from the ability to quickly identify large-
scale patterns to help in diagnosing and preventing problems. Similarly, any organization with a
24
large IT department will appreciate the ability to identify incremental performance optimization
opportunities.

3. Call center analytics

Now we turn to the customer-facing Big Data application examples, of which call center
analytics are particularly powerful. What's going on in a customer's call center is often a great
barometer and influencer of market sentiment, but without a Big Data solution, much of the insight
that a call center can provide will be overlooked or discovered too late. Big Data solutions can help
identify recurring problems or customer and staff behavior patterns on the fly not only by making
sense of time/quality resolution metrics, but also by capturing and processing call content itself.

4. Social media analysis

A Big Data solution built to harvest and analyze social media activity, like IBM's Cognos
Consumer Insights, a point solution running on IBM's BigInsights Big Data platform, can make
sense of the chatter. Social media can provide real-time insights into how the market is responding
to products and campaigns. With those insights, companies can adjust their pricing, promotion, and
campaign placement on the fly for optimal results.

DATA EXPLOSION

What has led to this explosive growth of data? One answer is innovation. Innovation has
transformed the way we engage in business, provide services, and the associated measurement of
value and profitability. Three fundamental trends that shaped up the data world in the last few years
are business model transformation, globalization, and personalization of services.

BIG DATA CHARACTERISTICS

25
DATA VOLUME

Data volume Data volume is characterized by the amount of data that is generated
continuously. Different data types come in different sizes. For example, a blog text is a few
kilobytes; voice calls or video files are a few megabytes; sensor data, machine logs, and clickstream
data can be in gigabytes. Machine-generated data is often characterized by a steady pattern of
numbers and text, which occurs in a rapid-fire fashion.
There are several examples of machine-generated data; for instance, a robotic arm that is on
an assembly line at an automotive plant will be sending signals for every movement it makes, and
that pattern will be steady, as opposed to a dredging machine that is doing roadwork and sending
signals back on how much movement it had during the day, payload moved, and system and
machine status. Sensors on top of buildings that regulate heating and cooling send different signals
all through the day; though the structure may seem the same, the values differ depending on many
factors. Sensors on automobiles send different signals depending on road type, driving speed,
weight, and more to support centers.

DATA VELOCITY
Velocity can be defined as the speed and direction of motion of an object. Constant velocity
of an object is the motion of an object at constant speed and direction. With the advent of Big Data,
understanding the velocity of data is extremely important. The basic reason for this arises from the
fact that in the early days of data processing, we used to analyze data in batches, acquired over
time. Typically, data is broken into fixed-size chunks and processed through different layers from
source to targets, and the end result is stored in a data warehouse for further use in reporting and
analysis.
For example, a flight from London to New York generates 650TB of data from the airplane
engine sensors. There is a lot of value in reading this information during the stream processing and
postgathering for statistical modeling purposes. Mobile networks The most popular way to share
pictures, music, and data today is via mobile devices. The sheer volume of data that is transmitted
by mobile networks provides insights to the providers on the performance of their network, the
amount of data processed at each tower, the time of day, the associated geographies, user
demographics, location, latencies, and much more.
Different options for processing Big Data, the list of features for handling data velocity
included that the:
26
● System must be elastic for handling data velocity along with volume.
● System must scale up and scale down as needed without increasing costs.
● System must be able to process data across the infrastructure in the least processing time.
● System throughput should remain stable independent of data velocity.
● System should be able to process data on a distributed platform.
After selecting the most appropriate architecture for data processing, the data velocity problem was
handled by the IT team. The key takeaways from the exercise were:
● Data architecture needs to be based on workload.
● Metadata requirements need to be clearly documented.
● The data platform supports a wide range of real-time analytics.
● Result sets can be integrated easily with high-volume analytic data stores.
● Read and write latencies below 50 milliseconds.
● Scale out on commodity hardware.
● Database must automatically implement the defined partitioning strategy.

Big data infrastructure and challenges

● High throughput. Big Data’s velocity mandates that data be ingested and processed at high
speeds. This requires an infrastructure that is extremely fast across input/output (I/O), processing,
and storage. ● Fault tolerance. Big Data because of its inherent complexity needs a fault-tolerant
architecture. Any one portion of the processing architecture should be able to take over and resume
processing from the point of failure in any other part of the system.
● Auto recovery. The processing architecture should be self-managing and recover from failure
without manual intervention.
● Programming language interfaces. Big Data can be processed for multiple business scenarios. For
example, a document can be processed for multiple subject areas and classified under more than
one category of products or services. This processing cannot be done using any COTS (commercial
off-the-shelf) software and needs custom coding and development.
● High degree of parallelism. By processing data in parallel, we can distribute the load across
multiple machines, each having its own copy of the same data, but processing a different program.
● Distributed data processing. Since Big Data processing happens on a file-based architecture, to
achieve extreme scalability, the underlying platform must be able to process distributed data. This
is an overlapping requirement with parallel processing, but differs in the fact that parallelism can

27
exist within multiple layers of the architecture stack. 40 Chapter 3 Big Data Processing
Architectures Data processing explained
● Data model-less architecture. Due to the size and complexity of data, there is no fixed data model
when you process Big Data. In other words, the data processing is on a schemaless or nonstructured
data set.
● Near-real-time data collection. Big Data can be either collected as a batch or in real time. To
process the landing of data very effectively, a highly scalable and performing storage platform is
needed.
● Microbatch processing. Big Data can be processed as real time, batch, or microbatch data sets.
The underlying platform must be capable of processing data in any of these formats, concurrently
and in a scalable fashion.
● Minimal data transformation. Big Data processing with programming languages provides for
some transformation. Most transformations are executed as a multistep derivation and complexity is
kept to a minimal within each step. This feature is necessary to improve speed and design fault
tolerance.
● Efficient data reads. The data processing design should be read-oriented, as there is no
schemabased processing of data.
● Multipartition capability. The data processing architecture should be able to support vertical and
horizontal partitions of the data. Due to the size and the volume of data, both the partitioning
techniques will be implemented in the processing workflow.
● Store result in file system or DBMS (not relational). Big Data processing is file-based and
therefore the result sets need to be stored in a file system architecture. Alternately, a DBMS
platform can be integrated into the processing cycle, but not an RDBMS.
● Share data across multiple processing points. Replication and sharing of data is a key processing
requirement. The reasons for this include the design of fault tolerance, multistep processing, and
multipartitioning. In a nutshell, Big Data processing can be summarized as a file system–based
scalable and distributed data processing architecture, designed and deployed on a high-performance
and scalable infrastructure.

Slow storage media

28
Disk I/O bottlenecks are one common source of delays in data processing. Fortunately, there are
some tricks that you can use to minimize their impact.

One solution is to upgrade your data infrastructure solid-state disks (SSDs), which typically run
faster. Alternatively, you could use in-memory data processing, which is much faster than relying
on conventional storage.

SSDs and in-memory storage are more costly, of course, especially when you use them at scale. But
that does not mean you can’t take advantage of them strategically in a cost-effective way: Consider
deploying SSDs or in-memory data processing for workloads that require the highest speed, but
sticking with conventional storage where the benefits of faster I/O won’t outweigh the costs.

Do you need to bring together massive amounts of data in a variety of forms and integrate it all in a
cohesive way that enables business users to make real-time decisions? This eBook will guide you
through the ins and outs of building a successful big data project on a solid foundation of data
integration.

Lack of scalability

If your data infrastructure can’t increase in size as your data needs grow, it will undercut your
ability to turn data into value.

At the same time, of course, you don’t want to maintain substantially more big data infrastructure
than you need today just so that it’s there for the future. Otherwise, you will be paying for
infrastructure you’re not currently using, which is not a good use of money.

One way to help address this challenge is to deploy big data workloads in the cloud, where you can
increase the size of your infrastructure virtually instantaneously when you need it, without paying
for it when you don’t. If you prefer not to shift all of your big data workloads to the cloud, you
might also consider keeping most workloads on-premise, but having a cloud infrastructure set up
and ready to handle “spillover” workloads when they arise—at least until you can create a new on-
premise infrastructure to handle them permanently.

Slow network connectivity

29
If your data is large in size, transferring it across the network can take time—especially if network
transfers require using the public internet, where bandwidth tends to be much more limited than it is
on internal company networks.

Paying for more bandwidth is one way to mitigate this problem, but that will only get you so far
(and it will cost you). A better approach is to architect your big data infrastructure in a way that
minimizes the amount of data transfer that needs to occur over the network. You could do this by,
for example, using cloud-based analytics tools to analyze data that is collected in the cloud, rather
than downloading that data to an on-premise location first. (The same logic applies in reverse: If
your data is born or collected on-premise, analyze it there.)

Sub-optimal data transformation

Getting data from the format in which it is born into the format that you need to analyze it or share
it with others can be very tricky. Most applications structure data in ways that work best for them,
with little consideration of how well those structures work for other applications or contexts.

This is why data transformation is so important. Data transformation allows you to convert data
When done incorrectly—which means manually and in ways that do not control for data quality—
data transformation can quickly cause more trouble than it is worth. But when you automate data
transformation and ensure the quality of the resulting data, you maximize your data infrastructure’s
ability to meet your big data needs, no matter how your infrastructure is constructed.

● lack of proper understanding of Big Data. Companies fail in their Big Data initiatives due to
insufficient understanding. ...
● Data growth issues. ...
● Confusion while Big Data tool selection. ...
● Lack of data professionals. ...
● Securing data. ...
● Integrating data from a variety of sources.

Big Data processing Architectures,Datawarehouse

30
Data processing has been a complex subject to deal with since the primitive days of
computing. The underlying reason for this stems from the fact that complexity is induced from the
instrumentation of data rather than the movement of data.This is a type of architecture where there
are no dedicated servers and clients; instead, all the processing responsibilities are allocated among
all machines, known as peers.
Each machine can perform the role of a client or server or just process data.
Distributed processing has a lot of advantages and disadvantages.
● Advantages: – Scalability of systems and resources can be achieved based on isolated needs. –
Processing and management of information can be architected based on desired unit of operation. –
Parallel processing of data reducing time latencies.
● Disadvantages: – Data redundancy – Process redundancy – Resource overhead – Volumes
● The most popular distributed processing implementations in the data world are:
● Peer to peer
● Hub and spoke
● Federated Processing data in either the centralized processing or distributed processing style has a
lot of infrastructure challenges that continue to dominate the space.

Shared-everything architecture.
The data processing life cycle for Big Data differs from transactional data (Figure 3.5). In a
traditional environment you first analyze the data and create a set of requirements, which leads to
data discovery and data model creation, and then a database structure is created to process the data.
The resulting architecture is very efficient from the perspective of write performance, as data’s
finite shape, structure, and state are loaded in the end state. Big Data widely differs in its processing

31
cycle. The data is first collected and loaded to a target platform, then a metadata layer is applied to
the data, and a data structure for the content is created. Once the data structure is applied, the data is
then transformed and analyzed. The end result from the process is what provides insights into the
data and any associated context (based on the business rules processed). To process the data in this
flexible manner, a database-driven architecture will be unsuitable and will underperform. To
process this volume and complexity, a file-driven architecture with a programming language
interface is best suited.
The key requirements for infrastructure and processing architecture as follows:
● Data model-less architecture
● Near-real-time data collection
● Microbatch processing
● Minimal data transformation
● Efficient data reads
● Multipartition capability
● Store result in file system or DBMS (not relational)
● Share data across multiple processing points
● Infrastructure requirements
● Linear scalability
● High throughput Collect Process Capturing Discover transaction data Collection of data from
subsystems .

Big Data Processing Architectures

● Fault tolerance ● Auto recovery
● High degree of parallelism
● Distributed data processing
● Programming language interface

32
Shared nothing architecture

UNIT II

33
Summarizing Data & Revisiting Probability:
Summary Statistics- Summarizing data with R, Probability, Expected, Random, Bivariate Random
variables, Probability distribution. Central Limit Theorem, Regression Analysis, Regression
Modeling.

Summary Statistics
Summary statistics summarize and provide information about your sample data. It tells you
something about the values in your data set. This includes where the mean lies and whether your
data is skewed. Summary statistics fall into three main categories:
● Measures of location (also called central tendency).
● Measures of spread.
● Graphs/charts.
Summary Statistics: Measures of location
Measures of location tell you where your data is centered at, or where a trend lies. Click on one of
the following common measures of location for a full definition and examples for that particular
measure:
● Mean (also called the arithmetic mean or average).
● Geometric mean (used for interest rates and other types of growth).
● Trimmed Mean (the mean with outliers excluded).
● Median (the middle of a data set).
Summary Statistics: Measures of Spread
Measures of spread tell you (perhaps not surprisingly!) how spread out or varied your data set is.
This can be important information. For example, test scores that are in the 60-90 range might be
expected while scores in the 20-70 range might indicate a problem. Range isn’t the only measure of
spread though. Click on one of the names below for a full definition of that particular measure of
spread.

● range (how spread out your data is).

● Interquartile range (where the “middle fifty” percent of your data is).
● Quartiles (boundaries for the lowest, middle and upper quarters of data.
● Skewed (does your data have mainly low, or mainly high values?).
● Kurtosis (a measure of how much data is in the tails).
Summary Statistics: Graphs and Charts
There are literally dozens of ways to display summary data using graphs or charts. Some of the
most common ones are listed below. Click on any name for a definition of that particular chart type.

● Histogram.
● Frequency Distribution Table.
● Box plot.
● Bar chart.
● Scatter plot.
● Pie char.t

Important Ways to Summarise Data in R

● apply: Apply function returns a vector or array or list of values obtained by applying a
function to either rows or columns.
● lapply

34
● sapply
● tapply
● by
● sqldf
● ddply

Introduction
● People remain confused when it comes to summarizing data real quick in R. There are
various options. But, which one is the best ? I’ve answered this question below. You must
choose one at first. And, become expert at it. That’s how you should move to the next.
● People who transition from SAS or SQL are used to writing simple queries on these
languages to summarize data sets. For such audience, the biggest concern is to how do we
do the same thing on R.
● Generally, summarizing data means finding statistical figures such as mean, median, box
plot etc. If understand well with scatter plots & histogram, you can refer to guide on data
visualization in R.

Methods to Summarise Data in R

1. apply
Apply function returns a vector or array or list of values obtained by applying a function to either
rows or columns. This is the simplest of all the function which can do this job. However this
function is very specific to collapsing either row or column.
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)apply(m, 1, mean)
[1] 6 7 8 9 10 11 12 13 14 15
apply(m, 2, mean)
[1] 5.5 15.5
2. lapply
lapply” returns a list of the same length as X, each element of which is the result of applying FUN
to the corresponding element of X.”
l <- list(a = 1:10, b = 11:20)
lapply(l, mean)
35
$a
[1] 5.5
$b
[1] 15.5

3. sapply
“sapply” does the same thing as apply but returns a vector or matrix. Let’s consider the last
example again.
l <- list(a = 1:10, b = 11:20) l.mean <- sapply(l, mean)
class(l.mean)
[1] "numeric"

4. tapply
Till now, all the function we discussed cannot do what Sql can achieve. Here is a function which
completes the palette for R. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”,
where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same
length as X”. Here is an example which will make the usage clear.
attach(iris)
# mean petal length by species
tapply(iris$Petal.Length, Species, mean)
    setosa versicolor virginica
     1.462      4.260      5.552

5. by
Now comes a slightly more complicated algorithm. Function ‘by’ is an object-oriented wrapper for
‘tapply’ applied to data frames. Hopefully the example will make it more clear.
attach(iris)
by(iris[, 1:4], Species, colMeans)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
       5.006        3.428        1.462        0.246
------------------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
       5.936        2.770        4.260        1.326
------------------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
       6.588        2.974        5.552        2.026
What did the function do? It simply splits the data by a class variable, which in this case is the
specie. And then it creates a summary at this level. So it does apply function on split frames. The
returned object is of class “by”.

6. sqldf
If you found any of the above statements difficult, don’t panic. I bring you a life line which you can
use anytime. Let’s fit in the SQL queries in R. Here is a way you can do the same.
attach(iris)

36
summarization <- sqldf(select Species, mean(Petal.Length) from Petal.Length_mean where Species
is not null group by Species’)
And it’s done. Wasn’t it simple enough? One setback of this approach is the amount of time it takes
to execute. In case you are interested in getting speed and same results read the next section.

7. ddply
Fastest of all we discussed. You will need an additional package. Let’s do what we exactly did in
tapply section.
library(plyr)
attach(iris)
# mean petal length by species
ddply(iris,"Species",summarise, Petal.Length_mean = mean (Petal.Length))

Probability is the branch of mathematics concerning numerical descriptions of how likely

an event is to occur, or how likely it is that a proposition is true. The probability of an event is a
number between 0 and 1, where, roughly speaking, 0 indicates impossibility of the event and 1
indicates certainty.[note 1][1][2] The higher the probability of an event, the more likely it is that the event
will occur. A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two
outcomes ("heads" and "tails") are both equally probable; the probability of "heads" equals the
probability of "tails"; and since no other outcomes are possible, the probability of either "heads" or
"tails" is 1/2 (which could also be written as 0.5 or 50%).
These concepts have been given an axiomatic mathematical formalization in probability theory,
which is used widely in areas of study such
as statistics, mathematics, science, finance, gambling, artificial intelligence, machine
learning, computer science, game theory, and philosophy to, for example, draw inferences about the
expected frequency of events. Probability theory is also used to describe the underlying mechanics
and regularities of complex systems.[3]

Independent events
If two events, A and B are independent then the joint probability is
For example, if two coins are flipped, then the chance of both being heads is

Mutually exclusive events

● If either event A or event B can occur but never both simultaneously, then they are called
mutually exclusive events.
● If two events are mutually exclusive, then the probability of both occurring
● If two events are mutually exclusive, then the probability of either occurring
● For example, the chance of rolling a 1 or 2 on a six-sided die is

Not mutually exclusive events

If the events are not mutually exclusive then
For example, when drawing a single card at random from a regular deck of cards, the chance of
getting a heart or a face card (J,Q,K) (or one that is both) is , since among the 52 cards of a deck, 13
are hearts, 12 are face cards, and 3 are both: here the possibilities included in the "3 that are both"
are included in each of the "13 hearts" and the "12 face cards", but should only be counted once.
37
Conditional probability
Conditional probability is the probability of some event A, given the occurrence of some
other event B. Conditional probability is written and is read "the probability of A, given B". It is
defined by
If then is formally undefined by this expression. In this case and are independent, since
However, it is possible to define a conditional probability for some zero-probability events using
a σ-algebra of such events (such as those arising from a continuous random variable)..

Inverse probability

In probability theory and applications, Bayes' rule relates the odds of event to event , before

(prior to) and after (posterior to) conditioning on another event . The odds on to event is simply the
ratio of the probabilities of the two events.

What are Random Variables?

A random variable (also known as a stochastic variable) is a real-valued function, whose

domain is the entire sample space of an experiment. Think of the domain as the set of all possible
values that can go into a function. A function takes the domain/input, processes it, and renders an
output/range. Similarly, a random variable takes its domain (sample space of an experiment),
processes it, and assigns every event/outcome a real value. This set of real values obtained from the
random variable is called its range.

In statistical notations, a random variable is generally represented by a capital letter, and its
realizations/observed values are represented by small letters.

Consider the experiment of tossing two coins. We can define X to be a random variable that
measures the number of heads observed in the experiment. For the experiment, the sample space is
shown below:

There are 4 possible outcomes for the experiment, and this is the domain of X. The random variable
X takes these 4 outcomes/events and processes them to give different real values. For each
outcome, the associated value is shown as:

38
Thus, we can represent X as follows:

Types of Random Variables

There are three types of random variables- discrete random variables, continuous random variables,
and mixed random variables.

1) Discrete Random Variables: Discrete random variables are random variables, whose range is a
countable set. A countable set can be either a finite set or a countably infinite set. For instance, in
the above example, X is a discrete variable as its range is a finite set ({0, 1, 2}).

2) Continuous Random Variables: Continuous random variables, on the contrary, have a range in

the forms of some interval, bounded or unbounded, of the real line. E.g., Let Y be a random
variable that is equal to the height of different people in a given population set. Since the people can
have different measures of height (not limited to just natural numbers or any countable set), Y is a
continuous variable (in fact, the distribution of Y follows a normal/gaussian distribution on most
occasions).

3) Mixed Random Variables: Lastly, mixed random variables are ones that are a mixture of both
continuous and discrete variables. These variables are more complicated than the other two. Hence,
they are explained at the end of this article.

Probability Distribution of Random Variables

When we describe the values in the range of a random variable in terms of the probability of
their occurrence, we are essentially talking about the probability distribution of the random
variable. In other words, the probability distribution of a random variable can be determined by
calculating the probability of occurrence of every value in the range of the random variable. A
probability distribution is described for discrete and continuous random variables in subtly different
ways.

The Discrete Case

For discrete variables, the term ‘Probability mass function (PMF)’ is used to describe their
distributions. Using the example of coin tosses, as discussed above, we calculate the probability of
X taking the values 0, 1 and 2 as follows:
39
We use the notation PX(x) to refer to
the PMF of the random variable X. The distribution is shown as follows:

The table can also be graphically demonstrated:

In general, if a random variable X has a countable range given by:

Then, we define probability mass function as:

This also leads us to the general description of the distribution in tabular format:

40
Properties of probability mass function:

1) PMF can never be more than 1 or negative i.e.,

2) PMF must sum to one over the entire range set of a random variable.

3) For A, a subset of Rx,

The Continuous Case

For continuous variables, the term ‘Probability density function (PDF)’ is used to describe
their distributions. We’ll consider the example of the distribution of heights. Suppose, we survey a
group of 1000 people and measure the height of each person very precisely. The distribution of the
heights can be shown by a density histogram as follows:

We have grouped the different heights in certain intervals. But let’s see what happens when we try
to reduce the size of the histogram bins. In other words, we make the grouping intervals smaller and
smaller.

41
Central Limit Theorem
“In probability theory, the central limit theorem (CLT) states that, given certain conditions,
the arithmetic mean of a sufficiently large number of iterates of independent random variables, each
with a well-defined expected value and well-defined variance, will be approximately normally
distributed, regardless of the underlying distribution. That is, suppose that a sample is obtained
containing a large number of observations, each observation being randomly generated in a way
that does not depend on the values of the other observations, and that the arithmetic average of the
observed values is computed.

If this procedure is performed many times, the central limit theorem says that the computed
values of the average will be distributed according to the normal distribution (commonly known as
a ‘bell curve’) n the domain of statistics, the most commonly used statistical technique is
Regression Analysis which is used to estimate particular relationships among variables. Under this
technique, the main focus is upon the relationship between dependent variable and any one or more
independent variables.

There are several techniques within this analysis that are used for modeling and analyzing
several variables. This technique helps you see how the particular value of a dependent variable
changes when any one of the independent variable varies with all others fixed. In simple terms,
through this approach you get to estimate the conditional expectation or the average value of the
dependent variable. Thoroughly in all the cases, the target for estimation is a function of any one or
more independent variables, which is termed as regression function. The main goal of regression
analysis is to ascertain the values of all the parameters to derive a function that will fit the data
observations in the best way possible.

Common uses of Regression Analysis

Most widely, this technique is used for predicting and forecasting. In both these fields, the
uses overlap with that of the domain of machine learning. The technique also helps you in figuring
out the form and type of relationships that forms between a dependent variable and the independent
variable. It also interprets the casual relationships between the same.

There are a variety of techniques within data analytic that are employed to carry out
regression analysis. Some of the famous ones are- Linear Regression Analytic, Logistic Regression
Analytic and ordinary least squares. Linear regression and squares techniques are parametric; in
both of these methodologies regression function is managed from the limited number of peculiar
parameters.

42
The way how regression analysis methods perform depends on the types of the data
generating process. Below given is the list of variables that various regression models incorporate

● The dependent variable, Y

● The independent variable, X
Data Analysis
Data Analysis can be categorized into the following modes

● Narrative. For example, laws and arts.

● Descriptive. For example, fields of social sciences
● Statistical and Mathematical. For example, both pure and applied sciences
● Audio-Optical. For example, varied domains of telecommunication
Data Analytic Techniques
There are innumerable data analytic techniques that are used to manage the enormous data. Find
enlisted the not so commonly used analytic techniques:

● Cluster Analysis
● Experimental Simulation
● MTMM
● Conjoint Analysis
● Multi-normal distribution
● Correspondence analysis
● Factor analysis
● SEM (Structural Equation Modeling)
● Linear probability models
● R-squared or R2: Coefficient of determination
● LISREL
● Multiple discriminant analysis
● PCA (Principal Components Analysis)

43
Regression Analysis
Regression analysis is a statistical technique for analysing and comprehending the
connection between two or more variables of interest. The methodology used to do regression
analysis aids in understanding which elements are significant, which may be ignored, and how they
interact with one another.
● Regression is a statistical approach used in finance, investment, and other fields to identify
the strength and type of a connection between one dependent variable (typically represented
by Y) and a sequence of other variables (known as independent variables).
● Regression is essentially the "best guess" at utilising a collection of data to generate some
form of forecast. It is the process of fitting a set of points to a graph.
● Regression analysis is a mathematical method for determining which of those factors has an
effect. It provides answers to the following questions:

Types of Regression Analysis

● Simple linear regression

The relationship between a dependent variable and a single independent variable is
described using a basic linear regression methodology. A Simple Linear Regression model reveals a
linear or slanted straight line relation, thus the name.

The simple linear model is expressed using the following equation:
Y = a + bX + ϵ
Where:
Y – variable that is dependent
X – Independent (explanatory) variable
a – Intercept
b – Slope
ϵ – Residual (error)

Multiple linear regression
● Multiple linear regression (MLR), often known as multiple regression, is a statistical
process that uses multiple explanatory factors to predict the outcome of a response
variable.
44
● MLR is a method of representing the linear relationship between explanatory
(independent) and response (dependent) variables.
The mathematical representation of multiple linear regression is:
y=ß0+ ß1 x1+ …………..ßn xn + ϵ
Where, y = the dependent variable’s predicted value
● B0 = the y-intercept
● B1X1= B1 is the coefficient for regression of the first independent variable X1 (The effect of
increasing the independent variable's value on the projected y value is referred to as X1.)
● … = Repeat for as many independent variables as you're testing.
● BnXn = the last independent variable's regression coefficient
● ϵ = model error (i.e. how much flexibility is there in our y estimate)
Non-linear regression
A sort of regression analysis in which data is fitted to a model and then displayed
numerically is known as nonlinear regression.
Simple linear regression connects two variables (X and Y) in a straight line (y = mx + b),
whereas nonlinear regression connects two variables (X and Y) in a nonlinear (curved) relationship.
The goal of the model is to minimise the sum of squares as much as possible. The sum of
squares is a statistic that tracks how much Y observations differ from the nonlinear (curved)
function that was used to anticipate Y.

What are applications of Regression Analysis?

Most of the regression analysis is done to carry out processes in finances. So, here are 5
applications of Regression Analysis in the field of finance and others relating to it.

Applications of regression analysis

1. Forecasting:
45
The most common use of regression analysis in business is for forecasting future
opportunities and threats. Demand analysis, for example, forecasts the amount of things a customer
is likely to buy.
When it comes to business, though, demand is not the only dependent variable. Regressive
analysis can anticipate significantly more than just direct income.

2. CAPM:

The Capital Asset Pricing Model (CAPM), which establishes the link between an asset's
projected return and the related market risk premium, relies on the linear regression model.

3. Comparing with competition:
It may be used to compare a company's financial performance to that of a certain
counterpart. It may also be used to determine the relationship between two firms' stock prices (this
can be extended to find correlation between 2 competing companies, 2 companies operating in an
unrelated industry etc).

4. Identifying problems:
Regression is useful not just for providing factual evidence for management choices
but also for detecting judgement mistakes.
A retail store manager, for example, may assume that extending shopping hours will significantly
boost sales.

5.Reliable source
Many businesses and their top executives are now adopting regression analysis (and
other types of statistical analysis) to make better business decisions and reduce guesswork and gut
instinct.
Regression enables firms to take a scientific approach to management. Both small and large
enterprises are frequently bombarded with an excessive amount of data.

Regression Modelling
Regression is a method to mathematically formulate relationship between variables that in
due course can be used to estimate, interpolate and extrapolate. Suppose we want to estimate the
weight of individuals, which is influenced by height, diet, workout, etc. Here, Weight is
the predicted variable. Height, Diet, Workout are predictor variables.

What's a typical process for performing regression analysis?

First select a suitable predicted variable with acceptable measurement qualities such as
reliability and validity. Likewise, select the predictors. When there's a single predictor, we call
it bivariate analysis; anything more, we call it multivariate analysis.

Collect sufficient number of data points. Use a suitable estimation technique to arrive at the
mathematical formula between predicted and predictor variables. No model is perfect. Hence, give
error bounds.

Finally, assess the model's stability by applying it to different samples of the same population.
When predictor variables are given for a new data point, estimate the predicted variable. If stable,
the model's accuracy should not decrease. This process is called model cross-validation.

46
I've heard of Least Squares. What's this and how is it related to regression?

The least squares regression line. Source: Sultana 2014, slide 6.
Least Squares is a term that signifies that the square of errors are at a minimum. The error is
defined as the difference between observed value and predicted value. The objective of regression
estimation is produce least squared errors as a result. When error approaches zero, we term it
as overfitting.

Least Squares Method provides linear equations with unknowns that can be solved for any given
data. The unknowns are regression parameters. The linear equations are called as Normal
Equations. The normal equations are derived using calculus to minimize squared errors.

● Could you explain the difference between interpolation and extrapolation w.r.t. regression?

Time Series Forecasting. Source: Zhao 2011.

Simply put, interpolation is estimation in familiar territory and extrapolation is estimation where
not much of data is available due to various reasons—not collected or cannot be collected.

We can interpolate missing data points using regression. For instance, we want to estimate
height given weight and data collection process missed out certain weights, we can use regression
to interpolate. This missing data can estimated by other means too. The missing data estimation is
called imputation.

47
The height and weight data is bound by nature and can be sourced. Say, we want to estimate future
weight of an individual given historical weight variations of the individual. This is extrapolation. In
regression, we call it forecasting. This is solved using a distinct set of techniques called as Time
Series Regression.

● What is correlation? How is it related to regression?

Types of correlations. Source: Statistics How To 2018.

Correlation helps identify variables that can be applied for regression modelling. Correlations
between each predictor and predicted variable are identified to decide on the predictors that need to
be included in the model.

Correlation is defining the association between two variables. The effect

of XX (or X1X1, X2X2, X3X3...) on YY can be thus quantified:

● Positive Correlation: YY goes up/down as XX goes up/down.

● Correlation coefficient will be in the range [0,1].
● Negative Correlation: YY goes up/down as XX goes down/up. Correlation coefficient will be in
the range [-1,0].
● No Correlation: YY doesn't go up/down as XX goes up/down. Correlation coefficient will be close
to 0.
Correlation coefficient rr has the following formula:
r=∑ni=1(xi−¯x)(yi−¯y)√∑ni=1(xi−¯x)2∑ni=1(yi−¯y)2r=∑i=1n(xi−x¯)
(yi−y¯)∑i=1n(xi−x¯)2∑i=1n(yi−y¯)2
An equivalent formula that substitutes the mean values ¯xx¯ and ¯yy¯ with their individual
sample points xixi and yiyi is published in Wikipedia. More formally, rr is called Pearson Product

Moment Correlation (PPMC).

48
What's the right interpretation of correlation coefficient?

Different samples with same correlation coefficient although their regression lines may differ.
Source: Stanton 2001, fig. 2.
Correlation coefficient rr is measure of linear association strength. It doesn't quantify non-linearity.
A correlation coefficient of 80% (0.8) means that 80% of variation in one variable is explained by
variation in the other variable. Example, 80% of variation in rainfall is explained by the number of
trees; 20% is due to factors other than the number of trees.
It will be apparent from the formula that rr factors in the sample variance. On a X-Y scatterplot, the
regression line may have different slopes due to different sample variance even when all of them
share the same correlation coefficient. In other words, rr is not simply the slope of the regression
line.
Could you give examples of non-linear correlation?

49
Illustrating linear, non-linear and no correlation types. Source: Johnivan 2011.
A non-linear correlation is where the relationship between the variables cannot be expressed by a
straight line. We call this relationship curvilinear.

Non-linear relationship can exhibit monotonous positive, monotonous negative, or both patterns
together.How can we do data analysis when relationships are non-linear?

Transformations for non-linear relationships Source

The correlation coefficient formula applies for only linear relationships. One common approach for
non-linear correlations is to transform them into linear forms. If the relationship is curvilinear, we
can apply transformations directly. Common transformations include logarithmic or inverse
transformations.

If the relationship is non-linear but not curvilinear, we can split the data into distinct
segments. Data within some segments may be linear. In other segments, if it's curvilinear,
transformations can be applied to make them linear. Analysis is thus segment-wise, sometimes
called segmented regression. As an example, yield of mustard is not affected by soil salinity for
low values. For salinity above a threshold, there's a negative linear relation. This dataset can be
segmented at the threshold.

What is causal relationship in regression?

50
One study about college education, showed positive correlation between SAT scores of
incoming students and their earnings when they graduate. Moreover, we can state that graduating
from elite colleges (high SAT scores) had a role in higher salaries.

Causality or causation refers to the idea that variation in a predictor XX causes variation in

the predicted variable YY. This is distinct from regression, which is more about
predicting YY based on its correlation with XX. Regression does not claim that YY is caused
by XX.

Here are some possible examples of causality. High scores lead to higher earnings. Regular exercise
results in better health. Current season influences power consumption. All pairs of variables that
have causal relationship will exhibit significant correlation.

Does strong correlation always imply causal relationship?

An example where correlation does not imply causation. Source: Stark 2017.
51
No. Sometimes correlations are purely coincidental. For example, non-commercial space
launches and sociology doctorates awarded are completely unrelated but the image shows them to
be strongly correlated. This is called a Spurious Correlation. This is a clear case where correlation
does not imply causation.

Another example is when ice cream sales are positively correlated with violent crime.
However, violent crime is not caused by ice cream sales. It so happens that there's a confounding
variable, which in this case is weather. Hot weather influences both ice cream sales and violent
crimes. It's therefore obvious that,

correlation does not always imply causation

Correlation shouldn't be mistaken for causation. Look at the physical mechanism causing
such a relationship. For example, is rain driving the sale of your product? Data may show a
correlation. It need not be causal unless your product is an umbrella. However, proving causality is
hard. At best, we can do randomized trials to establish causality.

Regression is a useful tool in either predictive or causal analysis. With the growth of Big Data, it's
being used more often for predictive analysis.

Could you explain the regression model?

We call it a model when the relationship between variables is in a well-defined mathematical
form: YY=f(X)f(X).
For instance, a linear relationship can be written
as f(X)=a+b1X1+b2X2+b3X3f(X)=a+b1X1+b2X2+b3X3, where aa is a constant
and b1,b2,b3b1,b2,b3 are regression coefficients. aa is constant effect while a unit change in X1X1,
will result in b1b1 unit change in YY.
It's important to note that linearity is in terms of the coefficients, not in terms of predictor variables.
For example, this model is still linear though it's quadratic in terms
of X1X1: f(X)=a+b1X1+b2X21f(X)=a+b1X1+b2X12.
How do we measure the accuracy of a regression model?

R-Squared comparison. Source: Statwing Docs 2018.

The accuracy of regression model is relative to base model. Called R-Squared, this measure is
squared deviation from the expected value, which is mathematically defined below:

52
● For base model, the sum of squared deviation of actual value YY from mean value E(Y)E(Y) is
referred to as Total Variance or SST (Total Sum of
Squares).SST=n∑i=1(yi−¯y)2SST=∑i=1n(yi−y¯)2
● For regression model, the sum of squared deviation of estimated value ˆYY^ from mean
value E(Y)E(Y) is referred to as Explained Variance or SSR (Regression Sum of
Squares).SSR=n∑i=1(ˆyi−¯y)2SSR=∑i=1n(y^i−y¯)2
● The accuracy of the model is called R-Squared.R2=Explained VarianceTotal
Variance=SSRSSTR2=Explained VarianceTotal Variance=SSRSST
Higher the R2R2, larger the explained variance and lower the unexplained. Hence,
higher R2R2 value is desired. For example, if R2=0.8R2=0.8, 80% of variation in data is explained
by model.
What are some challenges with regression and how to overcome them?
High multicollinearity is a challenge. It basically means one or more independent variables are
highly linearly dependent on another independent variable. This makes it difficult to estimate the
coefficients. One possible solution is to increase the sample size.

Another challenge is non-constant error variance, also called heteroscedasticity. An example of

this when the observations "funnel out" as we move along the regression line. One solution is to use
a Weighted Least Squares (WLS).

Regression assumes that errors from one observation are not related to other observations. This is
often not true with time series data. Auto correlated errors are therefore a challenge. One
approach is to estimate the pattern in the errors and refine the regression model.

Another problem is overfitting that occurs when the model is "too well-trained". Such a model will
not fit any other data. Regularization is the technique used to avoid overfitting. For parametric
models, there are regression routines that address overfitting concerns. Lasso regression and ridge
regression are a couple of such routines.

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise
deal with data sets that are too large or complex to be dealt with by traditional data-
processing application software. Data with many fields (columns) offer greater statistical power,
while data with higher complexity (more attributes or columns) may lead to a higher false discovery
rate.[2] Big data analysis challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, updating, information privacy, and data source.
53
Big data was originally associated with three key concepts: volume, variety, and velocity.[3] The
analysis of big data presents challenges in sampling, and thus previously allowing for only
observations and sampling. Therefore, big data often includes data with sizes that exceed the
capacity of traditional software to process within an acceptable time and value.
Current usage of the term big data tends to refer to the use of predictive analytics, user
behavior analytics, or certain other advanced data analytics methods that extract value from big
data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now
available are indeed large, but that's not the most relevant characteristic of this new data
ecosystem."[4] Analysis of data sets can find new correlations to "spot business trends, prevent
diseases, combat crime and so on".[5] Scientists, business executives, medical practitioners,
advertising and governments alike regularly meet difficulties with large data-sets in areas
including Internet searches, fintech, healthcare analytics, geographic information systems, urban
informatics, and business informatics. Scientists encounter limitations in e-Science work,
including meteorology, genomics, connectomics, complex physics simulations, biology, and
environmental research.[7]
The size and number of available data sets has grown rapidly as data is collected by devices
such as mobile devices, cheap and numerous information-sensing Internet of things devices, aerial
(remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID)
readers and wireless sensor networks.[8][9] The world's technological per-capita capacity to store
information has roughly doubled every 40 months since the 1980s;[10] as of 2012, every day
2.5 exabytes (2.5×260 bytes) of data are generated.[11] Based on an IDC report prediction, the global
data volume was predicted to grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013
and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. [12] One question for large
enterprises is determining who should own big-data initiatives that affect the entire organization.[13]
The term big data has been in use since the 1990s, with some giving credit to John
Mashey for popularizing the term. Big data usually includes data sets with sizes beyond the ability
of commonly used software tools to capture, curate, manage, and process data within a tolerable
elapsed time.[18] Big data philosophy encompasses unstructured, semi-structured and structured
data, however the main focus is on unstructured data.[19] Big data "size" is a constantly moving
target; as of 2012 ranging from a few dozen terabytes to many zettabytes of data. Big data requires
a set of techniques and technologies with new forms of integration to reveal insights from data-
sets that are diverse, complex, and of a massive scale.
"Variety", "veracity", and various other "Vs" are added by some organizations to describe it, a
revision challenged by some industry authorities. The Vs of big data were often referred to as the
"three Vs", "four Vs", and "five Vs". They represented the qualities of big data in volume, variety,
velocity, veracity, and value.[3] Variability is often included as an additional quality of big data.
A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and
notes, "This represents a distinct and clearly defined change in the computer science used, via
parallel programming theories, and losses of some of the guarantees and capabilities made
by Codd's relational model."
In a comparative study of big datasets, Kitchin and McArdle found that none of the commonly
considered characteristics of big data appear consistently across all of the analyzed cases. [24] For this
reason, other studies identified the redefinition of power dynamics in knowledge discovery as the
defining trait.[25] Instead of focusing on intrinsic characteristics of big data, this alternative
perspective pushes forward a relational understanding of the object claiming that what matters is
the way in which data is collected, stored, made available and analyzed.
Big data vs. business intelligence

54
The growing maturity of the concept more starkly delineates the difference between "big data" and
"business intelligence"
Business intelligence uses applied mathematics tools and descriptive statistics with data with high
information density to measure things, detect trends, etc.
Big data uses mathematical analysis, optimization, inductive statistics, and concepts from nonlinear
system identification[27] to infer laws (regressions, nonlinear relationships, and causal effects) from
large sets of data with low information density[28] to reveal relationships and dependencies, or to
perform predictions of outcomes and behaviors.
Characteristics

Shows the growth of big data's primary characteristics of volume, velocity, and variety
Big data can be described by the following characteristics:
Volume
The quantity of generated and stored data. The size of the data determines the value and potential
insight, and whether it can be considered big data or not. The size of big data is usually larger than
terabytes and petabytes.
Variety
The type and nature of the data. The earlier technologies like RDBMSs were capable to handle
structured data efficiently and effectively. However, the change in type and nature from structured
to semi-structured or unstructured challenged the existing tools and technologies. The big data
technologies evolved with the prime intention to capture, store, and process the semi-structured and
unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later,
these tools and technologies were explored and used for handling structured data also but preferable
for storage. Eventually, the processing of structured data was still kept as optional, either using big
data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden
insights exposed from the data collected via social media, log files, sensors, etc. Big data draws
from text, images, audio, video; plus it completes missing pieces through data fusion.
Velocity
The speed at which the data is generated and processed to meet the demands and challenges that lie
in the path of growth and development. Big data is often available in real-time. Compared to small
data, big data is produced more continually. Two kinds of velocity related to big data are the
frequency of generation and the frequency of handling, recording, and publishing.[31]
Veracity
The truthfulness or reliability of the data, which refers to the data quality and the data value. [32] Big
data must not only be large in size, but also must be reliable in order to achieve value in the analysis
of it. The data quality of captured data can vary greatly, affecting an accurate analysis.[33]
Value
The worth in information that can be achieved by the processing and analysis of large datasets.
Value also can be measured by an assessment of the other qualities of big data. [34] Value may also
represent the profitability of information that is retrieved from the analysis of big data.
Variability
55
The characteristic of the changing formats, structure, or sources of big data. Big data can include
structured, unstructured, or combinations of structured and unstructured data. Big data analysis may
integrate raw data from multiple sources. The processing of raw data may also involve
transformations of unstructured data to structured data.
Other possible characteristics of big data are:[35]
Exhaustive
Whether the entire system (i.e., =all) is captured or recorded or not. Big data may or may not
include all the available data from sources.
Fine-grained and uniquely lexical
Respectively, the proportion of specific data of each element per element collected and if the
element and its characteristics are properly indexed or identified.
Relational
If the data collected contains common fields that would enable a conjoining, or meta-analysis, of
different data sets.
Extensional
If new fields in each element of the data collected can be added or changed easily.
Scalability
If the size of the big data storage system can expand rapidly.
Architecture[edit]Big data repositories have existed in many forms, often built by corporations with
a special need. Commercial vendors historically offered parallel database management systems for
big data beginning in the 1990s. For many years, WinterCorp published the largest database report.
Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems
were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991
so the definition of big data continuously evolves according to Kryder's law. Teradata installed the
first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class
Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008
were 100% structured relational data. Since then, Teradata has added unstructured data types
including XML, JSON, and Avro.In 2000, Seisint Inc. (now LexisNexis Risk Solutions) developed
a C++-based distributed platform for data processing and querying known as the HPCC
Systems platform. This system automatically partitions, distributes, stores and delivers structured,
semi-structured, and unstructured data across multiple commodity servers. Users can write data
processing pipelines and queries in a declarative dataflow programming language called ECL. Data
analysts working in ECL are not required to define data schemas upfront and can rather focus on
the particular problem at hand, reshaping data in the best possible manner as they develop the
solution.
In 2004, LexisNexis acquired Seisint Inc.[37] and their high-speed parallel processing platform and
successfully used this platform to integrate the data systems of Choicepoint Inc. when they acquired
that company in 2008.[38] In 2011, the HPCC systems platform was open-sourced under the Apache
v2.0 License.CERN and other physics experiments have collected big data sets for many decades,
usually analyzed via high-throughput computing rather than the map-reduce architectures usually
meant by the current "big data" movement.In 2004, Google published a paper on a process
called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel
processing model, and an associated implementation was released to process huge amounts of data.
With MapReduce, queries are split and distributed across parallel nodes and processed in parallel
(the "map" step). The results are then gathered and delivered (the "reduce" step). The framework
was very successful,[39] so others wanted to replicate the algorithm. Therefore, an implementation of
the MapReduce framework was adopted by an Apache open-source project named "Hadoop".

56
Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as
[40]

it adds the ability to set up many operations (not just map followed by reducing).
MIKE2.0 is an open approach to information management that acknowledges the need for revisions
due to big data implications identified in an article titled "Big Data Solution Offering". [41] The
methodology addresses handling big data in terms of useful permutations of data
sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual
records.[42]Studies in 2012 showed that a multiple-layer architecture was one option to address the
issues that big data presents. A distributed parallel architecture distributes data across multiple
servers; these parallel execution environments can dramatically improve data processing speeds.
This type of architecture inserts data into a parallel DBMS, which implements the use of
MapReduce and Hadoop frameworks. This type of framework looks to make the processing power
transparent to the end-user by using a front-end application server. [43]The data lake allows an
organization to shift its focus from centralized control to a shared model to respond to the changing
dynamics of information management. This enables quick segregation of data into the data lake,
thereby reducing the overhead time

Big data has increased the demand of information management specialists so much so that Software
AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP, and Dell have spent more than
$15 billion on software firms specializing in data management and analytics. In 2010, this industry
was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast
as the software business as a whole.[5]
Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-
phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet.
[5]
Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which
means more people became more literate, which in turn led to information growth. The world's
effective capacity to exchange information through telecommunication networks was
281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007[10] and
predictions put the amount of internet traffic at 667 exabytes annually by 2014.[5] According to one
estimate, one-third of the globally stored information is in the form of alphanumeric text and still
image data,[55] which is the format most useful for most big data applications. This also shows the
potential of yet unused data (i.e. in the form of video and audio content).
While many vendors offer off-the-shelf products for big data, experts promote the development of
in-house custom-tailored systems if the company has sufficient technical capabilities.
The use and adoption of big data within governmental processes allows efficiencies in terms of
cost, productivity, and innovation,[57] but does not come without its flaws. Data analysis often
requires multiple parts of government (central and local) to work in collaboration and create new
and innovative processes to deliver the desired outcome. A common government organization that
makes use of big data is the National Security Administration (NSA), who monitor the activities of
the Internet constantly in search for potential patterns of suspicious or illegal activities their system
may pick up.
Civil registration and vital statistics (CRVS) collects all certificates status from birth to death.
CRVS is a source of big data for governments.
International development
Research on the effective usage of information and communication technologies for development
(also known as "ICT4D") suggests that big data technology can make important contributions but
also present unique challenges to international development.[58][59] Advancements in big data
analysis offer cost-effective opportunities to improve decision-making in critical development areas
such as health care, employment, economic productivity, crime, security, and natural disaster and

57
resource management.[60][61][62] Additionally, user-generated data offers new opportunities to give the
unheard a voice.[63] However, longstanding challenges for developing regions such as inadequate
technological infrastructure and economic and human resource scarcity exacerbate existing
concerns with big data such as privacy, imperfect methodology, and interoperability issues. [60] The
challenge of "big data for development" [60] is currently evolving toward the application of this data
through machine learning, known as "artificial intelligence for development (AI4D).[64]
Benefits
A major practical application of big data for development has been "fighting poverty with data".
[65]
In 2015, Blumenstock and colleagues estimated predicted poverty and wealth from mobile
phone metadata [66] and in 2016 Jean and colleagues combined satellite imagery and machine
learning to predict poverty. Using digital trace data to study the labor market and the digital
economy in Latin America, Hilbert and colleagues argue that digital trace data has several benefits
such as:
Thematic coverage: including areas that were previously difficult or impossible to measure
Geographical coverage: our international sources provided sizable and comparable data for almost
all countries, including many small countries that usually are not included in international
inventories
Level of detail: providing fine-grained data with many interrelated variables, and new aspects, like
network connections
Timeliness and timeseries: graphs can be produced within days of being collected
Challenges
At the same time, working with digital trace data instead of traditional survey data does not
eliminate the traditional challenges involved when working in the field of international quantitative
analysis. Priorities change, but the basic discussions remain the same. Among the main challenges
are:
Representativeness.
While traditional development statistics is mainly concerned with the representativeness of random
survey samples, digital trace data is never a random sample.
Generalizability.
While observational data always represents this source very well, it only represents what it
represents, and nothing more. While it is tempting to generalize from specific observations of one
platform to broader settings, this is often very deceptive.
Harmonization.
Digital trace data still requires international harmonization of indicators. It adds the challenge of so-
called "data-fusion", the harmonization of different sources.
Data overload.
Analysts and institutions are not used to effectively deal with a large number of variables, which is
efficiently done with interactive dashboards. Practitioners still lack a standard workflow that would
allow researchers, users and policymakers to efficiently and effectively.[68]
Healthcare
Big data analytics was used in healthcare by providing personalized medicine and
prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability
reduction, automated external and internal reporting of patient data, standardized medical terms and
patient registries. Some areas of improvement are more aspirational than actually implemented.
The level of data generated within healthcare systems is not trivial. With the added adoption
of mHealth, eHealth and wearable technologies the volume of data will continue to increase. This
includes electronic health record data, imaging data, patient generated data, sensor data, and other

58
forms of difficult to process data. There is now an even greater need for such environments to pay
greater attention to data and information quality.
Big data in health research is particularly promising in terms of exploratory biomedical
research, as data-driven analysis can move forward more quickly than hypothesis-driven research.
Then, trends seen in data analysis can be tested in traditional, hypothesis-driven follow up
biological research and eventually clinical research.
Big Data Technologies
Big data technology and Hadoop is a big buzzword as it might sound. As there has been a huge
increase in the data and information domain from every industry and domain, it becomes very
important to establish and introduce an efficient technique that takes care of all the needs and
requirements of clients and big industries which are responsible for data generation. Earlier the data
was being handled by normal programming languages and simple structured query language but
now these systems and tools don’t seem to do much in case of big data.
Big data technology is defined as the technology and a software utility that is designed for analysis,
processing, and extraction of the information from a large set of extremely complex structures and
large data sets which is very difficult for the traditional systems to deal with. Big data technology is
used to handle both real-time and batch related data. Machine learning has become a very critical
component of everyday lives and every industry and therefore managing data through big data
becomes very important.
Types of Big Data Technologies
Before starting with the list of technologies let us first see the broad classification of all these
technologies.
They can mainly be classified into 4 domains.
Data storage
Analytics
Data mining
Hadoop:
When it comes to big data, Hadoop is the first technology that comes into play. This is
based on map-reduce architecture and helps in the processing of batch related jobs and process
batch information. It was designed to store and process the data in a distributed data processing
environment along with commodity hardware and a simple programming execution model. It can
be used to store and analyze the data present in various different machines with high storage, speed,
and low costs. This forms one of the main core components of big data technology which was
developed by the Apache software foundation in the year 2011 and is written in Java.Visualization
MongoDB:
Another very essential and core component of big data technology in terms of storage is the
MongoDB NoSQL database. It is a NoSQL database which means that the relational properties and
other RDBMS related properties do not apply to it. It is different from traditional RDBMS
databases which makes use of structured query language. It makes use of schema documents and
the structure of data storage is also different and therefore they are helpful in holding a large
amount of data. It is a cross-platform document-oriented design and database program that makes
use of JSON like documents along with schema. This becomes a very useful use-case of operational
data stores in the majority of financial institutions and thereby working to replace the traditional
mainframes. MongoDB handles flexibility and also a wide variety of data types at high volumes
and among distributed architectures.
Hunk:
It is useful in accessing data through remote Hadoop clusters by making use of virtual
indexes and also makes use of Splunk search processing language which can be used for the

59
analysis of data. The hunk can be used to report and visualize huge amounts of data from the
Hadoop and NoSQL databases and sources. It was developed by team Splunk in the year 2013
which was written in Java.
Cassandra:
Cassandra forms a top choice among the list of popular NoSQL databases which is a free
and an open-source database, which is distributed and has a wide columnar storage and can
efficiently handle data on large commodity clusters i.e. it is used to provide high availability along
with no single failure point. Among the list of main features includes the ones like distributed
nature, scalability, fault-tolerant mechanism, MapReduce support, tunable consistency, query
language property, supports multi data center replication and eventual consistency.

Google File System

The Google file system (GFS) is a distributed file system (DFS) for data-centric applications with
robustness, scalability, and reliability [8]. GFS can be implemented in commodity servers to
support large-scale file applications with high performance and high reliability.
Google File System (GFS) is a scalable distributed file system (DFS) created by Google Inc. and
developed to accommodate Google's expanding data processing requirements. GFS provides fault
tolerance, reliability, scalability, availability and performance to large networks and connected
nodes

GFS is enhanced for Google's core data storage and usage needs (primarily the search engine),
which can generate enormous amounts of data that must be retained; Google File System grew out
of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the early days
of Google, while it was still located in Stanford. Files are divided into fixed-size chunks of
64 megabytes, similar to clusters or sectors in regular file systems, which are only extremely rarely
overwritten, or shrunk; files are usually appended to or read. It is also designed and optimized to
run on Google's computing clusters, dense nodes which consist of cheap "commodity" computers,
which means precautions must be taken against the high failure rate of individual nodes and the
subsequent data loss. Other design decisions select for high data throughputs, even when it comes
at the cost of latency.
A GFS cluster consists of multiple nodes. These nodes are divided into two types: one Master node
and multiple Chunkservers. Each file is divided into fixed-size chunks. Chunkservers store these
chunks. Each chunk is assigned a globally unique 64-bit label by the master node at the time of
creation, and logical mappings of files to constituent chunks are maintained. Each chunk is

60
replicated several times throughout the network. At default, it is replicated three times, but this is
configurable.[3] Files which are in high demand may have a higher replication factor, while files for
which the application client uses strict storage optimizations may be replicated less than three times
- in order to cope with quick garbage cleaning policies.[3]
The Master server does not usually store the actual chunks, but rather all
the metadata associated with the chunks, such as the tables mapping the 64-bit labels to chunk
locations and the files they make up (mapping from files to chunks), the locations of the copies of
the chunks, what processes are reading or writing to a particular chunk, or taking a "snapshot" of
the chunk pursuant to replicate it (usually at the instigation of the Master server, when, due to node
failures, the number of copies of a chunk has fallen beneath the set number). All this metadata is
kept current by the Master server periodically receiving updates from each chunk server ("Heart-
beat messages").
Programs access the chunks by first querying the Master server for the locations of the
desired chunks; if the chunks are not being operated on (i.e. no outstanding leases exist), the Master
replies with the locations, and the program then contacts and receives the data from the chunkserver
directly (similar to Kazaa and its supernodes).
Unlike most other file systems, GFS is not implemented in the kernel of an operating system, but is
instead provided as a userspace library.[4]
Interface
The Google File System does not provide a POSIX interface.[5] Files are organized hierarchically in
directories and identified by pathnames. The file operations such as create, delete, open, close, read,
write are supported. It supports Record Append which allows multiple clients to append data to the
same file concurrently and atomicity is guaranteed.
Performance
Deciding from benchmarking results,[3] when used with relatively small number of servers (15), the
file system achieves reading performance comparable to that of a single disk (80–100 MB/s), but
has a reduced write performance (30 MB/s), and is relatively slow (5 MB/s) in appending data to
existing files. The authors present no results on random seek time. As the master node is not
directly involved in data reading (the data are passed from the chunk server directly to the reading
client), the read rate increases significantly with the number of chunk servers, achieving 583 MB/s
for 342 nodes. Aggregating multiple servers also allows big capacity, while it is somewhat reduced
by storing data in three independent locations (to provide redundancy).
HDFS architecture
HDFS can be presented as the master/slave architecture. HDFS master is named
as NameNode whereas slave as DataNode. NameNode is a sever that manages the
filesystem namespace and adjusts the access (open, close, rename, and more) to
files by the client. It divides the input data into blocks and announces which data
block will be store in which DataNode. DataNode is a slave machine that stores the
replicas of the partitioned dataset and serves the data as the request comes. It also
performs block creation and deletion.

The internal mechanism of HDFS divides the file into one or more blocks; these
blocks are stored in a set of data nodes. Under normal circumstances of the
replication factor three, the HDFS strategy is to place the first copy on the local node,
second copy on the local rack with a different node, and a third copy into different
racks with different nodes. As HDFS is designed to support large files, the HDFS
block size is defined as 64 MB. If required, this can be increased.
Understanding HDFS components

61
HDFS is managed with the master-slave architecture included with the
following components:
• NameNode: This is the master of the HDFS system. It maintains the
directories, files, and manages the blocks that are present on the DataNodes.
• DataNode: These are slaves that are deployed on each machine and provide
actual storage. They are responsible for serving read-and-write data requests
for the clients.
• Secondary NameNode: This is responsible for performing periodic
checkpoints. So, if the NameNode fails at any time, it can be replaced with
a snapshot image stored by the secondary NameNode checkpoints.

MapReduce architecture
MapReduce is also implemented over master-slave architectures. Classic MapReduce contains job
submission, job initialization, task assignment, task execution, progress and status update, and job
completion-related activities, which are mainly managed by the Job Tracker node and executed by
Task Tracker. Client application submits a job to the Job Tracker. Then input is divided across the
cluster. The Job Tracker then calculates the number of map and reducer to be processed. It
commands the Task Tracker to start executing the job. Now, the Task Tracker copies the resources
to a local machine and launches JVM to map and reduce program over the data.
Along with this, the Task Tracker periodically sends update to the Job Tracker, which can be
considered as the heartbeat that helps to update JobID, job status, and usage of resources.

Understanding MapReduce components

MapReduce is managed with master-slave architecture included with the following components:
• JobTracker: This is the master node of the MapReduce system, which manages the jobs and
resources in the cluster (TaskTrackers). The JobTracker tries to schedule each map as close to the
actual data being processed on the TaskTracker, which is running on the same DataNode as the
underlying block.
• TaskTracker: These are the slaves that are deployed on each machine. They are responsible for
running the map and reducing tasks as instructed by the JobTracker.

Understanding the HDFS and MapReduce architecture by plot

In this plot, both HDFS and MapReduce master and slave components have been included,
where NameNode and DataNode are from HDFS and JobTracker and TaskTracker are from the
MapReduce paradigm.Both paradigms consisting of master and slave candidates have their own
specific responsibility to handle MapReduce and HDFS operations. In the next plot, there is a plot
with two sections: the preceding one is a MapReduce layer and the following
one is an HDFS layer.

62
Hadoop is a top-level Apache project and is a very complicated Java framework.
To avoid technical complications, the Hadoop community has developed a number
of Java frameworks that has added an extra value to Hadoop features. They are
considered as Hadoop subprojects. Here, we are departing to discuss several Hadoop
components that can be considered as an abstraction of HDFS or MapReduce.

Hadoop MapReduce fundamentals

Basically, the MapReduce model can be implemented in several languages, but
apart from that, Hadoop MapReduce is a popular Java framework for easily written
applications. It processes vast amounts of data (multiterabyte datasets) in parallel on
large clusters (thousands of nodes) of commodity hardware in a reliable and faulttolerant
manner. This MapReduce paradigm is divided into two phases, Map and
Reduce, that mainly deal with key-value pairs of data. The Map and Reduce tasks
run sequentially in a cluster, and the output of the Map phase becomes the input of
the Reduce phase.
All data input elements in MapReduce cannot be updated. If the input (key,
value) pairs for mapping tasks are changed, it will not be reflected in the input files.
The Mapper output will be piped to the appropriate Reducer grouped with the key
attribute as input. This sequential data process will be carried away in a parallel
manner with the help of Hadoop MapReduce algorithms as well as Hadoop clusters.
MapReduce programs transform the input dataset present in the list format into
output data that will also be in the list format. This logical list translation process
is mostly repeated twice in the Map and Reduce phases. We can also handle these
repetitions by fixing the number of Mappers and Reducers. In the next section,
MapReduce concepts are described based on the old MapReduce API.

The following are the components of Hadoop that are responsible for performing
analytics over Big Data:
• Client: This initializes the job
• JobTracker: This monitors the job
• TaskTracker: This executes the job
• HDFS: This stores the input and output data

The four main stages of Hadoop MapReduce data processing are as follows:
• The loading of data into HDFS
63
• The execution of the Map phase
• Shuffling and sorting
• The execution of the Reduce phase
Loading data into HDFS
The input dataset needs to be uploaded to the Hadoop directory so it can be used by
MapReduce nodes. Then, Hadoop Distributed File System (HDFS) will divide the
input dataset into data splits and store them to DataNodes in a cluster by taking care
of the replication factor for fault tolerance. All the data splits will be processed by
TaskTracker for the Map and Reduce tasks in a parallel manner.
Also, there are some alternative ways to get the dataset in HDFS with
Hadoop components:
• Sqoop: This is an open source tool designed for efficiently transferring bulk
data between Apache Hadoop and structured, relational databases. Suppose
your application has already been configured with the MySQL database
and you want to use the same data for performing data analytics, Sqoop is
recommended for importing datasets to HDFS. Also, after the completion
of the data analytics process, the output can be exported to the MySQL
database.
• Flume: This is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data to HDFS.
Flume is able to read data from most sources, such as logfiles, sys logs, and
the standard output of the Unix process.
Using the preceding data collection and moving the framework can make this data
transfer process very easy for the MapReduce application for data analytics.

Executing the Map phase

Executing the client application starts the Hadoop MapReduce processes. The Map
phase then copies the job resources (unjarred class files) and stores it to HDFS, and
requests JobTracker to execute the job. The JobTracker initializes the job, retrieves the
input, splits the information, and creates a Map task for each job.
The JobTracker will call TaskTracker to run the Map task over the assigned input
data subset. The Map task reads this input split data as input (key, value) pairs
provided to the Mapper method, which then produces intermediate (key, value)
pairs. There will be at least one output for each input (key, value) pair.

The list of (key, value) pairs is generated such that the key attribute will be repeated
many times. So, its key attribute will be re-used in the Reducer for aggregating
values in MapReduce. As far as format is concerned, Mapper output format values

64
and Reducer input values must be the same.
After the completion of this Map operation, the Task Tracker will keep the
result in its buffer storage and local disk space (if the output data size is more
than the threshold).
For example, suppose we have a Map function that converts the input text into
lowercase. This will convert the list of input strings into a list of lowercase strings.
Reducing phase execution
As soon as the Mapper output is available, Task Tracker in the Reducer node will
retrieve the available partitioned Map's output data, and they will be grouped
together and merged into one large file, which will then be assigned to a process
with a Reducer method. Finally, this will be sorted out before data is provided to the
Reducer method.
The Reducer method receives a list of input values from an input (key, list
(value)) and aggregates them based on custom logic, and produces the output
(key, value) pairs.

The output of the Reducer method of the Reduce phase will directly be written into
HDFS as per the format specified by the MapReduce job configuration class.

The Hadoop MapReduce fundamentals

To understand Hadoop MapReduce fundamentals properly, we will:
• Understand MapReduce objects
• Learn how to decide the number of Maps in MapReduce
• Learn how to decide the number of Reduces in MapReduce
• Understand MapReduce dataflow
• Take a closer look at Hadoop MapReduce terminologies

MapReduce objects

65
MapReduce operations in Hadoop are carried out mainly by three objects: Mapper, Reducer, and
Driver.
• Mapper:
This is designed for the Map phase of MapReduce, which starts MapReduce operations by
carrying input files and splitting them into several pieces. For each piece, it will emit a key-value
data pair as the output value.
• Reducer:
This is designed for the Reduce phase of a MapReduce job; it accepts key-based grouped
data from the Mapper output, reduces it by aggregation logic, and emits the (key, value) pair for the
group of values.
• Driver:
This is the main file that drives the MapReduce process. It starts the execution of
MapReduce tasks after getting a request from the client application with parameters. The Driver file
is responsible for building the configuration of a job and submitting it to the Hadoop cluster. The
Driver code will contain the main() method that accepts arguments from the command line. The
input and output directory of the Hadoop MapReduce job will be accepted by this program. Driver
is the main file for defining job configuration details, such as the job name, job input format, job
output format, and the Mapper, Combiner, Partitioner, and Reducer classes. MapReduce is
initialized by calling this main() function of the Driver class.

Deciding the number of Maps in Map Reduce

The number of Maps is usually defined by the size of the input data and size of the data split
block that is calculated by the size of the HDFS file / data split. Therefore, if we have an HDFS
data file of 5 TB and a block size of 128 MB, there will be 40,960 maps present in the file. But
sometimes, the number of Mappers created will be more than this count because of speculative
execution. This is true when the input is a file, though it entirely depends on the Input Format class.
In Hadoop Map Reduce processing, there will be a delay in the result of the job when the
assigned Mapper or Reducer is taking a long time to finish. If you want to avoid this, speculative
execution in Hadoop can run multiple copies of the same Map or Reduce task on different nodes,
and the result from the first completed nodes can be used. From the Hadoop API with the
setNumMapTasks(int) method, we can get an idea of the number of Mappers.

Deciding the number of Reducers in MapReduce

A numbers of Reducers are created based on the Mapper's input. However, if you hardcode the
number of Reducers in MapReduce, it won't matter how many nodes are present in a cluster. It will
be executed as specified in the configuration. Additionally, we can set the number of Reducers at
runtime along with the MapReduce command at the command prompt -D mapred reduce. tasks,
with the number you want. Programmatically, it can be set via conf. setNumReduceTasks(int).
MapReduce dataflow
Now that we have seen the components that make a basic MapReduce job possible,we will
distinguish how everything works together at a higher level. From the following diagram, we will
understand MapReduce dataflow with multiple nodes in a Hadoop cluster:

66
Map reduce data flow
The two APIs available for Hadoop MapReduce are: New (Hadoop 1.x and 2.x) and Old
Hadoop (0.20). YARN is the next generation of Hadoop MapReduce and the new Apache Hadoop
subproject that has been released for Hadoop resource management.
Hadoop data processing includes several tasks that help achieve the final output from an input
dataset. These tasks are as follows:
1. Preloading data in HDFS.
2. Running MapReduce by calling Driver.
3. Reading of input data by the Mappers, which results in the splitting of the data execution of the
Mapper custom logic and the generation of intermediate key-value pairs
4. Executing Combiner and the shuffle phase to optimize the overall Hadoop MapReduce process.
5. Sorting and providing of intermediate key-value pairs to the Reduce phase.
The Reduce phase is then executed. Reducers take these partitioned key value pairs and aggregate
them based on Reducer logic.
6. The final output data is stored at HDFS.
Here, Map and Reduce tasks can be defined for several data operations as follows:
• Data extraction
• Data loading
• Data segmentation
• Data cleaning
• Data transformation
• Data integration
Common Hadoop Shell commands
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
bin/hdfs dfs -ls <path>

mkdir:
To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer

touchz: It creates an empty file.

Syntax:
bin/hdfs dfs -touchz <file_path>

67
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

cat: To print file contents.

Syntax:
bin/hdfs dfs -cat <path>

copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

moveFromLocal: This command will move file from local to hdfs.

Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Name Node

NameNode:
This is the master of the HDFS system. It maintains the directories, files, and manages the blocks
that are present on the DataNodes.
• DataNode: These are slaves that are deployed on each machine and provide actual storage. They
are responsible for serving read-and-write data requests for the clients.
• Secondary NameNode: This is responsible for performing periodic checkpoints. So, if the
NameNode fails at any time, it can be replaced with a snapshot image stored by the secondary
NameNode checkpoints.
HADOOP MAPREDUCE PARADIGM
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an
input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.

68
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data resides
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data and creates several small chunks of
data.
Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it produces a
new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input
to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the
job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the Writable-
Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology
PayLoad − Applications implement the Map and the Reduce functions, and form the core of the
job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
SlaveNode − Node where Map and Reduce program runs.
69
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to JobTracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
Task − An execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

Example Scenario
Given below is the data regarding the electrical consumption of an organization. It contains the
monthly electrical consumption and the annual average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg

1979 23 23 2 43 24 25 26 26 26 26 25 26 25

1980 26 27 28 28 28 30 31 31 31 30 30 30 29

1981 31 32 32 32 33 34 35 36 36 34 34 34 34

1984 39 38 39 39 39 41 42 43 40 39 38 38 40

1985 38 39 39 39 39 41 41 41 00 40 39 39 45

If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This is a
walkover for the programmers with finite number of records. They will simply write the logic to
produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the large scale industries of a
particular state, since its formation.
When we write applications to process such bulk data,they will take a lot of time to execute.
There will be a heavy network traffic when we move data from source to network server and so on.
To solve these problems, we have the MapReduce framework.
Input Data
The above data is saved as sample.txt and given as input. The input file looks as shown below.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
70
public class ProcessUnits {
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable ,/*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()) {
lasttoken = s.nextToken(); }
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
} }
//Reducer class
public static class E_EReduce extends MapReduceBase implements Reducer< Text, IntWritable,
Text, IntWritable > {
//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;
while (values.hasNext()) {
if((val = values.next().get())>maxavg) {
output.collect(key, new IntWritable(val));
} }} }
//Main function
public static void main(String args[])throws Exception {
JobConf conf = new JobConf(ProcessUnits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}}
Save the above program as ProcessUnits.java. The compilation and execution of the program is
explained below.

71
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program.
Visit the following link mvnrepository.com to download the jar. Let us assume the downloaded
folder is /home/hadoop/.
Step 3
The following commands are used for compiling the ProcessUnits.java program and creating a jar
for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5
The following command is used to copy the input file named sample.txtin the input directory of
HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input files from
the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Wait for a while until the file is executed. After execution, as shown below, the output will contain
the number of input splits, the number of Map tasks, the number of reducer tasks, etc.
INFO mapreduce.Job: Job job_1414748220717_0002
completed successfully
14/10/31 06:02:52
INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read = 61
FILE: Number of bytes written = 279400
FILE: Number of read operations = 0
FILE: Number of large read operations = 0
FILE: Number of write operations = 0
HDFS: Number of bytes read = 546
HDFS: Number of bytes written = 40
HDFS: Number of read operations = 9
HDFS: Number of large read operations = 0
HDFS: Number of write operations = 2 Job Counters
Launched map tasks = 2
Launched reduce tasks = 1

72
Data-local map tasks = 2
Total time spent by all maps in occupied slots (ms) = 146137
Total time spent by all reduces in occupied slots (ms) = 441
Total time spent by all map tasks (ms) = 14613
Total time spent by all reduce tasks (ms) = 44120
Total vcore-seconds taken by all map tasks = 146137
Total vcore-seconds taken by all reduce tasks = 44120
Total megabyte-seconds taken by all map tasks = 149644288
Total megabyte-seconds taken by all reduce tasks = 45178880

Map-Reduce Framework
Map input records = 5
Map output records = 5
Map output bytes = 45
Map output materialized bytes = 67
Input split bytes = 208
Combine input records = 5
Combine output records = 5
Reduce input groups = 5
Reduce shuffle bytes = 6
Reduce input records = 5
Reduce output records = 5
Spilled Records = 10
Shuffled Maps = 2
Failed Shuffles = 0
Merged Map outputs = 2
GC time elapsed (ms) = 948
CPU time spent (ms) = 5160
Physical memory (bytes) snapshot = 47749120
Virtual memory (bytes) snapshot = 2899349504
Total committed heap usage (bytes) = 277684224

File Output Format Counters

Bytes Written = 40
Step 8
The following command is used to verify the resultant files in the output folder.
$HADOOP_HOME/bin/hadoop fs -ls output_dir/
Step 9
The following command is used to see the output in Part-00000 file. This file is generated by
HDFS.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000
Below is the output generated by the MapReduce program.
1981 34
1984 40
1985 45
Step 10
The following command is used to copy the output folder from HDFS to the local file system for
analyzing.

73
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_dir
/home/hadoop
Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running
the Hadoop script without any arguments prints the description for all commands.
Usage − hadoop [--config confdir] COMMAND
The following table lists the options available and their description.
Sr.No. Option & Description

1 namenode -format
Formats the DFS filesystem.

2 secondarynamenode
Runs the DFS secondary namenode.

3 namenode
Runs the DFS namenode.

4 datanode
Runs a DFS datanode.

5 dfsadmin
Runs a DFS admin client.

6 mradmin
Runs a Map-Reduce admin client.

7 fsck
Runs a DFS filesystem checking utility.

8 fs
Runs a generic filesystem user client.

9 balancer
Runs a cluster balancing utility.

10 oiv
Applies the offline fsimage viewer to an fsimage.

11 fetchdt
Fetches a delegation token from the NameNode.

12 jobtracker
Runs the MapReduce job Tracker node.

13 pipes
Runs a Pipes job.

74
14 tasktracker
Runs a MapReduce task Tracker node.

15 historyserver
Runs job history servers as a standalone daemon.

16 job
Manipulates the MapReduce jobs.

17 queue
Gets information regarding JobQueues.

18 version
Prints the version.

19 jar <jar>
Runs a jar file.

20 distcp <srcurl> <desturl>

Copies file or directories recursively.

21 distcp2 <srcurl> <desturl>

DistCp version 2.

22 archive -archiveName NAME -p <parent path> <src>* <dest>

Creates a hadoop archive.

23 classpath
Prints the class path needed to get the Hadoop jar and the required libraries.

24 daemonlog
Get/Set the log level for each daemon

How to Interact with MapReduce Jobs

Usage − hadoop job [GENERIC_OPTIONS]
The following are the Generic Options available in a Hadoop job.
Sr.No. GENERIC_OPTION & Description

1 -submit <job-file>
Submits the job.

2 -status <job-id>
Prints the map and reduce completion percentage and all job counters.

3 -counter <job-id> <group-name> <countername>

75
Prints the counter value.

4 -kill <job-id>
Kills the job.

5 -events <job-id> <fromevent-#> <#-of-events>

Prints the events' details received by jobtracker for the given range.

6 -history [all] <jobOutputDir> - history < jobOutputDir>

Prints job details, failed and killed tip details. More details about the job such as successful
tasks and task attempts made for each task can be viewed by specifying the [all] option.

7 -list[all]
Displays all jobs. -list displays only jobs which are yet to complete.

8 -kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.

9 -fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.

10 -set-priority <job-id> <priority>

Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH,
NORMAL, LOW, VERY_LOW

To see the status of job

$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004
To see the history of job output-dir
$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME>
e.g.
$ $HADOOP_HOME/bin/hadoop job -history /user/expert/output
To kill the job
$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004

Hadoop – Mapper In MapReduce

Last Updated : 28 Jul, 2020
Map-Reduce is a programming model that is mainly divided into two phases Map
Phase and Reduce Phase. It is designed for processing the data in parallel which is divided on
various machines(nodes). The Hadoop Java programs are consist of Mapper class and Reducer
class along with the driver class. Hadoop Mapper is a function or task which is used to process all
input records from a file and generate the output which works as input for Reducer. It produces the
output by returning new key-value pairs. The input data has to be converted to key-value pairs as
Mapper cannot process the raw input records or tuples(key-value pairs). The mapper also generates
some small blocks of data while processing the input records as a key-value pair. we will discuss
76
the various process that occurs in Mapper, There key features and how the key-value pairs are
generated in the Mapper.
Let’s understand the Mapper in Map-Reduce:

Mapper is a simple user-defined program that performs some operations on input-splits as

per it is designed. Mapper is a base class that needs to be extended by the developer or programmer
in his lines of code according to the organization’s requirements. input and output type need to be
mentioned under the Mapper class argument which needs to be modified by the developer.
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we
have 100 Data-Blocks of the dataset we are analyzing then in that case there will be 100 Mapper
program or process that runs in parallel on machines(nodes) and produce there own output known
as intermediate output which is then stored on Local Disk, not on HDFS. The output of the mapper
act as input for Reducer which performs some sorting and aggregation operation on data and
produces the final output.
The Mapper mainly consists of 5 components: Input, Input Splits, Record Reader, Map, and
Intermediate output disk. The Map Task is completed with the contribution of all this available
component.
Input: Input is records or the datasets that are used for analysis purposes. This Input data is set out
with the help of InputFormat. It helps in identifying the location of the Input data which is stored
in HDFS(Hadoop Distributed File System).
Input-Splits: These are responsible for converting the physical input data to some logical form so
that Hadoop Mapper can easily handle it. Input-Splits are generated with the help of InputFormat.
A large data set is divided into many input-splits which depend on the size of the input dataset.
There will be a separate Mapper assigned for each input-splits. Input-Splits are only referencing to
the input data, these are not the actual data. DataBlocks are not the only factor that decides the
number of input-splits in a Map-Reduce. we can manually configure the size of input-splits
in mapred.max.split.size property while the job is executing. All of these input-splits are utilized by
each of the data blocks. The size of input splits is measured in bytes. Each input-split is stored at
some memory location (Hostname Strings). Map-Reduce places map tasks near the location of the
split as close as it is possible. The input-split with the larger size executed first so that the job-
runtime can be minimized.
Record-Reader: Record-Reader is the process which deals with the output obtained from the
input-splits and generates it’s own output as key-value pairs until the file ends. Each line present in
a file will be assigned with the Byte-Offset with the help of Record-Reader. By-default Record-
Reader uses TextInputFormat for converting the data obtained from the input-splits to the key-
value pairs because Mapper can only handle key-value pairs.
Map: The key-value pair obtained from Record-Reader is then feed to the Map which generates a
set of pairs of intermediate key-value pairs.
Intermediate output disk: Finally, the intermediate key-value pair output will be stored on the
local disk as intermediate output. There is no need to store the data on HDFS as it is an intermediate
output. If we store this data onto HDFS then the writing cost will be more because of it’s
replication feature. It also increases its execution time. If somehow the executing job is terminated
then, in that case, cleaning up this intermediate output available on HDFS is also difficult. The
77
intermediate output is always stored on local disk which will be cleaned up once the job completes
its execution. On local disk, this Mapper output is first stored in a buffer whose default size is
100MB which can be configured with io.sort.mb property. The output of the mapper can be written
to HDFS if and only if the job is Map job only, In that case, there will be no Reducer task so the
intermediate output is our final output which can be written on HDFS. The number of Reducer
tasks can be made zero manually with job.setNumReduceTasks(0). This Mapper output is of no use
for the end-user as it is a temporary output useful for Reducer only.
MapReduce objects
MapReduce operations in Hadoop are carried out mainly by three objects: Mapper, Reducer and
Driver.
• Mapper: This is designed for the Map phase of MapReduce, which starts MapReduce operations
by carrying input files and splitting them into several pieces. For each piece, it will emit a key-value
data pair as the output value.
• Reducer: This is designed for the Reduce phase of a MapReduce job; it accepts key-based
grouped data from the Mapper output, reduces it by aggregation logic, and emits the (key, value)
pair for the group of values.
• Driver: This is the main file that drives the MapReduce process. It starts the execution of
MapReduce tasks after getting a request from the client application with parameters. The Driver file
is responsible for building the configuration of a job and submitting it to the Hadoop cluster. The
Driver code will contain the main() method that accepts arguments from the command line. The
input and output directory of the Hadoop MapReduce job will be accepted by this program. Driver
is the main file for defining job configuration details, such as the job name, job input format, job
output format, and the Mapper, Combiner, Partitioner, and Reducer classes.
MapReduce is initialized by calling this main() function of the Driver class.

Deciding the number of Maps in MapReduce

Deciding the number of Reducers in MapReduce

MapReduce dataflow

78
Now that we have seen the components that make a basic MapReduce job possible, we will
distinguish how everything works together at a higher level. From the following diagram, we will
understand MapReduce dataflow with multiple nodes in a Hadoop cluster:

Map reduce data flow

The two APIs available for Hadoop MapReduce are: New (Hadoop 1.x and 2.x) and Old Hadoop
(0.20). YARN is the next generation of Hadoop MapReduce and the new Apache Hadoop
subproject that has been released for Hadoop resource management.
Hadoop data processing includes several tasks that help achieve the final output from an input
dataset. These tasks are as follows:
1. Preloading data in HDFS.
2. Running MapReduce by calling Driver.
3. Reading of input data by the Mappers, which results in the splitting of
the data execution of the Mapper custom logic and the generation of
intermediate key-value pairs
4. Executing Combiner and the shuffle phase to optimize the overall Hadoop MapReduce process.
5. Sorting and providing of intermediate key-value pairs to the Reduce phase.
The Reduce phase is then executed. Reducers take these partitioned keyvalue
pairs and aggregate them based on Reducer logic.
6. The final output data is stored at HDFS.
Here, Map and Reduce tasks can be defined for several data operations as follows:
• Data extraction
• Data loading
• Data segmentation
• Data cleaning
• Data transformation
• Data integration

Introduction to NoSQL
A NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than the tabular
relations used in relational databases. Such databases came into existence in the late 1960s, but did
not obtain the NoSQL moniker until a surge of popularity in the early twenty-first century.
The suitability of a given NoSQL database depends on the problem it should solve. Data
structures used by NoSQL databases are sometimes also viewed as more flexible than relational
79
database tables. Most NoSQL databases offer a concept of eventual consistency in which database
changes are propagated to all nodes so queries for data might not return updated data immediately
or might result in reading data that is not accurate which is a problem known as stale reads.
Advantages of NoSQL:
There are many advantages of working with NoSQL databases such as MongoDB and Cassandra.
The main advantages are high scalability and high availability.
High scalability –
NoSQL database use sharing for horizontal scaling. Partitioning of data and placing it on
multiple machines in such a way that the order of the data is preserved is sharing. Vertical scaling
means adding more resources to the existing machine whereas horizontal scaling means adding
more machines to handle the data.
Vertical scaling is not that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra etc. NoSQL can handle huge
amount of data because of scalability, as the data grows NoSQL scale itself to handle that data in
efficient manner.

High availability –
Auto replication feature in NoSQL databases makes it highly available because in case of
any failure data replicates itself to the previous consistent state.

Disadvantages of NoSQL:
NoSQL has the following disadvantages.
Narrow focus –
NoSQL databases have very narrow focus as it is mainly designed for storage but it provides very
little functionality. Relational databases are a better choice in the field of Transaction Management
than NoSQL.

Open-source –
NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other words two
database systems are likely to be unequal.

Management challenge –
The purpose of big data tools is to make management of a large amount of data as simple as
possible. But it is not so easy. Data management in NoSQL is much more complex than a relational
database. NoSQL, in particular, has a reputation for being challenging to install and even more
hectic to manage on a daily basis.
GUI is not available –
GUI mode tools to access the database is not flexibly available in the market.
Backup –
Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no
approach for the backup of data in a consistent manner.
Large document size –
Some database systems like MongoDB and CouchDB store data in JSON format. Which means that
documents are quite large (BigData, network bandwidth, speed), and having descriptive key names
actually hurts, since they increase the document size.
Types of NoSQL database:
Types of NoSQL databases and the name of the databases system that falls in that category are:
MongoDB falls in the category of NoSQL document based database.

80
Key value store: Memcached, Redis, Coherence
Tabular: Hbase, Big Table, Accumulo
Document based: MongoDB, CouchDB, Cloudant
When should NoSQL be used:
When huge amount of data need to be stored and retrieved .The relationship between the data you
store is not that important. The data changing over time and is not structured. Support of
Constraints and Joins is not required at database level. The data is growing continuously and you
need to scale the database regular to handle the data.
NoSQL Data Architecture Patterns
Architecture Pattern is a logical way of categorising data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store it in
a valid format. It is widely used because of its flexibilty and wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:
This model is one of the most basic models of NoSQL databases. As the name suggests, the
data is stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or co-related to
the key. The key-value pair storage databases generally store data as a hash table where each key is
unique. The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc). This type
of pattern is usually used in shopping websites or e-commerce applications.
Advantages:
Can handle large amounts of data and heavy load.
Easy retrieval of data by keys.
Limitations:
Complex queries may attempt to involve multiple key-value pairs which may delay performance.
Data can be involving many-to-many relationships which may collide.
Examples:
DynamoDB
Berkeley DB

2. Column Store Database:

Rather than storing data in relational tuples, the data is stored in individual cells which are
further grouped into columns. Column-oriented databases work only on columns. They store large
amounts of data into columns together. Format and titles of the columns can diverge from one row
to other. Every column is treated separately. But still, each individual column may contain multiple
other columns like traditional databases.Basically, columns are mode of storage in this type.
81
Advantages:
Data is readily available Queries like SUM, AVERAGE, COUNT can be easily performed on
columns.
Examples:
HBase
Bigtable by Google
Cassandra

3. Document Database:
The document database fetches and accumulates data in forms of key-value pairs but here,
the values are called as Documents. Document can be stated as a complex data structure. Document
here can be a form of text, arrays, strings, JSON, XML or any such format. The use of nested
documents is also very common. It is very affective as most of the data created is usually in form of
JSONs and is unstructured.

Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.

Limitations:
Handling multiple documents is challenging
Aggregation operations may not work accurately.
Examples:
MongoDB
CouchDB

82
Figure – Document Store Model in form of JSON documents
4. Graph Databases:
Clearly, this architecture pattern deals with storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some data.
The objects or entities are called as nodes and are joined together by relationships called Edges.
Each edge has a unique identifier. Each node serves as a point of contact for the graph..
Advantages:
Fastest traversal because of connections.
Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
Neo4J
FlockDB( Used by Twitter)

Figure – Graph model format of NoSQL Databases

TEXTUAL ETL PROCESSING

ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is
a process in which an ETL tool extracts the data from various data source systems, transforms it in
the staging area, and then finally, loads it into the Data Warehouse system.

83
Let us understand each step of the ETL process in-depth:
Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is
extracted which can be in various formats like relational databases, No SQL, XML, and flat files
into the staging area. It is important to extract the data from various source systems and store it into
the staging area first and not directly into the data warehouse because the extracted data is in
various formats and can be corrupted also. Hence loading it directly into the data warehouse may
damage it and rollback will be much more difficult. Therefore, this is one of the most important
steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are
applied on the extracted data to convert it into a single standard format. It may involve following
processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States,
and America into USA, etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multiple attributes.
Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally
loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular intervals. The rate and period of
loading solely depends on the requirements and varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed data
is being loaded into the data warehouse, the already extracted data can be transformed. The block
diagram of the pipelining of ETL process is shown below:

84
ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder, CloverETL,
and MarkLogic.

Textual ETL is the process of reading text and producing a relational data base suitable for
analytical processing. The text can come from any electronic source and the results can go into
any relational data base. ... Email typically contains spam and personal email which does not belong
in a corporate data base.

85
UNIT IV
Big Data analytics:
Data analytics life cycle, Data cleaning , Data transformation, Comparing reporting and analysis,
Types of analysis, Analytical approaches, Data analytics using R, Exploring basic features of R,
Exploring R GUI, Reading data sets, Manipulating and processing data in R, Functions and
packages in R, Performing graphical analysis.

UNIT 4
Big Data Analytics
Types of analysis

Analysis of data is a vital part of running a successful business. There are four types of data
analysis that are in use across all industries. While we separate these into categories, they are all
linked together and build upon each other. As you begin moving from the simplest type of analytics
to more complexes, the degree of difficulty and resources required increases. At the same time, the
level of added insight and value also increases.

Four Types of Data Analysis

The four types of data analysis are:
● Descriptive Analysis
● Diagnostic Analysis
● Predictive Analysis
● Prescriptive Analysis

Descriptive Analysis

The first type of data analysis is descriptive analysis. It is at the foundation of all data insight. It is
the simplest and most common use of data in business today. Descriptive analysis answers the
“what happened” by summarizing past data, usually in the form of dashboards.

The biggest use of descriptive analysis in business is to track Key Performance Indicators (KPIs).
KPIs describe how a business is performing based on chosen benchmarks.

Business applications of descriptive analysis include:

● KPI dashboards
● Monthly revenue reports
● Sales leads overview

Diagnostic Analysis
Diagnostic analysis takes the insights found from descriptive analytics and drills down to
find the causes of those outcomes. Organizations make use of this type of analytics as it creates
more connections between data and identifies patterns of behavior.

A critical aspect of diagnostic analysis is creating detailed information. When new problems
arise, it is possible you have already collected certain data pertaining to the issue. By already

86
having the data at your disposal, it ends having to repeat work and makes all problems
interconnected.

Business applications of diagnostic analysis include:

● A freight company investigating the cause of slow shipments in a certain region

● A SaaS company drilling down to determine which marketing activities increased trials

Predictive Analysis

This type of analysis is another step up from the descriptive and diagnostic analyses. Predictive
analysis uses the data we have summarized to make logical predictions of the outcomes of events.
This analysis relies on statistical modeling, which requires added technology and manpower to
forecast. It is also important to understand that forecasting is only an estimate; the accuracy of
predictions relies on quality and detailed data.

Business applications of predictive analysis include:

● Risk Assessment
● Sales Forecasting
● Using customer segmentation to determine which leads have the best chance of converting
● Predictive analytics in customer success teams

Analytical Approach
An analytical approach is the use of analysis to break a problem down into the elements necessary
to solve it. It's the same as formal analysis.
Difficult social problems require an analytical approach because they do not yield to any other
approach.
Researchers use different analytical techniques to examine complex relationships between
variables. There are three basic types of analytical techniques:
● Regression Analysis
● Grouping Methods
● Multiple Equation Models

Regression Analysis
Regression analysis assumes that the dependent, or outcome, variable is directly affected by one or
more independent variables. There are four important types of regression analyses:
1. Ordinary least squares (OLS) regression
o OLS regression (also known as linear regression) is used to determine the
relationship between a dependent variable and one or more independent variables.
o OLS regression is used when the dependent variable is continuous. Continuous
variables, in theory, can take on any value with a range. For example, family child
care expenses, measured in dollars, is a continuous variable.
87
o When used to estimate the associations between two or more independent variables
and a single dependent variable, it is called multiple linear regression.
o In multiple regression, the coefficient (i.e., standardized or unstandardized
regression coefficient for each independent variable) tells you how much the
dependent variable is expected to change when that independent variable increases
by one, holding all the other independent variables constant.
2. Logistic regression
o Logistic regression (or logit regression) is a special form of regression analysis that
is used to examine the associations between a set of independent or predictor
variables and a dichotomous outcome variable. A dichotomous variable is a variable
with only two possible values, e.g. child receives child care before or after the Head
Start program day (yes, no).
o Like linear regression, the independent variables may be either interval, ordinal, or
nominal. A researcher might use logistic regression to study the relationships
between parental education, household income, and parental employment and
whether children receive child care from someone other than their parents (receives
nonparent care/does not receive nonparent care).
3. Hierarchical linear modeling (HLM)
o Used when data are nested. Nested data occur when several individuals belong to the
same group under study. For example, in child care research, children enrolled in a
center-based child care program are grouped into classrooms with several
classrooms in a center. Thus, the children are nested within classrooms and
classrooms are nested within centers.
o Allows researchers to determine the effects of characteristics for each level of nested
data, classrooms and centers, on the outcome variables. HLM is also used to study
growth (e.g., growth in children’s reading and math knowledge and skills over time).
4. Duration models
o Used to estimate the length of time before a given event occurs or the length of time
spent in a state. For example, in child care policy research, duration models have
been used to estimate the length of time that families receive child care subsidies.
o Sometimes referred to as survival analysis or event history analysis.

Grouping Methods
Grouping methods are techniques for classifying observations into meaningful categories. Two of
the most common grouping methods are discriminant analysis and cluster analysis.
a)Discriminant analysis
● Identifies characteristics that distinguish between groups. For example, a researcher could
use discriminant analysis to determine which characteristics identify families that seek child
care subsidies and which identify families that do not.
● It is used when the dependent variable is a categorical variable (e.g., family receives child
care subsidies [yes, no], child enrolled in family care [yes, no], type of child care child
receives [relative care, non-relative care, center-based care]). The independent variables are
interval variables (e.g., years of schooling, family income).
b)Cluster analysis

88
● Used to classify similar individuals together. It uses a set of measured variables to classify a
sample of individuals (or organizations) into a number of groups such that individuals with
similar values on the variables are placed in the same group. For example, cluster analysis
would be used to group together parents who hold similar views of child care or children
who are suspended from school.
● Its goal is to sort individuals into groups in such a way that individuals in the same group
(cluster) are more similar to each other than to individuals in other groups.
● The variables used in cluster analysis may be nominal, ordinal or interval.

Multiple Equation Models

Multiple equation modeling, which is an extension of regression, is used to examine the causal
pathways from independent variables to the dependent variable. For example, what are the variables
that link (or explain) the relationship between maternal education (independent variable) and
children's early reading skills (dependent variable)? These variables might include the nature and
quality of mother-child interactions or the frequency and quality of shared book reading.
There are two main types of multiple equation models:
● Path analysis
● Structural equation modeling
Path analysis
Path analysis is an extension of multiple regression that allows researchers to examine multiple
direct and indirect effects of a set of variables on a dependent, or outcome, variable. In path
analysis, a direct effect measures the extent to which the dependent variable is influenced by an
independent variable. An indirect effect measures the extent to which an independent variable's
influence on the dependent variable is due to another variable.
● A path diagram is created that identifies the relationships (paths) between all the variables
and the direction of the influence between them.
● The paths can run directly from an independent variable to a dependent variable (e.g.,
X→Y), or they can run indirectly from an independent variable, through an intermediary, or
mediating, variable, to the dependent variable (e.g. X1→X2→Y).
● The paths in the model are tested to determine the relative importance of each.
● Because the relationships between variables in a path model can become complex,
researchers often avoid labeling the variables in the model as independent and dependent
variables. Instead, two types of variables are found in these models:
o Exogenous variables are not affected by other variables in the model. They have
straight arrows emerging from them and not pointing to them.
o Endogenous variables are influenced by at least one other variable in the model.
They have at least one straight arrow pointing to them.

Structural equation modeling (SEM)

Structural equation modeling expands path analysis by allowing for multiple indicators of
unobserved (or latent) variables in the model. Latent variables are variables that are not directly
observed (measured), but instead are inferred from other variables that are observed or directly
measured. For example, children's school readiness is a latent variable with multiple indicators of

89
children's development across multiple domains (e.g., children's scores on standardized assessments
of early math and literacy, language, scores based on teacher reports of children's social skills and
problem behaviors).

Data analytics using R

What is R?
R is a programming language that provides more flexibility than Excel. R is not bound by
a spreadsheet, where the data need to be entered in cells. For more complex analyses, Excel’s
spreadsheet format is too restrictive. And R is freely available on-line with new content being
uploaded regularly.

What is RStudio?

RStudio is an interface that provides you with a greater ability to conduct your analyses in R. You
can think of RStudio as a overlay on the software R to allow you to visually group together in one
interface the input window, the output window, the objects in your workspace, and plots.

What are the Benefits of R Analytics?

Business analytics in R allows users to analyze business data more efficiently. The following are
some of the main benefits realized by companies employing R in their analytics programs:

Democratizing Analytics across the Organization: R can help democratize analytics by enabling
business users with interactive data visualization and reporting tools. R can be used for data science
by non data scientists so that business users and citizen data scientists can make better business
decisions. R analytics can also reduce time spent on data preparation and data wrangling,
allowing data scientists to focus on more complex data science initiatives.

Providing Deeper, More Accurate Insights: R can help create powerful models to analyze large
amounts of data. With more precise data collection and storage through R analytics, companies can
deliver more valuable insights to users. Analytics and statistical engines using R provide deeper,
more accurate insights for the business. R can be used to develop very specific, in-depth analyses.

Leveraging Big Data: R can handle big datasets and is arguably as easy if not easier for most
analysts to use as any of the other analytics tools available today.

Creating Interactive Data Visualizations: R is also helpful for data visualization and data
exploration because it supports the creation of graphs and diagrams. It includes the ability to create
interactive visualizations and 3D charts and graphs that are helpful for communicating with
business users.

How Can R analytics Be Implemented?

While R programming was originally designed for statisticians, it can be implemented for a variety
of uses including predictive analytics, data modeling, and data mining. Businesses can implement R
to create custom models for data collection, clustering, and analytics. R analytics can provide a

90
valuable way to quickly develop models targeted at understanding specific areas of the business and
delivering tailored insights on day-to-day needs.

R analytics can be used for the following purposes:

● Statistical testing
● Prescriptive analytics
● Predictive analytics
● Time-series analysis
● What-if analysis
● Regression models
● Data exploration
● Forecasting
● Text mining
● Data mining
● Visual analytics
● Web analytics
● Social media analytics
● Sentiment analysis

Exploring basic Features of R:

The R programming language is versatile and can be used for a software development environment
for statistical analysis or graphics representation and reporting purposes.
The below mentioned are the significant features of the R language:

● R is a simple and effective programming language that has been well-developed, as well as R is
data analysis software.
● R has a large, consistent, and incorporated set of tools used for data analysis.
● R contains a suite of operators for different types of calculations on arrays, lists, and vectors.
● R provides highly extensible graphical techniques.
● R graphical techniques for data analysis output either directly display to the computer, or can be
print on paper.
● R has an effective data handling and storage facility.
● R is a vibrant online community.
● R is free, open-source, robust, and highly extensible.
● R supports matrix arithmetic
● R language can also be used with several other scripting languages such as python, perl, ruby, F#,
and Julia.

Exploring R GUI
R GUI is the standard GUI platform for working in R. The R Console Window forms an essential
part of the R GUI. In this window, we input various instructions, scripts and several other important

91
operations. This console window has several tools embedded in it to facilitate ease of operations.
This console appears whenever we access the R GUI.
In the main panel of R GUI, go to the ‘File‘ menu and select the ‘New Script‘ option. This will
create a new script in R.
In order to quit the active R session, you can type the following code after the R prompt ‘>’ as
follows:

>q()

Manipulating and processing data in R

Data structures provide the way to represent data in data analytics. We can manipulate data
in R for analysis and visualization.One of the most important aspects of computing with data in R is
its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few
basic data structures in R:
a. Vectors in R
These are ordered container of primitive elements and are used for 1-dimensional data.
Types – integer, numeric, logical, character, complex
b. Matrices in R
These are Rectangular collections of elements and are useful when all data is of a single class that is
numeric or characters.
Dimensions – two, three, etc.
c. Lists in R
These are ordered container for arbitrary elements and are used for higher dimension data, like
customer data information of an organization. When data cannot be represented as an array or a
data frame, list is the best choice. This is so because lists can contain all kinds of other objects,
including other lists or data frames, and in that sense, they are very flexible.
d. Data frames
These are two-dimensional containers for records and variables and are used for representing data
from spreadsheets etc. It is similar to a single table in the database.

Creating Subsets of Data in R

Data size is increasing exponentially and doing analysis on complete data is very time-consuming.
So data is divided into small sized samples and analysis of samples is done. The process of creating
samples is called subsetting.
Different methods of subsetting in R are:
a. $
The dollar sign operator selects a single element of data. When you use this operator with a data
frame, the result is always a vector.
b. [[

92
Similar to $ in R, the double square brackets operator in R also returns a single element, but it
offers the flexibility of referring to the elements by position rather than by name. It can be used for
data frames and lists.
c. [
The single square bracket operator in R returns multiple elements of data. The index within the
square brackets can be a numeric vector, a logical vector, or a character vector.
For example: To retrieve 5 rows and all columns of already built in data set iris, below command is
used:
1>
iris[1:5,
]

Creating subsets in vectors

To create subsets of data stored in a vector, the subset( ) function or the [ ] brackets is used.
Example: using the subset( ) function and the [ ] brackets.

# A sample vector
V <- c(1,5,6,3,2,4,2)

Subset(v,v,4) # using subset function

V[v,4] #using square brackets

# another vector
T ,- c(“one”, “one”, “two”, “three”, “four”, “two”)

#remove “one” entries

Subset(t, t!=”one)

T[t!=”one”]

Creating subsets in Data Frames

Example 1:

# A sample data frame

Data <- read.table(header=T, text=’

Subject class marks

1 1 99
2 2 84
93
3 1 89
4 2 79
‘)

Subset(data, subject < 3)

Data {data$subject <3, ]

Example 2: creating subsets of different logical conditions

# Logical AND of two conditions

Subset(data, subject < 3 & class==2)

Data[data$subject <3 & data$calss==2, ]

#logical OR of two conditions

Subset(data, subject <3 | class==2)

Data[data$subject < 3 | data$class==2, ]

Merging datasets in R
Sometimes, similar datasets obtained from different sources need to be merged together for further
processing. R provides the following functions to combine different sets of data :

Merge() Function in R - The merge() function is used to combine data frames on the basis of
columns and rows.
The cbind( ) function – It is used to add the columns of datasets having an equal set and identical
order of rows.
The rbind ( ) function – it is used to add rows in datasets having an equal number of columns.

The merge()function takes a large number of arguments, as follows:

▪ x:A data frame
▪ y: A data frame
▪ by, by.x, by.y: Names of the columns common to both x and y. By default, it uses columns
with common names between the two data frames.
▪ all, all.x, all.y: Logical values that specify the type of merge. The default value is all =
FALSE…

94
The merge() function allows four ways of combining data:
a. Natural join in R
To keep only rows that match from the data frames, specify the argument all=FALSE
b. Full outer join in R
To keep all rows from both data frames, specify all=TRUE
c. Left outer join in R
To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE
d. Right outer join in R
To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE
Using the cbind( ) Function:
cbind() function in R Language is used to combine specified Vector, Matrix or Data Frame by
columns.

95
Syntax: cbind(x1, x2, …, deparse.level = 1)

Parameters:
x1, x2: vector, matrix, data frames
deparse.level: This value determines how the column names generated. The default value of
deparse.level is 1.

Example 1:

# R program to illustrate cbind() function

# Initializing two vectors

x <- 2:7

y <- c(2, 5)

# Calling cbind() function

cbind(x, y)

Output:
xy
[1, ] 2 2
[2, ] 3 5
[3, ] 4 2
[4, ] 5 5
[5, ] 6 2
[6, ] 7 5

Using the rbind( ) Function:

rbind(): The rbind or the row bind function is used to bind or combine the multiple group of rows
together.

rbind(my_data, new_row)

96
Example 1: rbind Vector to Data Frame
The easiest way of using rbind in R is the combination of a vector and a data frame. First, let’s
create some example data frame…

x1 <- c(7, 4, 4, 9) # Column 1 of data frame

x2 <- c(5, 2, 8, 9) # Column 2 of data frame

x3 <- c(1, 2, 3, 4) # Column 3 of data frame

data_1 <- data.frame(x1, x2, x3) # Create example data frame

and an example vector:

vector_1 <- c(9, 8, 7) # Create example vector

Now, let’s rbind this vector to the data frame:

rbind(data_1, vector_1) # rbind vector to data frame

Output:

## x1 x2 x3
## 1 7 5 1
## 2 4 2 2
## 3 4 8 3
## 4 9 9 4
## 5 9 8 7

Sorting data:

The Sort( ) function is used to sort the values contained in vector.

Example: vec1,-c(23,45,10,34,89,20,67,99)

# sorting of a vector

Sort(vec1)

97
#reverse sorting

Sort(vec1, descreasing=TRUE)

Ordering data:

the order( ) function is used to organize/arrange values or columns in a dataset.

Example:

#make a data frame

sampleDataFrame <- data. Frame (id=1:5,weight=c(25,37,14,62,55),

size=c(“small”, “large”, “medium”, “large”, “medium”))

sampleDataFrame[ order(sampleDataFrame$weight), ]

#sort by size, then by weight

sampleDataFrame[ order(sampleDataFrame$size, sampleDataFrame$weight), ]

#sort by all columns in the data frame, from left to right

sampleDataFrame[ do.call(order, as.list(sampleDataFrame)), ]

Transposing Data

t( ) function is used to transpose a matrix or a data frame. This function transposes rows into
columns and columns into rows.

Example:

sampleDataFrame

t(sampleDataFrame)

Converting data to wide or long formats

R provides the following functions of the reshape package to convert data into wide or long
formats:

98
o use the melt ( ) function to convert wide data into the long format
o use the dcast ( ) function to convert long data into the wide format

melt( )

Syntax:
melt(data, na.rm = FALSE, value.name = “value”)

Cast ( )
Syntax:
cast(data, formula, fun.aggregate)

Managing Data in R using Matrices

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. In R,
the elements that make up a matrix must be of a consistant mode (i.e. all elements must be numeric,
or character, etc.). Therefore, a matrix can be thought of as an atomic vector with a dimension
attribute. Furthermore, all rows of a matrix must be of same length.

Creating Matrices
Matrices are constructed column-wise, so entries can be thought of starting in the “upper left”
corner and running down the columns. We can create a matrix using the matrix() function and
specifying the values to fill in the matrix and the number of rows and columns to make the matrix.

# numeric matrix
m1 <- matrix(1:6, nrow = 2, ncol = 3)

m1
output :
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
The underlying structure of this matrix is simply an integer vector with an added 2x3 dimension
attribute.

Matrices can also contain character values. Whether a matrix contains data that are of numeric or
character type, all the elements must be of the same class.

# a character matrix
m2 <- matrix(letters[1:6], nrow = 2, ncol = 3)

m2
## [,1] [,2] [,3]
## [1,] "a" "c" "e"
## [2,] "b" "d" "f"

99
Matrices can also be created using the column-bind cbind() and row-bind rbind() functions.
However, keep in mind that the vectors that are being binded must be of equal length and mode.

v1 <- 1:4
v2 <- 5:8

cbind(v1, v2)
## v1 v2
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8

rbind(v1, v2)
## [,1] [,2] [,3] [,4]
## v1 1 2 3 4
## v2 5 6 7 8

# bind several vectors together

v3 <- 9:12

cbind(v1, v2, v3)

## v1 v2 v3
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12

Adding on to Matrices
We can leverage the cbind() and rbind() functions for adding onto matrices as well. Again, its
important to keep in mind that the vectors that are being binded must be of equal length and mode
to the pre-existing matrix.

m1 <- cbind(v1, v2)

m1
## v1 v2
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8

# add a new column

cbind(m1, v3)
## v1 v2 v3
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
100
## [4,] 4 8 12

# or add a new row

rbind(m1, c(4.1, 8.1))
## v1 v2
## [1,] 1.0 5.0
## [2,] 2.0 6.0
## [3,] 3.0 7.0
## [4,] 4.0 8.0
## [5,] 4.1 8.1

Subsetting Matrices
To subset matrices we use the [ operator; however, since matrices have 2 dimensions we need to
incorporate subsetting arguments for both row and column dimensions. A generic form of matrix
subsetting looks like: matrix[rows, columns]. We can illustrate with matrix m2:

m2
## col_1 col_2 col_3
## row_1 1 5 9
## row_2 2 6 10
## row_3 3 7 11
## row_4 4 8 12
By using different values in the rows and columns argument of m2[rows, columns], we can
subset m2 in multiple ways.

# subset for rows 1 and 2 but keep all columns

m2[1:2, ]
## col_1 col_2 col_3
## row_1 1 5 9
## row_2 2 6 10

# subset for columns 1 and 3 but keep all rows

m2[ , c(1, 3)]
## col_1 col_3
## row_1 1 9
## row_2 2 10
## row_3 3 11
## row_4 4 12

# subset for both rows and columns

m2[1:2, c(1, 3)]
## col_1 col_3
## row_1 1 9
## row_2 2 10

# use a vector to subset

v <- c(1, 2, 4)
101
m2[v, c(1, 3)]
## col_1 col_3
## row_1 1 9
## row_2 2 10
## row_4 4 12

# use names to subset

m2[c("row_1", "row_3"), ]
## col_1 col_2 col_3
## row_1 1 5 9
## row_3 3 7 11
Note that subsetting matrices with the [ operator will simplify1 the results to the lowest possible
dimension. To avoid this you can introduce the drop = FALSE argument:

# simplifying results in a named vector

m2[, 2]
## row_1 row_2 row_3 row_4
## 5 6 7 8

# preserving results in a 4x1 matrix

m2[, 2, drop = FALSE]
## col_2
## row_1 5
## row_2 6
## row_3 7
## row_4 8
Managing Data in R using Data Frames:
A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.

● The column names should be non-empty.

● The row names should be unique.
● The data stored in a data frame can be of numeric, factor or character type.
● Each column should contain same number of data items.

Create Data Frame

# We assign the data frame to variable df_base
2df_base <- data.frame(name = c("Jane","Sri","Eliza","Joe"),
3 occupation = c("engineer","designer","architect","engineer"))

Note that you simply use the data.frame function, provide names for your columns, and populate
the contents of the columns (using the c() vector functionality).
Output:

102
Note that a data frame can hold a number of data types (i.e., the columns can be characters,
integers, dates, factors, etc). Above you'll notice that the data types are displayed below the column
name; in this case, our two columns were coded as factors.

Accessing Data in the Data Frame

Selecting a Column by Index

If you want to access a particular column by using the column number. Perhaps you want to
understand the occupations in our dataset. Here's how that access works:
1df_base[,2]
r

We not only see the values of each row in the second column printed but also the
corresponding levels. See here for more on what levels are. The syntax is the same when selecting a
row from a tibble, except the levels aren't included because columns with characters aren't
automatically coded as factors and only factors have levels (don't get hung-up if you don't
understand levels for now). Note that the tibble column prints a little nicer (and is
a chr or character, but not in a jokey way).
1df_tidy[,2]
r

103
Note that in R, when locating a cell, [1,2] refers to the first row and second column, so
that [,2] grabs the entire second column.
To actually do something more interesting with this, and count the number of unique jobs, you use
the same syntax inside a function:
1unique(df_tidy[,2])
r

Selecting a Column by Name

Note that we keep the , syntax since we want the entire column, and select the column by name in
quotes.
1df_base[,"occupation"]
r

This is the same for a tibble (this is the last time we'll make the comparison). Again, notice the
levels are gone because the tidyverse defaults to characters instead of factors.
1df_tidy[,"occupation"]
r

Selecting a Row by Index

Computations usually happen on columns or parts of columns; when you're looking at an entire row
or a few rows it's usually more for inspection and sanity-checking. Now let's say you ran into an
unexpected result of a computation and wanted to examine an entire row in a dataset.
1df_tidy[1,]

104
r

And here's the syntax to grab multiple rows. Note that you span inclusively from the first to last row
of interest.
1df_tidy[1:2,]
r

Packages in R

A package is a collection of R functions, data, and compiled code in a well-defined

format. Packages are being stored in the directory called the library. R comes with a standard set of
packages. With the help of the search() command, you can find all the list of available packages
that are installed in your system.
Others are available for download and installation. Once installed, you need to load them into the
session to use.

example of search() command as follows:

search()
When the search() command is executed, you can overview the packages that are loaded and are
ready for use. You will see the graphics package that carries out routines to create graphs.

There are many packages that are being installed but not loaded by themselves.
For example – Splines package, that contains routines for smoothing curves, is being installed. But
this splines package is not loaded by itself.
To see what packages are available, you need to type the following command:

installed.packages()
105
Installing R Packages for Windows

In Windows, you get the package menu and install option which is very easy.

After selecting a local mirror site, a list of available binary packages is being shown. You can
choose the ones you need. Once you have selected the packages you need, you need to click
the OK button to download and install them into R.
If you download the package files from the internet(as .zip), you need to use the install package(s)
in the packages menu. It allows you to select the files you need and again packages are unzipped
and installed into R.

How to Install R Packages for Linux

To install R packages on the Linux system, you need to perform the below steps:

● Download the required packages as compressed files from the link: Available packages by
name
● Run the following command to install packages:
R CMD INSTALL [options] [l-lib] pkgs
● Use the following command to load the installed package:
library(package)
Installing by the Name of Package

In Linux, you can install the package if you know the name of a package.

Use the following command to install any package:

install.packages(‘ade4’)
R Packages List

The below table specifies the best packages in R with their usage:

Package Name Use

ade4 Used for analysis in ecological science
amap Used for multidimensional analysis
ANN Used for building and analyzing Artificial Neural Networks (ANN)
BayesLogit Used for logistic regression analysis
C50 Used for developing decision trees and rule-based models
lattice Used for creating lattice graphics for panel plots or trellis graphs
MASS Used for modern applied statics using S-PLUS
mgcv Used for building generalized additive models

106
How to Use Packages in R

We need to load the package in R after installing them to make them usable.

To load the R language Package, you can use the library() command, as follows:

library(package)
In R, you can unload a package by using detach() command, as follows:
detach(package:name)
Working with Functions
Moving from Scripts to R Function:

R function provides two major advantages over the script:

● Functions can work with any input. You can provide diverse input data to the functions.
● The output of the function is an object that allows you to work with the result.
How to Create a Script in R?

R supports several editors. So the script can be created in any of the editors like Notepad, MS Word
or Word Pad and can be saved with R extension in the current working directory.

Now to read the file in R, source function can be used.

For example, if we want to read the sample. R script in R, we need to provide below command:

source("sample.R")
In order to create a script, first, open a script file in the editor mode and type the required code.

We will create a script that takes in input in the form of fractions and converts it into a percentage
by further rounding it to one decimal digit.

> #Author DataFlair

> frac <- c(0.452, 1.5642, 0.84520)
> percentage <- round(frac * 100, digits = 1)
> out <- paste(percentage, "%", sep = "")
> print(out)
That is, you need to give values that you want to convert to a percentage as input and then convert it
into percentage and round off to required places. Then put the % sign and display the answer.

Save the above script as script file with any name for example pastePercent.R.

107
Now you can call this script on the console with the help of source command which we have
already seen.

source(‘pastePercent.R’)
Output:

This is how a script is written and executed in R.

Transforming the Script into R Function

Define a function with a name so that it becomes easier to call an R function and pass arguments to
it as input.

The R function should be followed by parentheses that act as a front gate for your function and
between the parentheses, arguments for the function are provided.

Use the return() statement that acts as a back gate of your function.

The return() statement provides the final result of the function that is returned to your workspace.

We will see this with an example below.

Let us now see how we can convert the script that we had written earlier to convert values into
percentage and round off into an R function.

Percent_add <- function(frac){

percent <- round(frac * 100, digits = 1)
out <- paste(percent, "%", sep = "%")
return(out)
}
Output:

108
The keyword function defines the starting of function. The parentheses after the function
form the front gate, or argument list of the function. Between the parentheses are the arguments to
the function. In this case, there is only one argument.
The return statement defines the end of the function and returns the result. The object put
between the parentheses is returned from inside the function to the workspace. Only one object can
be placed between the parentheses.
The braces, {} are the walls of the function. Everything between the braces is part of the assembly
line or the body of the function. This is how functions are created in R.

Using R Function

After transforming the script into an R function, you need to save it and you can use the function in
R again if required.

As R does not let you know by itself that it loaded the function but it is present in the workspace, if
you want you can check it by using ls() command.
Now as we know what all functions in R are present in the memory and we can use it when
required.

For example, if you want to create percentage from values again, you can use add percent function
for the same as below:

#Author DataFlair
ls()
new.vector <- c(0.8223, 0.02487, 1.62, 0.4)
Percent_add(new.vector)
Output:

109
Using the Function Objects in R

In R, a function is also an object and you can manipulate it as you do for other objects.

You can assign a function to the new object using below command:

percent_paste <- Percent_add

Now percent paste is a function as well that does exactly the same as Percent_add. Note that, you
do not add it after parentheses Percent_add in this case. If you add the parentheses, you call the
function and put the result of that call in percent_paste. If you do not add the parentheses, you refer
to the function object itself without calling it.
percent_paste
Percent_add <- function(frac){
percentage <- round(frac * 100, digits = 1)
out <- paste(percentage, "%", sep = "")
return(out)
}
print(Percent_add(new.vector))
Code Display:

The output of the above code is as follows:

110
Reducing the Number of Lines in R

There are basically two ways of doing it:

● Returning values by default

● Dropping {}
Let us see the above ways in detail below:

1. Returning Values by Default in R

Till now, in all the above code, we have written return() function to return output. But in R, this can
be skipped as by default, R returns the value of the last line of code in the R function body.
Now, the above code will be:

#Author DataFlair
Percent_add <- function(fac){
percentage <- round(fac * 100, digits = 1)
paste(percentage, "%", sep = "")}
Output:

111
You need return if you want to exit the function before the end of the code in the body.

For example, you could add a line to the Percent_add function that checks whether fac is numeric,
and if not, returns NULL, as shown in the following table:

#Author DataFlair
Percent_add <- function(frac){
if( !is.numeric(frac) ) return(NULL)
percentage <- round(frac * 100, digits = 1)
paste(percentage, "%", sep = "")}
Output:

112
2. Dropping the {}

You can drop braces in some cases though they form a proverbial wall around the function.
If a function consists of only one line of code, you can just add that line after the argument list
without enclosing it in braces. R will see the code after the argument list as the body of the
function.

Suppose, you want to calculate the odds from a proportion. You can write a function without using
braces, as shown below:

> odds <- function(x) x / (1-x)

Here no braces are used to write the function.

.
Scope of R Function

Every object you create ends up in this environment, which is also called the global environment.
The workspace or global environment is the universe of the R user where everything happens.

There are two types of R functions as explained below:

1. External R Function

If you use an R function, the function first creates a temporary local environment. This local
environment is nested within the global environment, which means that, from that local
environment, you also can access any object from the global environment. As soon as the function
ends, the local environment is destroyed along with all the objects in it.

If R sees any object name, it first searches the local environment. If it finds the object there, it uses
that one else it searches in the global environment for that object.

113
2. Internal R Function

Using global variables in an R function is not considered a good practice. Writing your functions in
such a way that they need objects in the global environment is not efficient because you use
functions to avoid dependency on objects in the global environment in the first place.

The whole concept behind R strongly opposes using global variables used in different functions. As
a functional programming language, one of the main ideas of R is that the outcome of a function
should not be dependent on anything but the values for the arguments of that function. If you give
the arguments for the same values, you will always get the same results.

The below example shows the usage of internal function:

#Author DataFlair
calculate_func <- function(data1, data2, data3){
base_min <- function(z) z - mean(data3)
base_min(data1) / base_min(data2)
}
Output:

The code for calculate.eff function is shown below:

> #Author DataFlair

> d1 <- c(3.24, 2.21, 1.45)
> d2 <- c(4.65, 5.12, 4.23)
> d3 <- c(0.11, 0.20, 0.49, 0.28)
> calculate_func(d1,d2,d3)
Output:

A closer look at the R function definition of base_min() shows that it uses an object control but
does not have an argument with that name.
114
Finding the Methods behind the Function

It is easy to find out the function you used in R. You can just look at the function code of print() by
typing its name at the command line.

In order to display the info code of the print() function, we proceed as follows:

print
Output:

1. The UseMethod() Function

The UseMethod() function contains central function in the main generic function system.

The UseMethod() function moves along and looks for a function that can deal with the type of
object that is given as the argument x.

Suppose you have a data frame that you want to print. The object that was passed as an argument
will be printed using the print.data.frame() function that will be first looked up by R.

The other functions are searched through a thorough search of another function. The procedure is
then started with a print which is then followed by the object type and the dot.

The function print.data.frame() can also be called inside the code.

2. Calling Functions Inside Code

You can also call the function print.data.frame() yourself.

Below is the example for the same:

> #Author DataFlair

> small_data <- data.frame(c1 = 1:2, c2 = 2:1)
> print.data.frame(small_data)
Output:

115
Using Default Methods in R

R provides the feature to create an object with the names that are already used by R. It is possible
with the use of default keyword.
R will ignore the type of the object in that case and just look for a default method if you use the
default keyword with the name of an object.

Below example explains it:

> #Author DataFlair

> print.default(small_data)
Output:

116
Performing Graphical Analysis in R

Graphs are useful for non-numerical data, such as colours, flavours, brand names, and more. When
numerical measures are difficult or impossible to compute, graphs play an important role.

Statistical computing is done with the aim to produce high-quality graphics.

Various types of plots drawn in R programming are:

● Plots with Single Variable – You can plot a graph for a single variable.
● Plots with Two Variables – You can plot a graph with two variables.
● Plots with Multiple Variables – You can plot a graph with multiple variables.
● Special Plots – R has low and high-level graphics facilities.

1. Plots with Single Variable

You may need to plot for a single variable in graphical data analysis with R programming.
For example – A plot showing daily sales values of a particular product over a period of time. You
can also plot the time series for month by month sales.
The choice of plots is more restricted when you have just one variable to the plot. There are various
plotting functions for single variables in R:

● hist(y) – Histograms to show a frequency distribution.

● plot(y) – We can obtain the values of y in a sequence with the help of the plot.
● plot.ts(y) – Time-series plots.
● pie(x) – Compositional plots like pie diagrams.
117
The types of plots available in R:

● Histograms – Used to display the mode, spread, and symmetry of a set of data.
● Index Plots – Here, the plot takes a single argument. This kind of plot is especially useful for
error checking.
● Time Series Plots – When a period of time is complete, the time series plot can be used to join
the dots in an ordered set of y values.
● Pie Charts – Useful to illustrate the proportional makeup of a sample in presentations.

Histograms have the response variable on the x-axis, and the y-axis shows the frequency of
different values of the response. In contrast, a bar chart has the response variable on the y-axis and
a categorical explanatory variable on the x-axis.
1.Histograms

Histograms display the mode, the spread, and the symmetry of a set of data. The R function hist() is
used to plot histograms.
The x-axis is divided into which the values of the response variable are distributed and then
counted. This is called bins. Histograms are tricky because it depends on the subjective judgments
of where exactly to put the bin margins that what graph you will be looking at. Wide bins produce
one picture, narrow bins produce a different picture, and unequal bins produce confusion.
Small bins produce multimodality (a combination of audio, textual, and visual modes), whereas
broad bins produce unimodality (contains a single-mode). When there are different bin widths, the
default in R is for this to convert the counts into densities.

The convention adopted in R for showing bin boundaries is to employ square and round brackets,
so that:

● [a,b) means ‘greater than or equal to a but less than b’ [square than round).
● (a,b] means ‘greater than a but less than or equal to b’ (round than square].
You need to take care that the bins can accommodate both your minimum and maximum values.

The cut() function takes a continuous vector and cuts it up into bins that can then be used for
counting.
The hist() function in R does not take your advice about the number of bars or the width of bars. It
helps simultaneous viewing of multiple histograms with similar range. For small integer data, you
can have one bin for each value.
In R, the parameter k of the negative binomial distribution is known as size and the mean is
known as mu.
Drawing histograms of continuous variables is a more challenging task than explanatory variables.
This problem depends on the density estimation that is an important issue for statisticians. To deal
with this problem, you can approximately transform continuous model to a discrete model using a
linear approximation to evaluate the density at the specified points.

The choice of bandwidth is a compromise made between removing insignificant bumps and real
peaks.

118
2 Index Plots

For plotting single samples, index plots can be used. The plot function takes a single
argument. This is a continuous variable and plots values on the y-axis, with the x coordinate
determined by the position of the number in the vector. Index plots are especially useful for error
checking.

3 Time Series Plot

The time series plot can be used to join the dots in an ordered set of y values when a period of time
is complete. The issues arise when there are missing values in the time series (e.g., if sales values
for two months are missing during the last five years), particularly groups of missing values (e.g., if
sales values for two quarters are missing during the last five years) and during that period we
typically know nothing about the behaviour of the time series.
ts.plot and plot.ts are the two functions for plotting time series data in R.

4 Pie Chart

You can use pie charts to illustrate the proportional makeup of a sample in presentations. Here the
function pie takes a vector of numbers and turns them into proportions. It then divides the circle on
the basis of those proportions.
To indicate each segment of the pie, it is essential to use a label. The label is provided as a vector of
character strings, here called data$names.
If a names list contains blank spaces then you cannot use read.table with a tab-delimited text file to
enter the data. Instead, you can save the file called piedata as a comma-delimited file, with a “.csv”
extension, and input the data to R using read.csv in place of read.table.
#Author DataFlair
data <- read.csv("/home/dataflair/data/piedata.csv")
data
Output:

The pie chart can be created, using the following command:

pie(data$amounts,labels=as.character(data$names))
Output:

119
2. Plots with Two Variables

The two types of variables used in the graphical data analysis with R:

● Response variable
● Explanatory variable
The response variable is represented on the y-axis and the explanatory variable is represented on
the x-axis.
When an explanatory variable is categorical, like genotype or colour or gender, the appropriate plot
is either a box-and-whisker plot or a barplot.

A barplot provides a graphical representation of data in the form of bar charts.

The most frequently used plotting functions for two variables in R:

● plot(x, y): Scatterplot of y against x

● plot(factor, y): Box-and-whisker plot of y at each factor level.
● barplot(y): Heights from a vector of y values (one bar per factor level).
The types of plots available in R:

● Scatterplots – When the explanatory variable is a continuous variable.

● Stepped Lines – Used to plot data distinctly and provide a clear view.
● Boxplots – Boxplots show the location, spread of data and indicate skewness.
● Barplots – It shows the heights of the mean values from the different treatments.

120
Scatterplots

Scatterplots shows a graphical representation of the relationship between two numbered sets. The
plot function draws axis and adds a scatterplot of points. You can also add extra points or lines to
an existing plot by using the functions, point, and lines.
The points and line functions can be specified in the following two ways:

● Cartesian plot (x, y) – A Cartesian coordinate specifies the location of a point in a two-
dimensional plane with the help of two perpendicular vectors that are known as an axis. The
origin of the Cartesian coordinate system is the point where two axes cut each other and the
location of this point is the (0,0).
● Formula plot (y, x) – The formula based plot refers to representing the relationship between
variables in the graphical form. For example – The equation, y=mx+c, shows a straight line in
the Cartesian coordinate system.
The advantage of the formula-based plot is that the plot function and the model fit look and feel the
same. The Cartesian plots build plots using “x then y” while the model fit uses “y then x”.
The plot function uses the following arguments:

● The name of the explanatory variable.

● The name of the response variable.
The basic syntax of a scatterplot is as follows:
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
where x is the data present on the horizontal coordinates.
y is the data that is present on the vertical axis.
main represents the title of our plot.
xlab is the label that denotes the horizontal axis.
ylab is the label for the vertical axis.
xlim is the limits of x for plotting.
ylim is the limits of y for plotting.
axes give an indication that both the axes should be present on the plot.
For creating our scatterplot, we will use the ‘mtcars’ dataset that is present in the R data library. We
plot the graph between the labels ‘wt’ and ‘mpg’.
data("mtcars")
scatter_data <- mtcars[,c('wt','mpg')]
#Naming the file
png(file = "DataFlair_scatterplot.png")
#Creating our Scatterplot
plot(x = scatter_data$wt,y = scatter_data$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
dev.off() #Saving the file
Output:

121
The best way to identify multiple individuals in scatterplots is to use a combination of colours and
symbols. A useful tip is to use as.numeric to convert a grouping factor into colour and/or a symbol.
Stepped Lines

Stepped lines can be plotted as graphical representation displays in R. These plots, plot data
distinctly and also provide a clear view of the differences in the figures.
While plotting square edges between two points, you need to decide whether to go across and then
up, or up and then across. Let’s assume that we have two vectors from 0 to 10. We plot these points
as follows:

x = 0:10
y = 0:10
plot(x,y)
Output:

We can draw a straight line by using the following command:

> #Author DataFlair

> plot(x,y)
> lines(x,y,col = "red")
Output:

122
Also, generate a line by using the upper case “S” as shown below:

> lines(x,y,col="green",type='S')
Output:

Box and Whisker Plot

A box-and-whisker plot is a graphical means of representing sets of numeric data using quartiles. It
is based on the minimum and maximum values, and upper and lower quartiles.
Boxplots summarizes the information available. The vertical dash lines are called the ‘whiskers’.
Boxplots are also excellent for spotting errors in data. The extreme outliers represent these errors.
123
Barplot

Barplot is an alternative to boxplot to show the heights of the mean values from the different
treatments. Function tapply computes the height of the bars. Thus it works out the mean values for
each level of the categorical explanatory variable.
Let us create a toy dataset of temperatures in a week. Then, we will plot a barplot that will have
labels.
temperature <- c(28, 35, 31, 40, 29, 41, 42)
days <- c("Sun", "Mon", "Tues", "Wed",
"Thurs", "Fri", "Sat")
barplot(temperature, main = "Maximum Temperatures
in a Week",
xlab = "Days",
ylab = "Degree in Celcius",
names.arg= days,
col = "darkred")
Output:

3. Plots with Multiple Variables

Initial data inspection using plots is even more important when there are many variables, any one of
which might have mistakes or omissions. The principal plot functions that represent multiple
variables are:

● The Pairs Function – For a matrix of scatterplots of every variable against every other.
● The Coplot Function – For conditioning plots where y is plotted against x for different values
of z.
It is better to use more specialized commands when dealing with the rows and columns of data
frames.
124
The Pairs Function

For two or more continuous explanatory variables, it is valuable to check for subtle
dependencies between the explanatory variables. Rows represent the response variables and
columns represent the explanatory variables.

Every variable in the data frame is on the y-axis against every other variable on the x-axis using the
pairs function plots. The pairs function needs only the name of the whole data frame as its first
argument.

The Coplot Function

The relationship between the two variables may be obscured by the effects of other processes in
multivariate data. When you draw a two-dimensional plot of y against x, then all the effects of other
explanatory variables are shown onto the plane of the paper. In the simplest case, we have one
response variable and just two explanatory variables.

The coplot panels are ordered from lower left to upper right, associated with the values of the
conditioning variable in the upper panel from left to right.

Special Plots in Graphical Data Analysis with R

R has extensive facilities for producing graphs. It also has low and high-level graphics
facilities as per the requirement.

The low-level graphics are the basic building blocks that can build up graphs step by step, while a
high-level facility provides the variety of pre-assembled graphical display.

Apart from the various kinds of graphical plots discussed, R supports the following special plots:

● Design Plots – Effective sizes in designed experiments can be visualized using design
plots. One can plot the design plots using the plot.design function
– plot.design(Growth.rate~Water*Detergent*Daphnia)
● Bubble Plots – Useful for illustrating the variation in the third variable across different
locations in the x–y.
● Plots with many Identical Values – Sometimes, two or more points with count data fall in
exactly the same location in a scatterplot. As a result, the repeated values of y are hidden, one
beneath the other.

Adding Other Shapes to a Plot

Using the following functions, we can add the extra graphical objects in plots:

● rect – For plotting rectangles – rect(xleft, ybottom, xright, ytop)

Using the locator function, we can obtain the coordinates of the corners of the rectangle. But the
rect function does not accept locator as its argument.

125
● arrows – For plotting arrows and headed bars – The syntax for the arrows function is to draw a
line from the point (xO, yO) to the point (x1, y1) with the arrowhead, by default, at the
“second” end (x1, y1).
arrows(xO, yO, xl, yl)
Adding code = 3 produces a horizontal double-headed arrow from (2,1) to (9,8), for example:
plot(x,y)
arrows(2,1,9,8, code = 3)
#Author DataFlair
Output:

Saving Graphics to File in R

You are likely to want to save each of your plots as a PDF or PostScript file for publication-quality
graphics. This is done by specifying the ‘device’ before plotting, then turning the device off once
finished.

The computer screen is the default device, where we can obtain a rough copy of the graph, using
the following command:

data <- read.table (filename, header=T)

attach(data)
There are numerous options for the PDF and postscript functions, but width and height are the ones
that you will want to change most often. The sizes are in inches. You can specify any nondefault
arguments that you want to change using the functions pdf.options and ps.options before you
invoke either PDF or postscript.

Selecting an Appropriate Graph in R

It is also important to select an appropriate type of graph according to the requirements.

Some common graphs and their uses are as follows:

● Line Graph – It displays over the time period. It generally keeps the track of records for both,
long-time period and short-time period according to requirements. In the case of small

126
change, the line graph is more common than the bar graph. In some cases, the line graphs also
compare the changes among different groups in the same time period.
● Pie Chart – It displays comparison within a group.
For example – You can compare students in a college on the basis of their streams, such as
arts, science, and commerce using a pie chart. One cannot use a pie chart to show changes over
the time period.
● Bar Graph – Similar to a line graph, the bar graph generally compares different groups or
tracking changes over a defined period of time. Thus the difference between the two graphs is
that the line graph tracks small changes while a bar graph tracks large changes.
● Area Graph – The area graph tracks the changes over the specific time period for one or more
groups related to a similar category.
● X-Y Plot – The X-Y plot displays a certain relationship between two variables. In this type of
variable, the X-axis measures one variable and Y-axis measures another variable. On the one
hand, if the values of both variables increase at the same time, a positive relationship exists
between variables. On the other hand, if the value of one variable decreases at the time of the
increasing value of another variable, a negative relationship exists between variables. It could
be also possible that the two variables don’t have any relationship. In this case, plotting graph
has no meaning.

UNIT – V

127
Big Data Visualization:
Introduction to Data visualization, Challenges to Big data visualization, Types of data visualization,
Visualizing Big Data, Tools used in data visualization, Proprietary Data Visualization tools, Open
source data visualization tools, Data visualization with Tableau.

Data visualization
Data visualization is a graphical representation of any data or information. Visual elements
such as charts, graphs, and maps are the few data visualization tools that provide the viewers with
an easy and accessible way of understanding the represented information.
Data visualization tools and technologies are essential to analyze massive amounts of
information and make data-driven decisions.

Importance of Data Visualization:

● Easily, graspable information – Data is increasing day-by-day, and it is not wise for
anyone to scram through such quantity of data to understand it. Data visualization comes
handy then.
● Establish relationships – Charts and graphs do not only show the data but also
established co-relations between different data types and information.
● Share – Data visualization is also easy to share with others. You could share any
important fact about a market trend using a chart and your team would be more
receptive about it.
● Interactive visualization – when technological inventions are making waves in every
market segment, regardless of big or small, you could also leverage interactive
visualization to dig deeper and segment the different portions of charts and graphs to
obtain a more detailed analysis of the information being presented.
● Intuitive, personalized, updatable – Data visualization is interactive. You could click on it
and get another big picture of a particular information segment. They are also tailored
according to the target audience and could be easily updated if the information modifies.

Example of Data visualization:

Profit and loss – Business companies often resort to pie charts or bar graphs showing their
annual profit or loss margin.

Challenges of Big Data Visualization

128
Scalability and dynamics are two major challenges in visual analytics
The visualization-based methods take the challenges presented by the “four Vs” of big data and turn
them into following opportunities.
● Volume: The methods are developed to work with an immense number of datasets and
enable to derive meaning from large volumes of data.
● Variety: The methods are developed to combine as many data sources as needed.
● Velocity: With the methods, businesses can replace batch processing with real-time stream
processing.
● Value: The methods not only enable users to create attractive info graphics and heat maps,
but also create business value by gaining insights from big data.

Big data often has unstructured formats. Due to bandwidth limitations and power requirements,
visualization should move closer to the data to extract meaningful information efficiently.
Effective data visualization is a key part of the discovery process in the era of big data. For the
challenges of high complexity and high dimensionality in big data, there are different
dimensionality reduction methods.
There are also following problems for big data visualization:
● Visual noise: Most of the objects in dataset are too relative to each other. Users cannot
divide them as separate objects on the screen.
● Information loss: Reduction of visible data sets can be used, but leads to information loss.
● Large image perception: Data visualization methods are not only limited by aspect ratio and
resolution of device, but also by physical perception limits.
● High rate of image change: Users observe data and cannot react to the number of data
change or its intensity on display.
● High performance requirements: It can be hardly noticed in static visualization because of
lower visualization speed requirements--high performance requirement.

In Big Data applications, it is difficult to conduct data visualization because of the large size
and high dimension of big data. Most of current Big Data visualization tools have poor
performances in scalability, functionalities, and response time. Uncertainty can result in a great
challenge to effective uncertainty-aware visualization and arise during a visual analytics
process.

Potential solutions to some challenges or problems about visualization and big data were presented:
● Meeting the need for speed: One possible solution is hardware. Increased memory and
powerful parallel processing can be used. Another method is putting data in-memory but
using a grid computing approach, where many machines are used.

● Understanding the data: One solution is to have the proper domain expertise in place.

● Addressing data quality: It is necessary to ensure the data is clean through the process of
data governance or information management.

● Displaying meaningful results: One way is to cluster data into a higher-level view where
smaller groups of data are visible and the data can be effectively visualized.

129
● Dealing with outliers: Possible solutions are to remove the outliers from the data or create a
separate chart for the outliers.

Types of Data Visualization:

Data can be visualized in many ways, such as in the form of 1D,2D, or 3D structures. The below
table briefly describes the different types of data visualization:

Name Description Tool

1D/Linear For example, a list of items Generally, no tool is used for

organized in a predefined 1D visualization
manner

2D/Planar For example, choropleth, GeoCommons, Google Fusion

cartogram, dot distribution Tables, Google Maps, API,
map, and proportional symbol Polymaps, Many Eyes,
map Google Charts, and Tableau
Public

3D/ Volumetric For example, 3D computer TimeFlow, timeline JS, Excel,

models, surface rendering, Timeplot, TimeSearcher,
volume rendering, and Google charts, Tableau Public,
computer simulations and Google Fusion Tables

Multidimensional For example, pie chart, Many eyes, google charts,

histogram, tag cloud, bubble Tableau public, and google
cloud, bar chart, scatter plot, fusion tables
heat map, etc.

Tree/ Hierarchical For example, dendogram, D3,google charts, and network

radial tree, hyperbolic tree, workbench/Sci2
and wedge stack graph

Network For example, matrix, node link Pajek, gephi, nodeXL,

diagram, hive plot, and tube VOSviewer, UCINET,
map GUESS, Network
workbench/Sci2,
sigma.js,d3/Protovis, Many
eyes, and google fusion tables

As shown in the table, the simplest type of data visualization is 1D representation and the most
complex data visualization is the network representation. The following is a brief description of
each of these data visualizations:

▪ 1D (Linear) Data Visualization – In the linear data visualization, data is presented in the
form of lists. Hence, we cannot term it as visualization. It is rather a data organization
technique. Therefore, no tool is required to visualize data in a linear manner.
130
▪ 2D(planar) Data visualization –This technique presents data in the form of images,
diagrams or charts on aplane surface.

▪ 3D (volumetric) Data visualization –In this method, data presentation involves exactly three dimensions to
show simulations, surface and volume rendering, etc. generally, it is used in scientific studies. Today, many
organizations use 3D computer modeling and volume rendering in advertisements to provide users a better feel
of their products.

▪ Temporal Data Visualization – sometimes, visualizations are time dependent. To visualize

the dependence of analyses on time, the temporal data visualization is used.

▪ Multidimensional Data visualization – in this type of data visualization, numerous

dimensions are used to present data.

▪ Tree/Hierarchical Data visualization – sometimes, data relationships need to be shown in

the form of hierarchies. To represent such kind of relationships, we use tree or hierarchical
data visualizations.

▪ Network data visualization – It is used to represent data relations that are too complex to
be represented in the form of hierarchies.

Visualizing Big Data:

Data visualization is a great way to reduce the turn-around time consumed in interpreting big data.
Traditional analytical techniques are not enough to capture or interpret the information that big data
possesses. Traditional tools are developed by using relational models that works best on static
interaction. Big data is highly dynamic in function and therefore, most traditional tools are not able
to generate quality results. The response time of traditional tool is quite high, making it unfit for
quality interaction.

Deriving Business solution:

The most common notation used for big data is 3 V’s- volume, velocity, and variety. But, the
most exciting feature is the way in which value is filtered from the haystack of data.

Now a days, IT companies that are using Big Data faces the following challenges:

⮚ Most data is in unstructured form

⮚ Data is not analyzed in real time

⮚ The amount of data generated is huge

⮚ There is a lack of efficient tools and techniques

By considering the above factors, IT companies are focusing more on research and development of
robust algorithm, software, and tools to analyze the data that is scattered in the internet space.

Turning Data into Information:

131
Visualization of data produces cluttered images that are filtered with the help of clutter- reduction
techniques. Uniform sampling and dimension reduction are two commonly used clutter- reduction
techniques.

Visual data reduction process involves automated data analysis to measure density, outliers, and
their differences. These measures are then used as quality metrics to evaluate data – reduction
activity.

Visual quality metrics can be categorized as:

⮚ Size metrics

⮚ Visual effectiveness metrics

⮚ Feature preservation metrics

A visual analytics tool should be:

⮚ Simple enough so that even non- technical users can operate it

⮚ Interactive to connect with different sources of data

⮚ Component to create appropriate visuals for interpretations

⮚ Able to interpret big data and share information

A part from representing data, a visualization tool must be able to establish links between different
data values, restore the missing data, and polish data for further analysis.

Tools Used in Data Visualization:

Some useful visualization tools are listed as follows:

EXCEL – It is a new tool that is used for data analytics. It helps you to track and visualize data for
deriving better insights. This tool provides various ways to share data and analytical conclusions
within and across organizations.

132
LastForward – It is open – source software provided by last.fm for analyzing and
visualizing social music network.

Digg.com – Digg.com provides some of the best web-based visualization tools.

Pics – This tool is used to track the activity of images on the website.

133
D3 – D3 allows you to bind arbitrary data to a document object model (DOM) and then applies
data- driven transformations to the document. For example, you can use D3 to generate an
HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar
chart with smooth transitions and interactions.

Rootzmap Mapping the internet – It is a tool to generate a series of maps on the basis
of the datasets provided by the National Aeronautics and Space Administration (NASA).

Open source data visualization tools

Due to economic and infrastructural limitations, every organization

cannot purchase all the applications required for analyzing data. Therefore, to fulfill their
requirement of advanced tools and technologies, organizations often turn to open- source
libraries. These libraries can be defined as pools of freely available applications and analytical
tools. Some examples of open- source tools available for data visualization are VTK, Cave5D,
ELKI, Tulip, Gephi, IBM openDX, Tableau public and vis5D.

● Open source tools are easy to use, consistent, and reusable.

● They deliver performance and are complaint with the web as well as mobile web
security.

Analytical Techniques used in Big Data Visualization:

Analytical techniques are used to analyze complex relationships among variables. The following
are some commonly used analytical techniques for big data solutions:

Regression analysis – it is a statistical tool used for prediction. Regression analysis is used to
predict continuous dependent variables from independent variables.

Types of regression analysis are as follows:

⮚ Ordinary least squares regression – it is used when dependent variable is continuous

and there exists some relationship between the dependent variable and independent
variable.
134
⮚ Logistic regression – it is used when dependent variable has only two potential
results.

⮚ Hierarchical linear modeling – it is used when data is in nested form

⮚ Duration models – it is used to measure length of process

Grouping methods – the technique of categorizing observation into significant or

purposeful blocks is called grouping. The recognition of features to create a distinction
between groups is called discriminant analysis.

Multiple equation models – it is used to analyze casual pathways from independent

variables to dependent variables. Types of multiple equation models are as follows :

o Path analysis

o Structural equation modeling

Data Visualization with tableau

Introduction to Tableau:

Tableau is a Data visualization software that allows developers to build interactive dashboards
that are easily updated with new data and can be shared with a wider audience. There are various
types of Tableau products available in the market. Some of the commonly known products include
Tableau Desktop, Tableau Server, Tableau Online, Tableau Reader, and Tableau Public.
The important features of Tableau Software include the following:
⮚ Single- click data analytics in visual form
⮚ In- depth statistical analysis
⮚ Management of metadata
⮚ In built, top-class data analytic practices
⮚ In built data engine
⮚ Big data analytics
⮚ Quick and accurate data discovery
⮚ Business dashboards creation
⮚ Various types of data visualization
⮚ Social media analytics, including Facebook and Twitter
⮚ Easy and quick integration of R
⮚ Business intelligence through mobile
⮚ Analysis of time series data
⮚ Analysis of data from surveys

Tableau software can be used in various industries and data environments.

Tableau has primarily been used within the following data environments:
⮚ All data sources
⮚ Amazon redshift
⮚ Excel charts and graphs

135
⮚ Google analytics
⮚ Google bigquery
⮚ Hadoop
⮚ HP vertica
⮚ SAP
⮚ Splunk

Tableau Desktop Workspace

Tableau’s desktop environment is simple and easy to learn almost for anyone.
Tableau is an extremely efficient tool, which can answer your questions about data analytics
quickly.
To use tableau desktop, you first need to download and install the tool on your computer.
Steps to download and install the tableau software:
1. Open the link http://www.tableau.com/products/desktop in the web browser
2. Download the trail version of the tableau desktop by clicking the TRY IT Free button.
3. Go to the directory of your computer where you have stored the tableau desktop setup and
double click the executable file. The tableau desktop installer, showing the tableau version
number on top, will appear.
4. Click the I have read and accept the terms of this License Agreement check box to enable
the install button
5. Click the install button to start the installation of the tableau desktop.
6. Click the start trail now option. This will ask for registration.
7. Click the register button to open the registration form and provide all the required details.

Click the continue button. The open page of tableau tool appears.
Toolbar Icons:
Tableau is a GUI- oriented drag and drop tool. The following figure shows the icons present on the
tableau toolbar.

o Undo/ Redo – scrolls backward or forward on the screen. You can retrieve any step by
clicking the undo/redo button

o File save – saves your work. You need to click this button frequently as tableau does not
have the automated save function.

o Connect to a new data source – connects you to a data source

o New dashboard or worksheet – Adds new page to your worksheet

o Duplicate sheet- creates an exact/duplicate copy of a worksheet as well as of the

dashboard page that you are working on

o Clear sheet – Allows you to clear data of a sheet

136
o Auto/ manual update – generates visual. It is particularly helpful for large datasets where
dragging and dropping items consume time

o Group – allows you to group data by selecting more than one headers in a table or values in
a legend.

o Pivot worksheet- allows you to create a pivot table on a new worksheet

o Ascending/ Descending sort - Sorts selected items in an ascending or descending order.

o Label Marks - turns on or off screen elements.

o Presentation mode – Hides/Unhides design shelves. It is particularly used during

presentations where you want to use Tableau as a presentation slide deck to keep the slides
of the presentation.

o Reset cards – provides a menu to turn on or off screen elements, such as caption or
summary.

o Fit Menu – allows different views of the tableau screen. You can fit the screen either
horizontally or vertically.

o Fit Axis - Fixes the axis of view. You can zoom in/out charts with this button

o Highlight control - compares the selected combinations of dimensions.

Main menu

Main menu of tableau contains following options :

File – contains general functions, such as open, save, and save as. Other functions are print to pdf
and Repository location function to review and change the default location of the saved file.

Data – helps to analyze the tabular data on the tableau website. The edit relationships option is used
to blend data when the field names in two data sources are not identical

Worksheet – provides option such as export option, excel crosstab, and duplicate as crosstab

Dashboard – Provides the actions menu, which is the most important option on the dashboard menu
because all the actions related to tableau worksheets and dashboards are defined within the actions
menu. The actions menu is present under the worksheet menu as well.

Story – provides the new story option that is used for explaining the relationship among facts,
providing context to certain events, showing the dependency between decisions and outcomes, etc.

Analysis – provides the aggregate measures and stack mark options. To create new measures or
dimensions, use create calculated field or edit calculated field.

Map – provides options to change the color scheme and replace the default maps

Format – contains options like cell size and workbook theme

137
Server – provides options to publish work on tableau server

Window – provides the bookmark menu, which is used to create .tbm files that can be shared with
different users

Help – provides options to access tableau’s online manual, training videos, and sample workbooks

Tableau Server
In Tableau Server Users can interact with the dashboards on the server without any installation on
their machines. Tableau Online is Tableau Server hosted by Tableau on a cloud platform.
Tableau server also provides robust security to the dashboards. Tableau Server web-edit feature
allows authorized users to download and edit the dashboards. Tableau server allows users to
publish and share their data sources as live connections or extracts. Tableau Server is highly
secured for visualizing data. It leverages fast databases through live connections

Tableau workbook and data source files

Depending on their utility and the amount of information they contain, tableau saves and shares
files as:

● Tableau workbook – it is the default save type when you save your work on the desktop.
The extension of such files will be .twb. The files with extension .twbx can be shared with
people not having tableau desktop license or those who cannot access the data source.

● Tableau data source – if you frequently connect to a specific data source or if you have
manipulated the metadata of any data source, saving the file as tableau data source is of
great use. The extension of such a file will be .tds, and it includes server address, password,
and metadata.

● Tableau bookmark - if you want to share any specific file with others, use tableau
bookmark.

● Tableau data extract – it compresses your extracted data and improves performance by
incorporating more formulas and functions. The extension of a tableau data extract file
is .tde.

Tableau charts

Tableau can create different types of univariate, bivariate, and multivariate charts.

The following are some of the common chart types that tableau can create:

● Tables – tables are an excellent choice of presenting data as they preserve all the
information, which in turn minimize the chances of misinterpretation.

● Scatter plots – scatter plots are used to describe the relationship between two variables.

● Trend lines – trend lines are used to analyze the relationship between variables as well
as predict the future outcome

138
● Bullet graph – bullet graph is just like a bar graph and is generally used in qualitative
analysis.

● Box plot – box plot represents distribution of data and is used in the comparison of
multiple sets of data. It can effectively compute:

o Minimum and Maximum value

o Median

o 25% and 75% quartile

o Treemap – treemap is one of the best compact techniques to visualize the part to
whole relationships as well as hierarchical models.

o Bubble charts – bubble charts help in categorizing and computing different

values and factors in the data with the help of bubbles.

o Word cloud – similar to bubble charts, the words in a word cloud are sized
according to the frequency at which they appear in the content.

15. Additional Topics

139
16. UNIVERSITY QUESTION PAPERS OF PREVIOUS YEAR:

17. Question Bank

Unit – 1

Unit – 2
1. .
Unit-3
1
Unit – 4

Unit – 5

140
18. Assignment Questions

Unit 1
1.Define Big Data and Data science?
2.Give some real world Big Data Examples. Explain Each.
3. What has led to explosive growth of data? Justify.
4.Define Data Velocity. Discuss Some examples on it.
5.Discuss in detail about Big Data Processing Architectures?

Unit 2

Unit 3
1.What is meant by Google File System,Explain?
2.Give an importance of Hadoop Architecture?
3.Explain the Hadoop MapReduce paradigm.
4.What is name node,secondary name node and data node.expalin?
5.What are the tasks perform by map reduce in Hadoop?

Unit 4
1. Discuss the life cycle of Data Analytics.
2.
a. List the techniques used to draw graphs in R.
b. Write short note on time series plots.
3. Discuss the syntax of defining a function in R.
4.
a. What are functions? Explain the two main advantages of using functions over
scripts.
b. What are arguments? Why are they used in functions?
5. a. List some types of data structures available in R.

b. Name some operators used to form data subsets in R.

6. What is data cleaning? Explain the types of data cleaning.

Unit 5

1. Discuss some applications of data visualization.

2. List and discuss various types of data visualizations.
3. List some important features of tableau software.
4. List and discuss the icons present on the tableau toolbar.
5. Discuss the challenges of big data visualization.
6. Discuss the tools used in big data visualization.

141
19. Unit wise Quiz questions
Unit – 1

1. Which of the following are parts of the 5 P's of data science and what is the additional P
introduced in the slides?

● People
● Purpose
● Product
● Perception
● Process
● Programmability
● Platforms

2. Which of the following are part of the four main categories to acquire, access, and retrieve
data?

● NoSQL Storage
● Remote Data
● Traditional Databases
● Web Services
● Text Files

3. What are the steps required for data analysis?

● Investigate, Build Model, Evaluate

● Classification, Regression, Analysis
● Regression, Evaluate, Classification
● Select Technique, Build Model, Evaluate

4. Of the following, which is a technique mentioned in the videos for building a model?

● Investigation
● Validation
● Evaluation
● Analysis

5. What is the first step in finding a right problem to tackle in data science?

● Assess the Situation

● Ask the Right Questions

142
● Define the Problem
● Define Goals

6. What is the first step in determining a big data strategy?

● Business Objectives
● Collect Data
● Build In-House Expertise
● Organizational Buy-In

7. According to Ilkay, why is exploring data crucial to better modeling?

Data exploration...

● leads to data understanding which allows an informed analysis of the data.

● enables a description of data which allows visualization.
● enables understanding of general trends, correlations, and outliers.
● enables histograms and others graphs as data visualization.

8. Why is data science mainly about teamwork?

● Analytic solutions are required.

● Engineering solutions are preferred.
● Data science requires a variety of expertise in different fields.
● Exhibition of curiosity is required.

9. What are the ways to address data quality issues?

● Remove outliers.
● Generate best estimates for invalid values.
● Remove data with missing values.
● Data Wrangling
● Merge duplicate records.

10. What is done to the data in the preparation stage?

● Retrieve Data
● Select Analytical Techniques
● Build Models
● Identify Data Sets and Query Data
● Understanding Nature of Data and Preliminary Analysis

143
11. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop?

A. Big data management and data mining

B. Data warehousing and business intelligence

C. Management of Hadoop clusters

D. Collecting and storing unstructured data

12. All of the following accurately describe Hadoop, EXCEPT:

A. Open source

B. Real-time

C. Java-based

D. Distributed computing approach

13. __________ has the world’s largest Hadoop cluster.

A. Apple

B. Datamatics

C. Facebook

D. None of the mentioned

14. What are the five V’s of Big Data?

A. Volume

B. Velocity

C. Variety

D. All the above

15. ________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.

A. Scalding
144
B. Cascalog

C. Hcatalog

D. Hcalding

UNIT-2

Unit – 3
The MapReduce algorithm contains two important tasks, namely __________.

A. mapped, reduce
B. mapping, Reduction
C. Map, Reduction
D. Map, Reduce

2.takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs)
A.Map
B.Reduce
C.BothAandB
D. Node
3. task, which takes the output from a map as an input and combines those data tuples into a smaller
set of tuples.

A. Map
B. Reduce
C. Node
D. Both A and B

4. In how many stages the MapReduce program executes?

A. 2
B. 3
C. 4
D. 5

5.Which of the following is used to schedules jobs and tracks the assign jobs to Task tracker?

A. SlaveNode
B. MasterNode
C. JobTracker
D. Task Tracker
145
6.
Which of the following is used for an execution of a Mapper or a Reducer on a slice of data?

A. Task
B. Job
C. Mapper
D. PayLoad

Which of the following commnd runs a DFS admin client?

A. secondaryadminnode
B. nameadmin
C. dfsadmin
D. adminsck

Although the Hadoop framework is implemented in Java, MapReduce applications need not be
written in ____________

A. C
B. C#
C. Java
D. None of the above

The number of maps is usually driven by the total size of ____________

A. Inputs
B. Output
C. Task
D. None of the above

1. Which of the following is not a NoSQL database?

Cassandra
MongoDB
SQL Server
None of the above

2. Which of the following is a nosql database type?

SQL
JSON
146
Document databases
None of the Above

Which of the following are the simplest NoSQL databases?

Key-value
Document
Wide-column
All of the above

Which of the following is not a strong feature for nosql databases?

Scalability
Relational data
Faster data access than RDBMS.
Data easily held across multiple servers

NoSQL can be referred to as .............

No SQL
Only SQL
Not Only SQL
SQL Undefined
15.What ETL stands for?
A.Data Inspection
B.transformtion
C.Extract,Transform,Load
D.Data Flow

16. Which of these steps is executed at the end of every stage of ETL – extract, clean,
conform?
Loggingtheactivitytoaflatfile
Displayingthedatatotheuser
Stagingthedatatothedatabase
Sending a message about the tasks

147
17. ETL execution or operation approach falls into which of these two major categories:
Planning&Execution
Implementation&Testing
Scheduling&Support
Maintenance & Support

18. One of the requirements while designing an ETL system is how quickly source data can be
delivered to end users. This is referred as:
Dataspeed
Datalineage
Datalatency
Data availability

19. Extracting the data from the source systems, is involved in _______ step of ETL process.
extract
transform
load
planning
Point out the correct statement.

Hadoop run on the .............

Debian
Unix-like
Bare metal
Cross-platform

Which of the following statement is incorrect about Hadoop?

It runs with commodity hard ware

It is best for live streaming of data
It is a part of the Apache project sponsored by the ASF
All of the above

Unit – 4
1. 1. Which of the following is the process of fixing or removing the incorrect data within the
dataset?
a. Data transformation b. Data cleaning c. Data visualization d. All of the above

Ans: b

148
2. Which of the following is the correct syntax to sort the elements of dataset X in the
decreasing order?
a. > order(X$E1, decreasing=TRUE)
b. >order(X, decreasing)
c. >order(X)
d. >order(X, decreasing = FALSE)
Ans: a
3. Which of the following data objects should be used in R if the data consists of one or more
values of a single type?
a. Vector b. matrix c. list d. data frame
Ans: a
4. Which of the following data objects should be used in R if the data consists of one or more
values of any type?
a. Vector b. matrix c. list d. data frame
Ans: c
5. Which of the following subset operators should be used to get an output with multiple
elements of your data?
a. $ b. [[ c.[[ d. c()
Ans: c
6. Which of the following is not a type of merge operation performed in R?
a. Full outer join b. Left outer join c. Common join d. Natural join
Ans: c
7. ____ are used to represent only one- dimensional data, like a set of digits.
a. Vectors b. Matrices c. Data frames d. Lists
Ans: a
8. The _____ function is used to sort a data frame.
a. Sort() b.order() c. melt() d.dcast()
9. Which of the following is the correct symbol to represent the dot argument?
a. … … … … b. … … … c. … … d. …
Ans: d
10. ____ function can be created without specifying its name.
a. Anonymous function
b. External function
c. Internal function
d. Built in function
Ans: a
11. The environment inside a function is known as the ____ environment.
a. Global b. nested c. local d. child
Ans: c
12. Which character function is used for extracting characters from a string?
a. Grep() b. substr() c. sub() d. paste()
Ans: b
13. Which of the following techniques is not used for plotting mechanism?
a. Using plots for default variable
b. Using plots for single variable
c. Using plots for two variables
d. Using plots for multiple variables
Ans: a

149
14. Which of the following plots is circular?
a. Index b. histogram c. bar d. pie
Ans: d
15. Which of the following arguments is used in the hist() function for setting the color of a
histogram?
a. Color b. clr c. col d. crl
Ans: c

Unit 5

1. Which of the following machine learning technique uses input to study the effects on
outputs?
a. Supervised learning
b. Unsupervised learning
c. Semi- supervised learning
d. Transduction
Ans: d
2. Visual analytics is the science of ____ supported by the interactive visual interface.
a. Computers
b. Optics
c. Technical tools
d. Analytical reasoning
Ans: d
3. Which of the following tools provides state of the art technology to store and process big
data?
a. Yahoo
b. Google analytics
c. Twitter map
d. Hadoop
Ans: d
4. Visualization of data produces images that are filtered with the help of ____ techniques.
a. Clutter reduction
b. Predictive analytical
c. Logical reasoning
d. Open source
Ans: a
5. Which of the following big data visualization methods is used for categorizing observations
on the basis of meaningful differences?
a. Regression analysis
b. Grouping methods
c. Multiple equation models
d. Structural equation modeling
Ans: b
6. _____ provides a menu to turn on or off screen elements.
a. Reset cards b. label marks c. duplicate sheets d. fit axis
Ans: b

150
7. Which of the following main menu options will you use to publish your work on the server
of tableau?\
a. Map
b. Format
c. Server
d. Window
Ans: a
8. In tableau, which of the following charts will you use for comparing multiple sets of data
and representing their minimum and maximum values?
a. Scatter plot b. trend lines c. bullet graph d. box plot
Ans:c
9. Which of the following tableau software is suitable for data sharing?
a. Tableau desktop
b. Tableau server
c. Tableau reader
d. Tableau public
Ans: b
10. Which of the following toolbar icons allows you to clear data of the sheet?
a. Undo/redo
b. File save
c. Duplicate sheet
d. Clear sheet
Ans: d
11. Data can be visualized using?

A. graphs
B. charts
C. maps
D. All of the above
Ans: d
12. Which method shows hierarchical data in a nested format?

A. Treemaps
B. Scatter plots
C. Population pyramids
D. Area charts
Ans: a
13. Which of the following is false?
A. data visualization include the ability to absorb information quickly
B. Data visualization is another form of visual art
C. Data visualization decrease the insights and take solwer decisions
D. None Of the above
Ans: c
14. Common use cases for data visualization include?

A. Politics
B. Sales and marketing
C. Healthcare
D. All of the above
151
Ans: d

15. A useful chart for displaying multiple variables is the .............

a. Scatter chart
b. Scatter chart matrix
c. Two dimensional graph
d. Stacked column and bar chart
Ans: b

20. Tutorial Problems

21. Known Gaps

----

22. Discussion topics

23. References, Journals, websites and E-links

Websites
1.

REFERENCES

152
Text Book(s)

Reference Book(s)

24. Quality Measurement sheets

153
25. Students list

Section-A

154
155
Section-B

156
Section-C

157
Section-D

158
Section-E

159
26. Group-Wise students list for discussion topic

Section-A
B.N AdmnNo StudentName B.No AdmnNo StudentName
o
18R11A0501 ADAVIKOLANU SWAPNA 18R11A0528 MOHAMMED ABDUL AMEEN SIDDIQUI
18R11A0502 ANDUGULA SHASHAANK 18R11A0529 MUDDULA ANUSHA
18R11A0503 AWARI DEEKSHITHA 18R11A0530 MUSALE AASHISH
18R11A0504 B DEEVENA ANGELINE 18R11A0531 MUTYALA SANTOSH
SUNAYANA
1 4
18R11A0505 BHAMIDIPATI SHIRIDI PRASAD 18R11A0532 PARITI DIVYA
REVANTH
18R11A0506 CH SIRI SOWMYA 18R11A0533 PARUCHURI HARSHA VARDHAN
18R11A0507 CHERIPALLI SREEJA 18R11A0534 PATRI SAI SINDHURA
18R11A0509 ERRABELLI RUSHYANTH 18R11A0535 PINNEM TARUN KUMAR
18R11A0510 G N HARSHITA 18R11A0536 PIRANGI NITHIN KALYAN
18R11A0511 GAJJI VARUN KUMAR 18R11A0537 POOJABOINA PREETHI
18R11A0512 SRI SAI PRANAVI GANTI 18R11A0538 PURANAM SATYA SAI RAMA TARUN
18R11A0513 H S SHREYA 18R11A0539 S GUNA SINDHUJA
18R11A0514 JANGAM NAGARJUNA GOUD 18R11A0540 SANGARAJU GREESHMA
2 5
18R11A0515 KANNE NITHESH SAI 18R11A0541 SYED ZAINUDDIN
18R11A0516 KODI AKHIL YADAV 18R11A0542 TELUKUNTLA RAJKUMAR
18R11A0517 KOLA SNEHITHA 18R11A0543 THORUPUNURI JANCY
18R11A0518 KOMURAVELLI KARTHIK 18R11A0544 THUMU RAM SAI TEJA REDDY
18R11A0519 KORADA SANTOSH KUMAR 18R11A0545 VADAKATTU HARISH
18R11A0520 KUNCHALA SAIRAM 18R11A0546 VAISHNAVI SABNA
18R11A0521 L A PRITHVIRAJ KUMAR 18R11A0547 VEMURI MADHU VENKATA SAI
18R11A0522 LAHARI BASAVARAJU 18R11A0548 YARRAM REDDY VENKATA SRIVANI REDDY
18R11A0523 LINGA JAYA KRISHNA 19R15A0501 BHULAXMI KALPANA
3 6
18R11A0524 M SREE CHARAN REDDY 19R15A0502 CHALLA DIVYA REDDY
18R11A0525 MAMBETI SAIRAM 19R15A0503 ADLA LIKITHA
18R11A0526 MAMILLA RAMYA 19R15A0504 GOPALADAS VINAYALATHA
18R11A0527 MOHAMMAD AFROZ KHAN 19R15A0505 GANJI CHARAN KUMAR

Section-B
160
B.N AdmnNo StudentName B.No AdmnNo StudentName
o
18R11A0549 ALETI KALYANI 18R11A0577 LANKE RAVI TEJA
18R11A0550 ARENDRA SAI KRUPA CHARY 18R11A0578 MANDA SAI JAHNAVI
18R11A0551 BALRAMGARI SREENIDHI 18R11A0579 MANJEET KAUR
18R11A0552 BANALA KEERTHANA 18R11A0580 MEDIPALLI KRISHNA SAKETH
1 18R11A0553 BATHINA SIRISHA 4 18R11A0581 MOHAMMED AFREEN
18R11A0554 BHEMISETTY ABHISHEK 18R11A0582 MOHAMMED NAZEER AHMED
18R11A0555 BODDU SURESH KUMAR 18R11A0583 MOTHE SAHITHI
18R11A0556 BRAMANDLAPALLY ADITYA 18R11A0584 MOVVA MANASWINI
18R11A0557 CHAKALI HARINATH 18R11A0585 PABBA JOGENDAR
18R11A0558 CHILAKAPATI DIVYA 18R11A0586 PONTHAGANI SAI TEJA
18R11A0559 CHILUMULA AMULYA 18R11A0587 PURALI LIPIKA
18R11A0560 CHINNAM ROSE MARY 18R11A0588 RAGHU VARDHAN P
18R11A0561 D SWEEKRUTHI 18R11A0589 RUPA REDDY VUDUMULA
2 18R11A0562 DADI GEETHIKA 5 18R11A0590 SHIVA DHANUSH T V S
18R11A0563 DESETTY SRI NIKITA 18R11A0591 SUNKARI DINESHWAR REDDY
18R11A0564 DUSETY DIVYA REDDY 18R11A0592 T VISHNU VARDHAN REDDY
18R11A0565 G RAGHAV 18R11A0593 VALLABHANENI MANIKANTA VINEETH
18R11A0566 G RUCHITHA 18R11A0594 VENKATA APARNA K L
18R11A0567 G SAI KEERTHANA 18R11A0595 VENNAKANDLA KARTHIK
18R11A0569 GORLA SAI CHARAN 18R11A0596 YEDLAPATI JEEVANI
18R11A0570 GORREMUTCHU WINNY REBBECA 19R15A0506 MANCHALA RAJINIBHARATHI
18R11A0571 K G N BALAJI RAO 19R15A0507 CHINNA SWAMY ANANTHIKA
3 18R11A0572 KARRI MAHESH VARMA 6 19R15A0508 SABBAN AJAY
18R11A0573 KODAKANDLA NAVYA 19R15A0509 GANDHALA GANESH
18R11A0574 KODISANA SANJAY RAM 19R15A0510 KONKA UMESH
18R11A0575 KONAKANCHI GOPTRI SULEKHA *14R11A05F1 P SAI PRASAD
18R11A0576 KUMBA SRI CHAKRA *15R11A05N4 PORIKA RAMA KRISHNA

Section-C

161
B.No AdmnNo StudentName B.No AdmnNo StudentName
17R11A0556 CHERUKUPALLY MANISH 18R11A05C3 MOOD GANESH
18R11A0597 ANISHA DHANANJAY SHENDKAR 18R11A05C4 MUDUMBAI ACHARYA
RAGHUTHAMADITHYA
18R11A0598 AVIDI DIVYA 18R11A05C5 NANDIKANTI SRI RAM
18R11A0599 BATTULA SAI ABHISHEK YADAV 18R11A05C6 NIKHILA POLKAMPALLY
1 18R11A05A0 BODDUPALLI RADHA KRISHNA REVANTH 4 18R11A05C7 P BHUMIKA PRIYA
18R11A05A1 BUNGA STEEN STRA DORA 18R11A05C8 P HEMANTH REDDY
18R11A05A2 BURUGADDA SHREYA SREE 18R11A05C9 PENDYALA NIKHIL RAO
18R11A05A3 C SURYA VENKAT 18R11A05D0 PODDUTOORI ANUSHA
18R11A05A4 CHALLAKOLUSU NANDA KISHORE 18R11A05D1 POLA SUMANTH
18R11A05A5 CHARAN RAJU M 18R11A05D2 PRIYADARSHINI VIJAYA RANGAN
18R11A05A6 CHITTIBOMMA BALARAM 18R11A05D3 RACHAKONDA GOPIKRISHNA
18R11A05A7 CHOKKAPURAM SRAVANI 18R11A05D4 RAMAVATH LAVANYA
18R11A05A8 DHONAKANTI ROHITH 18R11A05D5 RAVALKOL PAVANI GOUD
2 18R11A05A9 KAKARLA SATYA SRILATHA 5 18R11A05D6 ARRAM RUSHWANTH
18R11A05B0 KALANCHA SATHVIK 18R11A05D7 S APOORVA
18R11A05B2 KODIPE SAIKIRAN 18R11A05D8 SALVERU SAI ROHAN
18R11A05B3 KOMMURI MANASA 18R11A05D9 SHAIKH AFROZ
18R11A05B4 KOTHA SOHAN 18R11A05E1 VAKITI VINESH REDDY
18R11A05B5 KOTLA ASWINI 18R11A05E2 VARUN N
18R11A05B6 KOTTURI NIKHIL PADMANABH 18R11A05E3 VUTKURI SATHVIK GOUD
18R11A05B7 KYATHAM ASHOK GAJAPATHI RAJU 18R11A05E4 YENIMI REDDY JHANSI
18R11A05B8 LAKAVATH DIVYA 19R15A0511 BURA PREM SAI
3
18R11A05B9 M HARI RAMAKRISHNA 6 19R15A0512 RATNALA ASHWINI
18R11A05C0 M RAHUL 19R15A0513 ETIKALA ANUSHA
18R11A05C1 MALLARAPPU SHIRISHA 19R15A0514 ADEPU KEERTHANA
18R11A05C2 METTU KARUNA SRI REDDY 19R15A0515 T BALAVARDHAN REDDY
*18R15A0532 GANDAMALLA SHESHANK

Section-D

162
B.No AdmnNo StudentName B.No AdmnNo StudentName
18R11A05E5 AKSHITA YERRAM 18R11A05H2 NEELA PAVAN
18R11A05E6 ARYASOMAYAJULA VISHAL BHASKAR 18R11A05H3 NEELAPALA TEJA SHREE
18R11A05E7 BALANNAGARI DEEPAK REDDY 18R11A05H4 NEELAYAVALASA MEGHNA PATNAIK
18R11A05E8 BATHRAJ HARINI 18R11A05H5 NEMANA PRANAMYA
1 18R11A05E9 BHALLAMUDI LAKSHMI PRIYANKA 4 18R11A05H6 PAPAIAHGARI SAI PRIYA
18R11A05F0 BODA AKHILA 18R11A05H7 PENUMARTHI KRISHNA BHARADWAJ
18R11A05F1 BODAGAM DEEKSHITHA REDDY 18R11A05H8 SAI NEHA MANDA
18R11A05F2 BOGURAMPETA SUNIL REDDY 18R11A05H9 SAI PRAVALIKA PERIKA
18R11A05F3 BORRA YASWANTH KUMAR 18R11A05J0 SALLA ANUSHA
18R11A05F4 CHINTAMANENI MEGHANA 18R11A05J1 SANDU JAI VENKATESH
18R11A05F5 DINDU SANDEEP 18R11A05J2 SANKU RAJSHREE RAO
18R11A05F6 DINTYALA NAVYA SREE 18R11A05J3 SEELAM SANJANA
18R11A05F7 DONDAPATI MITHUN 18R11A05J4 SOMI SETTY SAI NEELESH
2 18R11A05F8 DONKENA THARUN KUMAR 5 18R11A05J5 TADEPALLI SAI NANDINI
18R11A05F9 G BHUMIKA 18R11A05J6 THARA REKHA KAKARAPARTHI
18R11A05G0 GAJJALA TEJANARAYANA GOUD 18R11A05J7 TUMMALA VARSHITH
18R11A05G1 GARUGULA VIDYA SAGAR 18R11A05J8 V SATYA NAGA SAI SRILEKHA
18R11A05G2 GATTU BHARGAVI 18R11A05J9 VADDE NITHISH
18R11A05G3 GOWLUGARI ALEKHYA REDDY 18R11A05K0 VARIKUTI LAKSHMI TEJA
18R11A05G4 INJEY DIVYA 18R11A05K1 VIPRAGHNA VISHWANATH
SRIKAKULAPU
18R11A05G5 JYOTI GOUDA 18R11A05K2 YALALA SHALINI
3 18R11A05G7 KOMMERA VAMSHI KRISHNA REDDY 6 19R15A0516 KOLANUCHELIMI SAI CHARAN
18R11A05G8 KONAKANCHI MAHALAKSHMI 19R15A0517 CH NIKHIL
18R11A05G9 KORUKOPPULA SAI KRISHNA 19R15A0518 KANDI PAVAN
18R11A05H0 KOTTAM CHANDRA SHEKAR 19R15A0519 CHITYALA SIRISHA
18R11A05H1 MADHAVI YADAV 19R15A0520 VAGALDAS ARAVIND

Section-E

B.N AdmnNo StudentName B.No AdmnNo StudentName

163
o
17R11A05M5 PATEL VAGGMIC REDDY 18R11A05M7 MALLADI VENKATA VAISHNAVI
18H11A0507 MOHAMMED ARBAAZ SOHAIL 18R11A05M8 MORA AKHILA
18R11A05K3 ANGADI SOWMYA 18R11A05M9 MURUGAN RAJAM MEGHNA
18R11A05K4 B VAGDEVI SAHASRA 18R11A05N0 PALVAI VARSHINI
1 18R11A05K5 BASIREDDY SAI NITIN REDDY 4 18R11A05N1 PENTAM HARITHASRI
18R11A05K6 BATCHU OBED EMMANUEL 18R11A05N2 PINNAMARAJU NAGA SHIVANI
18R11A05K7 BHUKYA SAGAR 18R11A05N3 POSHAM KEERTHANA
18R11A05K8 BUNGA VINAY 18R11A05N4 PUNDRU MADHAVI
18R11A05K9 CHAKKA VYDEHI 18R11A05N5 RAVIKANTI VARSHA
18R11A05L0 CHALLA DHRUVITHA 18R11A05N7 SARAYU RANGU
18R11A05L1 CHINNALA AVINASH CHALAM 18R11A05N8 SRIRAM VENKATA SAI SHRAVAN
18R11A05L2 CHITTA JAYA VAMSI KRISHNA 18R11A05N9 SUDAGANI SANJANA
18R11A05L3 DANTULURI NAVYA MEGHANA 18R11A05P0 SUDHESHNA JINNA
18R11A05L4 DARISE GOPI CHANDU 18R11A05P1 TANDA AKSHAY KUMAR
2 5
18R11A05L5 GURRAM VISHNU SAHITA 18R11A05P2 TANMAYI KASTHURI
18R11A05L6 JATOTH YAKUB 18R11A05P3 THIGULLA PAVAN KUMAR
18R11A05L7 K ANIRUDH REDDY 18R11A05P4 V VAISHNAVI
18R11A05L8 K MADHUMITHA 18R11A05P5 VENGALA SETTI AVINASH
18R11A05L9 K MANGAIAH 18R11A05P6 Y RUSHIKA REDDY
18R11A05M0 JOSHNITHA RANGOLU 18R11A05P7 YELLELA V HARSHA VARDHAN
18R11A05M1 KAMMA DIVYA SREE 18R11A05P8 HANEESH REDDY POTHIREDDY
18R11A05M2 KANDIKONDA SHIVANI 18R11A05Q0 MALLARAPU PAVAN ABHINAV
3 18R11A05M3 KOLLAPU KARTHIK 6 19R15A0521 BOLLI UMARANI
18R11A05M4 KOYYADA ANUSHA GOUD 19R15A0522 MEDA NEERAJA
18R11A05M5 LELLA S V N CHANDANA LAVANYA 19R15A0523 GUJJULA VINAY REDDY
18R11A05M6 LILY VERONICA KOPPOLI 19R15A0524 CHEVALLA VENNELA

164

Solid Starts - First 100 Days
94% (18)
Solid Starts - First 100 Days
287 pages
Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Read People Like A Book by Patrick King-Edited
62% (66)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (212)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
36 Questions To Fall in Love 1
97% (31)
36 Questions To Fall in Love 1
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
71% (69)
1001 Songs
1,798 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Denver International Airport (DIA)
No ratings yet
Denver International Airport (DIA)
6 pages
20IT503 - Big Data Analytics - Unit4
No ratings yet
20IT503 - Big Data Analytics - Unit4
73 pages
Data Analyst Udemy Report Writing PDF
No ratings yet
Data Analyst Udemy Report Writing PDF
15 pages
Data Cleaning and Data Pre Processing
100% (1)
Data Cleaning and Data Pre Processing
72 pages
Final
No ratings yet
Final
99 pages
SSIS Succinctly
No ratings yet
SSIS Succinctly
116 pages
ITS332 SRS Library Management System
50% (2)
ITS332 SRS Library Management System
28 pages
Answers. The Questions Presented in This Blog Are Collected Based On The Opinion of
No ratings yet
Answers. The Questions Presented in This Blog Are Collected Based On The Opinion of
39 pages
Tableau Desktop Training: About Intellipaat
No ratings yet
Tableau Desktop Training: About Intellipaat
10 pages
Guide For Datawarehousing
No ratings yet
Guide For Datawarehousing
24 pages
3 - Big Data Insight V.2019 PDF
No ratings yet
3 - Big Data Insight V.2019 PDF
28 pages
Advance Analytics
100% (2)
Advance Analytics
25 pages
Data Science in E-Commerce - Report - Writing
No ratings yet
Data Science in E-Commerce - Report - Writing
18 pages
DAiB Week 4-DataVisualization Tableau
No ratings yet
DAiB Week 4-DataVisualization Tableau
67 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Power BI - Exam Prep - 29 - 3
No ratings yet
Power BI - Exam Prep - 29 - 3
40 pages
Creating Data Visualizations Using Tableau Desktop (Beginner) _ Map and Data Library
No ratings yet
Creating Data Visualizations Using Tableau Desktop (Beginner) _ Map and Data Library
48 pages
Post Graduate Diploma in Data Science and Machine Learning
No ratings yet
Post Graduate Diploma in Data Science and Machine Learning
8 pages
Getting Started With Tableau Prep
No ratings yet
Getting Started With Tableau Prep
3 pages
Deployment Project Guide POWER BI
No ratings yet
Deployment Project Guide POWER BI
28 pages
CODE201911 Practices DataVisualizations
No ratings yet
CODE201911 Practices DataVisualizations
19 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Visualization, Data and Tableau: Alexander Mou
No ratings yet
Visualization, Data and Tableau: Alexander Mou
22 pages
Capstone Project Power BI
No ratings yet
Capstone Project Power BI
8 pages
What Is Data Science
No ratings yet
What Is Data Science
5 pages
Whitepaper Advanced Analytics With Tableau Eng
No ratings yet
Whitepaper Advanced Analytics With Tableau Eng
21 pages
Purdue Data Analytics PG Program
No ratings yet
Purdue Data Analytics PG Program
29 pages
Data Preparation With Tableau
No ratings yet
Data Preparation With Tableau
11 pages
NM Project PDF-1
No ratings yet
NM Project PDF-1
16 pages
Deep Learning For IoT Big Data and Streaming Analytics A Survey
No ratings yet
Deep Learning For IoT Big Data and Streaming Analytics A Survey
40 pages
Chart Handout
No ratings yet
Chart Handout
9 pages
Learning Tableau - Sample Chapter
100% (1)
Learning Tableau - Sample Chapter
28 pages
Tablue
0% (1)
Tablue
2 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
ML AI Main Brochure
No ratings yet
ML AI Main Brochure
7 pages
Bussiness Intelligence
No ratings yet
Bussiness Intelligence
6 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
House Price Prediction Using Machine Learning
No ratings yet
House Price Prediction Using Machine Learning
6 pages
Tableau Tutorial - Learn Data Visualization Using Tableau
100% (1)
Tableau Tutorial - Learn Data Visualization Using Tableau
19 pages
Powerbivstableau 160912230240
100% (1)
Powerbivstableau 160912230240
34 pages
Data Warehousing Quick Guide
No ratings yet
Data Warehousing Quick Guide
43 pages
Data Science Architect Master's Course Brochure
No ratings yet
Data Science Architect Master's Course Brochure
23 pages
The 365 DS Booklet PDF
100% (1)
The 365 DS Booklet PDF
67 pages
DVC - All Questions and Answers - CT 1, CT 2 and Model - Final
No ratings yet
DVC - All Questions and Answers - CT 1, CT 2 and Model - Final
114 pages
New Batches Info: Quality Thought Ai-Data Science Diploma
No ratings yet
New Batches Info: Quality Thought Ai-Data Science Diploma
16 pages
Data Visualization R Programming Power Bi Lab Record
No ratings yet
Data Visualization R Programming Power Bi Lab Record
29 pages
Modul Data Science 1
No ratings yet
Modul Data Science 1
12 pages
Wal Mart Sales Forecasting
No ratings yet
Wal Mart Sales Forecasting
35 pages
Data Analytics Tableau & Python
No ratings yet
Data Analytics Tableau & Python
15 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
A Dimension Table Consists of The Attributes About The Facts
No ratings yet
A Dimension Table Consists of The Attributes About The Facts
3 pages
Association Rule Mining Lesson PDF
No ratings yet
Association Rule Mining Lesson PDF
9 pages
8 Power BI
No ratings yet
8 Power BI
20 pages
COGNOS Guidelines and Best Practices
No ratings yet
COGNOS Guidelines and Best Practices
21 pages
Everything You Need For Clear and Efficient Data Visualization
No ratings yet
Everything You Need For Clear and Efficient Data Visualization
41 pages
Data Modeling Best Practices
No ratings yet
Data Modeling Best Practices
41 pages
Tableau Finalpresentation 161211155749
100% (2)
Tableau Finalpresentation 161211155749
43 pages
(M) BROCHURE - Data Science Learning Path
No ratings yet
(M) BROCHURE - Data Science Learning Path
33 pages
Data Analytics With PowerBI
No ratings yet
Data Analytics With PowerBI
27 pages
Data Engineering
No ratings yet
Data Engineering
92 pages
Excel 2013/2016: Get Your Hands Dirty
From Everand
Excel 2013/2016: Get Your Hands Dirty
Sam Akrasi
No ratings yet
1.definition and Characteristics of Iot
No ratings yet
1.definition and Characteristics of Iot
19 pages
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
Data Analytics Important Questions
No ratings yet
Data Analytics Important Questions
11 pages
Unit 2 Assignment Questions
No ratings yet
Unit 2 Assignment Questions
1 page
DA Unit 3,4,5 Notes
No ratings yet
DA Unit 3,4,5 Notes
54 pages
Log
No ratings yet
Log
3 pages
Dbms r19 - Unit-2 (Ref-2)
No ratings yet
Dbms r19 - Unit-2 (Ref-2)
27 pages
ITP29 - Servicing The Existing Deployment and Upgrading To Skype For Business Server - v04
No ratings yet
ITP29 - Servicing The Existing Deployment and Upgrading To Skype For Business Server - v04
103 pages
Comparative Analysis of Various Decision
No ratings yet
Comparative Analysis of Various Decision
7 pages
Microsoft Dynamics AX 2012 Security Guide
No ratings yet
Microsoft Dynamics AX 2012 Security Guide
94 pages
PHP OOP Part 1
No ratings yet
PHP OOP Part 1
17 pages
Data Dir Listing
No ratings yet
Data Dir Listing
15 pages
Chapter 5-Cloud Simulation
No ratings yet
Chapter 5-Cloud Simulation
8 pages
Acs RPL Report Writing Services
No ratings yet
Acs RPL Report Writing Services
6 pages
12 Ip
No ratings yet
12 Ip
4 pages
083 CS Board Practical
No ratings yet
083 CS Board Practical
4 pages
Advanceflow Brochure
No ratings yet
Advanceflow Brochure
16 pages
Proposal For Search Engine Optimization Program
No ratings yet
Proposal For Search Engine Optimization Program
8 pages
CSS - 06-Week 3 - Module 3 - Windows Server R2 Configuration
100% (1)
CSS - 06-Week 3 - Module 3 - Windows Server R2 Configuration
25 pages
Govindaraj Resume1
No ratings yet
Govindaraj Resume1
2 pages
Strategic Sourcing
100% (1)
Strategic Sourcing
7 pages
DP 600 Dumps
No ratings yet
DP 600 Dumps
244 pages
HvrUserManual-5 5 PDF
No ratings yet
HvrUserManual-5 5 PDF
586 pages
Fourth Generation Techniques
No ratings yet
Fourth Generation Techniques
15 pages
SwiftUI Basics
No ratings yet
SwiftUI Basics
57 pages
Des 1d11org
No ratings yet
Des 1d11org
30 pages
WP Technical Brief Qlik Sense Performance Benchmark en
No ratings yet
WP Technical Brief Qlik Sense Performance Benchmark en
10 pages
4th Sem 18 Computer Science Engineering
No ratings yet
4th Sem 18 Computer Science Engineering
16 pages