Big Data Analytics
Big Data Analytics
Big Data Analytics
FMTH0301/Rev.5.3
Course Plan
1. Explain the concept and challenge of big data and why existing technology is
inadequate to analyze the big data;
2. Analyze the impact of big data on business decisions and strategy.
3. Gain hands-on experience on large-scale analytics tools to solve some open big data
problems;
4. Apply non-relational databases and techniques for storing and processing large volumes
of semi-structured and unstructured data.
5.2 - Demonstrate an ability to select and 5.2.1 - Identify the strengths and limitations of
apply domain specific tools, techniques and tools for (i) acquiring information, (ii)
resources modeling and simulating, (iii) Monitoring
system performance, and (iv) creating
designs.
5.2.2 - Demonstrate proficiency in using
domain specific tools
7.2 - Demonstrate an ability to Identify 7.2.1 - Identify technological advances in
changing trends in computing knowledge and computing that required practitioners to stay
practice updated with current technologies.
7.2.2 - Recognize the necessity of being
updated with new developments in the
domain
Eg: 1.2.3: Represents program outcome ‘1’, competency ‘2’ and performance indicator ‘3’.
Text Book
1. Seema Acharya, Subhashini Chellapan, Big Data and Analytics, First edition,
2015, Wiley publications.
References
1. EMC Education Services, Data Science and Big Data Analytics: Discovering,
Analyzing, Visualizing and Presenting Data, Wiley Publications.
2. Frank J Ohlhorst, Big Data Analytics: Turning Big Data into Big Money‖,
Wiley and SAS Business Series, 2012.
3. Colleen Mccue, Data Mining and Predictive Analysis: Intelligence Gathering
and Crime Analysis‖, Elsevier, 2007.
4. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer, 2007.
5. Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge
Data Streams with Advanced Analytics‖, Wiley and SAS Business Series,
2012.
6. Paul Zikopoulos, Chris Eaton, Paul Zikopoulos, Understanding Big Data:
Analytics for Enterprise Class Hadoop and Streaming Data‖, McGraw Hill,
2011.
7. Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques‖,
Second Edition, Elsevier, Reprinted 2008.
Evaluation Scheme
CIE Scheme
Assessment Theory
ISA- 1 15
ISA- 2 15
Total 50
List of Practices
Chapter-wise Plan
Learning Outcomes:
At the end of the topic the student should be able to:
TLO's CO's BL CA
Code
1. Differentiate between structured, semi-structured and CO1 L3 1.4
unstructured data.
2. Explain the need to integrate structured, semi-structured and CO1 L3 1.4
unstructured data in the context of big data analytics.
3. Explain characteristics and significance of big data. CO1 L2 2.2
4. Address challenges of big data. CO1 L3 1.4
Lesson Schedule
Class No. - Portion covered per hour
1. Classification of digital data: Unstructured, Semi-structured, Structured.
2. Characteristics of data, Evolution of big data.
3. Definition of big data: 5 Vs, challenges with big data.
4. Typical data warehouse environment: Hadoop Environment.
Review Questions
Sr. No. - Questions TLO BL PI Code
1. Illustrate types of digital data. Explain sources of structured TLO1 L3 1.4.1
data.
2. Why integration of structured, semi-structured and unstructured TLO2 L3 1.4.1
data is needed in the context of data generated by Facebook.
3. Define big data. Explain five V’s of big data. Illustrate sources TLO3 L3 2.2.2
of big data.
4. What are challenges of big data? How traditional BI TLO4 L3 1.4.1
environment is different from big data environment.
Learning Outcomes:
At the end of the topic the student should be able to:
TLO's CO's BL CA
Code
1. Explain significance of big data analytics in various business CO2 L2 1.4
domains.
2. Explain the role of data scientist. CO2 L2 2.2
3. Explain various terminologies used in the big data CO2 L2 1.4
environment.
4. Select best tool/s for data analytics based on the business CO2 L3 2.2
domain.
Lesson Schedule
Class No. - Portion covered per hour
1 What is big data analytics? What big data analytics is not?
2 Classification of analytics, Top challenges facing big data
3 Importance of big data analytics
4 Need of technology to meet big data challenges, Data science: business acumen skills
5 Technology expertise, mathematics expertise
6 Data scientist, terminologies used in big data environments
7 BASE, top analytics tools
Review Questions
Sr. No. - Questions TLO BL PI Code
1 Why big data analytics is needed? Explain terminologies used TLO1 L3 2.2.2
in big data analytics.
2 Explain the knowledge required by data scientist. Give CAP TLO2 L2 2.2.2
theorem used for BDA. and
TLO3
3 What is predictive and prescriptive analytics? Illustrate any TLO4 L3 2.2.2
three analytics tools used for BDA.
Lesson Schedule
Class No. - Portion covered per hour
1. Not Only SQL (NOSQL): Types of NoSQL, Advantages of NoSQL
2. Use of NoSQL in industry, NewSQL
3. Hadoop: features, key advantages, versions, overview of Hadoop ecosystem
4. Hadoop distributions, Hadoop versus SQL, Cloud-based Hadoop solutions.
‘
Review Questions
Sr.No. - Questions TLO BL PI Code
1. What are the properties and benefits of NoSQL? Illustrate TLO1 L3 2.2.2
classification of NoSQL.
2. Who are the vendors of NoSQL? Identify and explain best TLO2 L3 2.2.2
parameters used compare SQL, NoSQL and NewSQL.
3. Justify the need of Hadoop in the context of BDA. Cite key TLO3 L3 2.2.2
advantages of Hadoop. How Hadoop 1.0 is different from
Hadoop 2.0?
.
Lesson Schedule
Class No. - Portion covered per hour
1. Introduction, Why Hadoop, RDBMS versus Hadoop.
2. Distributed computing challenges. hardware failure, how to process gigantic store of
data.
3. History of Hadoop, Hadoop overview, use case of Hadoop
4. Hadoop distributors, Hadoop Distributed File System (HDFS), Name node, Data
node, secondary Name node.
5. Anatomy of file read, anatomy of file write, replica placement
6. Processing of data with Hadoop, Managing resources an applications with Hadoop
7. Interaction with Hadoop ecosystem.
Review Questions
Sr.No. - Questions TLO BL PI Code
1 What is the key consideration of Hadoop popularity? List and TLO1 L2 1.4.1
explain parameters used to compare Hadoop and RDBMS. &
TLO2
2 Which features of HDFS makes it suitable for distributed TLO3 L3 2.2.2
computing? Identify and explain the components used to build
HDFS architecture.
3 How MapReduce programming is used to process massive TLO4 L3 2.2.2
amounts of data in parallel. Explain the same in the context of
word count problem.
Lesson Schedule
Class No. - Portion covered per hour
1 Introduction, Why MongoDB, Terms used in RDBMS and MongoDB
2 Data types in MongoDB
3 MongoDB query language: basic functions, Arrays, aggregate functions, MapReduce
function
4 Java script programming, Cursors in MongoDB, MongoImport and MongoExport.
Review Questions
Sr.No. - Questions TLO BL PI Code
1 What is MongoDB and why it is needed? How replication and TLO1 L3 2.2.2
sharding is performed in MongoDB.
2 Compare terms used in RDBMS and MongoDB. Illustrate TLO2 L3 2.2.2
CRUD operation in MongoDB.
3 How MapReduce framework is used in MongoDB? Illustrate the TLO3 L3 2.2.2
same.
4 How to implement data flow between MongoDB and CSV file? TLO4 L3 2.2.2
Illustrate the same.
Lesson Schedule
Class No. - Portion covered per hour
1 Introduction, Apache Cassandra, features of Cassandra
2 Data types, CQLSH, Keyspaces, CRUD operations
3 Introduction to MapReduce, Mapper, Reducer, Combiner.
4 Partitioner, searching, sorting, compression.
Review Questions
Sr.No. - Questions TLO BL PI Code
1 Explain notable points and technical features of Apache TLO1 L2 1.4.1
Cassandra.
2 What is the need of CQL? Illustrate use of Collections, Import and TLO2 L3 1.4.1
Export in Apache Cassandra.
3 Illustrate how MapReduce programming is implemented using TLO3 L3 2.2.2
mapper and reducer on Hadoop cluster.
Lesson Schedule
Class No. - Portion covered per hour
1 Introduction, What is Hive, History of Hive and recent releases of Hive
2 Hive integration and work flow, Hive data units
3 Hive architecture, Hive data types, Hive file format, Hive Query Language (HQL): DDL
4 DML, Hive shell, database, tables, Partitions, Bucketing, Views
5 Sub-query: RCFile implementation, SERDE, User defined function
Review Questions
Sr.No. - Questions TLO BL PI Code
1 What is Hive? How it is used to query structured data built on TLO1 L3 2.2.2
top of Hadoop?
2 What types of data are supported by Hive? Illustrate DDL TLO2 L3 2.2.2
statements in HQL.
3 Why partitions are required in Hive? Illustrate static and TLO3 L3 1.4.1
dynamic partitions in Hive
4 Cite features of Hive? Illustrate how managed table and TLO4 L3 2.2.2
external tables are created in Hive.
Learning Outcomes:
At the end of the topic the student should be able to:
Lesson Schedule
Class No. - Portion covered per hour
1 Introduction, What is PIG, Key features of PIG
2 The anatomy of PIG, PIG philosophy, use case for PIG: ETL processing
3 PIG Latin overview, Data types in PIG, Running PIG, execution modes of PIG
4 HDFS commands, relational operators, eval function
5 Complex data types, piggy bank, user defined function.
.
Review Questions
Sr.No. - Questions TLO BL PI Code
1 Why do we need Apache Pig? What features makes Apache TLO1 L3 1.4.1
Pig more popular? How Apache Pig is different from
MapReduce?
2 Write a user defined function in Pig Latin to convert name into TLO2 L3 2.2.2
uppercase.
3 Explain different running modes and execution modes of Pig. TLO3 L2 1.4.1
4 Write Pig Latin script to illustrate the uses of relational TLO4 L3 2.2.2
operators using Filter operator.
.
Total Duration : 75 Minutes Course : Big Data Analytics Maximum Marks :40
Code: 20ECAC801
Note:
1. Answer any two FULL questions
2. Missing data may be suitably assumed with justification.
Q.N Questions Mark CO BL PO PI
o. s Code
1a Illustrate types of digital data. Explain sources of structured 10 CO1 L3 1 1.4.1
data.
1b What are the properties and benefits of NoSQL? Illustrate 10 CO4 L3 2 2.2.2
classification of NoSQL.
2a Why big data analytics is needed? Explain terminologies 10 CO2 L3 2 2.2.2
used in big data analytics.
2b Why integration of structured, semi-structured and 5 CO1 L3 1 1.4.1
unstructured data is needed in the context of data generated
by Facebook.
2c Define big data. Explain three V’s of big data. Illustrate 5 CO2 L2 2 2.2.2
sources of big data.
3a Who are the vendors of NoSQL? Identify and explain best 10 CO4 L3 2 2.2.2
parameters used to compare SQL, NoSQL, and NewSQL.
3b What is predictive and prescriptive analytics? Illustrate any 10 CO2 L3 2 2.2.2
three analytics tools used for BDA.