Certificate in Big Data Analytics For Business and Management
Certificate in Big Data Analytics For Business and Management
AND
University of California, Riverside, Extension
Course Contents
Certificate in
Big Data Analytics
for Business and Management
(2017-2018)
Big Data Program
Table of Contents
About Program ...................................................................................................................................... 3
Program Objectives ............................................................................................................................... 3
Subject wise details ............................................................................................................................... 5
Introductory Business Statistics ........................................................................................................ 5
Data Mining and Data Analytics ....................................................................................................... 6
Module 2.1: Machine Learning Algorithms ................................................................................. 7
Module 2.2: Hadoop and Kafka Eco System; Processing streaming data and analysis ............... 9
Module 2.3: NoSQL and Graph Databases ................................................................................. 10
Virtual Machine for course participants...................................................................................... 10
Business Analytics Capstone (Python Oriented) ............................................................................ 12
Web Analytics ................................................................................................................................. 14
Students Exercises/Projects............................................................................................................. 16
Page 2 of 17
About Program
Certificate Program in Big Data and Data Analytics covers 4+1 complementary subjects. The subjects are as
follows:
4. Web Analytics 8
5. Students Exercises/Projects --
The subject of Data Mining and Data Analytics, in turn, is sub-divided into three distinct modules with a
common theme of Analytics. Detailed content under each subject and module follows. We lay special
and continued stress throughout the program on performing exercises on the part of students. Details
about these are listed subsequently.
** No of hours are indicative. Execution of projects by students being the focus, actual hours generally exceed than those specified.
Program Objectives
Applications of Big Data transcend disciplines. Use of predictive analytics pervades diverse disciplines
as oil and gas, marketing and sales, sports, molecular biology, drug-designing, waste management,
finance and the list is very long. Smart cities, for example, are the melting pot where variety of big data
technologies mesh with one another to transform a city into a semi-intelligent being. In Marketing and
Sales, for example, Big Data is fast emerging as a potent tool to gain deeper insights into Customer
behaviour and thereby act as a strong driver in spurring innovation. In manufacturing, operations
managers are employing advanced analytics on historical process data to identify patterns and
relationships among discrete process steps and inputs, and then optimize the factors that prove to have
the greatest effect on yield.
Broadly the course has two parts: one the analytics part and second the technological part. The analytics
Page 3 of 17
part is about learning machine learning algorithms and implementing them, the technological part is
about learning to work in hadoop and apache kafka layered-system as also developing skills in NoSQL
databases. At the end of this course, given a large dataset from any domain, a participant should:
a. Be able to clean, transform and visualize the dataset to gain deeper insights and make it ready for
analysis
b. Be able to select a subset of appropriate machine learning algorithms that could be applied to get the
desired predictive results
c. Gain sufficient proficiency in tools necessary to implement algorithms
d. Put to use relevant tools and techniques to get a reasonable predictive accuracy
Further:
e. Should be able to himself install, setup and configure and experiment with a complete hadoop and
Kafka ecosystem
f. Should be able to install, configure and be sufficiently familiar with the variety of NoSQL databases
and decide for himself which one to use, when and how
(e) and (f) are important objectives as they instill a sense of confidence in students in handling and
experimenting themselves with open-source technologies.
This course is project oriented: All tools, data and platforms including hadoop-ecosystem and Kafka-
streaming technologies necessary for learning data-analytics are provided to the participants in advance.
There is a heavy emphasis on open-source technologies universally used almost throughout the industry.
Each participant, at the beginning of the course, receives a Virtual Machine (VM) fully equipped with all
the software platforms, tools, packages and data to work on. Assembling such a VM independently and
by himself is also an important part of our education; students are able to work at ease with open-source
technologies that are central to Analytics. We make the whole process very simple and stress-free.
Details of Virtual Machine are more fully described below.
We have experience with several Industrial projects. An e-book illustrating the projects executed by us
can be downloaded from here. Students execute these and other projects while implementing techniques
learnt and as part of weekly exercises.
Go to Table of contents
Page 4 of 17
Subject wise details
1.
Introductory Business Statistics
Data Mining is intimately intertwined with Statistics. Knowledge of basic statistics is essential for a
successful analyst. Descriptive statistics is invariably used to explore data. Concepts of inferential
statistics are used in comparing machine learning models. In this subject, we refresh as also learn
statistical fundamentals and essential inferential statistics. Concepts learnt here are reinforced when we
use them in subsequent subjects throughout the duration of the program. Besides, as and when needed,
we cover additional statistical concepts as they arise under different subjects.
Total 15
Reference Book
1. Essentials of Statistics for Business and Economics by David R Anderson, Dennis J Sweeney and
Thomas A Williams, Cengage Learning.
2. Statistics for Management by Richard Levin and Sanjay Rastogi; Pearson Publications
Go to Table of contents
Page 5 of 17
2.
Data Mining and Data Analytics
Subject Objectives Go to top
• Generate familiarity with Big Data, Data Visualization and Data Mining methods: In
generating this familiarity there is special emphasis on conceptual understanding of
techniques rather than on mathematics. Analytics is a creative process and students are
encouraged to be creative.
• Develop skills to set up predictive models across various types of disparate data sets.
This is intended to bring home the point that predictive analytics offers a generic set of
tools that can be applied on different types of datasets within intersecting set of
disciplines.
• Think differently: Expose students through projects as to how novel ways of applying
Big Data technologies are changing business models.
Brief about Subject Contents
This subject is divided into three distinct modules. Module 1 is about Machine Learning
Algorithms. In this module, we use variety of tools besides R. Module 2 is about Hadoop and
Kafka eco-system: we learn to work on hadoop and its layers; perform data extraction and pipe
it into analytics engine. Analyzing streaming data is becoming a major subject in its own right:
in this respect we experiment with Apache Kafka and related technologies. Module 3 relates to
NoSQL and Graph databases. The new millennium and the explosion of web content has
marked a new era for database management systems. A whole generation of new databases have
emerged, all categorized under the name of NoSQL databases with focus on "task-oriented"
database management system; selecting the right tool for the job depending upon its
characteristics, nature and requirements. We cover, in depth, some often used NoSQL databases.
Pedagogy
We strongly believe that a course in data analytics can only be practice-based rather than theory
based. We also believe that a practice based course requires constant interaction with the teacher
during lecture hours in real time. As it is a distance online course, the teaching pedagogy is like
this: First the algorithm (or theory part) is conceptually explained without getting into
mathematics and then a project is undertaken to implement the techniques. Datasets for
implementation are made available in advance and so also a copy of code (or hints on it) that we
need to execute. The code is numbered and copiously commented so that long after the lecture
has finished, students can go back through the code/comments and refresh their knowledge.
During the lecture, we execute this code (or prompt students to fill in the gaps), line-by-line and
explain the steps. At his end, the student executes the required code on his laptop. Consequently,
results are available at our end as also with the Students immediately. In short, both the teacher
and students are working on their respective laptops simultaneously; students solve their
problems and ask any questions to clarify. The whole experience is just as if everyone is sitting
in a laboratory and working together.
Page 6 of 17
Module details
Page 7 of 17
S No Subject*** Projects/Datasets for projects** Session
hours*
Total 54
* No of session hours are indicative. Execution of projects by students being the focus, actual hours generally exceed than those
specified.
**Datasets other than those mentioned here may also be introduced during classes to achieve better clarity. Datasets needed for
Kaggle projects are to be downloaded from their site even though freely available; this is as per site requirements.
***Teaching sequence may alter somewhat depending upon feedback from students
Go to Table of contents
Page 8 of 17
Module 2.2: Hadoop and Kafka Eco System; Processing streaming data and analysis
5 SparkR: Data Extraction with SQL; Airline on-time data for all 5
Executing ML algorithms flights departing NYC in 2013
(R package nycflights13).
Total 29.5
* No of session hours are indicative. Execution of projects by students being the focus, actual hours generally exceed than those
specified.
**Datasets other than those mentioned here may also be introduced during classes to achieve better clarity. Datasets needed for
Kaggle projects are to be downloaded from their site even though freely available; this is as per site requirements.
Go to Table of contents
Page 9 of 17
Module 2.3: NoSQL and Graph Databases Go to top
Total 13.5
* No of session hours are indicative. Execution of projects by students being the focus, actual hours generally exceed than those
specified.
Go to Table of contents
At the commencement of the course, each participant is given a virtual machine (VM) that is
installable on Windows/Mac/Linux systems with 4GB of RAM. It can be installed on Laptop or
desktop. The Virtual Machine contains all the software tools that the participant will work on. It
also contains plenty of data to experiment with along with reading materials. Every software
installed on VM is fully licensed. The virtual machine makes it easy for participants to practice
weekly exercises at home/workplace. Applications installed on the VM are as follows:
• R and Python: R (with more than 200 packages pre-loaded); RStudio Server; Vowpal
Wabbit (both as R package and as a binary).
• Hadoop eco-system: Hadoop; Yarn Resource manager; Hive/ hiveserver2; Pig; Apache
SparkR; Mahout; Hbase; Hue; Apache Drill and Apache Phoenix
• Apache Kafka and Apache Samza
Page 10 of 17
• Visual Frameworks: H2o; KNIME; Orange; Gephi (for social network analyses)
• NoSQL Databases: Redis, MongoDB, Hbase and Neo4j
We may mention that besides this virtual machine, we have a separate Hadoop-cluster of ten
machines with Cloudera server installed (with around 120GB RAM). This large cluster helps
participants to work in groups remotely.
Reference Materials:
Reading material for each Module is placed on e-learning site and also study material is sent by
mail.
Go to Table of contents
*****************
Page 11 of 17
3.
Business Analytics Capstone (Python Oriented)
Introduction Go to top
Python is fast emerging as a preferred tool of data science for many analysts. It is often praised for its easy-to-
understand syntax. Like R, python is also open source with several highly developed Integrated Development
Environments (IDEs) that make learning python a fun. We use Anaconda that comes both with a set of powerful
packages (distribution) as also two well-known IDEs. Our approach here is to use python for data cleaning and
transformation, visualization, and model development. Participant will find a great degree of similarity in data
manipulation using R and using python. This similarity makes learning python easier. This subject builds upon
our knowledge of modeling techniques learnt earlier.
Assignments
There will be hands on homework and project assignments to give students an opportunity to apply
what they learn in the class.
2 Exploring data with pandas—Quick Start UCI Repo: Adult Dataset 2.5
Page 12 of 17
8 Logistic Regression (along with MNIST Digits dataset 2.5
Dimensionality Reduction, PCA)
Total 20
* No of session hours are indicative. Execution of projects by students being the focus, actual hours generally exceed than those
specified.
**Datasets other than those mentioned here may also be introduced during classes to achieve better clarity. Datasets needed for
Kaggle projects are to be downloaded from their site even though freely available; this is as per site requirements.
***Teaching sequence may alter somewhat depending upon feedback from students
Python Resources
1. Online Book--Automate the Boring Stuff with Python: Great intro to python as a programming language with links
to worked out examples and links to videos: .https://automatetheboringstuff.com
2. The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani,
Jerome Friedman; https://web.stanford.edu/~hastie/ElemStatLearn/
3. Python.org has a lot of useful information along with the home pages for numpy and scipy as introduced in the
notebooks discussed in the class.(numpy: https://github.com/numpy/numpy and scipy: https://www.scipy.org )
Go to Table of contents
**********************
Page 13 of 17
4
Web Analytics
Introduction Go to top
Successfully business brands today require a well-balanced blend of art and science. This course introduces
students to the science of web analytics, while casting a keen eye toward the artful use of numbers found in the
digital space. The goal is to provide marketers with the foundation needed to apply data analytics to real-world
challenges they confront daily in their professional lives. Students will learn to identify the web analytics tools
right for their specific needs, understand valid and reliable ways to collect, analyze, and visualize data from the
web, and utilize data in decision making for their agencies, organizations, or clients.
Objectives:
• Gain an understanding of the motivations behind data collection and analysis methods used by
business professionals.
• Learn to evaluate and choose appropriate web analytics tools and techniques
• Understand frameworks and approaches to measuring consumers’ digital actions.
• Gain an understanding of a step-by-step approach to planning, collecting, analyzing, and
reporting data
• Utilize tools to collect data using today’s most important online techniques: performing bulk
downloads, tapping APIs, and scraping webpages
• To understand business analytics practices in digital world
Pedagogy:
• Lectures,
• live project,
• hands on sessions
Session Plan:
The course will consist of the following three broad modules
Page 14 of 17
Reference Book
Eric Peterson, Web Analytics Demystified, 2004 (available for free download from Web Analytics
Demystified) and link also available on e-learning site.
Go to Table of contents
****************
Page 15 of 17
5
Students Exercises/Projects
Introduction Go to top
The ultimate beneficiary of this program are students. Our experience shows that students learn faster, if they
attempt exercises, make mistakes and learn from them. Students are, therefore, expected to undertake exercises
and projects.
Exercises serve another purpose. We try to make students learn some of the important topics not possible to
cover in the class. We give exercises with sufficient hints to attempt them.
There is also a third advantage. The more students perform exercises, the more a teacher can move faster and
also cover advanced concepts.
A list of projects, topic-wise, is given below. To assist students, for each project we provide sufficient steps/code
on our e-learning site. For each project, our steps/codes are quite detailed and students should be able to execute
the projects on their own by following the listed steps (or at times by stealing a glance at our code).
Students will be assessed based upon their performance in Exercises and Projects.
Page 16 of 17
10 Feature plotting Feature plotting of Credit Card
Fraud dataset
Go to Table of contents
*******************
Page 17 of 17