Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Syllabus - PGD - DS - Batch-7 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

MANIPAL ACADEMY OF HIGHER EDUCATION, MANIPAL (MAHE)

Centre for Executive Education (CEE)


(Applicable to the candidates admitted to the M Tech programs from July 2019)
Master of Technology in Data Science and Artificial Intelligence
Syllabus

SL No Code Term 1 (5 Months) L T P C


1 DDS 501 Programming for Data Science 2 0 3 3
2 DDS 503 Statistical Techniques for Data Science 2 1 0 3
3 DDS 505 Data Scraping and Wrangling 2 0 3 3
4 DDS 507 Data Analysis and Visualization 2 1 3 4
5 DDS 509 Big Data Technologies 2 0 3 3
6 DDS 511 Machine Learning 2 1 3 4
SL No Code Term 2 (6 Months) L 0 P C
7 DDS 502 Artificial Intelligence 2 0 3 3
8 DDS 504 Elective 1 2 1 3 4
9 DDS 506 Elective 2 2 0 3 3
10 DDS 510 Mini Project - - - 10
Total credits 40

L – Lecture; T – Tutorial; P – Practical; C – Credits


Faculty will decide on the case studies, Assignments and learning activities for each subject.

Electives
SL No Code Elective 1 L T P C
1 DDS 504.1 Financial Services Analytics 2 1 3 4
2 DDS 504.2 Marketing Analytics 2 1 3 4

SL No Code Elective 2 L T P C
1 DDS 506.1 Unstructured Data Analysis 2 0 3 4
2 DDS 506.2 Robotic Process Automation 2 0 3 4

1
SL No Code Elective 3 L T P C
1 MDA 509.1 H R Analytics 2 1 3 4
2 MDA 509.1 Supply Chain Analytics 2 1 3 4

Term 1

DDS 501 Programming for Data Science

Unit – 1 Unit – 1 Starting Python & Basics of Python Language (4 hours)


What is Python? Why Python for Data Science?
Programming Model of Python. Python Installation, Simple Input/Output, Work with Numbers, Basic
Data Types, Variables, input, Data Types, Control Structures, if Condition, while Loop, for Loop,
break and continue, Arithmetic & Logical operators.

Unit – 2 Python Core Data Structures (4 hours)


Strings, Lists, Tuples, Dictionaries, Sets, List Comprehensions, Regular expressions. Lambda
Functions – Map, Filter, Reduce

Unit – 3 Functions, Modules and Object Oriented Programming (4 hours)


Introduction to Functions, Function Syntax, Introduction to Modules, Create Modules, Importing
Modules, Introduction to Object oriented concepts

Unit – 4 Exception Handling, FILE Input/output (2 hours)


Exception handling, File operations: The open Function, Reading and Writing to Files

Unit 5 – NumPy (8 hours)


NumPy array creation, NumPy datatypes, NumPy indexing, slicing, Basic Reduction, statistical &
Logical operations, Array shape manipulation, Array sorting, copies and views

Unit 6– Introduction to Pandas, Operations in Pandas - Part I (4 hours)


Series creation, Operations on Series, DataFrames creation, Operations on Data frames. Basic
Indexing using.loc, iloc, .ix, Multi Indexing, Boolean Indexing.

Unit 7 - Operations in Pandas - Part II (4 hours)


Grouping of data, Merging and joining data, pivots and reshaping data

References:
1) Python for Data Analysis by Wes McKinney, O’Reilly Publication
2) Python Data science Handbook by Jake VanderPlas , O’Reilly Publication
3) Python Cookbook by Alex Martelli, Anna Martelli Ravenscroft, and David Ascher, O’Reilly
Publication

DDS 503 Statistical Techniques for Data Science

Unit 1: Introduction to Statistics and Linear Algebra (6 Hours)


• Collection, Categorization, and Presentation of Data
• Measures of Central Tendency such as Mean, median, mode
• Measures of Dispersion such as Range, variance, standard deviation
• Scalers, Vectors, Matrices and Operations on Matrices & Vectors - Addition,
Multiplication, Transpose and Inverse

2
Unit 2: Probability (5 Hours)
• Introduction to Probability
• Probability Distributions such as Normal, and Binomial
• Conditional Probability and Bayes Theorem
• Central Limit Theorem

Unit 3: Sampling (3 Hours)


• Introduction to Sampling
• Common Sampling Techniques such as Random Sampling
• Estimation - Sample size and standard error,
• Point Estimates, Interval Estimates, Confidence Intervals
• SMOTE – Under-sampling, Over-sampling

Unit 4: Testing of Hypothesis (5 hours)


• Introduction to Hypothesis Testing
• Parametric test such as t-test, z-test, and ANOVA
• Non-parametric test such as chi-square, Wilcoxson tests, Kolmogorov Simon (KS)
test, Kruskal Wallis test

Unit 5: Data Relationship through Correlation (2 Hour)


• Correlation, Pearson, Kendall, Spearman
• Correlation coefficient
• Correlation cautions due to outliers and causation

Unit 6: Regression (9 Hours)


• Introduction to Regression – Simple & Multiple Regression
• Estimation, Goodness of fit measures, Diagnostics
• Binary Logistic Regression
• Model validation
• Applications
Hands-on assignments to be conducted using Excel / SciPy

References:
1. Richard I. Levin, David S. Rubin, Statistics for Management, Pearson Education
2. Statistics for Management and Economics by Gerald Keller ,Cengage Learning
3. Sampling Techniques by William G Cochran ,Wiley and Sons
4. Bayesian Methods for Management and Business by Eugene D. Hahn, Wiley
5. Guide to Programming and Algorithms Using R, Ozgur Ergul, Springer
6. Python Data Science Essentials – PACKT Publication
7. Learning Python David Asher and Mark Lutz
8. Introduction to Statistical Learning with Applications in R, Springer

DDS 505 Data Scraping and Wrangling

Unit 1: Unit 1: Data Scraping (16 Hours)


• Introduction to Data Scraping and Wrangling
• Types of Data
• Finding data across sources
• Manual Scraping
• Scraping tables such as Wikipedia
• API-based scraping: Querying Twitter API using tweepy
• Browser-based scraping: Automating browsers, Identifying selectors

3
• Server-side scraping: scraping tables such as IMDb
• Scraping across pages
• Tools: Python, Selenium

Unit 2: Data Wrangling (4 Hours)


• Data quality detection
• Types of data quality issues

Unit 3: Introduction to Database Management Systems (8 Hours)


• Introduction to database, RDBMS
• RDBMS terminologies
• Concept of keys in RDBMS
• Conceptual Database Design Entity – Relationship Model: Relationship: Degree of
relationship, Cardinality, Participation, Key features of E-R Model
• Relational databases and SQL (Structured Query Language):
• Basic SQL queries, Integrity constraints on tables
• SQL querying to do operations such as identifying nulls, special characters, blank
rows/columns, get unique counts
• SQL Joins, Aggregate functions and GROUP BY, and sub queries.
• GROUP BY CLAUSE along basic aggregations such as SUM, COUNT, AVG
• RANK (), ROWNUM () & DENSE_RANK.
• UNION and UNION ALL and CASE statement

References:
1) Web Scrapping with Python – Ryan Mitchell, O’Reilly Publishers
2) Data Wrangling with Python – Jacqueline Kazil and Katharine Jarmul, O’Reilly Publishers
3) Automated Data Collection with R, Simon Munzert et al, John Wiley & Sons
4) Database System Concepts, Abraham Silberschatz, Henry F. Korth, and S. Sudarshan
5) Fundamentals of Databases – Elmasri and Navathe.

DDS 507 Data Analysis and Visualization

Unit 1: Introduction to Data Science (2 hours):


• Key components in Data Science
• Use cases from different Domains such as Banking, Retail, Telecom or Healthcare
• Data Science life cycle
• The roles in a Data Science stream
• Challenges involved in Data Science work
• Ethics in Data Science

Unit 2: Data Analysis and Story Telling (6 Hours):


• Characteristics of Data – data at rest, data in motion, data of many types (structured,
unstructured, semi-structured)
• Types of Data Analysis – Descriptive, Exploratory, Predictive, Inferential
• Steps in Data Analysis
• Representing and visualizing multiple variables
• Types of plot - such as Scatter plot, Pie chart, Histogram, Boxplot
• Which graphs are most suitable?

Unit 3: Visualization and Communication using Descriptive Analysis (6 Hours):


• Data and Datasets

4
• Quantitative variables and Qualitative variables
• Data Analysis and using charts, basic and advanced formatting
• Connecting to data and Using Extracts
• Joining and Blending Data
• Building Views
• Visual Analytics; Formatting and dynamic data manipulations
• Data visualization of Numeric data versus Non-numeric data
• Design strategies for information visualization such as Tuftes design principles

Unit 4: Data cleansing and transformation, Building Dashboards (8 Hours):


• Why Data cleansing is important?
• Treating missing values
• Treating outliers and errors
• Data virtualization – a unified view of business entities from multiple sources of data
• Building dashboards and automation
• Adding advanced interactive visualization capabilities – Slicers, Hierarchies, Pivotcharts

Unit 5: Dimension Reduction and Visualisation (8 Hours):


• Challenges of high dimensionality
• Principal Component Analysis, Factor Analysis
• Building views, dashboards
• MatplotLib – Plotting tool in Python, Seaborn
• A visualization tool loke _ Tableau or Power BI will be used.

References
1. An Introduction to Data Science, Jeffrey Stanton, Syracuse University
2. Exploratory Data Analysis with R, Roger Peng
3. Analysing Multivariate Data, James M Lattin; J Douglas Carroll; Paul E Green
4. Hands-On Data Science and Python Machine Learning - , FrankKane, Packt books
5. Python Data Science Essentials – PACKT Publication

DDS 509 Big Data Technologies

Unit 1: Motivation for Big Data (8 Hours)


• What Is Big Data? Big Data Programming Models: Massively Parallel Processing (MPP)
• Database Systems - In-Memory Database Systems - MapReduce Systems - Bulk Synchronous
Parallel (BSP) Systems
• Big Data and Transactional Systems, How Much Can We Scale?
• Big Data Technology options – Hadoop, NoSQL, SAP Hana
• Use-Cases for Big Data, Hadoop Concepts

Unit 2: Hadoop Components (8 Hours)


• Hadoop Framework and Architecture
• Overview of all Hadoop Ecosystem components
• Hadoop Distributed File System (HDFS) Architecture
• HDFS Commands
• MapReduce Architecture
• Word Count Implementation using map reduce
• Hadoop 2.0 - YARN , HDFS High Availability
• Precedence of Hadoop Configuration Files

5
• Preparing the Development Environment

Unit 3: Understanding HBase (2 Hours)


• HBase, Architecture and role of HBase
• HBase schema design, Basic programming for HBase
• Combining the capabilities of HBase and HDFS
• Hbase case Study

Unit 4: Analyzing Data with Hive and Pig (10 Hours)


• Hive Architecture and Concepts
• Data Definition Language
• Data Manipulation Language
• External Interfaces, Hive Scripts
• Performance, MapReduce Integration
• Creating Partitions
• Hive Case Study
• Scripting with Apache pig and its use cases

Unit 5: Sqoop, Flume and Kafka (8 Hours)


Sqoop:
• Objectives of Sqoop
• Preview of MySQL
• Sqoop eval command
• Sqoop simple import
• Incremental Imports
• Sqoop Export
Flume:
• Data Ingestion - From non-traditional data sources
• Flume Agent and configuration files
• Ingest data using Flume - exec source and HDFS sink

Kafka:
• Kafka Architecture
• Producers and consumers

Unit 6: Understanding Spark (4 Hours)


• Spark Architecture
• Spark capabilities such as distributed datasets (RDD)
• In-memory caching
• Interactive shell

Unit 7: Spark programming (10 hours)


• Programming for simple batch jobs in PYSPARK
• Stream processing and machine learning using built-in libraries
• Introduction to SPARK-SQL
• Spark streaming

References:
1. Pro Apache Hadoop, 2nd Edition, Jason Venner, Sameer Wadkar, and Madhu Siddalingaiah
2. Big Data Analytics with R and Hadoop, Vignesh Prajapati

6
DDS 511 Machine Learning

Unit 1: Introduction to Machine Learning and Data Science (2 Hours)


• Introduction to Machine Learning
• Broad classification – Supervised vs Un-supervised Learning
• Overview of Regression
• Use cases of Machine Learning

Unit 2: Classification (6 Hours)


• Classification using Decision Trees, Random Forests
• Classification using Nearest Neighbours
• Classification using Naïve Bayes

• Goodness measures such as confusion matrix


• Ensemble Techniques (Bagging, Boosting, Extreme Boosting)
• Applications discussed with case studies

Unit 3: Validation Measures (4 Hours)


• ROC curves- comparison of distribution function/business measures
• Divergence
• Kolmogorv Smirnov- difference in distribution function
• Gini coefficient/D concordance statistic etc.

Unit 4 Clustering (4 Hours)


• Introduction to Clustering
• K-means, Hierarchical Clustering
• Practical Issues in clustering
• Validation
• Applications discussed with case studies

Unit 5 Recommendation systems (4 Hours):


• Collaborative filtering such as User-based, Item-based, Matrix Factorization
• Evaluation of Recommenders such as Cumulative gain and discounted cumulative gain
• Applications of recommendation systems
• Applications discussed with case studies
• Association Rules such as Apriori Algorithm

Unit 6: Customer Analytics (4 Hours)


• Customer Life Cycle
• Segmentation
• Scoring
• Use cases

Unit 7: Time Series (6 hours)


• Introduction to time series with examples
• Trends and Cyclic Variations, normalization of data
• Stationary processes and ARIMA models
• Exponential, Smoothing, ARCH /GARCH
• Forecasting Time series data,
• Use cases

References:

7
1. Machine Learning with R Edition 2, Brett Lantz
2. Data Mining and Business Analytics with R
4. Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner,
5. Python Data Science Essentials – PACKT Publication
6. Hands-On Data Science and Python Machine Learning - PACKT Publication
7. The Analysis of Time Series – an Introduction by Chris Chatfield, Chapman & Hall/CRC
8. Time Series Analysis: Forecast and Control by Box and Jenkins

Term 2

DDS 502 Artificial Intelligence

Unit1: Introduction to AI (3 Hours)


• Recent Developments using AI: Sophia, AlphaGo and the rebirth of AI;
• Sneak peek into the future;
• Current trends in AI

Unit2: Applications of AI (3 Hours)


• Enterprise Applications of AI-Industries,
• Consumer Applications - Gaming,
• Home Automation with AI
• Understanding Artificial intelligence,
• Machine learning and Deep learning.
• Use case driven comparison of AI, ML and DL

Unit3: Programming with TensorFlow (12 Hours)


• Understanding what is TensorFlow
• Need for tensor flow along with use cases,
• Describe TensorFlow and its basic features
• Creating a computation Graph;
• Variables, Constants and Placeholders,
• evaluate how to create computation graphs in TensorFlow;
• List various programming elements and relate their significance in TensorFlow
• Using tensor flow on machine learning algorithms: Regression and Classification with TF

Unit4: High-level TensorFlow APIs (3 Hours)


• Understanding the high level TensorFlow AIP’s of Estimators,
• Tf. layers and Keras APIs - Overview
• Describe the role and types of high-level TensorFlow APIs - Overview
• Compare different types of high-level TensorFlow APIs - Overview

Unit5: TensorBoard (3 Hours)


• understanding tensorBoard.
• Need for tensorboard.
• Elaborate what TensorBoard is and how it helps evaluate TensorFlow runtime data,

Unit6: Deep Learning (DL) and Reinforcement learning (RL) (6 Hours)


• Introduction and Applications of Deep learning
• Image recognition Using Deep Learning
• Introduction to Reinforcement learning - Overview
• Types of Reinforcement learning: Value based(Q-learning) on policy-based methods -
Overview

8
References:
1. Artificial Intelligence, A Modern Approach: Stuart J. Russell and Peter Norvig
2. Neural Networks – Satish Kumar
3. Neural Networks and Machine Learning – Haykin and Simon
4. Specific papers for deep learning algorithms.

DDS 504.1 Financial Services Analytics

Unit 1: Understanding Financial Services: (6 hours)


• Time Preference Rate and Required Rate of Return
• Present Value and Future Value of Money
• Annuity and Growing Annuity
• Perpetuity
• Applications of Time value of money
• Introduction to Financial Statements (balance sheet, Income Statement, Cash flow
Statements)
• Risk Analysis in Capital Budgeting
• Understanding banking assets and liability products
• Basel Norms

Unit 2: Predictive Analytics (6 hours)


• Valuation of Bonds
• Valuation of Shares
• Understanding and analyzing High Frequency Data
• Time Series Analysis on Stock prediction
• Types and sources of risk

Unit 3: Credit Risk Analytics (6 hours)


• Understanding Credit and its functions
• Understanding Scoring models
• The Value –at –Risk(VaR)
• Probability Default Model
• Loss at Given Default Model
• Exposure at Default models

Unit 4: Customer Relationship Management (CRM) Analytics (6 hours)


• Customer Acquisition Modelling
• Collection and recovery analytics
• Propensity Model
• Cross selling and Up selling analytics

Unit 5 Operational Risk Analytics (6 Hours)

• Internal and External Fraud Analysis


• Regulatory risk analytics
• Cash flow prediction
• Optimizing Cross channel effectiveness

Case-studies will be used to demonstrate the above areas.

9
References:
1. Essentials of Business Analytics by Jeffrey D Camm (Author)
2. Damodaran, A Corporate Finance: Theory and Practice, John Wiley & Sons. 6.
Chandra, P. Financial Management, Tata McGraw Hill.
3. Srivastava, Rajiv and Misra. Anil, Financial Management, Oxford University Press.
4. Ross S.A., R.W. Westerfield and J. Jaffe, Corporate Finance, McGraw Hill.
5. Python for Finance, Yves Hilpisch, Oreilly Publication

DDS 504. 2 Marketing Analytics

Unit 1 Introduction to Marketing (4 Hours)


• Marketing overview
• Different Marketing Items
• Different Types of Markets
Marketing Mix
• 4 P’s of Marketing
• Developing Marketing Strategies
• Competitive Analysis: Porter’s Five Forces Model, PESTEL Analysis
• Product Strategy
• Pricing Strategy
• Place (Distribution Strategy)
• Promotion Strategy
Unit 2 Introduction to Marketing Analytics (4 Hours)
• An overview of Marketing Data and its characteristics
• Preview of different marketing channels data (brief about multiple sources, e.g.
online, social, etc.)
• Integration for modelling purposes.
• Propensity Models
Unit 3 . Price & Promotion Analytics (4 Hours)
• Modelling Price Elasticities: Understanding price-demand curves; Identifying and Optimizing
Price Points
• Modelling Promotional Effectiveness: Measuring Effectiveness of Promotional Measures
such as: i. Discount, ii. Feature, iii. Display;
Unit 4 Forecasting (6 Hours)
• Models for forecasting sales/demand: Understanding and capturing trends, seasonality’s and
cyclic variations. Smoothing Models, ARIMA Models
Unit 5 Conjoint and Scaling/Mapping Analytics (4 Hours)
• Conjoint analysis for market entry, and new product development.
• Positioning Brands/Companies using Multi-Dimensional Scaling, and Perceptual Mappings.
Unit 6 Campaign Analytics (4 Hours)
• Measuring Effectiveness of Advertising: TV, Online, Direct Marketing (Events, Sampling)
• Modelling, and ROI (decomposing and optimization).
Unit 7 Digital Marketing Analytics (4 Hours)
• Introduction to Digital Marketing: Channels (e.g. Social Media) Measurement Metrics.
• Attribution Modelling.

Case-studies will be used to demonstrate the above areas.

References:
1. Marketing Models (Kotler, Lilien, Moorthy)
2. Measuring Marketing: 101 key metrics every marketer needs (Davis)

10
3. Marketing Analytics, Wayne L Winston, Wiley

DDS 506.1 Unstructured Data Analysis

Unit 1: Introduction to Unstructured Data Analysis (8 Hours)


• Introduction to unstructured data,
• Differences in structured and unstructured data,
• Challenges posed due to lack of structure.
• Unstructured data encountered in various applications such as text, speech, multimedia (rich),
web and social media data
• Feature Extraction : text features, speech features, multimedia features, features in web and
social media
o Document Term Matrix, Term Frequency, Inverse Term Frequency
o Count Vectorizer, TFIDF Vectorizer, Hash Vectorizer
o Text Classification Techniques using Vectorizers

• Using Baye’s algorithm for text Classification


• Using ‘Parts of speech’ to provide context to a statement
• Word embedding

Unit 2: Sentiment Analysis and Topic Modelling (12 hrs)


• Using Vader Algorithm for calculating Sentiment polarity
• LDA Techniques
• Classification of words using K-Means clustering
• Document Classification using Cosine Similarity
• Customer Segmentation and Profiling

Unit 3: Introduction to NoSQL Databases (8 Hrs)


• Understanding storage architecture
• Column-oriented databases
• Understanding key-value stores
• Performing operations in a NoSQL database such as MongodB
• Update and deleting data
• Querying data
• Understanding Consistency, Partition tolerance, Availability
• Understanding indexing and aggregation

Unit4: Audio and Video Classification ( 2 hrs)


• Introduction to Audio Data Classification
o Acoustic parameters from audio samples
o Classification & categorisation of audio samples (derive gender, age, singing
capability)
• Introduction to Video Data Classification
o Classification and Categorisation of YouTube videos & Analysing the feedback
comments (Political, Technical, Entertainment, etc.)

References:
1. Tan, Steinbach and Kumar, “Introduction to Data Mining”
2. Camastra and Vinciarelli “Machine Learning for Audio, Video and Image Analysis”
3. MongoDB, The Definitive Guide, O’Reilly
4. Python text processing, Jacob Perkins, PACKT publications

11
DDS 506.2 Robotic Process Automation

UNIT 1: Introduction to RPA (6 Hours)


• What is RPA
• Typical Benefits of RPA
• RPA Concepts and Implementation Approach
• Natural language processing and RPA
• How Robotic Process Automation works with Repetitive tasks
• RPA Solution Architecture Patterns
• Data Handling in RPA

UNIT 2: RPA Functionalities: (using open source / proprietary RPA tool) (12 Hours)
• Features and Benefits
• Using Task Editor
• Types of Variables
• Recording an Automation Task
• Recording, Editing and Running Tasks
• Creating an Automation Task
• Recording Web Actions with Web Recorder
• Extracting Data from Websites, Web Data, Pattern-Based Data, Table Data,
• Standard Recorder, Object Recorder
• Viewing and Setting General Properties
• Setting up Hotkeys for a Task and security Features
• Scheduling Tasks to Run, Adding Triggers to a Task and Run Remotely

UNIT 3: Implementation of Functionalities (12 Hours)


• Automation - Email Automation, FTP Automation and PDF Integration,
• Web Recorder with Database Automation
• Using MetaBots, Web Recorder, Smart Recorder

References:
1. Learning Robotic Process Automation: Create Software robots and automate business processes
with the leading RPA tool – UiPath by, Alok Mani Tripathi

2. Robotic Process Automation Tools, Process Automation and their benefits: Understanding RPA
and Intelligent Automation by, Srikanth Merianda (Author), Kiwa K (Editor)

DDS 510 Project / Internship


(Mini Project)

Students can undergo project in a company in the second term and submit a report based on the
project. Students will have milestones on the project, which they would be required to submit to the
academy at regular intervals as notified by the Director/ Head. The project will have a midterm review
and a final review (Viva Voce) during which the students are expected to present it in front of the
review panel and submit a report based on which they are evaluated.

12

You might also like