Mtech-Syllabus-Data Science - Sem1
Mtech-Syllabus-Data Science - Sem1
Curriculum
Handbook for
M.Tech – Data
Science
SEMESTER I
Semester: I Year: 2019-2020
Pre-requisites:
Database Management Systems
Good programming skills
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the need for managing/storing data and identify the value and L2
relative importance of data management.
CO2 Describe fundamentals of Data Management techniquessuitable for L2
Enterprise Applications.
CO3 Apply Data Management Solution for Internet Applications. L3
CO4 Describe various data analysis techniques in the internet Context. L2
Teaching Methodology:
Blackboard teaching and PPT
Programming Assignment
Assessment Methods
Open Book Test for 10 Marks.
Assignment evaluation for 10 Marks on basis of Rubrics
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.
Unit – I 10 Hrs
Introduction to Data Science and Class Logistics/Overview, Statistical Inference and Exploratory Data Analysis,
Principles of Data Management, SQL for Data Science: SQL Basics, SQL Joins and aggregates, Grouping and query
evaluation, SQL Sub-queries, Key Principles of RDBMS
Unit – II 10 Hrs
Data Models, Data Warehousing, OLAP, Data Storage and Indexing , Query Optimization and Cost Estimation,
Datalog, E/R Diagrams and Constraints, Design Theory, BCNF
Unit – III 8 Hrs
Data Management Solutions for Enterprise Applications: Introduction to Transactions, Transaction
Implementations, Transaction Model, Database Concurrency Control Protocols, Transaction Failures and Recovery,
Database Recovery Protocols.
Unit – IV 12 Hrs
Parallel Databases: Introduction to NoSQL database , Apache Cassandra, MongoDB, Apache Hive
(Text Book-3- Chapter1, 2, 5))
Unit – V 12 Hrs
Data Management Solution for Internet Applications: Google's Application Stack: Chubby Lock Service, BigTable
Data Store, and Google File System; Yahoo's key-value store: PNUTS; Amazon's key-value store: Dynamo;
Text Books:
1. Database Systems: the Complete Handbook, by Hector Garcia-Molina, Jennifer Widom, and Jeffrey
Ullman. Second edition.
2. Fundamentals of database systems by Elsmasri and Navathe
3. Seven NoSQL Databases in a Week: Get up and running with the fundamentals, By Xun (Brian) Wu,
Sudarshan Kadambi, Devram Kandhare, Aaron Ploetz, Packt Publishers
Reference Books/resources:
1. Database management systems by Raghu Ramakrishnan and Johannes Gehrke.
2. Foundations of database systems by Abiteboul, Hull and Vianu
3. “Transactional Information Systems” by Gerhard WEIKUM and Gottfried VOSSEN, publisher Morgan
Kaufmann.
4. Programming Hive: Data Warehouse and Query Language for Hadoop By Edward Capriolo, Dean
Wampler, Jason Rutherglen, O’Reilly
5. https://ai.google/research/pubs/pub27897
6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured
Data, Google, Inc. OSDI 2006
7. Brian F. Cooper et al., “ PNUTS: Yahoo!'s hosted data serving platform”, Journal Proceedings of the
VLDB Endowment VLDB Endowment Hompage archive Volume 1 Issue 2, August 2008 Pages 1277-
1288
8. Giuseppe DeCandia et al. , “Dynamo: Amazon’s Highly Available Key-value Store”, Proceeding SOSP
'07 Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles,Pages 205-
220 Stevenson, Washington, USA — October 14 - 17, 2007
Semester: I Year: 2019-2020
Teaching Methodology:
Text books:
1. Miller &freund’s Probability and statistics for engineers, ninth edition, Richard a. Johnson, Pearson.
2. Devore. J.L., “Probability and Statistics for Engineering and the Sciences”, Cengage Learning, New
Delhi, 8th Edition, 2012.
Reference books:
1. Walpole. R.E., Myers. R.H., Myers. S.L. and Ye. K., “Probability and Statistics for Engineers and
Scientists”, Pearson Education, Asia, 8th Edition, 2007.
2. Ross, S.M., “Introduction to Probability and Statistics for Engineers and Scientists”, 3rd Edition,
Elsevier, 2004.
3. Spiegel. M.R., Schiller. J. and Srinivasan. R.A., “Schaum’s Outline of Theory and Problems of
Probability and Statistics”, Tata McGraw Hill Edition, 2004.
4. Griffiths, Dawn. Head first statistics. " O'Reilly Media, Inc.", 2008.
Semester: I Year: 2019-2020
Prerequisite:
Linear Algebra, Probability & Statistics, Calculus, Data Mining
Any programming language C++, Python.
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the basic underlying machine learning concepts. L2
CO2 Analyze a range of machine learning algorithms along with their strength & L4
weaknesses.
CO3 Apply appropriate machine learningtechniques to solve problems of L3
moderate complexity.
CO4 Implement Ensemble methods to obtain better predictive performance than L3
could be obtained from any of the constituent learning algorithms alone
Teaching Methodology:
Black board teaching / Power Point presentations
Executable Codes/ Live Demonstration
Programming Assignment
Assessment Methods:
Online certification from NPTEL/course-era
Programming Assignment (10M), evaluated on the basis of Rubrics.
Three internals, 30Marks each will be conducted and the Average of best of two will be taken.
Final examination, of100 Marks will be conducted and will be evaluatedfor50Marks.
Unit – I 8 Hrs
Concept Learning: Learning problems, Designing a learning system, perspectives and issues in Machine Learning.
Concept Learning Task, Concept Learning as search, Find S, Version space and Candidate Elimination Algorithm.
(TextBook-1)
Decision Tree Learning: Introduction, Decision tree representation, Appropriate problems for Decision Tree Learning,
The Basic Decision Tree Learning Algorithm, Hypothesis Space Search in Decision Tree Learning, Inductive Bias in
Decision Tree Learning, Issues in Decision Tree Learning (TextBook-1)
Unit – II 9Hrs
Feature Engineering for Machine Learning: Machine Learning Pipeline, Binarization, Quantization/Binning, Log
Transformation, Feature Scaling/Normalization, Interaction features, and feature selection
Text Data: Flattening, Filtering and chunking: Bag-of-X: Turning Natural Text into Flat Vectors, Filtering for
cleaner features, Atoms of Meaning: From words to n-Grams to Phrases. (TextBook3)
Unit – III 10 Hrs
Categorical variables: Encoding categorical variables, dealing with large categorical variables: feature hashing, Bin
counting
Dimensionality reduction: Intuition, Derivation, PCA in Action, Whitening and ZCA, Considerations and limitations
of PCA, Use cases (TextBook3)
Unit – IV 6 Hrs
Bayesian Learning: Bayes theorem – An Example; Bayes theorem and concept learning: Brute-Force Bayes Concept
Learning, MAP Hypotheses and Consistent Learners; maximum likelihood and least-squared error hypotheses; Bayes
optimal classifier; Gibbs algorithm, naive Bayes classifier; Bayesian belief networks – Conditional Independence,
Representation, Inference, Learning Bayesian Belief Networks.
Cluster Analysis: Basic concepts and algorithms: Overview, K-Means, Agglomerative Hierarchical clustering,
DBSCAN. (TextBook2)
Unit – V 06 Hrs
Ensemble Methods: Rationale for ensemble method, methods for constructing an Ensemble classifier, Bias-Variance
decomposition, Bagging, Boosting, Random forests, Empirical comparison among Ensemble methods. (TextBook2)
Text Books:
1. Tom M. Mitchell, “Machine Learning”, McGraw-Hill Education (INDIAN EDITION), 2013.
2. Introduction to Data Mining-Pang-NingTan, Michael Steinbach,Vipin Kumar, Pearson Education, 2007.
3. Amanda Casari, Alice Zheng, “Feature Engineering for Machine Learning”, O’Reilly, 2018.
1. https://nptel.ac.in/courses/106106139/
2. Andrew NG's online Course
Pre-requisites:
Graduate Mathematics.
Basic understanding of Probability and Statistics.
Ability to comprehend and understand relational, and unstructured datasets.
Course Outcomes:
Students will be able to:
UNIT – I
Introduction to Exploratory data analysis, and distributions 8hrs
Creating a Data Frame, Getting Information About a Data Structure, adding a Column to a Data Frame, Deleting
a Column from a Data Frame, Renaming Columns in a Data Frame, Reordering Columns in a Data Frame,
Getting a Subset of a Data Frame, Changing the Order of Factor Levels, Changing the Order of Factor Levels
Based on Data Values, Changing the Names of Factor Levels, Removing Unused Levels from a Factor,
Changing the Names of Items in a Character Vector, Recoding a Categorical Variable to Another Categorical
Variable, Recoding a Continuous Variable to a Categorical Variable, Transforming Variables, Transforming
Variables by Group, Summarizing Data by Groups, Summarizing Data with Standard Errors and Confidence
Intervals, Converting Data from Wide to Long, Converting Data from Long to Wide, Converting a Time Series
Object to Times and Values.
UNIT – II
Probability mass function, Cumulative distributions, and modeling distributions 8hrs
Making a Basic Histogram, Making Multiple Histograms from Grouped Data, Making a Density Curve ,Making
Multiple Density Curves from Grouped Data, Making a Frequency Polygon, Making a Basic Box Plot, Adding
Notches to a Box Plot, Adding Means to a Box Plot, Making a Violin Plot, Making a Dot Plot, Making Multiple
Dot Plots for Grouped Data, Making a Density Plot of Two-Dimensional Data
UNIT – III
Miscellaneous Graphs 8hrs
Making a Correlation Matrix, Plotting a Function, Shading a Subregion Under a Function Curve, Creating a
Network Graph, Using Text Labels in a Network Graph, Creating a Heat Map, Creating a Three-Dimensional
Scatter Plot, Adding a Prediction Surface to a Three-Dimensional Plot, Saving a Three-Dimensional Plot,
Animating a Three-Dimensional Plot, Creating a Dendrogram, Creating a Vector Field, Creating a QQ Plot,
Creating a Graph of an Empirical Cumulative Distribution Function, Creating a Mosaic Plot, Creating a Pie
Chart, Creating a Map, Creating a Choropleth Map, Making a Map with a Clean Background
UNIT – IV
Relationship between variables, and estimation 8hrs
Survival Curves, Hazard Function, Estimating Survival Curves, Kaplan-Meier Estimation, The Marriage Curve,
Estimating the Survival Function, Confidence Intervals, Normal Distributions, Sampling Distributions,
Representing Normal Distributions, Central Limit Theorem, Testing the CLT, Applying the CLT, Correlation
Test, Chi-Squared Test
Text books:
1. Think Stats, 2nd Edition: Exploratory Data Analysis, Allen B. Downey, Year:2014, Pages:226, ISBN 13:978-
1-49190-733-7
Reference books:
1. Making sense of Data: A practical Guide to Exploratory Data Analysis and Data Mining, by Glenn J. Myatt.
2. Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and
Applications, Glenn J. Myatt, and Wayne P. Johnson. Print ISBN:9780470222805 |Online
ISBN:9780470417409 |DOI:10.1002/9780470417409.
Semester: I Year: 2019-2020
Pre-requisites:
Students should have knowledge of ‘C’ Programming.
Knowledge of data structures; discrete mathematics, probability, basics of mathematical concepts.
Students should have completed Analysis and Design of Algorithm course.
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Apply the most appropriate algorithms to solve a real world problem L3
through data science applications.
CO2 Evaluate and measure the performance of an algorithm L4
CO3 Design Algorithm for given problem to find out approximate solution. L5
CO4 Describe optimization techniques using algorithms and perform feasibility L2
study for solving an optimization problem.
CO5 Apply optimization techniques for the given problems L3
Teaching Methodology:
Blackboard teaching and PPT
Assignment
Assessment Methods
Open Book Test for 10 Marks.
Assignment evaluation for 10 Marks.
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.
Unit – I 10 Hrs
Basics of Algorithm Analysis; Probabilistic Analysis & Randomized Algorithm: The hiring problem, Indicator
Random Variables, Randomized Algorithms
Dynamic Programming: Principles of Dynamic programming, Segmented Least Squares, Sequence Alignment in
Linear Space.
Unit – II 12 Hrs
Network Flow: Maximum Flow Networks, Pre-flow push Maximum Flow Algorithm,
Graph Algorithms:Basics- Searching and Traversing, Ideas Behind Map Searches: A* Algorithm,
Spectral Algorithms:The Best Fit Space, Mixture Model, Streaming algorithms for computing statistics on the
data: Models and Basic techniques, Hash Functions, Counting Distinct Elements, Frequency Estimation, Other
Streaming problems
Unit – III 10 Hrs
NP-and computational Intractability - Polynomial Time Reduction, The satisfiability problem, Polynomial Time
Verification, NP-Completeness & reducibility, NP-Complete Problems
Approximation Algorithms:Greedy Algorithms and bound on optimum, Center Selection problem, The Pricing
Method, Maximization Via the Pricing Method, Linear programming & Rounding
Unit – IV 12 Hrs
Optimization Methods:
Need for unconstrained methods in solving constrained problems. Necessary conditions of unconstrained optimization,
Structure of methods, quadratic models. Methods of line search, Armijo-Goldstein and Wolfe conditions for partial line
search. Global convergence theorem, Steepest descent method. Quasi-Newton methods: DFP, BFGS, Broyden family.
Unit – V 8 Hrs
Conjugate-direction Methods: Fletcher-Reeves, Polak-Ribierre. Derivative-free methods: finite differencing.
Restricted step methods. Methods for sums of squares and nonlinear equations. Linear and Quadratic Programming.
Duality in optimization.
Optimization algorithms for parameter tuning or design projects:Genetic algorithms, quantum-inspired
evolutionary algorithms, simulated annealing, particle-swarm optimization, Ant Colony Optimization
Text Books:
1. John Kleinberg, Eva Trados, “Algorithm Design”, Pearson Addison Wesley
2. CormenT.H.,LeisersonC.E.,RivestR.L.,SteinC.,IntroductiontoAlgorithms,3rdedition,PHI2010,ISBN:97
80262033848
3. Fletcher R., Practical Methods of Optimization, John Wiley, 2000.
Reference Material
1. Spectral Algorithms, by Ravindran Kannan, Santosh vempala, 2009,
https://www.cc.gatech.edu/~vempala/spectralbook.pdf
2. Streaming Algorithms, Great Ideas in Theoretical Computer Science, Saarland University, Summer
2014
3. S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical
Computer Science, 1(2), 2005.
4. http://theory.stanford.edu/~amitp/GameProgramming/AStarComparison.html
Semester: I Year: 2019-2020
Pre-requisites:
Probability and Statistics for data Science.
Good programming skills
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the fundamental advantage and necessity of forecasting in various L2
situations.
CO2 Identify how to choose an appropriate forecasting method in a particular L2
environment.
CO3 Apply various forecasting methods, which include obtaining the relevant L3
data and carrying out the necessary computation using suitable statistical
software.
CO4 Improve forecast with better statistical models based on statistical analysis L4
Teaching Methodology:
Blackboard teaching and PPT
Programming Assignment
Assessment Methods
Open Book Test for 10 Marks.
Assignment evaluation for 10 Marks on basis of Rubrics
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.
Unit – I 10 Hrs
An Introduction to Forecasting: Forecasting and Data. Forecasting Methods. Errors in Forecasting. Choosing a
Forecasting Technique. An Overview of Quantitative Forecasting Techniques.
REGRESSION ANALYSIS: The Simple Linear Regression Model. The Least Squares Point Estimates. Point
Estimates and Point Predictions. Model Assumptions and the Standard Error. Testing the Significance of the Slope and
y Intercept. Confidence and Prediction Intervals. Simple Coefficients of Determination and Correlation. An F Test for
the Model.
Unit – II 10Hrs
Multiple Linear Regressions: The Linear Regression Model. The Least Squares Estimates, and Point Estimation and
Prediction. The Mean Square Error and the Standard Error. Model Utility: R2, Adjusted R2, and the Overall F Test.
Model Building and Residual Analysis: Model Building and the Effects of Multicollinearity. Residual Analysis in
Simple Regression. Residual Analysis in Multiple Regression. Diagnostics for Detecting Outlying and Influential
Observations
Unit – III 12 Hrs
Time Series Regression: Modelling Trend by Using Polynomial Functions. Detecting Autocorrelation. Types of
Seasonal Variation. Modelling Seasonal Variation by Using Dummy Variables and Trigonometric Functions. Growth
Curves. Handling First-Order Autocorrelation.
Decomposition Methods: Multiplicative Decomposition. Additive Decomposition. The X-12-ARIMA Seasonal
Adjustment Method. Exercises.
Exponential Smoothing: Simple Exponential Smoothing. Tracking Signals. Holt’s Trend Corrected Exponential
Smoothing. Holt-Winters Methods. Damped Trends and Other Exponential
Unit – IV 10 Hrs
Non-seasonal Box-Jenkins Modelling and Their Tentative Identification: Stationary and Nonstationary Time
Series. The Sample Autocorrelation and Partial Autocorrelation Functions: The SAC and SPAC. An Introduction to
Non-seasonal Modelling and Forecasting. Tentative Identification of Non-seasonal Box-Jenkins Models.
Estimation, Diagnostic Checking, and Forecasting for Non-seasonal Box-Jenkins Models: Estimation. Diagnostic
Checking. Forecasting. A Case Study. Box-Jenkins Implementation of Exponential Smoothing.
Unit – V 10 Hrs
Box-Jenkins Seasonal Modelling: Transforming a Seasonal Time Series into a Stationary Time Series. Examples of
Seasonal Modelling and Forecasting. Box-Jenkins Error Term Models in Time Series Regression.
Advanced Box-Jenkins Modelling: The General Seasonal Model and Guidelines for Tentative Identification.
Intervention Models. A Procedure for Building a Transfer Function Model
Causality in time series: Granger causality. Hypothesis testing on rational expectations. Hypothesis testing on market
efficiency.
Text Books:
1. Bruce L. Bowerman, Richard O'Connell, Anne Koehler, “Forecasting, Time Series, and Regression,
4th Edition”, Cengage Unlimited Publishers
2. Enders W. Applied Econometric Time Series. John Wiley & Sons, Inc., 1995
1. Mills, T.C. The Econometric Modelling of Financial Time Series. Cambridge University Press, 1999
2. Andrew C. Harvey. Time Series Models. Harvester wheatsheaf, 1993
3. P. J. Brockwell, R. A. Davis, Introduction to Time Series and Forecasting. Springer, 1996
4. Cryer, Jonathan D.; Chan, Kung-sik, “Time series analysis : with applications in R”, ed.: New York:
Springer, cop. 2008
Semester: I Year: 2019-2020
Pre-requisites:
Course Outcomes:
Students will be able to:
Teaching Methodology:
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Rubrics for Programming Assignment for 20 marks.
Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.
UNIT – I 10hrs
Introduction: Why is computer vision difficult?, Image presentation and analysis tasks. The image, its representations and
properties- a few concepts, image digitization, digital image properties, color images, cameras, image, its mathematical
and physical background- linear integral transforms, image as stochastic processes, image formation physics.
UNIT – II 10 hrs
Data structures for image analysis-levels of image data representation, traditional image data structures, and hierarchical
data structures. Image –preprocessing- pixel brightness transformations, geometric transformations, local preprocessing,
Image restoration.
UNIT – IV 11 hrs
Shape representation and description- region identification,, contour based shape representation and description-chain
codes, simple geometric border representation, region based representation and description- simple scalar region
descriptors, moments.
UNIT – V 11 hrs
Recognition: knowledge representation statistical pattern recognition- classification principles, classifier settings,
classifier learning. Support vector machines, cluster analysis. Neural nets- feed forward networks, unsupervised learning,
hopefield neural networks.
Text books:
Reference book:
1. Digital image processing and analysis by Chanda and Dutta Majumder
2. Digital image processing by Gonzalez and woods.
Semester: I Year: 2019-2020
Pre-requisites:
Basic Python Programming,
Machine learning
Fundamentals of Probability and Statistics
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the commands and set up the programming environment of L2
Python, R and MapReduce
CO2 Apply machine learning concepts to analyze real world problems using Data L3
Analysis.
CO3 Apply probability and statistical techniques to solve problems of moderate L3
complexity.
CO4 Analyze large data sets to derive interesting inferences. L4
Teaching Methodology:
Blackboard teaching and PPT
Executables
Programming Assignment
Assessment Methods
Program Evaluation on the basis of Rubrics.
Two internals, 20 Marks each will be conducted and the Average of best of two will be taken.
Final examination, of 50 Marks will be conducted and will be evaluated for 50 Marks.
2. Basic Python Dr. Granger is interested in studying the relationship between the length of house-
elves’ ears and aspects of their DNA. She has obtained DNA samples and ear
measurements from a small group of house-elves to conduct a preliminary analysis.
You are supposed to conduct the analysis for her. She has placed the file on the web for
you to download.
Write a Python script that:
https://github.com/mGalarnyk/datasciencecoursera/blob/master/4_Exploratory_Data_A
nalysis/project1/README.md
You can read each of the two files using the readRDS() function in R. For example,
reading in each file can be done with the following code:
You must address the following questions and tasks in your exploratory analysis. For
each question/task you will need to make a single plot. Unless specified, you can use
any plotting system in R to make your plot.
1. Have total emissions from $PM_{2.5}$ decreased in the United States from
1999 to 2008? Using the base plotting system, make a plot showing the total
$PM_{2.5}$ emission from all sources for each of the years 1999, 2002,
2005, and 2008.
2. Have total emissions from $PM_{2.5}$ decreased in the Baltimore City,
Maryland (fips == "24510") from 1999 to 2008? Use the base plotting system
to make a plot answering this question.
3. Of the four types of sources indicated by the type (point, nonpoint, onroad,
nonroad) variable, which of these four sources have seen decreases in
emissions from 1999, 2008 for Baltimore City? Which have seen increases in
emissions from 1999, 2008? Use the ggplot2 plotting system to make a plot
answer this question.
4. Across the United States, how have emissions from coal combustion-related
sources changed from 1999, 2008?
5. How have emissions from motor vehicle sources changed from 1999, 2008 in
Baltimore City?
6. Compare emissions from motor vehicle sources in Baltimore City with
emissions from motor vehicle sources in Los Angeles County, California
(fips == "06037"). Which city has seen greater changes over time in motor
vehicle emissions?
Hint-
https://github.com/mGalarnyk/datasciencecoursera/blob/master/4_Exploratory_Data_A
nalysis/project2/project2.md
5. 5Decision Binary Decision Trees: One very interesting application area of machine learning is in
.Tree making medical diagnoses.
Objective: To train and test a binary decision tree to detect breast cancer using
real world data using Python /R. Predict whether the cancer is benign or
malignant.
DataSet:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
The Dataset We will use the Wisconsin Diagnostic Breast Cancer (WDBC) dataset1 .
The dataset consists of 569 samples of biopsied tissue. The tissue for each sample is
imaged and 10 characteristics of the nuclei of cells present in each image are
characterized. These characteristics are 1. Radius 2. Texture 3. Perimeter 4. Area 5.
Smoothness 6. Compactness 7. Concavity 8. Number of concave portions of contour
Each of the 569 samples used in the dataset consists of a feature vector of length 30.
The first 10 entries in this feature vector are the mean of the characteristics listed above
for each image. The second 10 are the standard deviation and last 10 are the largest
value of each of these characteristics present in each image. Each sample is also
associated with a label. A label of value 1 indicates the sample was for malignant
(cancerous) tissue. A label of value 0 indicates the sample was for benign tissue. This
dataset has already been broken up into training, validation and test sets for you and is
available in the compressed archive for this problem on the class website. The names
of the files are “trainX.csv”, “trainY.csv”, “validationX.csv”, “validationY.csv”,
“testX.csv” and “testY.csv.” The file names ending in “X.csv” contain feature vectors
and those ending in “Y.csv” contain labels. Each file is in comma separated value
format where each row represents a sample.
6. Linear Objective: Implement linear regression with one variable to predict profits for a
Regression food truck.
with One
variable Data Set: https://searchcode.com/codesearch/view/5404318/#
Suppose you are the CEO of arestaurant franchise and are considering different cities
for opening a newoutlet. The chain already has trucks in various cities and you have
data forprofitss and populations from the cities.You would like to use this data to help
you select which city to expandtonext.The file ex1data1.txt contains the dataset for our
linear regression problem. The first column is the population of a city and the second
column isthe profitt of a food truck in that city. A negative value for profitt indicates
aloss.
7. linear Objective: Implement linear regression with multiple variables to predict the
regression prices of houses.
with multiple
variables Data Set: https://searchcode.com/codesearch/view/6577026/
Suppose you are selling your house and youwant to know what a good market price
would be. One way to do this is to first collect information on recent houses sold and
make a model of housingprices.Thefile ex1data2.txt contains a training set of housing
prices in Port-land, Oregon. The first column is the size of the house (in square feet),
thesecond column is the number of bedrooms, and the third column is the priceof the
house.
8. Logistic Objective: Build a logistic regression model to predict whether a student gets
Regression admitted into a university.
Dataset: http://en.pudn.com/Download/item/id/2546378.html
Suppose that you are the administrator of a university department and you want to
determine each applicant’s chance of admission based on their results on two exams.
You have historical data from previous applicants that you can use as a training set for
logistic regression. For each training example, you have the applicant’s scores on two
exams and the admissions decision. Your task is to build a classification model that
estimates an applicant’s probability of admission based the scores from those two
exams.
9. Probability Generate and plot some data from a Poisson distribution with an arrival rate of 1.
Distribution
10. Uniform Calculate the Area of A=(x, y) ϵ Ɍ2: 0 < x < 1; 0< x <y2} using the Monte Carlo
Probability Integration Method.
Distribution
11. Support Objective: To model a classifier for predicting whether a patient is suffering from any
Vector heart disease or not.
Machine
Data Set:https://archive.ics.uci.edu/ml/datasets/heart+Disease
Hint: https://dataaspirant.com/2017/01/19/support-vector-machine-classifier-
implementation-r-caret-package/
12. Bayes Data Set: specdata.zip
Theorem The zip file containing the data can be downloaded here: specdata.zip. The zip file
contains 332 comma-separated-value (CSV) files containing pollution monitoring data
for fine particulate matter (PM) air pollution at 332 locations in the United States. Each
file contains data from a single monitor and the ID number for each monitor is
contained in the file name. For example, data for monitor 200 is contained in the file
“200.csv”. Each file contains three variables. Date: the date of the observation in (year-
month-day) format, sulfate: the level of sulfate PM in the air on that date (measured in
micrograms per cubic meter), and nitrate: the level of nitrate PM in the air on that date
(measured in micrograms per cubic meter)
1. Write a function named ‘pollutantmean’ that calculates the mean of a
pollutant (sulfate or nitrate) across a specified list of monitors. The function
‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’.
Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’
particulate matter data from the directory specified in the ‘directory’ argument
and returns the mean of the pollutant across all of the monitors, ignoring any
missing values coded as NA.
2. Write a function that reads a directory full of files and reports the number of
completely observed cases in each data file. The function should return a data
frame where the first column is the name of the file and the second column is
the number of complete cases.
3. Write a function that takes a directory of data files and a threshold for
complete cases and calculates the correlation between sulfate and nitrate for
monitor locations where the number of completely observed cases (on all
variables) is greater than the threshold. The function should return a vector of
correlations for the monitors that meet the threshold requirement. If no
monitors meet the threshold requirement, then the function should return a
numeric vector of length 0.
Hint:
https://rpubs.com/ahmedtadde/DS-Rprogramming1
https://github.com/mGalarnyk/datasciencecoursera/blob/master/2_R_Programming/pro
jects/project1.md
13. Mapreucde Write map and reduce methods to count the number of occurrences of each word in a
file. For the purposes of this assignment a word will be defined as any string of
alphabetic characters appearing between non-alphabetic characters. nature's is two
words. The count should be case-insensitive. If a word occurs multiple times in a line,
all should be counted. A StringTokenizer is a convenient way to parse the words from
the input line. There is documentation of StringTokenizer online, and there is an
example of its use in the reader functions.
14. Objective: Write map and reduce methods to determine the average ratings of
movies.
Data Set:
http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9
a
The input consists of a series of lines, each containing a movie number, user number,
rating, and date: 3980,294028,5,2005-11-15
map should emit movie number and list of rating, and reduce should return for each
movie number a list of average rating as Double, and number of ratings as Integer. This
data is similar to the Netflix Prize data.
Write map and reduce methods to determine the average ratings of movies. The input
consists of a series of lines, each containing a movie number, user number, rating, and
date: 3980,294028,5,2005-11-15
map should emit movie number and list of rating, and reduce should return for each
movie number a list of average rating as Double, and number of ratings as Integer. This
data is similar to the Netflix Prize data.
15. K-means Given the matrix X whose rows represent different data points, you are asked to
Clustering perform a k-means clustering on this dataset using the Euclidean distance as the
distance function. Here k is chosen as 3. The Euclidean distance d between a vector x
and a vector y both in Rp is defined as d = pPp i=1(xi − yi) 2. All data in X were
plotted in Figure 1. The centres of 3 clusters were initialized as µ1 = (6.2, 3.2) (red), µ2
= (6.6, 3.7) (green), µ3 = (6.5, 3.0) (blue).
1. What’s the centre of the first cluster (red) after one iteration? (Answer in the format
of [x1, x2], round your results to three decimal places, same as problems 2 and 3)
2. What’s the centre of the second cluster (green) after two iteration?
3. What’s the centre of the third cluster (blue) when the clustering converges?
4. How many iterations are required for the clusters to converge?
16. Hierarchical In Figure, there are two clusters A (red) and B (blue), each has four members and
Clustering plotted in Figure . The coordinates of each member are labeled in the figure. Compute
the distance between two clusters using Euclidean distance.
1. What is the distance between the two farthest members? (complete link) (round to
four decimal places here, and next 2 problems);
2. What is the distance between the two closest members? (single link)
3. What is the average distance between all pairs?
4. Among all three distances above, which one is robust to noise? Answer either
“complete”, “single”, or “average”.
17. Multivariate Multilinear Regression :
Analysis
18. 1. Using the matrices X and Y found in FILE2 do the following:
Using the matrix XY (the first 3 columns of XY are the Y variables while the last 5
columns are the X variables) found in FILE5 do the following:
File1:
Mata
X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
Stata
mat X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
SAS IML
X={
3 9 17 24,
7 8 11 25,
6 5 13 29,
4 7 15 32,
7 9 13 24,
8 8 1 23};
File 2:
Mata
X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
Stata
mat X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
SAS IML
X={
3 9 17 24,
7 8 11 25,
6 5 13 29,
4 7 15 32,
7 9 13 24,
8 8 1 23};