Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
131 views

Mtech-Syllabus-Data Science - Sem1

The document provides information about a curriculum handbook for the M.Tech Data Science program at Nitte Meenakshi Institute of Technology. It includes details about 5 core courses in the first semester including Introduction to Data Management, Statistics for Data Science, and courses on data warehousing, NoSQL databases, and data management solutions for internet applications. For each course it lists the course code, credits, outcomes, content, textbooks, and assessment methods. The courses aim to describe fundamental data management techniques and statistical concepts, and apply various data analysis methods.

Uploaded by

reshmaitagi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views

Mtech-Syllabus-Data Science - Sem1

The document provides information about a curriculum handbook for the M.Tech Data Science program at Nitte Meenakshi Institute of Technology. It includes details about 5 core courses in the first semester including Introduction to Data Management, Statistics for Data Science, and courses on data warehousing, NoSQL databases, and data management solutions for internet applications. For each course it lists the course code, credits, outcomes, content, textbooks, and assessment methods. The courses aim to describe fundamental data management techniques and statistical concepts, and apply various data analysis methods.

Uploaded by

reshmaitagi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

NITTE MEENAKSHI INSTITUE OF TECHNOLOGY

(A Unit of Nitte Education Trust (R), Mangalore)


An Autonomous Institution

Department of Information Science and


Engineering

Curriculum
Handbook for
M.Tech – Data
Science
SEMESTER I
Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core


Course Title: Introduction to Data Management Course Code:19DS11
L-T-P:3-0-2 Credits: 04
Total Contact Hours:39 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50

Pre-requisites:
 Database Management Systems
 Good programming skills

Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the need for managing/storing data and identify the value and L2
relative importance of data management.
CO2 Describe fundamentals of Data Management techniquessuitable for L2
Enterprise Applications.
CO3 Apply Data Management Solution for Internet Applications. L3
CO4 Describe various data analysis techniques in the internet Context. L2

Teaching Methodology:
 Blackboard teaching and PPT
 Programming Assignment

Assessment Methods
 Open Book Test for 10 Marks.
 Assignment evaluation for 10 Marks on basis of Rubrics
 Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
 Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6


CO1 1 3 1
CO2 1 2 3
CO3 2 2 1 2 2
CO4 1 2 2
19DS11 1 2 1 2 2
COURSE CONTENT

Unit – I 10 Hrs
Introduction to Data Science and Class Logistics/Overview, Statistical Inference and Exploratory Data Analysis,
Principles of Data Management, SQL for Data Science: SQL Basics, SQL Joins and aggregates, Grouping and query
evaluation, SQL Sub-queries, Key Principles of RDBMS
Unit – II 10 Hrs
Data Models, Data Warehousing, OLAP, Data Storage and Indexing , Query Optimization and Cost Estimation,
Datalog, E/R Diagrams and Constraints, Design Theory, BCNF
Unit – III 8 Hrs
Data Management Solutions for Enterprise Applications: Introduction to Transactions, Transaction
Implementations, Transaction Model, Database Concurrency Control Protocols, Transaction Failures and Recovery,
Database Recovery Protocols.

Unit – IV 12 Hrs
Parallel Databases: Introduction to NoSQL database , Apache Cassandra, MongoDB, Apache Hive
(Text Book-3- Chapter1, 2, 5))
Unit – V 12 Hrs
Data Management Solution for Internet Applications: Google's Application Stack: Chubby Lock Service, BigTable
Data Store, and Google File System; Yahoo's key-value store: PNUTS; Amazon's key-value store: Dynamo;

Text Books:
1. Database Systems: the Complete Handbook, by Hector Garcia-Molina, Jennifer Widom, and Jeffrey
Ullman. Second edition.
2. Fundamentals of database systems by Elsmasri and Navathe
3. Seven NoSQL Databases in a Week: Get up and running with the fundamentals, By Xun (Brian) Wu,
Sudarshan Kadambi, Devram Kandhare, Aaron Ploetz, Packt Publishers
Reference Books/resources:
1. Database management systems by Raghu Ramakrishnan and Johannes Gehrke.
2. Foundations of database systems by Abiteboul, Hull and Vianu
3. “Transactional Information Systems” by Gerhard WEIKUM and Gottfried VOSSEN, publisher Morgan
Kaufmann.
4. Programming Hive: Data Warehouse and Query Language for Hadoop By Edward Capriolo, Dean
Wampler, Jason Rutherglen, O’Reilly
5. https://ai.google/research/pubs/pub27897
6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured
Data, Google, Inc. OSDI 2006
7. Brian F. Cooper et al., “ PNUTS: Yahoo!'s hosted data serving platform”, Journal Proceedings of the
VLDB Endowment VLDB Endowment Hompage archive Volume 1 Issue 2, August 2008 Pages 1277-
1288
8. Giuseppe DeCandia et al. , “Dynamo: Amazon’s Highly Available Key-value Store”, Proceeding SOSP
'07 Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles,Pages 205-
220 Stevenson, Washington, USA — October 14 - 17, 2007
Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core


Course Title: Statistics for Data Science Course Code:19DS12
L-T-P: 4-0-0 Credits: 04
Total Contact Hours:52 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Pre-requisites:

 Good understanding of engineering mathematics (especially Algebra and Arithmetic).


 Inferring conclusions from two, and three dimensional graphs.
Course Outcomes:
Students will be able to:

Cos Course Outcome Description Blooms Level


CO1 Describe the basic and intermediate concepts of probability, statistics, and L2
distributions.
CO2 Describe the applications of discrete probability distributions. L2
CO3 Analyze the inference about population statistic based on the parameters of L4
sample population.
CO4 Analyze hypothesis to accept/reject alternative hypothesis based on L4
statistical evidence available.
CO5 Apply regression, ANOVA, and goodness of fit test to construct model L3
and infer conclusions about population/sample.

Teaching Methodology:

 Black Board Teaching / Power Point Presentation.


 Seminar
Assessment Methods:

 Rubrics to evaluate Case Study (depends on the course)


 Rubrics to evaluate Course Project (depends on the course)
 Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
 Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6


CO1 1 2 2
CO2 2 2 1
CO3 2 2 3
CO4 3 3 3 2
CO5 2 3 3 2
19DS12 2 2 3 2
COURSE CONTENT
UNIT – I
Probability and statistics 10 hours
Why Study Statistics?, Modern Statistics, Statistics and Engineering, two Basic Concepts—Population and
Sample, A Case Study: Visually Inspecting Data to Improve Product Quality, Pareto Diagrams and Dot
Diagrams, Frequency Distributions, Graphs of Frequency Distributions, Stem-and-Leaf Displays, Descriptive
Measures, Quartiles and Percentiles, calculation of X bar and S, Problems with aggregating data, Sample Spaces
and Events, Counting, Probability, The Axioms of Probability, Some Elementary Theorems, Conditional
Probability, Bayes’ Theorem.
UNIT – II
Probability Distributions 10 hours
Random Variables, The Binomial Distribution, The Hypergeometric Distribution, The Mean and the Variance
of a Probability Distribution, Chebyshev’s Theorem, The Poisson Distribution and Rare Events, Poisson
Processes, The Geometric and Negative, Binomial Distribution, The Multinomial Distribution, Simulation.
UNIT – III
Probability Densities and Sampling Distributions 12 hours
Continuous Random Variables, The Normal Distribution, The Normal Approximation to the, Binomial
Distribution, Other Probability Densities, The Uniform Distribution, The Log-Normal Distribution, The
Gamma Distribution, The Beta Distribution, The Weibull Distribution, Continuous Random Variables, The
Normal Approximation to the Binomial Distribution, Other Probability Densities, The Uniform Distribution,
The Log-Normal Distribution, The Gamma Distribution, The Beta Distribution, The Weibull Distribution,
Populations and Samples,
UNIT – IV
Inferences concerning mean and variance 10 hours
Statistical Approaches to Making, Generalizations, Point Estimation, Interval Estimation, Maximum Likelihood
Estimation, Tests of Hypotheses, Null Hypotheses and Tests of Hypotheses, Hypotheses Concerning One Mean,
The Relation between Tests and Confidence Intervals, Power, Sample Size, and Operating Characteristic Curve,
The Estimation of Variances, Hypotheses Concerning One Variance, Hypotheses Concerning Two Variances.
UNIT – V
Analysis of Variance/ Regression/ Goodness-of-fit tests 10 hours
Single-Factor ANOVA, Multiple Comparisons in ANOVA, More on Single-Factor ANOVA, Introduction
Two-Factor ANOVA with Kij=1, Two-Factor ANOVA with Kij>1,Three-Factor ANOVA, Introduction, The
Simple Linear Regression Model, Estimating Model Parameters, Inferences About the Slope Parameter,
Inferences Concerning and the Prediction of Future Y Values, Correlation, Introduction, Assessing Model
Adequacy, Polynomial Regression, Goodness-of-Fit Tests

Text books:
1. Miller &freund’s Probability and statistics for engineers, ninth edition, Richard a. Johnson, Pearson.
2. Devore. J.L., “Probability and Statistics for Engineering and the Sciences”, Cengage Learning, New
Delhi, 8th Edition, 2012.
Reference books:
1. Walpole. R.E., Myers. R.H., Myers. S.L. and Ye. K., “Probability and Statistics for Engineers and
Scientists”, Pearson Education, Asia, 8th Edition, 2007.
2. Ross, S.M., “Introduction to Probability and Statistics for Engineers and Scientists”, 3rd Edition,
Elsevier, 2004.
3. Spiegel. M.R., Schiller. J. and Srinivasan. R.A., “Schaum’s Outline of Theory and Problems of
Probability and Statistics”, Tata McGraw Hill Edition, 2004.
4. Griffiths, Dawn. Head first statistics. " O'Reilly Media, Inc.", 2008.
Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core


Course Title: Machine Learning-I Course Code:19DS13
L-T-P:3-0-2 Credits: 04
Total Contact Hours:39 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50

Prerequisite:
 Linear Algebra, Probability & Statistics, Calculus, Data Mining
 Any programming language C++, Python.

Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the basic underlying machine learning concepts. L2
CO2 Analyze a range of machine learning algorithms along with their strength & L4
weaknesses.
CO3 Apply appropriate machine learningtechniques to solve problems of L3
moderate complexity.
CO4 Implement Ensemble methods to obtain better predictive performance than L3
could be obtained from any of the constituent learning algorithms alone

Teaching Methodology:
 Black board teaching / Power Point presentations
 Executable Codes/ Live Demonstration
 Programming Assignment

Assessment Methods:
 Online certification from NPTEL/course-era
 Programming Assignment (10M), evaluated on the basis of Rubrics.
 Three internals, 30Marks each will be conducted and the Average of best of two will be taken.
 Final examination, of100 Marks will be conducted and will be evaluatedfor50Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6


CO1 1 1 2
CO2 1 2 1
CO3 2 1 2 3 1
CO4 1 2 2 1
19DS13 1 1 2 2 1
COURSE CONTENT

Unit – I 8 Hrs
Concept Learning: Learning problems, Designing a learning system, perspectives and issues in Machine Learning.
Concept Learning Task, Concept Learning as search, Find S, Version space and Candidate Elimination Algorithm.
(TextBook-1)
Decision Tree Learning: Introduction, Decision tree representation, Appropriate problems for Decision Tree Learning,
The Basic Decision Tree Learning Algorithm, Hypothesis Space Search in Decision Tree Learning, Inductive Bias in
Decision Tree Learning, Issues in Decision Tree Learning (TextBook-1)
Unit – II 9Hrs
Feature Engineering for Machine Learning: Machine Learning Pipeline, Binarization, Quantization/Binning, Log
Transformation, Feature Scaling/Normalization, Interaction features, and feature selection
Text Data: Flattening, Filtering and chunking: Bag-of-X: Turning Natural Text into Flat Vectors, Filtering for
cleaner features, Atoms of Meaning: From words to n-Grams to Phrases. (TextBook3)
Unit – III 10 Hrs
Categorical variables: Encoding categorical variables, dealing with large categorical variables: feature hashing, Bin
counting

Dimensionality reduction: Intuition, Derivation, PCA in Action, Whitening and ZCA, Considerations and limitations
of PCA, Use cases (TextBook3)

Unit – IV 6 Hrs
Bayesian Learning: Bayes theorem – An Example; Bayes theorem and concept learning: Brute-Force Bayes Concept
Learning, MAP Hypotheses and Consistent Learners; maximum likelihood and least-squared error hypotheses; Bayes
optimal classifier; Gibbs algorithm, naive Bayes classifier; Bayesian belief networks – Conditional Independence,
Representation, Inference, Learning Bayesian Belief Networks.
Cluster Analysis: Basic concepts and algorithms: Overview, K-Means, Agglomerative Hierarchical clustering,
DBSCAN. (TextBook2)

Unit – V 06 Hrs
Ensemble Methods: Rationale for ensemble method, methods for constructing an Ensemble classifier, Bias-Variance
decomposition, Bagging, Boosting, Random forests, Empirical comparison among Ensemble methods. (TextBook2)

Text Books:
1. Tom M. Mitchell, “Machine Learning”, McGraw-Hill Education (INDIAN EDITION), 2013.
2. Introduction to Data Mining-Pang-NingTan, Michael Steinbach,Vipin Kumar, Pearson Education, 2007.
3. Amanda Casari, Alice Zheng, “Feature Engineering for Machine Learning”, O’Reilly, 2018.

Additional Reference Book:


1. Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, "An Introduction to Statistical
Learning: with Applications in R", Springer, 2016.
2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data
Mining, Inference, and Prediction", Springer, 2016
3. Andreas Muller, "Introduction to Machine Learning with Python: A Guide for Data
Scientists", Shroff/O'Reilly; First edition (2016)
4. Introduction to Data Mining-Pang-NingTan, Michael Steinbach,Vipin Kumar, Pearson Education, 2007.
Online Materials:

1. https://nptel.ac.in/courses/106106139/
2. Andrew NG's online Course

Programming Assignments: (Sample)


1) Implement the CANDIDATE – ELIMINATION algorithm. Show how it is used to learn from training
examples and hypothesize new instances in Version Space.
2) Implement the FIND–S algorithm. Show how it can be used to classify new instances of target concepts.
Run the experiments to deduce instances and hypothesis consistently.
3) Implement the ID3 algorithm for learning Boolean–valued functions for classifying the training examples
by searching through the space of a Decision Tree.
4) Design and implement the Back-propagation algorithm by applying it to a learning task involving an
application like FACE RECOGNITION.
5) Design and implement Naïve Bayes Algorithm for learning and classifying TEXT DOCUMENTS.
Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core


Course Title: Exploratory Data Analysis Course Code:19DS14
L-T-P:3-0-2 Credits: 04
Total Contact Hours:39 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50

Pre-requisites:

 Graduate Mathematics.
 Basic understanding of Probability and Statistics.
 Ability to comprehend and understand relational, and unstructured datasets.

Course Outcomes:
Students will be able to:

Cos Course Outcome Description BL


CO1 Describe the philosophy of exploratory data analysis, L2
CO2 Apply visualize discrete and continuous probability distributions L3
CO3 Describe visualizing, and estimating the correlation between variables. L2
CO4 Apply linear and nonlinear models visually. L3
CO5 Describe the visualization and analysis of time series and survival calculations. L2
Teaching Methodology:
 Black Board Teaching
 Power Point Presentation.
 Seminar
Assessment Methods:
 Rubrics to evaluate Seminar
 Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
 Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6


CO1 1 1 2
CO2 2 2 3 1
CO3 1 1 2
CO4 2 2 3 1
CO5 2 2 2 1 2
19DS14 2 2 2 2 1
COURSE CONTENT

UNIT – I
Introduction to Exploratory data analysis, and distributions 8hrs

Creating a Data Frame, Getting Information About a Data Structure, adding a Column to a Data Frame, Deleting
a Column from a Data Frame, Renaming Columns in a Data Frame, Reordering Columns in a Data Frame,
Getting a Subset of a Data Frame, Changing the Order of Factor Levels, Changing the Order of Factor Levels
Based on Data Values, Changing the Names of Factor Levels, Removing Unused Levels from a Factor,
Changing the Names of Items in a Character Vector, Recoding a Categorical Variable to Another Categorical
Variable, Recoding a Continuous Variable to a Categorical Variable, Transforming Variables, Transforming
Variables by Group, Summarizing Data by Groups, Summarizing Data with Standard Errors and Confidence
Intervals, Converting Data from Wide to Long, Converting Data from Long to Wide, Converting a Time Series
Object to Times and Values.
UNIT – II
Probability mass function, Cumulative distributions, and modeling distributions 8hrs

Making a Basic Histogram, Making Multiple Histograms from Grouped Data, Making a Density Curve ,Making
Multiple Density Curves from Grouped Data, Making a Frequency Polygon, Making a Basic Box Plot, Adding
Notches to a Box Plot, Adding Means to a Box Plot, Making a Violin Plot, Making a Dot Plot, Making Multiple
Dot Plots for Grouped Data, Making a Density Plot of Two-Dimensional Data
UNIT – III
Miscellaneous Graphs 8hrs

Making a Correlation Matrix, Plotting a Function, Shading a Subregion Under a Function Curve, Creating a
Network Graph, Using Text Labels in a Network Graph, Creating a Heat Map, Creating a Three-Dimensional
Scatter Plot, Adding a Prediction Surface to a Three-Dimensional Plot, Saving a Three-Dimensional Plot,
Animating a Three-Dimensional Plot, Creating a Dendrogram, Creating a Vector Field, Creating a QQ Plot,
Creating a Graph of an Empirical Cumulative Distribution Function, Creating a Mosaic Plot, Creating a Pie
Chart, Creating a Map, Creating a Choropleth Map, Making a Map with a Clean Background
UNIT – IV
Relationship between variables, and estimation 8hrs

Scatter Plots, Characterizing Relationships, Correlation, Covariance, Pearson’s Correlation, Nonlinear


Relationships, Spearman’s Rank Correlation, Correlation and Causation, The Estimation Game, Guess the
Variance, Sampling Distributions, Sampling Bias, Exponential Distributions, Classical Hypothesis Testing,
Hypothesis Test, Testing a Difference in Means, Other Test Statistics, Testing a Correlation, Testing
Proportions, Chi-Squared Tests, First Babies Again, Power, Replication,
UNIT – V
Time series and survival analysis 7hrs

Survival Curves, Hazard Function, Estimating Survival Curves, Kaplan-Meier Estimation, The Marriage Curve,
Estimating the Survival Function, Confidence Intervals, Normal Distributions, Sampling Distributions,
Representing Normal Distributions, Central Limit Theorem, Testing the CLT, Applying the CLT, Correlation
Test, Chi-Squared Test

Text books:

1. Think Stats, 2nd Edition: Exploratory Data Analysis, Allen B. Downey, Year:2014, Pages:226, ISBN 13:978-
1-49190-733-7
Reference books:

1. Making sense of Data: A practical Guide to Exploratory Data Analysis and Data Mining, by Glenn J. Myatt.
2. Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and
Applications, Glenn J. Myatt, and Wayne P. Johnson. Print ISBN:9780470222805 |Online
ISBN:9780470417409 |DOI:10.1002/9780470417409.
Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Elective


Course Title: Advanced Algorithms and Optimization Course Code:19DSE241
L-T-P: 4-0-0 Credits: 04
Total Contact Hours:52 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50

Pre-requisites:
 Students should have knowledge of ‘C’ Programming.
 Knowledge of data structures; discrete mathematics, probability, basics of mathematical concepts.
 Students should have completed Analysis and Design of Algorithm course.

Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Apply the most appropriate algorithms to solve a real world problem L3
through data science applications.
CO2 Evaluate and measure the performance of an algorithm L4
CO3 Design Algorithm for given problem to find out approximate solution. L5
CO4 Describe optimization techniques using algorithms and perform feasibility L2
study for solving an optimization problem.
CO5 Apply optimization techniques for the given problems L3

Teaching Methodology:
 Blackboard teaching and PPT
 Assignment

Assessment Methods
 Open Book Test for 10 Marks.
 Assignment evaluation for 10 Marks.
 Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
 Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6


CO1 2 2 3 2 1
CO2 2 2 2
CO3 3 2 3 1
CO4 2 2 1
CO5 3 1 2 3 1 1
19DSE241 2 1 2 3 1 1
COURSE CONTENT

Unit – I 10 Hrs
Basics of Algorithm Analysis; Probabilistic Analysis & Randomized Algorithm: The hiring problem, Indicator
Random Variables, Randomized Algorithms
Dynamic Programming: Principles of Dynamic programming, Segmented Least Squares, Sequence Alignment in
Linear Space.

Unit – II 12 Hrs
Network Flow: Maximum Flow Networks, Pre-flow push Maximum Flow Algorithm,
Graph Algorithms:Basics- Searching and Traversing, Ideas Behind Map Searches: A* Algorithm,
Spectral Algorithms:The Best Fit Space, Mixture Model, Streaming algorithms for computing statistics on the
data: Models and Basic techniques, Hash Functions, Counting Distinct Elements, Frequency Estimation, Other
Streaming problems
Unit – III 10 Hrs
NP-and computational Intractability - Polynomial Time Reduction, The satisfiability problem, Polynomial Time
Verification, NP-Completeness & reducibility, NP-Complete Problems
Approximation Algorithms:Greedy Algorithms and bound on optimum, Center Selection problem, The Pricing
Method, Maximization Via the Pricing Method, Linear programming & Rounding
Unit – IV 12 Hrs
Optimization Methods:
Need for unconstrained methods in solving constrained problems. Necessary conditions of unconstrained optimization,
Structure of methods, quadratic models. Methods of line search, Armijo-Goldstein and Wolfe conditions for partial line
search. Global convergence theorem, Steepest descent method. Quasi-Newton methods: DFP, BFGS, Broyden family.
Unit – V 8 Hrs
Conjugate-direction Methods: Fletcher-Reeves, Polak-Ribierre. Derivative-free methods: finite differencing.
Restricted step methods. Methods for sums of squares and nonlinear equations. Linear and Quadratic Programming.
Duality in optimization.
Optimization algorithms for parameter tuning or design projects:Genetic algorithms, quantum-inspired
evolutionary algorithms, simulated annealing, particle-swarm optimization, Ant Colony Optimization

Text Books:
1. John Kleinberg, Eva Trados, “Algorithm Design”, Pearson Addison Wesley
2. CormenT.H.,LeisersonC.E.,RivestR.L.,SteinC.,IntroductiontoAlgorithms,3rdedition,PHI2010,ISBN:97
80262033848
3. Fletcher R., Practical Methods of Optimization, John Wiley, 2000.

Reference Material
1. Spectral Algorithms, by Ravindran Kannan, Santosh vempala, 2009,
https://www.cc.gatech.edu/~vempala/spectralbook.pdf
2. Streaming Algorithms, Great Ideas in Theoretical Computer Science, Saarland University, Summer
2014
3. S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical
Computer Science, 1(2), 2005.
4. http://theory.stanford.edu/~amitp/GameProgramming/AStarComparison.html
Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Elective


Course Title: Time Series Analysis and Forecasting Course Code:19DS152
L-T-P: 4-0-0 Credits: 04
Total Contact Hours:52 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50

Pre-requisites:
 Probability and Statistics for data Science.
 Good programming skills

Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the fundamental advantage and necessity of forecasting in various L2
situations.
CO2 Identify how to choose an appropriate forecasting method in a particular L2
environment.
CO3 Apply various forecasting methods, which include obtaining the relevant L3
data and carrying out the necessary computation using suitable statistical
software.
CO4 Improve forecast with better statistical models based on statistical analysis L4

Teaching Methodology:
 Blackboard teaching and PPT
 Programming Assignment

Assessment Methods
 Open Book Test for 10 Marks.
 Assignment evaluation for 10 Marks on basis of Rubrics
 Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
 Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6


CO1 1 2 2
CO2 2 2 3 1 1
CO3 3 1 3 3 2 2
CO4 3 3 3
19DS22 2 1 2 3 1 1
COURSE CONTENT

Unit – I 10 Hrs
An Introduction to Forecasting: Forecasting and Data. Forecasting Methods. Errors in Forecasting. Choosing a
Forecasting Technique. An Overview of Quantitative Forecasting Techniques.
REGRESSION ANALYSIS: The Simple Linear Regression Model. The Least Squares Point Estimates. Point
Estimates and Point Predictions. Model Assumptions and the Standard Error. Testing the Significance of the Slope and
y Intercept. Confidence and Prediction Intervals. Simple Coefficients of Determination and Correlation. An F Test for
the Model.
Unit – II 10Hrs
Multiple Linear Regressions: The Linear Regression Model. The Least Squares Estimates, and Point Estimation and
Prediction. The Mean Square Error and the Standard Error. Model Utility: R2, Adjusted R2, and the Overall F Test.
Model Building and Residual Analysis: Model Building and the Effects of Multicollinearity. Residual Analysis in
Simple Regression. Residual Analysis in Multiple Regression. Diagnostics for Detecting Outlying and Influential
Observations
Unit – III 12 Hrs
Time Series Regression: Modelling Trend by Using Polynomial Functions. Detecting Autocorrelation. Types of
Seasonal Variation. Modelling Seasonal Variation by Using Dummy Variables and Trigonometric Functions. Growth
Curves. Handling First-Order Autocorrelation.
Decomposition Methods: Multiplicative Decomposition. Additive Decomposition. The X-12-ARIMA Seasonal
Adjustment Method. Exercises.
Exponential Smoothing: Simple Exponential Smoothing. Tracking Signals. Holt’s Trend Corrected Exponential
Smoothing. Holt-Winters Methods. Damped Trends and Other Exponential
Unit – IV 10 Hrs
Non-seasonal Box-Jenkins Modelling and Their Tentative Identification: Stationary and Nonstationary Time
Series. The Sample Autocorrelation and Partial Autocorrelation Functions: The SAC and SPAC. An Introduction to
Non-seasonal Modelling and Forecasting. Tentative Identification of Non-seasonal Box-Jenkins Models.
Estimation, Diagnostic Checking, and Forecasting for Non-seasonal Box-Jenkins Models: Estimation. Diagnostic
Checking. Forecasting. A Case Study. Box-Jenkins Implementation of Exponential Smoothing.
Unit – V 10 Hrs
Box-Jenkins Seasonal Modelling: Transforming a Seasonal Time Series into a Stationary Time Series. Examples of
Seasonal Modelling and Forecasting. Box-Jenkins Error Term Models in Time Series Regression.
Advanced Box-Jenkins Modelling: The General Seasonal Model and Guidelines for Tentative Identification.
Intervention Models. A Procedure for Building a Transfer Function Model
Causality in time series: Granger causality. Hypothesis testing on rational expectations. Hypothesis testing on market
efficiency.

Text Books:

1. Bruce L. Bowerman, Richard O'Connell, Anne Koehler, “Forecasting, Time Series, and Regression,
4th Edition”, Cengage Unlimited Publishers
2. Enders W. Applied Econometric Time Series. John Wiley & Sons, Inc., 1995

Additional Reference Material

1. Mills, T.C. The Econometric Modelling of Financial Time Series. Cambridge University Press, 1999
2. Andrew C. Harvey. Time Series Models. Harvester wheatsheaf, 1993
3. P. J. Brockwell, R. A. Davis, Introduction to Time Series and Forecasting. Springer, 1996
4. Cryer, Jonathan D.; Chan, Kung-sik, “Time series analysis : with applications in R”, ed.: New York:
Springer, cop. 2008
Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Elective


Course Title: Computer Vision Course Code:19DS153
L-T-P: 4-0-0 Credits: 04
Total Contact Hours:52 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50

Pre-requisites:

 Basic knowledge of Data Mining


 Programming knowledge in object oriented methodology

Course Outcomes:
Students will be able to:

Cos Course Learning Outcomes BL


CO1 Identify image processing techniques to solve real world applications L2
CO2 Apply deep learning methods on images to solve high complexity problems L3
CO3 Develop a technique for image feature extraction L3
CO4 Design techniques for image analysis and classification L3

Teaching Methodology:

 Black Board Teaching / Power Point Presentation


 Programming Assignment
Assessment Methods:

 Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
 Rubrics for Programming Assignment for 20 marks.
 Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping:

COs PO1 PO2 PO3 PO4 PO5 PO6


CO1 2 2 1 1
CO2 2 3 3 1
CO3 3 3 2
CO4 2 2 3 3
19DS251 2 2 3 2 1
COURSE CONTENT

UNIT – I 10hrs
Introduction: Why is computer vision difficult?, Image presentation and analysis tasks. The image, its representations and
properties- a few concepts, image digitization, digital image properties, color images, cameras, image, its mathematical
and physical background- linear integral transforms, image as stochastic processes, image formation physics.

UNIT – II 10 hrs
Data structures for image analysis-levels of image data representation, traditional image data structures, and hierarchical
data structures. Image –preprocessing- pixel brightness transformations, geometric transformations, local preprocessing,
Image restoration.

UNIT – III 10 hrs


Segmentation- thresholding, edge based segmentation, region based segmentation, matching, evaluation issues in
segmentation. Image data compression- image data properties, discrete image transforms in image data compression,
predictive compression methods, vector quantization, hierarchical and progressive compression methods, comparison of
compression methods.

UNIT – IV 11 hrs

Shape representation and description- region identification,, contour based shape representation and description-chain
codes, simple geometric border representation, region based representation and description- simple scalar region
descriptors, moments.

UNIT – V 11 hrs

Recognition: knowledge representation statistical pattern recognition- classification principles, classifier settings,
classifier learning. Support vector machines, cluster analysis. Neural nets- feed forward networks, unsupervised learning,
hopefield neural networks.

Text books:

1. Digital image processing and computer vision by Milan Sonka

Reference book:
1. Digital image processing and analysis by Chanda and Dutta Majumder
2. Digital image processing by Gonzalez and woods.
Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core


Course Title: Data Engineering Lab Course Code:19DSL16
L-T-P:0-0-4 Credits: 04
Total Contact Hours:26hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50

Pre-requisites:
 Basic Python Programming,
 Machine learning
 Fundamentals of Probability and Statistics

Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the commands and set up the programming environment of L2
Python, R and MapReduce
CO2 Apply machine learning concepts to analyze real world problems using Data L3
Analysis.
CO3 Apply probability and statistical techniques to solve problems of moderate L3
complexity.
CO4 Analyze large data sets to derive interesting inferences. L4

Teaching Methodology:
 Blackboard teaching and PPT
 Executables
 Programming Assignment

Assessment Methods
 Program Evaluation on the basis of Rubrics.
 Two internals, 20 Marks each will be conducted and the Average of best of two will be taken.
 Final examination, of 50 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6


CO1 2 2 1 1 2
CO2 3 2 2 3 1 2
CO3 3 2 2 3 1 2
CO4 3 1 3 2 3 2
19DSL16 3 2 2 3 1 2
COURSE CONTENT
Progr Domain Assignment
am
No.
1 Basic Python The number of birds banded at a series of sampling sites has been counted by your
field crew and entered into the following list. The first item in each sublist is an
alphanumeric code for the site and the second value is the number of birds banded. Cut
and paste the list into your assignment and then answer the following questions by
printing them to the screen.

data = [['A1', 28], ['A2', 32], ['A3', 1], ['A4', 0],


['A5', 10], ['A6', 22], ['A7', 30], ['A8', 19],
['B1', 145], ['B2', 27], ['B3', 36], ['B4', 25],
['B5', 9], ['B6', 38], ['B7', 21], ['B8', 12],
['C1', 122], ['C2', 87], ['C3', 36], ['C4', 3],
['D1', 0], ['D2', 5], ['D3', 55], ['D4', 62],
['D5', 98], ['D6', 32]]

1. How many sites are there?


2. How many birds were counted at the 7th site?
3. How many birds were counted at the last site?
4. What is the total number of birds counted across all sites?
5. What is the average number of birds seen on a site?
6. What is the total number of birds counted on sites with codes beginning with
C?

2. Basic Python Dr. Granger is interested in studying the relationship between the length of house-
elves’ ears and aspects of their DNA. She has obtained DNA samples and ear
measurements from a small group of house-elves to conduct a preliminary analysis.
You are supposed to conduct the analysis for her. She has placed the file on the web for
you to download.
Write a Python script that:

1. Imports the data into a data structure of your choice


2. Loops over the rows in the dataset
3. For each row in the dataset checks to see if the ear length is large (>10 cm) or
small (<=10 cm) and determines the GC-content of the DNA sequence (i.e.,
the percentage of bases that are either G or C)
4. Stores this information in a table where the first column has the ID for the
individual, the second column contains the string ‘large’ or the string ‘small’
depending on the size of the individuals ears, and the third column contains
the GC content of the DNA sequence.
5. Prints the average GC-content for both large-eared elves and small-eared
elves to the screen.
6. Exports the table of individual level GC values to a CSV (comma delimited
text) file titled grangers_analysis.csv.
3. Basic Measurements of electric power consumption in one household with a one-minute
Exploratory sampling rate over a period of almost 4 years. Different electrical quantities and some
Data Analysis sub-metering values are available.
Dataset: Individual household electric power consumption Data SetElectric power
consumption

https://github.com/mGalarnyk/datasciencecoursera/blob/master/4_Exploratory_Data_A
nalysis/project1/README.md

Perform the following:


1. Load the data
2. Subset the data from the dates 2007-02-01 and 2007-02-02.
3. Create a histogram
4. Create a Time series
5. Create a plot for sub metering
6. Create multiple plot
4. Exploratory The data for this assignment are available from the course web site as a single zip file:
Data Analysis
Using R Data Set

The zip file contains two files:

$PM{2.5}$ Emissions Data (summarySCC_PM25.rds): This file contains a data frame


with all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year,
the table contains number of tons of $PM{2.5}$ emitted from a specific type of source
for the entire year. Here are the first few rows.

## fips SCC Pollutant Emissions type year


## 4 09001 10100401 PM25-PRI 15.714 POINT 1999
## 8 09001 10100404 PM25-PRI 234.178 POINT 1999
## 12 09001 10100501 PM25-PRI 0.128 POINT 1999
## 16 09001 10200401 PM25-PRI 2.036 POINT 1999
## 20 09001 10200504 PM25-PRI 0.388 POINT 1999
## 24 09001 10200602 PM25-PRI 1.490 POINT 1999

fips: A five-digit number (represented as a string) indicating the U.S. county


SCC: The name of the source as indicated by a digit string (see source code
classification table)
Pollutant: A string indicating the pollutant
Emissions: Amount of PM2.5 emitted, in tons
type: The type of source (point, non-point, on-road, or non-road)
year: The year of emissions recorded

Source Classification Code Table (Source_Classification_Code.rds): This table


provides a mapping from the SCC digit strings into the Emissions table to the actual
name of the $PM{2.5}$ source. The sources are categorized in a few different ways
from more general to more specific and you may choose to explore whatever
categories you think are most useful. For example, source 10100101 is known as Ext
Comb /Electric Gen /Anthracite Coal /Pulverized Coal.

You can read each of the two files using the readRDS() function in R. For example,
reading in each file can be done with the following code:

You must address the following questions and tasks in your exploratory analysis. For
each question/task you will need to make a single plot. Unless specified, you can use
any plotting system in R to make your plot.

1. Have total emissions from $PM_{2.5}$ decreased in the United States from
1999 to 2008? Using the base plotting system, make a plot showing the total
$PM_{2.5}$ emission from all sources for each of the years 1999, 2002,
2005, and 2008.
2. Have total emissions from $PM_{2.5}$ decreased in the Baltimore City,
Maryland (fips == "24510") from 1999 to 2008? Use the base plotting system
to make a plot answering this question.
3. Of the four types of sources indicated by the type (point, nonpoint, onroad,
nonroad) variable, which of these four sources have seen decreases in
emissions from 1999, 2008 for Baltimore City? Which have seen increases in
emissions from 1999, 2008? Use the ggplot2 plotting system to make a plot
answer this question.
4. Across the United States, how have emissions from coal combustion-related
sources changed from 1999, 2008?
5. How have emissions from motor vehicle sources changed from 1999, 2008 in
Baltimore City?
6. Compare emissions from motor vehicle sources in Baltimore City with
emissions from motor vehicle sources in Los Angeles County, California
(fips == "06037"). Which city has seen greater changes over time in motor
vehicle emissions?

Making and Submitting Plots


For each plot you should:
1. Construct the plot and save it to a PNG file.
2. Create a separate R code file (plot1.R, plot2.R, etc.) that constructs the
corresponding plot, i.e. code in plot1.R constructs the plot1.png plot. Your
code file should include code for reading the data so that the plot can be fully
reproduced. You should also include the code that creates the PNG file. Only
include the code for a single plot (i.e. plot1.R should only include code for
producing plot1.png)
3. Upload the PNG file on the Assignment submission page
4. Copy and paste the R code from the corresponding R file into the text box at
the appropriate point in the peer assessment.

Hint-
https://github.com/mGalarnyk/datasciencecoursera/blob/master/4_Exploratory_Data_A
nalysis/project2/project2.md
5. 5Decision Binary Decision Trees: One very interesting application area of machine learning is in
.Tree making medical diagnoses.

Objective: To train and test a binary decision tree to detect breast cancer using
real world data using Python /R. Predict whether the cancer is benign or
malignant.

DataSet:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

The Dataset We will use the Wisconsin Diagnostic Breast Cancer (WDBC) dataset1 .
The dataset consists of 569 samples of biopsied tissue. The tissue for each sample is
imaged and 10 characteristics of the nuclei of cells present in each image are
characterized. These characteristics are 1. Radius 2. Texture 3. Perimeter 4. Area 5.
Smoothness 6. Compactness 7. Concavity 8. Number of concave portions of contour
Each of the 569 samples used in the dataset consists of a feature vector of length 30.
The first 10 entries in this feature vector are the mean of the characteristics listed above
for each image. The second 10 are the standard deviation and last 10 are the largest
value of each of these characteristics present in each image. Each sample is also
associated with a label. A label of value 1 indicates the sample was for malignant
(cancerous) tissue. A label of value 0 indicates the sample was for benign tissue. This
dataset has already been broken up into training, validation and test sets for you and is
available in the compressed archive for this problem on the class website. The names
of the files are “trainX.csv”, “trainY.csv”, “validationX.csv”, “validationY.csv”,
“testX.csv” and “testY.csv.” The file names ending in “X.csv” contain feature vectors
and those ending in “Y.csv” contain labels. Each file is in comma separated value
format where each row represents a sample.
6. Linear Objective: Implement linear regression with one variable to predict profits for a
Regression food truck.
with One
variable Data Set: https://searchcode.com/codesearch/view/5404318/#

Suppose you are the CEO of arestaurant franchise and are considering different cities
for opening a newoutlet. The chain already has trucks in various cities and you have
data forprofitss and populations from the cities.You would like to use this data to help
you select which city to expandtonext.The file ex1data1.txt contains the dataset for our
linear regression problem. The first column is the population of a city and the second
column isthe profitt of a food truck in that city. A negative value for profitt indicates
aloss.
7. linear Objective: Implement linear regression with multiple variables to predict the
regression prices of houses.
with multiple
variables Data Set: https://searchcode.com/codesearch/view/6577026/

Suppose you are selling your house and youwant to know what a good market price
would be. One way to do this is to first collect information on recent houses sold and
make a model of housingprices.Thefile ex1data2.txt contains a training set of housing
prices in Port-land, Oregon. The first column is the size of the house (in square feet),
thesecond column is the number of bedrooms, and the third column is the priceof the
house.
8. Logistic Objective: Build a logistic regression model to predict whether a student gets
Regression admitted into a university.

Dataset: http://en.pudn.com/Download/item/id/2546378.html

Suppose that you are the administrator of a university department and you want to
determine each applicant’s chance of admission based on their results on two exams.
You have historical data from previous applicants that you can use as a training set for
logistic regression. For each training example, you have the applicant’s scores on two
exams and the admissions decision. Your task is to build a classification model that
estimates an applicant’s probability of admission based the scores from those two
exams.

Implement the following:


1. Visualize the data.
2. Implement Sigmoid function
3. Implement the cost function and gradient for logistic regression
4. Evaluate Logistic Regression
5. Predict the results

9. Probability Generate and plot some data from a Poisson distribution with an arrival rate of 1.
Distribution
10. Uniform Calculate the Area of A=(x, y) ϵ Ɍ2: 0 < x < 1; 0< x <y2} using the Monte Carlo
Probability Integration Method.
Distribution
11. Support Objective: To model a classifier for predicting whether a patient is suffering from any
Vector heart disease or not.
Machine
Data Set:https://archive.ics.uci.edu/ml/datasets/heart+Disease

Hint: https://dataaspirant.com/2017/01/19/support-vector-machine-classifier-
implementation-r-caret-package/
12. Bayes Data Set: specdata.zip
Theorem The zip file containing the data can be downloaded here: specdata.zip. The zip file
contains 332 comma-separated-value (CSV) files containing pollution monitoring data
for fine particulate matter (PM) air pollution at 332 locations in the United States. Each
file contains data from a single monitor and the ID number for each monitor is
contained in the file name. For example, data for monitor 200 is contained in the file
“200.csv”. Each file contains three variables. Date: the date of the observation in (year-
month-day) format, sulfate: the level of sulfate PM in the air on that date (measured in
micrograms per cubic meter), and nitrate: the level of nitrate PM in the air on that date
(measured in micrograms per cubic meter)
1. Write a function named ‘pollutantmean’ that calculates the mean of a
pollutant (sulfate or nitrate) across a specified list of monitors. The function
‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’.
Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’
particulate matter data from the directory specified in the ‘directory’ argument
and returns the mean of the pollutant across all of the monitors, ignoring any
missing values coded as NA.
2. Write a function that reads a directory full of files and reports the number of
completely observed cases in each data file. The function should return a data
frame where the first column is the name of the file and the second column is
the number of complete cases.
3. Write a function that takes a directory of data files and a threshold for
complete cases and calculates the correlation between sulfate and nitrate for
monitor locations where the number of completely observed cases (on all
variables) is greater than the threshold. The function should return a vector of
correlations for the monitors that meet the threshold requirement. If no
monitors meet the threshold requirement, then the function should return a
numeric vector of length 0.
Hint:
https://rpubs.com/ahmedtadde/DS-Rprogramming1
https://github.com/mGalarnyk/datasciencecoursera/blob/master/2_R_Programming/pro
jects/project1.md
13. Mapreucde Write map and reduce methods to count the number of occurrences of each word in a
file. For the purposes of this assignment a word will be defined as any string of
alphabetic characters appearing between non-alphabetic characters. nature's is two
words. The count should be case-insensitive. If a word occurs multiple times in a line,
all should be counted. A StringTokenizer is a convenient way to parse the words from
the input line. There is documentation of StringTokenizer online, and there is an
example of its use in the reader functions.
14. Objective: Write map and reduce methods to determine the average ratings of
movies.

Data Set:
http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9
a

The input consists of a series of lines, each containing a movie number, user number,
rating, and date: 3980,294028,5,2005-11-15

map should emit movie number and list of rating, and reduce should return for each
movie number a list of average rating as Double, and number of ratings as Integer. This
data is similar to the Netflix Prize data.

Write map and reduce methods to determine the average ratings of movies. The input
consists of a series of lines, each containing a movie number, user number, rating, and
date: 3980,294028,5,2005-11-15
map should emit movie number and list of rating, and reduce should return for each
movie number a list of average rating as Double, and number of ratings as Integer. This
data is similar to the Netflix Prize data.
15. K-means Given the matrix X whose rows represent different data points, you are asked to
Clustering perform a k-means clustering on this dataset using the Euclidean distance as the
distance function. Here k is chosen as 3. The Euclidean distance d between a vector x
and a vector y both in Rp is defined as d = pPp i=1(xi − yi) 2. All data in X were
plotted in Figure 1. The centres of 3 clusters were initialized as µ1 = (6.2, 3.2) (red), µ2
= (6.6, 3.7) (green), µ3 = (6.5, 3.0) (blue).

1. What’s the centre of the first cluster (red) after one iteration? (Answer in the format
of [x1, x2], round your results to three decimal places, same as problems 2 and 3)
2. What’s the centre of the second cluster (green) after two iteration?
3. What’s the centre of the third cluster (blue) when the clustering converges?
4. How many iterations are required for the clusters to converge?

16. Hierarchical In Figure, there are two clusters A (red) and B (blue), each has four members and
Clustering plotted in Figure . The coordinates of each member are labeled in the figure. Compute
the distance between two clusters using Euclidean distance.
1. What is the distance between the two farthest members? (complete link) (round to
four decimal places here, and next 2 problems);
2. What is the distance between the two closest members? (single link)
3. What is the average distance between all pairs?
4. Among all three distances above, which one is robust to noise? Answer either
“complete”, “single”, or “average”.
17. Multivariate Multilinear Regression :
Analysis
18. 1. Using the matrices X and Y found in FILE2 do the following:

1* Compute the vector of raw regression weights


2 Compute the standard error of the regression weights
3 Compute t-tests for each regression weight
4* Compute the vector of predicted scores
5* Compute the vector of residual scores
6* Compute the squared multiple correlation
7* Compute the F-ratio for the model
8 Compute the standard error of estimate
9 Compute the vector of standardized regression weights

2. Using the matrix X found in FILE1 do the following:

1. Compute the deviation SSCP matrix, S


2. Compute the covariance matrix, C
3. Compute the correlation matrix, R
4. Compute the determinants of S, C and R
5. Compute the eigenvalues of S, C and R
6. Compute the eigenvectors of S, C and R

Canonical Correlation Analysis

Using the matrix XY (the first 3 columns of XY are the Y variables while the last 5
columns are the X variables) found in FILE5 do the following:

1. Compute the squared canonical correlation.


2. Compute the canonical correlation.
3. Compute the eignevalues of A and B.
4. Compute the eigenvectors of A and B.
5. Compute the F statistic approximations.
6. Compute the degrees of freedom.
7. Determine which canonical dimensions are significant.

Note: Be sure to label your output and include comments.

File1:
Mata
X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
Stata
mat X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
SAS IML
X={
3 9 17 24,
7 8 11 25,
6 5 13 29,
4 7 15 32,
7 9 13 24,
8 8 1 23};

File 2:
Mata
X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
Stata
mat X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
SAS IML
X={
3 9 17 24,
7 8 11 25,
6 5 13 29,
4 7 15 32,
7 9 13 24,
8 8 1 23};

You might also like