0% found this document useful (0 votes)

308 views

Empirical Software Engineering (SE-404) LAB A1-G1 Laboratory Manual

This document provides a laboratory manual for an empirical software engineering course. It outlines 7 experiments to be completed over the semester related to data analysis tools, empirical study design, version control, prediction modeling, and statistical hypothesis testing. The manual includes an index of the experiments with dates and remarks. Experiment 1 provides an overview and comparison of the data analysis tools WEKA, KEEL, SPSS, MATLAB, and R.

Uploaded by

Akash Tyagi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

308 views

Empirical Software Engineering (SE-404) LAB A1-G1 Laboratory Manual

Uploaded by

Akash Tyagi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Empirical Software Engineering (SE-404)

LAB A1-G1
Laboratory Manual

Department of Software Engineering

DELHI TECHNOLOGICAL UNIVERSITY(DTU)
Shahbad Daulatpur, Bawana Road, Delhi-110042

Submitted to: - Submitted by:-

Parul Sharma Ma’am Name: Akhil Kumar Goswami
Roll number: 2K18/SE/016
INDEX
S.NO EXPERIMENT DATE Remarks

1 Perform a comparison of the following data analysis tools. 06-01-2022

WEKA, KEEL, SPSS, MATLAB, R.

2 Consider any empirical study of your choice (Experiments, Survey 13-01-2022

Research, Systematic Review, Postmortem analysis and case study).
Identify the following components for an empirical study:
a. Identify parametric and nonparametric tests
b. Identify Independent, dependent and confounding variables
c. Is it Within-company and cross-company analysis?
d. What type of dataset is used? Proprietary and open-source software

3 Defect detection activities like reviews and testing help in identifying the 27-01-2022
defects in the artifacts (deliverables). These defects must be classified into
various buckets before carrying out the root cause analysis. Following are
some of the defect categories: Logical, User interface, Maintainability,
and Standards. In the context of the above defect categories, classify the
following statements under the defect categories.

4 Why is version control important? How many types of version control 03-02-2022
systems are there? Demonstrate how version control is used in a proper
sequence (stepwise).

5 Demonstrate how Git can be used to perform version control? 03-02-2022

6 Consider any prediction model of your choice. 25-02-2022

a. Analyze the dataset that is given as a input to the prediction model
b. Find out the quartiles for the used dataset
c. Analyze the performance of a model using various performance
metrics.

7 Validate the results obtained in experiment 6 using 10- cross validation. 25-02-2022

8 Online loan system has two modules for the two basic services, namely 08-03-2022
Car loan service and House loan service. The two modules have been
named as Car_Loan_Module and House_Loan_Module.
Car_Loan_Module has 2000 lines of uncommented source
code.House_Loan_Module has 3000 lines of uncommented source code.
Car_Loan_Module was completely implemented by Mike.
House_Loan_Module was completely implemented by John. Mike took
100 person hours to implement the Car_Loan_Module.
John took 200 person hours to implement the House_Loan_Module.
Mike’s module had 5 defects. John’s module had 6 defects. With respect
to the context given, which among the following is an INCORRECT
statement? Identify the null and alternate hypothesis for the following
options.
Justify and Choose one:
a. John’s Quality is better than Mike’s Quality
b. John’s Productivity is more than Mike’s Productivity
c. John introduced more defects than Mike
d. John’s Effort is more than Mike’s Effort

9 Statistical Hypothesis Testing in R/ python Statisticians use hypothesis 22-03-2022

testing to formally check whether the hypothesis is accepted or rejected.
Consider an example or data of your choice and identify the following:
a. State the Hypothesis
b. Formulate an Analysis Plan
c. Analyze Sample Data
d. Interpret Results
e. Estimate type-I and type-II
EXPERIMENT-1
Objective:- Perform a comparison of the following data analysis tools.
a.WEKA
b. KEEL
c. SPSS
d. MATLAB
e. R

Introduction:-
1. WEKA: Also known as Waikato Environment is a machine learning software developed at the University of
Waikato in New Zealand. WEKA is a free and open source software package that assembles a wide range of
data mining and model building algorithms.
It is best suited for data analysis and predictive modeling. It contains algorithms and visualization tools that
support machine learning. WEKA has a GUI that facilitates easy access to all its features. It is written in JAVA
programming language and runs on almost any platform. WEKA supports major data mining tasks including
data mining, processing, visualization, regression etc. It works on the assumption that data is available in the
form of a flat file. WEKA can provide access to SQL Databases through database connectivity and can further
process the data/results returned by the query.

Advantages of WEKA include:

• Free availability under the GNU General Public License.

• Portability, since it is fully implemented in the Java programming language and thus runs on almost any
modern computing platform.
• A comprehensive collection of data preprocessing and modeling techniques. • Ease of use due to its graphical
user interfaces.

Disadvantages of WEKA include:

• It is not capable of multi-relational data mining.

• It can only handle small datasets. Whenever a set is bigger than a few megabytes an “Out Of Memory” error
occurs.

2. KEEL: KEEL is a free software Java tool which empowers the user to assess the behavior of evolutionary
learning and soft computing based techniques for different kind of data mining problems: regression,
classification, clustering, pattern mining and so on. KEEL is a data mining tool used by many EDM researchers.
For instance, KEEL has extremely extensive support for discretization algorithms, but has limited support for
other methods for engineering new features out of existing features. It has excellent support for feature
selection, with a wider range of algorithms than any other package. It also has extensive support for imputation
of missing data, and considerable support for data re-sampling. KEEL is open-source and free for use under a
GNU license.
For modeling, KEEL has an extensive set of classification and regression algorithms; with a large focus on
evolutionary algorithms. Its support for other types of data mining algorithms, such as clustering and factor
analysis, is more limited than other packages. Support for association rule mining is decent, though not as
extensive as some other packages.

The main features of KEEL are:

• It contains a large collection of evolutionary algorithms for predicting models, preprocessing methods
(evolutionary feature and instance selection among others) and postprocessing procedures (evolutionary tuning
of fuzzy rules). It also presents many state-of-the-art methods for different areas of data mining such as decision
trees, fuzzy rule based systems or crisp rule learning.
• It includes around 100 data preprocessing algorithms proposed in the specialized literature: data
transformation, discretization, instance and feature selection, noise filtering and so forth.
• It incorporates a statistical library to analyze the results of the algorithms.
• It comprises a set of statistical tests for analyzing the suitability of the results and for performing parametric
and nonparametric comparisons among the algorithms. • It provides a user-friendly interface, oriented to the
analysis of algorithms.

3. SPSS: SPSS stands for “Statistical Package for the Social Sciences”. It is an IBM tool. This tool first
launched in 1968. This is a software package which is mainly used for statistical analysis of the data.
SPSS is mainly used in the following areas like healthcare, marketing, and educational research, market
researchers, health researchers, survey companies, education researchers, government, marketing organizations,
data miners, and many others. It provides data analysis for descriptive statistics, numeral outcome predictions,
and identifying groups. This software also gives data transformation, graphing and direct marketing features to
manage data smoothly.
Advantages of SPSS are:
• The data from any survey collected via Survey Gizmo gets easily exported to SPSS for detailed and good
analysis.
• In SPSS, data gets stored in .SAV format. This data mostly comes from surveys. This makes the process of
manipulating, analyzing and pulling data very simple. • SPSS have easy access to data with different variable
types. These variable data are easy to understand. SPSS helps researchers to set up models easily because most
of the process is automated.
• SPSS allows Opening data files, either in SPSS’ own file format or many others. • SPSS allows editing data
such as computing sums and means over columns or rows of data. SPSS has outstanding options for more
complex operations as well. • SPSS has options for creating tables and charts containing frequency counts or
summary statistics over (groups of) cases and variables.
• SPSS has a unique way to get data from critical data also. Trend analysis, assumptions, and predictive models
are some of the characteristics of SPSS.
• SPSS is easy for you to learn, use and apply.
• It helps to get data management system and editing tools handy.
• SPSS offers you in-depth statistical capabilities for analyzing the exact outcome. • SPSS helps us to design,
plotting, reporting and presentation features for more clarity.

Limitations are:

• SPSS is expensive to purchase for students.

• Usually involves added training to completely exploit all the available features. • The graph features are not as
simple as Microsoft Excel.
• Documentation about algorithms is sometimes difficult or impossible to find. • Information about effect size
and confidence intervals is missing for many techniques.

4. MATLAB: MATLAB stands for Matrix laboratory. It was developed by Mathworks, and it is a multipurpose
(or as we say it Multi-paradigm) programming language. It allows matrix manipulations and helps us to plot
different types of functions and data. It can also be used for the analysis and design as such as the control
systems.

Advantages of MATLAB are:

● Easy to use interface: A user-friendly interface with features you want to use is one click away.
● A large inbuilt database of algorithms: MATLAB has numerous important algorithms you want to use already
built-in, and you just have to call them in your code. ● Extensive data visualization and processing: We can
process a large amount of data in MATLAB and visualize them using plots and figures.
● Debugging of code is easy: There are many inbuilt tools like analyzer and debugger for analysis and
debugging of codes written in MATLAB.
● Easy symbolic manipulation: We can perform symbolic math operations in MATLAB using the symbolic
manipulation algorithms and tools in MATLAB

Disadvantages of MATLAB are:

● MATLAB is slow since it is an interpreted language that means MATLAB programs are not converted into
Machine language but are run by external software, so it can sometimes be slow.
● We cannot create the OUTPUT file in MATLAB.
● One cannot use graphics in MATLAB with -nojvm option, on doing so, we will get a runtime error.

5. R: R is a popular and powerful open source programming language and software environment for statistical
computing and graphics representation. R was created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development Core Team. R implements various
statistical techniques like linear and non-linear modeling, machine learning algorithms, time series analysis, and
classical statistical tests and so on. R consists of a language and a run-time environment with graphics, a
debugger, access to certain system functions, and the ability to run programs stored in script files
Advantages of R:

1) Open Source: R is an open-source language. We can contribute to the development of R by optimizing our
packages, developing new ones, and resolving issues.
2) Platform Independent: R is a platform-independent language or cross-platform programming language which
means its code can run on all operating systems. R can run quite easily on Windows, Linux, and Mac.
3) Machine Learning Operations: R allows us to do various machine learning operations such as classification
and regression. R is used by the best data scientists in the world.
4) Highly Compatible: R is highly compatible and can be paired with many other programming languages like
C, C++, Java, and Python. It can also be integrated with technologies like Hadoop and various other database
management systems as well.
Disadvantages of R:
1) Data Handling: In R, objects are stored in physical memory. It is in contrast with other programming
languages like Python. R utilizes more memory as compared to Python. It requires the entire data in one single
place which is in the memory. It is not an ideal option when we deal with Big Data.
2) Basic Security: R lacks basic security. Because of this, there are many restrictions with R as it cannot be
embedded in a web-application.
3) Complicated Language: R is a very complicated language, and it has a steep learning curve. The people who
don't have prior knowledge or programming experience may find it difficult to learn R.
4) Lesser Speed: R programming language is much slower than other programming languages such as
MATLAB and Python. In comparison to other programming languages, R packages are much slower.

Learning from experiment:- We have successfully discussed and learnt various data analysis tools. We
have compared 5 different analysis tools WEKA, KEEL, SPSS, MATLAB, R and mentioned their advantages
and disadvantages as well.
EXPERIMENT-2
OBJECTIVE: Consider any empirical study of your choice (Experiments, Survey Research, Systematic
Review, Postmortem analysis and case study). Identify the following components for an empirical study.

a) Identify parametric and nonparametric tests

b) Identify Independent, dependent and confounding variables
c) Is it Within-company and cross-company analysis?
d) What type of dataset is used? Proprietary and open-source software

INTRODUCTION:

Parametric Test

Parametric tests are used for data samples having normal distribution (bell-shaped curve)

Non-Parametric Test

Nonparametric tests are used when the distribution of data samples is highly skewed. If the assumptions of
parametric tests are met, they are more powerful as they use more information while computation.

Dependent Variables

The dependent variable (or response variable) is the output produced by analyzing the effect of the independent
variables. The dependent variables are presumed to be influenced by the independent variables. For example- effort,
cost, faults, and productivity.

Independent Variables
Independent variables (or predictor variables) are input variables that are manipulated or controlled by the
researcher to measure the response of the dependent variable. For example- lines of source code, number of
methods, and number of attributes.
Confounding Variables
Apart from the independent variables, unknown variables or confounding variables (extraneous variables) may
affect the outcome (dependent) variable.

Within-Company Analysis

In within-company analysis, the empirical study collects the data from the old versions/ releases of the same
software, predicts models, and applies the predicted models to the future versions of the same project.

Cross Company Analysis

In practice, the old data may not be available. In such cases, the data obtained from similar earlier projects
developed by different companies are used for prediction in new projects. The process of validating the predicted
model using data collected from different projects from which the model has been derived is known as
cross-company analysis.
Proprietary Dataset

Proprietary dataset is a licensed dataset owned by a company. For example, Microsoft Office, Adobe Acrobat, and
IBM SPSS dataset are proprietary dataset. In practice, obtaining data from proprietary dataset for research
validation is difficult as the software companies are usually not willing to share the information about their software
systems.

Open-Source Dataset

Open source dataset is usually a freely available dataset, developed by many developers from different places in a
collaborative manner. For example kaggle, google cloud and NASA dataset.

For our experiment I am taking a survey paper on the title Effect of Music on the Human Body and Mind
(https://digitalcommons.liberty.edu/cgi/viewcontent.cgi?article=1162&context=honors) by Dawn Kent.

The research project consisted of a 100-count survey given to students at Liberty University. The 10-question
surveys included some demographic data, and questions about average time spent studying in high school and in
college, average high school and college GPAs, and whether or not students listened to music while studying, and if
so, what types of music. The data was then compiled and analyzed. All surveys were completed voluntarily and all
participants had the understanding that their anonymous information would be used for research. Two extra surveys
were taken to allow for the exclusion of two surveys that were missing answers or failed to follow directions.

The survey was distributed on the campus of Liberty University. All surveys were taken voluntarily and without
recompense. Surveys were distributed within several classes, including the concert band, a large chemistry class
containing mostly biology and health sciences majors, and a statistics in psychology class. This contributed to a
higher concentration of music, biology, health, and psychology majors. Other surveys were distributed in common
areas such as the computer lab and dining hall and dormitories, where a variety of students could be found. Students
of every academic standing (i.e. freshman, sophomore, etc.) and a wide variety of ages were represented, although
most were concentrated in the average age range of college students at Liberty University.Gender ratios were
similar. There were 43 males and 57 females completing the survey. Ninety percent of the study participants were
between the ages of 17 and 22, while 7% were between 23 and 25 years of age, and 3% were 26 years or older.
Ratios between classes were fairly similar. Participants included 26 freshmen, 21 sophomores, 23 juniors and 30
seniors. Fifteen different majors were represented, although the survey contained greater concentrations in the
aforementioned areas. Also affecting this factor is the size of each major department at Liberty University. For
instance, after the top three majors, communications was next, which is the largest major at Liberty University. The
ratios of majors were as follows: music: 15; biology: 12; psychology: 11; communications: 10; business: 8;
education: 8; health: 7; computer science: 6; religion: 6; nursing: 5; government: 4; undecided: 3; math: 2; family
and consumer science: 2; and history.
RESULT:-

(a) Parametric Test and Non Parametric Test

Parametric Test

T-test

The initial independent sample t-test run on the hypothesis that listening to music while studying would positively
affect GPA showed no significance (t(98) = -1.182; p = 240). However, the mean GPA of those who did not listen to
music while studying was slightly lower than those who did. Closer examination of the mean GPAs of each type of
music revealed that listeners of each individual type of music had lower, but not significantly lower GPAs than
those who did not, except in the case of two types of music. Those that listened to easy listening music had a much
higher, though not significantly higher, GPA mean than any other group. Those who listened to rock music had a
slightly higher GPA than those who did not listen to music. Also, two types of music showed notably lower GPA
means than the no-music group. Hip-hop/R&B had the lowest GPA meanwhile rap followed closely behind. Jazz
and country also had moderately lower GPA means. Alternative, classical, and gospel music stayed very close to the
no-music control group.

ANOVA

A one-way ANOVA test was run on this set of information. No significant effect on GPA was found between these
three groups (F(2, 97) = 2.202; p = .116). Next, the data was divided into a different three-group comparison. This
time, the focus turned to the negative end of the spectrum. The first group included all students who listened to any
type of music while studying excluding country, rap, hip-hop/R&B, and jazz. These four types of music showed
considerably lower GPA means than the no music group. The second group included all students who listened to
any type of music while studying that included at least one of country, rap, hip-hop/R&B, and jazz. The third group
remained those students that listen to no music while studying. An ANOVA test was run. The comparison between
the three groups was not significant (F(2, 97) = 3.003; p = .054).

Non Parametric Test

There was no non parametric test used in the survey but wilcoxon whitney test which is an alternative for t test and
kruskal wallis test can be used to find p value significance between various music listening groups.

(b) Identify Independent, dependent and confounding variables

Independent variables

Independent variables used in the survey are demographic data like nationality, major pursuing, gender, age and
questions about average time spent studying in high school and in college, average high school GPAs, and whether
or not students listened to music while studying, and if so, what types of music.
Dependent Variables

Dependent Variable is the outcome of the survey that is the GPA they scored in the college.

Confounding Variables

Confounding Variables are the variables which affect the outcome indirectly so in the survey that can be the subject
that they are studying while listening to music, volume of music and disturbance in background.

c) Is it Within-company and cross-company analysis?

It is a with-in company analysis as the survey is done on the data given by the user on their past experiences and
then the data is used to predict about the GPA they have scored.

d) What type of dataset is used? Proprietary and open-source software

The dataset used in the survey is open source dataset as the dataset is not licensed and also the results are calculated
on open source software as SPSS Student Version 12.0 for Windows software program.

Learning from experiment:-We have successfully discussed and learnt about parametric and nonparametric
tests independent, dependent and confounding variables, Within-company and cross-company analysis ,Proprietary
and open-source software.
EXPERIMENT-3
Objective:- Defect detection activities like reviews and testing help in identifying the defects in the
artifacts (deliverables). These defects must be classified into various buckets before carrying out the root
cause analysis. Following are some of the defect categories: Logical, User interface, Maintainability, and
Standards. In the context of the above defect categories, classify the following statements under the defect
categories.

Result:-

a. Divide by Zero Error is not guarded

Logical Defect:
Logical defects are mistakes done regarding the implementation of the code. They are related to the core of the
software and happen when the programmer does not take care of the corner cases or doesn’t understand the
problem clearly or thinks in a wrong way. Not handling corner cases can lead to low-quality software causing
crashes and other kinds of defects.
Dividing a number by Zero is a mathematical error (not defined) and we can use exception handling to gracefully
overcome such operations.

b. Usage of 3.14 in the statement Circle_Area = 3.14 * Radius * Radius;

Logical Defect:
Using 3.14 can lead to loss of precision for some applications. When the programmer doesn’t understand the
problem clearly or thinks in a wrong way then such types of defects happen. Solution 3.14 can be declared as
macro.

c. 3500 lines of code in a single function

Maintainability:
Managing large monolithic codebases is tough. And modifying such codebases is tougher. Decomposing a system
into subsystems reduces the complexity developers have to deal with by simplifying the parts and increasing their
coherence.

d. A pointer is declared but not initialized. It is used in the program for storing a value.
Logical Defect and Standards:
A pointer is a variable that holds the address of another variable. So the solution would be to initialize the pointer
properly and assign a value to it.

e. A program designed to handle 1000 simultaneous users, crashed when 1001 the user logged in.
Logical Defect, User Interface:
Solution would be to do proper load and stress testing before hosting to the internet for user usage.
f. A “while” loop never exits
Logical Defect:
Solution would be to write appropriate conditions to end the loop otherwise the program would run infinite times.

g. User interface displays “MALFUNCTION 54” when something goes wrong in the back-end
User Interface:
Solution would be to do proper error handling and make an appropriate display message so that the user would
get to know what the exact problem is. One of the main aspects of GUI testing is checking whether correct error
messages are being displayed. UI Testing is done to uncover such User Interface Bugs.

h. No documentation (comments) for the source code

Standards:
Solution: It is considered to be a good coding standard to name the functions according to what they perform and
to add comments explaining what functions are actually doing and proper documentation should be done.

i. Hungarian Notation not followed while coding, even though the coding guidelines mandate to use
Hungarian Notation
Standards:
Solution: It is always good to follow standards i.e. Hungarian Notation while coding. All coding guidelines
should be followed for successful software development.

j. Pressing the “Tab” key moves the cursor in different fields of a web form randomly.
User Interface:
Solution would be to design forms properly and to perform form-based testing before hosting to the internet.

Learning from experiment:- We have successfully learned about RCA and from which we learned different
types of testing and found different defects in the different scenarios. We also discussed solutions to prevent these
defects.
EXPERIMENT-4
Objective:- Why is version control important? How many types of version control systems are there?
Demonstrate how version control is used in a proper sequence (stepwise).

Introduction:- In software engineering, version control (also known as revision control, source control, or
source code management) is a class of systems responsible for managing changes to computer programs,
documents, large web sites, or other collections of information.
Version control allows you to keep track of your work and helps you to easily explore the changes you have
made, be it data, coding scripts, notes, etc. With version control software such
as Git, version control is much smoother and easier to implement. Using an online platform like Github to store
your files means that you have an online backup of your work, which is beneficial for both you and your
collaborators.

Version control helps solve these kinds of problems and provides:

● A complete history of every file, which enables you to go back to previous versions to analyze the source
of bugs and fix problems in older versions.
● The ability to work on independent streams of changes, which allow you to merge that work back together
and verify your changes conflict.
● The ability to trace each change with a message describing the purpose and intent of the change and
connect it to project management and bug tracking software.

Types of version control Systems:

There are three types of version control: centralized and distributed.

1. Local Version Control Systems: It is one of the simplest forms and has a database that kept all the
changes to files under revision control. RCS is one of the most common VCS tools. It keeps patch sets
(differences between files) in a special format on disk. By adding up all the patches it can then re-create what any
file looked like at any point in time.

2. Centralized version control System: With centralized version control systems, you have a single
“central” copy of your project on a server and commit your changes to this central copy. You pull the files that
you need, but you never have a full copy of your project locally. Some of the most common version control
systems are centralized, including Subversion (SVN) and Perforce.

3. Distributed version control systems: With distributed version control systems (DVCS), you don't rely on
a central server to store all the versions of a project’s files. Instead, you clone a copy of a repository locally so that
you have the full history of the project. Two common distributed version control systems are Git and Mercurial.

Version control System stepwise proper sequence:

1. Proper stepwise sequence to use VCS in the Centralized Version Control System.
2. Proper stepwise sequence to use VCS in Distributed Version Control System.

Learning from experiment:- We have successfully learned about the version control system(VCS) and its
benefits. We have also learned about the type of VCS and stepwise sequence of VCS.
EXPERIMENT-5
Objective:- Demonstrate how Git can be used to perform version control?

Introduction:- Version control allows you to keep track of your work and helps you to easily explore the
changes you have made, be it data, coding scripts, notes, etc. With version control software such as Git, version
control is much smoother and easier to implement. Using an online platform like Github to store your files means
that you have an online backup of your work, which is beneficial for both you and your collaborators.

Benefits of using GIT as a version control Tool

Having a GitHub repo makes it easy for you to keep track of collaborative and personal projects - all files
necessary for certain analyses can be held together and people can add in their code, graphs, etc. as the projects
develop. Each file on GitHub has a history, making it easy to explore the changes that occurred to it at different
time points. You can review other people’s code, add comments to certain lines or the overall document, and
suggest changes. For collaborative projects, GitHub allows you to assign tasks to different users, making it clear
who is responsible for which part of the analysis. You can also ask certain users to review your code. For personal
projects, version control allows you to keep track of your work and easily navigate among the many versions of
the files you create, whilst also maintaining an online backup.

GITHUB WORKFLOW:
The GitHub workflow can be summarized by the commit-pull push” mantra.
● Commit: Once you’ve saved your files, you need to commit them - this means the changes you have made to
files in your repo will be saved as a version of the repo, and your changes are now ready to go up on GitHub (the
online copy of the repository).
● Pull: Now, before you send your changes to Github, you need to pull, i.e. make sure you are completely up to
date with the latest version of the online version of the files - other people could have been working on them even
if you haven’t. You should always pull before you start editing and before you push.
● Push: Once you are up to date, you can push your changes – at this point in time your local copy and the online
copy of the files will be the same.

Learning from experiment:- In this Experiment we learned about how to work with git and understand its
workflow.
EXPERIMENT-6
OBJECTIVE: Consider any prediction model of your choice.
1. Analyze the dataset that is given as a input to the prediction model
2. Find out the quartiles for the used dataset
3. Analyze the performance of a model using various performance metrics.

INTRODUCTION:
For the experiment we have used Iris dataset which can be downloaded from
https://archive.ics.uci.edu/ml/datasets/iris

CODE & RESULT:-

The dataset is evaluated using pre-defined models and techniques in the sklearn library of python.

1. Analyze the dataset that is given as an input to the prediction model

The summary of the iris dataset is given in the above image then we have reduced the dimension for the
prediction model and created input data and output label.
2. Find out the quartiles for the used dataset
The quantile( ) function of NumPy is used to find quartiles of the dataset for each of the metrics or attributes used.
Quantiles are the set of values/points that divides the dataset into groups of equal size.

3. Analyze the performance of a model using various performance metrics.

Classification Accuracy: Classification accuracy is the number of correct predictions made as a

ratio of all predictions made. This is the most common evaluation metric for classification
problems.

Classification Accuracy comes out to be 0.947 i.e. 94.7%.

Recall: The recall is the number of true positives divided by the number of true positives plus the number of false
negatives. True positives are data point classified as positive by the model that actually are positive (meaning they
are correct), and false negatives are data points the model identifies as negative that actually are positive
(incorrect).

Recall comes out to be 0.954

Precision: Precision is defined as the number of true positives divided by the number of true positives plus the
number of false positives. False positives are cases the model incorrectly labels as positive that are actually
negative.

Precision comes out to be 0.934.

Learning from experiment:- Using the Decision Tree Predictive Model for iris dataset, we analyzed the
dataset, found out quartiles and analyzed the performance of the model using various performance metrics using
Python language.
EXPERIMENT-7
OBJECTIVE: Validate the results obtained in experiment 3 using 10-cross validation, hold out
validation or leave one out cross-validation.

INTRODUCTION:-For the experiment we have used Iris dataset which can be downloaded from
https://archive.ics.uci.edu/ml/datasets/iris

Hold Out Validation

The holdout technique is an exhaustive cross-validation method that randomly splits the dataset into train and
test data depending on data analysis.

In the case of holdout cross-validation, the dataset is randomly split into training and validation data. Generally,
the split of training data is more than test data. The training data is used to induce the model and validation data
evaluates the performance of the model.

The more data is used to train the model, the better the model is. For the holdout cross-validation method, a good
amount of data is isolated from training.

10 cross validation

In k-fold cross-validation, the original dataset is equally partitioned into k subparts or folds. Out of the k-folds or
groups, for each iteration, one group is selected as validation data, and the remaining (k-1) groups are selected as
training data.

The process is repeated for k times until each group is treated as validation and remaining as training data.
The final accuracy of the model is computed by taking the mean accuracy of the k-models validation data.

Leave – one out cross validation

Leave-one-out cross-validation (LOOCV) is an exhaustive cross-validation technique. Exhaustive cross-validation

methods are cross-validation methods that learn and test in all possible ways. It is a category of LpOCV(Leave p-out
cross-validation) with the case of p=1.

For a dataset having n rows, 1st row is selected for validation, and the rest (n-1) rows are used to train the model.
For the next iteration, the 2nd row is selected for validation and rest to train the model. Similarly, the process is
repeated until n steps or the desired number of operations.

OUTPUT:-

The dataset is evaluated using pre-defined models and techniques in the sklearn library of python.

Hold Out Validation:

10- cross fold validation

Leave one out cross validation

RESULT:-The hold out validation gives out the best accuracy score of 1 better than the 10 cross validation and
leave one out validation.

Learning from experiment:-We have learned about the hold out validation, k cross validation and leave one
out cross validation.
EXPERIMENT-8
OBJECTIVE:-Online loan system has two modules for the two basic services, namely Car loan service and
House loan service. The two modules have been named as Car_Loan_Module and House_Loan_Module.
Car_Loan_Module has 2000 lines of uncommented source code.
House_Loan_Module has 3000 lines of uncommented source code. Car_Loan_Module was completely
implemented by Mike. House_Loan_Module was completely implemented by John. Mike took 100 person hours
to implement the Car_Loan_Module. John took 200 person hours to implement the House_Loan_Module. Mike’s
module had 5 defects. John’s module had 6 defects. With respect to the context given, which among the following
is an INCORRECT statement?
Justify and choose one:
a) John’s Quality is better than Mike’s Quality
b) John’s Productivity is more than Mike’s Productivity
c) John introduced more defects than Mike
d) John’s Effort is more than Mike’s Effort.

NULL hypothesis: The null hypothesis states that a population parameter (such as the mean, the standard
deviation, and so on) is equal to a hypothesized value. The null hypothesis is often an initial claim that is based on
previous analyses or specialized knowledge.

Alternate hypothesis: The alternative hypothesis states that a population parameter is smaller, greater, or
different than the hypothesized value in the null hypothesis. The alternative hypothesis is what you might believe
to be true or hope to prove true.

CALCULATIONS:-

For John,
● Size = 3000 LOC
● Effort = 200 person-hours Defect = 6
● Productivity = size/effort = 3000/200 LOC/person-hours = 15 LOC/person-hours
● Quality = defect/size = 6 defects/ 3000 LOC = 0.02 defect/size

For Mike,
● Size = 2000 LOC
● Effort = 100 person-hours Defect = 5
● Productivity = size/effort = 2000/100 LOC/person-hours = 20 LOC/person-hours
● Quality = defect/size = 5 defects/ 2000 LOC = 0.025 defect/sizes

(b) is the incorrect statement. Below is the reason for the same.

Information provided by the question:

Mike implemented Car.Loan_Module having 4000 uncommented SLOC and took 200 person-hours of effort with
the final module having 5 defects.
John implemented House_Loan_Module having 5000 uncommented SLOC and took 300 person-hours of effort
with the final module having 6 defects.

(a) The quality of the code can be expressed in terms of defect density i.e. number of defects per lines of code.
● Mike's code's defect density = 5/4000 = 0.00125 defects/SLOC
● John's defect density = 6/5000 = 0.00120 defects/SL0C

The higher the defect density,the lower is the quality of the code. S0, John's quality is better than Mike's. Hence,
(a) is correct.

(b) Productivity = Size/Effort

● Mike's productivity = 4000/200 = 20 SLOC/person-hours
● John's Productivity = 5000/300 =16.667 SLOC/person-hours

John's productivity is less than Mike's Hence, (b) is incorrect.

(c) John introduced 6 defects while Mike introduced 5. Clearly, John introduced more defects than Mike. Hence,
(c) is correct.

(d) John’s effort is 300 person-hours, while Mike’s effort is 200 person-hours. Clearly, John’s effort is more than
Mike’s. Hence, (d) is correct.

RESULT:- Mike’s productivity is more than that of John. Hence, option (b) is INCORRECT statement
EXPERIMENT-9

OBJECTIVE:-Statistical Hypothesis Testing in R-Statisticians use hypothesis testing to formally check

whether the hypothesis is accepted or rejected. Consider an example or data of your choice and identify the
following:
(A) State the Hypotheses
(B) Formulate an Analysis Plan
(C) Analyze Sample Data
(D) Interpret Results
(E) Estimate type-I and type-II error

THEORY:-
1. NULL HYPOTHESIS: A null hypothesis is a type of hypothesis used in statistics that proposes that no
statistical significance exists in a set of given observations.
2. HYPOTHESIS TESTING: All hypothesis tests are conducted the same way. The researcher states a hypothesis
to be tested, formulates an analysis plan, analyzes sample data according to the plan, and accepts or rejects the
null hypothesis, based on results of the analysis.

PROCEDURE:-
State the hypotheses. Every hypothesis test requires the analyst to state a null hypothesis and an alternative
hypothesis. The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the
other must be false; and vice versa.
Formulate an analysis plan. The analysis plan describes how to use sample data to accept or reject the null
hypothesis. It should specify the following elements.
Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or 0.10; but any value
between 0 and 1 can be used.
Test method. Typically, the test method involves a test statistic and a sampling distribution. Computed from
sample data, the test statistic might be a mean score, proportion, difference between means, difference between
proportions, z-score, t statistic, chi-square, etc. Given a test statistic and its sampling distribution, a researcher can
assess probabilities associated with the test statistic. If the test statistic probability is less than the significance
level, the null hypothesis is rejected.
Analyze sample data. Using sample data, perform computations called for in the analysis plan.
Test statistics. When the null hypothesis involves a mean or proportion, use either of the following equations to
compute the test statistic.
Test statistic = (Statistic - Parameter) / (Standard deviation of statistic)
Test statistic = (Statistic - Parameter) / (Standard error of statistic)
where Parameter is the value appearing in the null hypothesis, and Statistic is the point estimate of Parameter. As
part of the analysis, you may need to compute the standard deviation or standard error of the statistic. Previously,
we presented common formulas for the standard deviation and standard error. When the parameter in the null
hypothesis involves categorical data, you may use a chi-square statistic as the test statistic. Instructions for
computing a chi-square test statistic are presented in the lesson on the chi-square goodness of fit test.
P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic, assuming the
null hypothesis is true.
Interpret the results. If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null
hypothesis. Typically, this involves comparing the P-value to the significance level, and rejecting the null
hypothesis when the P-value is less than the significance level.
Type I and Type II errors can be defined in terms of hypothesis testing. A Type I error is the probability of
rejecting a true null hypothesis. A Type II error is the probability of failing to reject a false null hypothesis.

Problem Taken:- Within a school district, students were randomly assigned to one of two Math teachers - Mrs.
Smith and Mrs. Jones. After the assignment, Mrs. Smith had 30 students, and Mrs. Jones had 25 students.
At the end of the year, each class took the same standardized test. Mrs. Smith's students had an average test score
of 78, with a standard deviation of 10; and Mrs. Jones' students had an average test score of 85, with a standard
deviation of 15.
Test the hypothesis that Mrs. Smith and Mrs. Jones are equally effective teachers. Use a 0.10 level of significance.
(Assume that student performance is approximately normal.)

OUTPUT / ANSWERS:-
State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.
Null hypothesis: μ1 - μ2 = 0
Alternative hypothesis: μ1 - μ2 ≠ 0
Note that these hypotheses constitute a two-tailed test. The null hypothesis will be rejected if the difference
between sample means is too big or if it is too small.
Formulate an analysis plan. For this analysis, the significance level is 0.10. Using sample data, we will conduct
a two-sample t-test of the null hypothesis.
Analyze sample data. Using sample data, we compute the standard error (SE), degrees of freedom (DF), and the
t statistic test statistic (t).
SE = sqrt[(s12/n1) + (s22/n2)]
SE = sqrt[(102/30) + (152/25] = sqrt(3.33 + 9)
SE = sqrt(12.33) = 3.51
DF = (s12/n1 + s22/n2)2 / { [ (s12 / n1)2 / (n1 - 1) ] + [ (s22 / n2)2 / (n2 - 1) ] }
DF = (102/30 + 152/25)2 / { [ (102 / 30)2 / (29) ] + [ (152 / 25)2 / (24) ] }
DF = (3.33 + 9)2 / { [ (3.33)2 / (29) ] + [ (9)2 / (24) ] } = 152.03 / (0.382 + 3.375) = 152.03/3.757 = 40.47
t = [ (x1 - x2) - d ] / SE = [ (78 - 85) - 0 ] / 3.51 = -7/3.51 = -1.99

where s1 is the standard deviation of sample 1, s2 is the standard deviation of sample 2, n1 is the size of sample 1,
n2 is the size of sample 2, x1 is the mean of sample 1, x2 is the mean of sample 2, d is the hypothesized difference
between the population means, and SE is the standard error.
Since we have a two-tailed test, the P-value is the probability that a t statistic having 40 degrees of freedom is
more extreme than -1.99; that is, less than -1.99 or greater than 1.99.
We use the t Distribution Calculator to find P(t < -1.99) = 0.027, and P(t > 1.99) = 0.027. Thus, the P-value =
0.027 + 0.027 = 0.054.
Interpret results. Since the P-value (0.054) is less than the significance level (0.10), we cannot accept the null
hypothesis.
Type 1 error. Significance level in this case is 0.1
Specifically, the approach is appropriate because the sampling method was simple random sampling, the samples
were independent, the sample size was much smaller than the population size, and the samples were drawn from a
normal population.

Learning from experiment:- This experiment gives insights into the use of hypothetical testing on real-life
examples.

Notes CA403 Object Oriented Programming Using C Sybbaca
No ratings yet
Notes CA403 Object Oriented Programming Using C Sybbaca
49 pages
Ibm Rational Requisitepro V2003.06: Evaluators' Guide
100% (1)
Ibm Rational Requisitepro V2003.06: Evaluators' Guide
17 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
26 pages
Software Testing Lab File
No ratings yet
Software Testing Lab File
56 pages
Brief Resume Dr-R-H-Goudar CS Dept1
No ratings yet
Brief Resume Dr-R-H-Goudar CS Dept1
2 pages
(Unit - 1) - 1.2 ObjectModel
No ratings yet
(Unit - 1) - 1.2 ObjectModel
13 pages
Mca Alagappa University
No ratings yet
Mca Alagappa University
27 pages
Discrete Mathematics With Graph Theory 1nbsped 3031213203 9783031213205 9783031213212 9789382127185 9783031213236 - Compress
No ratings yet
Discrete Mathematics With Graph Theory 1nbsped 3031213203 9783031213205 9783031213212 9789382127185 9783031213236 - Compress
657 pages
Abstract Data Type (ADT) : & Array
No ratings yet
Abstract Data Type (ADT) : & Array
26 pages
SS & OS LAB Manual-1 PDF
No ratings yet
SS & OS LAB Manual-1 PDF
73 pages
Software Engineering Unit 1 (KCS 601)
No ratings yet
Software Engineering Unit 1 (KCS 601)
100 pages
Se Practical File
No ratings yet
Se Practical File
38 pages
Foss Lab Programs
No ratings yet
Foss Lab Programs
12 pages
ch-6 Common Mechanisms
No ratings yet
ch-6 Common Mechanisms
15 pages
Ada Lab Manual
No ratings yet
Ada Lab Manual
64 pages
Pro Angular JS (Apress)
No ratings yet
Pro Angular JS (Apress)
1 page
MITWPU - Unit 1-Theory of Computation-Merged
No ratings yet
MITWPU - Unit 1-Theory of Computation-Merged
299 pages
18CS42 Model Question Paper - 1 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
18CS42 Model Question Paper - 1 With Effect From 2019-20 (CBCS Scheme)
3 pages
Limitations of Algorithm Power
100% (1)
Limitations of Algorithm Power
10 pages
Module 2: Divide and Conquer: Design and Analysis of Algorithms 18CS42
No ratings yet
Module 2: Divide and Conquer: Design and Analysis of Algorithms 18CS42
82 pages
Black Box Testing
No ratings yet
Black Box Testing
27 pages
AWP Main (Munotes - In)
No ratings yet
AWP Main (Munotes - In)
107 pages
It8076 Software Testing
No ratings yet
It8076 Software Testing
2 pages
"Hospital Management System": Project Report On
No ratings yet
"Hospital Management System": Project Report On
19 pages
Module-4 (PDFDrive)
No ratings yet
Module-4 (PDFDrive)
67 pages
Unit - 1: Introduction To Object Oriented Programming
No ratings yet
Unit - 1: Introduction To Object Oriented Programming
19 pages
Tybca OOSE Notes
No ratings yet
Tybca OOSE Notes
49 pages
Department of Computing Adama Science and Technology University
No ratings yet
Department of Computing Adama Science and Technology University
56 pages
Android OS Case Study: Operating Systems
No ratings yet
Android OS Case Study: Operating Systems
21 pages
Software Reliability
100% (1)
Software Reliability
25 pages
Internship Report Roshan
No ratings yet
Internship Report Roshan
14 pages
Angular JS Lab Manual
No ratings yet
Angular JS Lab Manual
43 pages
Node-Js-React-Js-Django - LAB
No ratings yet
Node-Js-React-Js-Django - LAB
38 pages
TOC Solved Nov - Dec 2022
No ratings yet
TOC Solved Nov - Dec 2022
17 pages
HTML Forms
No ratings yet
HTML Forms
87 pages
Database System Concepts 7th Edition Avi Silberschatz All Chapters Instant Download
No ratings yet
Database System Concepts 7th Edition Avi Silberschatz All Chapters Instant Download
51 pages
Compiler Report
No ratings yet
Compiler Report
15 pages
Final Report: Java Programming Language. A Simple Project To Draw Paint (Java Language)
No ratings yet
Final Report: Java Programming Language. A Simple Project To Draw Paint (Java Language)
27 pages
Operating Systems Lab: Implement The Following Using C/C++/JAVA
No ratings yet
Operating Systems Lab: Implement The Following Using C/C++/JAVA
22 pages
Nptel Week1
No ratings yet
Nptel Week1
134 pages
Cse-CSEViii-web 2.0 & Rich Internet Application (06cs832) - Notes
No ratings yet
Cse-CSEViii-web 2.0 & Rich Internet Application (06cs832) - Notes
86 pages
Unit Ii
No ratings yet
Unit Ii
61 pages
VHDL Lectures
No ratings yet
VHDL Lectures
676 pages
CONSTRUCTOR AND DESTRUCTOR (C++)
No ratings yet
CONSTRUCTOR AND DESTRUCTOR (C++)
24 pages
4th Year Comps GTU - BH - Qbanks
100% (1)
4th Year Comps GTU - BH - Qbanks
8 pages
R18 B.Tech. ECE Syllabus Jntu Hyderabad: ND TH
No ratings yet
R18 B.Tech. ECE Syllabus Jntu Hyderabad: ND TH
1 page
Full Stack Development PDF
No ratings yet
Full Stack Development PDF
179 pages
BCT Techknowledge Want All Subjects Notes Pls
No ratings yet
BCT Techknowledge Want All Subjects Notes Pls
193 pages
Unit 5 - SE - Notes
No ratings yet
Unit 5 - SE - Notes
45 pages
Project of Petya and Staircases
0% (1)
Project of Petya and Staircases
16 pages
Object Oriented Programming in C++ (UNIT 1)
No ratings yet
Object Oriented Programming in C++ (UNIT 1)
135 pages
LM94021, LM94022: Temperature Sensor Solutions For Low-Voltage Systems
No ratings yet
LM94021, LM94022: Temperature Sensor Solutions For Low-Voltage Systems
6 pages
7th Sem 1
No ratings yet
7th Sem 1
32 pages
BPLCK105B
No ratings yet
BPLCK105B
5 pages
SQA System Architecture
No ratings yet
SQA System Architecture
11 pages
Report On Python
No ratings yet
Report On Python
24 pages
Unit 4 Software Engineering-204
No ratings yet
Unit 4 Software Engineering-204
77 pages
STTP Proposal Format
0% (1)
STTP Proposal Format
8 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Ese Lab - Sanoj-159
No ratings yet
Ese Lab - Sanoj-159
11 pages
ST Container Storage For Dummies Ebook v2 f7543 201705 en
No ratings yet
ST Container Storage For Dummies Ebook v2 f7543 201705 en
27 pages
PLC HMI: + in One Unit
No ratings yet
PLC HMI: + in One Unit
13 pages
Visa Sample
No ratings yet
Visa Sample
3 pages
Climate Graph Assignment Rubric
No ratings yet
Climate Graph Assignment Rubric
1 page
AS NZS 1428.4.1 Amdt 1 2010
No ratings yet
AS NZS 1428.4.1 Amdt 1 2010
9 pages
4x4 Suspension Catalogue
100% (1)
4x4 Suspension Catalogue
328 pages
Example of A Dissertation Problem Statement
100% (1)
Example of A Dissertation Problem Statement
8 pages
Ict - Telecom Osp Installation (Fiber Optic Cable) NC Ii PDF
No ratings yet
Ict - Telecom Osp Installation (Fiber Optic Cable) NC Ii PDF
15 pages
Project Title: Bachelor of Technology IN Mechanical Engineering
No ratings yet
Project Title: Bachelor of Technology IN Mechanical Engineering
11 pages
Chapter 03
No ratings yet
Chapter 03
50 pages
CRM
No ratings yet
CRM
65 pages
Quick Reference To Base 24 Error Codes
0% (1)
Quick Reference To Base 24 Error Codes
2 pages
DBGV (X429 - 3cli: Model
No ratings yet
DBGV (X429 - 3cli: Model
2 pages
Chapter 8 - Notes Payable and Debt Restructuring: Problem 8-7
No ratings yet
Chapter 8 - Notes Payable and Debt Restructuring: Problem 8-7
3 pages
Easy 800 Manual
100% (1)
Easy 800 Manual
351 pages
Arc Master
No ratings yet
Arc Master
60 pages
Whitepaper Clicks and Views 8
No ratings yet
Whitepaper Clicks and Views 8
5 pages
Wage Order RBV 21
83% (6)
Wage Order RBV 21
3 pages
TR 369 1 3 1
No ratings yet
TR 369 1 3 1
275 pages
Steps To A Basic Company Financial Analysis
No ratings yet
Steps To A Basic Company Financial Analysis
21 pages
Boktai 3 English Patch Guide
No ratings yet
Boktai 3 English Patch Guide
6 pages
Vijendra
No ratings yet
Vijendra
1 page
3140_02_6RP_AFP_tcm143-700710
No ratings yet
3140_02_6RP_AFP_tcm143-700710
8 pages
Open Channel Flow
No ratings yet
Open Channel Flow
54 pages
Mugomba-RegionalOrganisationsAfrican-1978
No ratings yet
Mugomba-RegionalOrganisationsAfrican-1978
13 pages
Coach Others in Job Skills 310812
No ratings yet
Coach Others in Job Skills 310812
117 pages
CV
100% (1)
CV
3 pages
Consumer Awareness
76% (38)
Consumer Awareness
6 pages
For HDFC ERGO General Insurance Company LTD
No ratings yet
For HDFC ERGO General Insurance Company LTD
2 pages
Buffing Process in Manufacturing
No ratings yet
Buffing Process in Manufacturing
4 pages