Part 1: Data Investigation and Cleaning: Classification For Data Errors

This document discusses analyzing and cleaning population data, summarizing iris data statistics, and calculating Fibonacci numbers. For the population data, it identifies errors, filters by country to find patterns, and cleans the data using imputation or substitution. Statistical analysis of iris data finds differences between species for sepal/petal attributes. Calculating the first 10,000 Fibonacci numbers takes under a second, while 25,000 takes around 8 seconds, and calculating the 1,000,000th number is challenging due to memory usage.

Uploaded by

Quang Vu

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views

Part 1: Data Investigation and Cleaning: Classification For Data Errors

Uploaded by

Quang Vu

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Question 1:

Part 1: Data Investigation and Cleaning

Classification for data errors
After loading the datasets file as the data frame, we use the method info() to have overview of
summary of the data frame.

As noticed that, there are some errors format in the column “year” such as “19-99” instead of
correct format “1999”. So, replacing the character ‘-‘ with blank character for the series in
column “year” solves this issue.
For errors data, that contains the character ‘-‘ or is a negative value, removing the character ‘-‘
will gives the correct data by doubling checking the value with the population in the World Bank
Data population.
From the data type of the column, some entries are in the incorrect data type – not integer.
Moreover, there are missing values and negative value in the datasets. So, the data errors are
classified as 3 categories:
- not integer value: the population must be in integer data type
- negative value: the population of a country can not a negative number
- missing value: there are Missing completely at random (MCAR) as well as Missing at
random (MAR) values, which is critically affect the further progress using classification
of this dataset.
It is necessary to allocate the error data and investigate the patterns of them. So, the library
“CustomElementValidation” from Pandas library was imported to handle this issue[1]. Number of
entries data errors for each category:
- not integer value: 171
- missing value: 178
- negative value: 11
- equal to 0: 24
- total data errors: 386
Logic/Rules to clean up data
After counting and filtering the data errors into categories, it is necessary to filter the errors by
country. This could help to further investigating the pattern of errors to choose the correct rules
and method to handle them.
Most of the country have only one data error and it is complete random, while some other
countries have more than 5 errors following a pattern of the following years in a period.

From the figure above, the countries with many errors were print out with the exact number of
errors. The errors were assigned to another data frame to better investigation. Then, we must
every error by each country to identify the patterns.
- Country Name “Not classified” with 110 data errors: all the population data for this
country is missing. In fact, the country name does not make sense and could not be found
in internet sources for the information. Therefore, Country Name “not classified” is
removed from the dataset.
- Data errors of others country in this list are missing or incorrect format but following by a
period of years. For example, “Guatemala” has missing data from 1960 to 1980 and from
2008 to 2017.
- Especially, for the Country Name “West Bank and Gaza” there are so many missing
values from 1960 to 1989. There are only official records and data collecting from 1990.
Other name of this country is Palestinians and the history are complicated due to by
immigration and domination policy. The data is decided to be dropped since it could be
considered as noise the world population due to outliner data in the classification phrase.
Rules and methods to cleaning up data
Since we have identified the patterns of the errors for critical cases for some country with many
data errors, so the cleaning up data is proceeded by group by country. There are two rules to
cleaning this dataset:
- For country that does not have many errors and the errors are completely random, the
Mice library is applied to imputing the data [4]. Overall, the population of every country is
increase by 3.6% to 5% each year, so the imputing method is suitable and reasonable in
this case.
- For country that have many errors in a pattern, the substituting method is applied because
the imputing method in the case could generate data very differently from the real
population. The substituting data is collected from World Data Bank and from World
Meters (with the data source from the United Nations, Department of Economic and
Social Affairs, Population Division. World Population Prospects: The 2019 Revision)
Number of removed Rows and Columns
Two Columns has been removed, which are the 'Indicator Name', 'Indicator Code'. These two
attributes contain only one string value for all rows, which tend to be the code of population
using worldwide.
For the removed rows, all the rows that are not country are removed from the datasets since the
dataset contains countries as well as world geographical development indicators such as Central
Europe and the Baltics, East Asia & Pacific, etc.…
Part 2: Country Histogram for start and end digit

First digit histogram: The small digit occurs most frequently, and the greater number is, the less
frequency it occurs as the first digit. This can be explained using the observation of the first
digits of numbers in any real-world datasets of Benford’s Law[2]. The same pattern occurs with
the data from 1980 to 2018.
Last digit histogram: The number 0 is occurring most frequently, while other number occurs
equally and less than number 0. The reason behind this could be this population number is often
round to the number, which is dividable by 10. This helps to highlight the Significant Figure [3].

Question 2:
Statistical Analysis:
The statistics of the iris dataset is calculated as follow:
- Statistics of Setosa Iris
- Statistics of Versicolor Iris

- Statistics of Virginica Iris

Regarding to the statistical values of iris species, we can observe that:

- From the mean and the median of the sepal length, the Virginica Iris has longer sepal
length than Sentosa and Versicolor Iris
- From the sepal with values, Setosa Iris has wider sepal than other two Irises Species.
Sepal width of Virginica Iris is slightly higher than Versicolor Iris. However, in the range
of 25% to 75% data these two species have many overlapping data, which should be
considered for choosing the classification.
- About the petal length statistics, the mean and median between three species are very
different and there are very little overlapping data in the range of 25 to 75% data through
each specie. The iris specie with the longest petal length is the Virginica Iris then comes
to Versicolor and Setosa Irises, respectively.
- Same as the petal length, the petal width shows very big different between three species
and very little overlapping data. And the order from the highest petal width to the lowest
petal width is the same as the petal length as well: Virginica, Versicolor, and then Setosa.
From the statistical analysis above, we can classify the iris as following:
- If petal length is in range of 5.2 to 6.1 cm or petal width is between 1 to 1.5 cm, the iris
usually is the Versicolor Iris Specie.
- If sepal length is from 6.2 to 7 cm, the iris should belong to the Virginia Iris.
- If sepal length is less than 5.2 cm and petal width is less than 0.6 cm, the iris belongs to
Setosa Specie
- If petal length is between 1 and 1.9 cm, it should be Setosa Iris.
- Any other cases do not match the above classifications should be suggested as Versicolor
Iris
Correlation and Covariance Matrix for each Iris Species
- Correlation Matrix:
- Covariance Matrix:
From the Covariance Matrixes of all three iris species, we can observe that all the covariance
between two attributes are positive related.
From the Correlation Matrixes, there are some attribute which is highly correlated to each other,
which should be key factors for iris classifications:
- The Setosa Iris has a high correlation between two attributes: sepal width and sepal
length
- Very attributes of Versicolor are highly correlated
- The Virginica Iris has its petal length highly correlated with sepal length
Scatter Plot

The Scatter Plot figure presents more details on the data distribution for each attribute of each
iris specie. The petal length and petal width graph have the highly separation for all three
species, except for some overlapping data between the Versicolor and Virginia Irises. This
finding reinforces the classification rules defined in the statistical analysis (sub-question of
question 2). Moreover, it is also clear that the iris with a high petal length and petal width should
be classified as the Virginica Specie. The graph also verified the classification that there is no
Setosa Iris with the Petal width greater than 0.6 cm.
Question 3:
How long does it would take to calculate the first 10,000 and 25,000 Fibonacci numbers?
- Functions to generate n first numbers of Fibonacci:

- Calculation and Estimation execution time for first 10,000 and 25,000 number:

Challenges while calculating the 1,000,000th Fibonacci number

To calculate the 1,000,000th Fibonacci number, we can use the previous code the calculate the
first 1,000,000 numbers and then take the last number of the result array as the result.
Challenges to calculate the 1,000,000th Fibonacci number:
- Memory consumption: to calculate the great-nth Fibonacci number, the current number
must be store to the memory for the next calculation. In fact, due to using the previous
code, the computer needs to store every numbers from 0 to 1,000,000 to the temporary
memory. While running that code to calculate first 1,000,000 numbers, the error of
memory errors occurs.
- Optimization: there is not necessary to store all first 1,000,000 to calculate the 1,000,000th
number. Therefore, the challenge is to calculate that number without wasting memory
and time. Moreover, the output I/O of python is limited, even when printing out 25,000
first numbers of Fibonacci. The error occurs as “IOPub data rate exceeded. Notebook
server will temporarily stop sending output”
Solution for calculating the 1,000,000th Fibonacci number
To calculate exact the nth number of Fibonacci without generating the whole series of number,
we can use the method called Fast Doubling [5], which has the complexity of O(logn):
When we have a number-k is F(k) and the next number is F(k+1), we can determine the F(2k)
and F(2k+1) by applying this equation:
F ( 2 k ) =F ( k )∗( F ( k +1 )∗2−F ( n ) )

F (2 k +1)=F(k )2∗F(k +1)2

- Implementation:

As the result, the 1,000,000th number is calculated correctly, and the execution time is only less
than 1 millisecond
Reference
[1]B. Cojocar, "How to do column validation with pandas", Medium, 2020. [Online]. Available:
https://medium.com/@bogdan.cojocar/how-to-do-column-validation-with-pandas-
bbeb38f88990. [Accessed: 10- Aug- 2020].
[2]P. Corn, H. Vee and C. Williams, "Benford's Law | Brilliant Math & Science
Wiki", Brilliant.org, 2020. [Online]. Available: https://brilliant.org/wiki/benfords-law/.
[Accessed: 11- Aug- 2020].
[3]"Significant Figures", Staff.vu.edu.au, 2020. [Online]. Available:
http://www.staff.vu.edu.au/mcaonline/units/numbers/numsig.html. [Accessed: 12- Aug- 2020].
[4]S. Buuren and C. Groothuis-Oudshoorn, "(PDF) MICE: Multivariate Imputation by Chained
Equations in R", ResearchGate, 2020. [Online]. Available:
https://www.researchgate.net/publication/44203418_MICE_Multivariate_Imputation_by_Chaine
d_Equations_in_R. [Accessed: 09- Aug- 2020].
[5]V. Kumar, "Fast Doubling method to find nth Fibonacci number - Vinay
Kumar", HackerEarth, 2020. [Online]. Available:
https://www.hackerearth.com/de/practice/notes/fast-doubling-method-to-find-nth-fibonacci-
number/. [Accessed: 03- Aug- 2020].

The Colonization of Tiamat V (Phoenix III, Daniel) PDF
No ratings yet
The Colonization of Tiamat V (Phoenix III, Daniel) PDF
44 pages
Simple Network Management Protocol (SNMP)
100% (1)
Simple Network Management Protocol (SNMP)
16 pages
CC111 Introduction To Computers: Lecturer: Dr. EMAD OSMAN E-Mail: 01025830256
No ratings yet
CC111 Introduction To Computers: Lecturer: Dr. EMAD OSMAN E-Mail: 01025830256
413 pages
Capgemini Interview Questions
No ratings yet
Capgemini Interview Questions
1 page
BC Assignment
No ratings yet
BC Assignment
22 pages
Laboratory Test Report: Test Name Result Biological Reference Interval Potassium
No ratings yet
Laboratory Test Report: Test Name Result Biological Reference Interval Potassium
2 pages
Mumbai Educational Trust: MET Institute of Computer Science
No ratings yet
Mumbai Educational Trust: MET Institute of Computer Science
368 pages
Lab Manual Stqa Aman
No ratings yet
Lab Manual Stqa Aman
33 pages
Coffeemaker Example: Contact Authors North Carolina State University CSC 326
No ratings yet
Coffeemaker Example: Contact Authors North Carolina State University CSC 326
3 pages
Python Programs
No ratings yet
Python Programs
25 pages
Unit 1-5 CS8079 HCI QBank Panimalar College PDF
No ratings yet
Unit 1-5 CS8079 HCI QBank Panimalar College PDF
49 pages
Question Bank - DS Unit I and Unit II
No ratings yet
Question Bank - DS Unit I and Unit II
1 page
Report (Group-28)
No ratings yet
Report (Group-28)
56 pages
Agricultural Project Planning and Analysis: (Agec 4012)
No ratings yet
Agricultural Project Planning and Analysis: (Agec 4012)
21 pages
A Study On Deep Learning For Fake News Detection
No ratings yet
A Study On Deep Learning For Fake News Detection
48 pages
Online Internet Survey Report
No ratings yet
Online Internet Survey Report
25 pages
(EDITABLE EN) InvGate Insight Proxy Installation v3
No ratings yet
(EDITABLE EN) InvGate Insight Proxy Installation v3
20 pages
CodingBootcamps E PDF
No ratings yet
CodingBootcamps E PDF
58 pages
B.Tech Aero MLR20 - 29-10-2021
No ratings yet
B.Tech Aero MLR20 - 29-10-2021
289 pages
04 State Management
No ratings yet
04 State Management
35 pages
R18CSE4102-UNIT 1 Data Mining Notes
No ratings yet
R18CSE4102-UNIT 1 Data Mining Notes
26 pages
17mtcs068-Lab Manual
No ratings yet
17mtcs068-Lab Manual
15 pages
Discrete Math
No ratings yet
Discrete Math
61 pages
Ôn Tập Applied Big Data in Management
No ratings yet
Ôn Tập Applied Big Data in Management
43 pages
Java Questions
No ratings yet
Java Questions
51 pages
RPG Open Access by Example
No ratings yet
RPG Open Access by Example
34 pages
Chapter Four Communication Paradigms
No ratings yet
Chapter Four Communication Paradigms
48 pages
Information Brochure-B Tech
No ratings yet
Information Brochure-B Tech
38 pages
Chapter 8 - Social Media Information Systems
No ratings yet
Chapter 8 - Social Media Information Systems
38 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
10 pages
Vsphere Esxi Vcenter Server 601 Networking Guide
No ratings yet
Vsphere Esxi Vcenter Server 601 Networking Guide
244 pages
Database
No ratings yet
Database
5 pages
Bosch Sensortec Product Overview
100% (1)
Bosch Sensortec Product Overview
16 pages
Introduction To The Course: Managing Bots Using Automation 360 Control Room - Step List
No ratings yet
Introduction To The Course: Managing Bots Using Automation 360 Control Room - Step List
6 pages
Lab Manual Oop C 2022f
No ratings yet
Lab Manual Oop C 2022f
52 pages
Roadmap of Python
No ratings yet
Roadmap of Python
3 pages
Bazaarvoice Implementation and Architecture
No ratings yet
Bazaarvoice Implementation and Architecture
17 pages
Lec - 05 AAA - Brute Force and Exhaustive Search
No ratings yet
Lec - 05 AAA - Brute Force and Exhaustive Search
39 pages
IN Price List Lifescience - 2022 PDF
No ratings yet
IN Price List Lifescience - 2022 PDF
64 pages
Unit 4
No ratings yet
Unit 4
27 pages
Phantom Full Report
No ratings yet
Phantom Full Report
10 pages
Data Science Engineering Full Time Program Brochure
No ratings yet
Data Science Engineering Full Time Program Brochure
21 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
B.Tech Project Report Format
No ratings yet
B.Tech Project Report Format
31 pages
DBMS Interview Questions (2021) - Javatpoint
No ratings yet
DBMS Interview Questions (2021) - Javatpoint
17 pages
Untitled
No ratings yet
Untitled
89 pages
Functions Lectr01
No ratings yet
Functions Lectr01
25 pages
DBMS Notes
No ratings yet
DBMS Notes
180 pages
Economics Extended Essay: Title: Effectiveness of PMJAY-CMCHIS in Improving Standard of Living of
No ratings yet
Economics Extended Essay: Title: Effectiveness of PMJAY-CMCHIS in Improving Standard of Living of
68 pages
Noc20 Cs81 Assignment 01 Week 03
No ratings yet
Noc20 Cs81 Assignment 01 Week 03
5 pages
Course File Compiler Design
No ratings yet
Course File Compiler Design
41 pages
Apostila Parte 1 e 2 Ingles Beginner
No ratings yet
Apostila Parte 1 e 2 Ingles Beginner
55 pages
Pes University: 6 Semester Project Report On
No ratings yet
Pes University: 6 Semester Project Report On
70 pages
Lecture 3 - Concurrency Control and Fault Tolerance
No ratings yet
Lecture 3 - Concurrency Control and Fault Tolerance
54 pages
Chapter One: 1. Basic Concepts, Methods of Data Collection and Presentation
No ratings yet
Chapter One: 1. Basic Concepts, Methods of Data Collection and Presentation
111 pages
Datadeling
No ratings yet
Datadeling
27 pages
FSD Week 2
No ratings yet
FSD Week 2
52 pages
Unit 3 Inter Process Communication
No ratings yet
Unit 3 Inter Process Communication
63 pages
SQL Server Specialist Self-Test
No ratings yet
SQL Server Specialist Self-Test
2 pages
ELE417 Project Proposal Template
No ratings yet
ELE417 Project Proposal Template
4 pages
Unit II Notes - Virtualization
No ratings yet
Unit II Notes - Virtualization
49 pages
Case Study On Amazon EC2
No ratings yet
Case Study On Amazon EC2
5 pages
Models and Sta S Cal Methods In: Rtrim
No ratings yet
Models and Sta S Cal Methods In: Rtrim
34 pages
Class 12 Science Assignment
No ratings yet
Class 12 Science Assignment
39 pages
Catalog AltoMarine
No ratings yet
Catalog AltoMarine
19 pages
Powershell' Deep Dive:: A United Threat Research Report
No ratings yet
Powershell' Deep Dive:: A United Threat Research Report
16 pages
1.2 Organic Chemistry 1
No ratings yet
1.2 Organic Chemistry 1
7 pages
Stat Post Test
No ratings yet
Stat Post Test
3 pages
INCREDIBLE PowerPoint Slide Zoom
No ratings yet
INCREDIBLE PowerPoint Slide Zoom
11 pages
Part and Assembly Modeling: With Solidworks 2014
100% (1)
Part and Assembly Modeling: With Solidworks 2014
123 pages
Evaluation of Ergosterol Composition and Esterification Rate in Fungi Isolated From Mangrove Soil, Long-Term Storage of Broken Spores, and Two Soils
No ratings yet
Evaluation of Ergosterol Composition and Esterification Rate in Fungi Isolated From Mangrove Soil, Long-Term Storage of Broken Spores, and Two Soils
15 pages
Firing Deformation in Large Size Porcelain Tiles. Effect of Compositional and Process Variables
No ratings yet
Firing Deformation in Large Size Porcelain Tiles. Effect of Compositional and Process Variables
15 pages
Temperature Guide
100% (1)
Temperature Guide
40 pages
74HC4051A Datasheet
No ratings yet
74HC4051A Datasheet
15 pages
Tekla Structural Designer 2022 Eurocodes Reference
No ratings yet
Tekla Structural Designer 2022 Eurocodes Reference
226 pages
Wing Pendidikan 200/elektronika Skadron Pendidikan 203
No ratings yet
Wing Pendidikan 200/elektronika Skadron Pendidikan 203
13 pages
Crash 2024 03 31 18 08 12 419
No ratings yet
Crash 2024 03 31 18 08 12 419
9 pages
DSP All Mcq's
50% (4)
DSP All Mcq's
46 pages
#9 Stability Guidelines June 98
No ratings yet
#9 Stability Guidelines June 98
114 pages
Properties of Malaysian Fired Clay Bricks and Their Evaluation With International Masonry Specifications
0% (1)
Properties of Malaysian Fired Clay Bricks and Their Evaluation With International Masonry Specifications
229 pages
AAP-Question Bank Solving Program (Yearwise)
No ratings yet
AAP-Question Bank Solving Program (Yearwise)
49 pages
Axial Fans
No ratings yet
Axial Fans
2 pages
5807 Digital Model1
No ratings yet
5807 Digital Model1
53 pages
Illustrative Example: A Blending Process: An Unsteady-State Mass Balance For The Blending System
No ratings yet
Illustrative Example: A Blending Process: An Unsteady-State Mass Balance For The Blending System
22 pages
MX1230 Product Datasheet
No ratings yet
MX1230 Product Datasheet
32 pages
Diamond Inclusions
No ratings yet
Diamond Inclusions
68 pages
Machine Vice Tutorial
No ratings yet
Machine Vice Tutorial
25 pages
Shake Table Testin of A Multi-Tower Connected Hybrid Structure - Zhou Ying, Lu Xilin, Lu Wensheng, He Zhijun - Marzo 2009
No ratings yet
Shake Table Testin of A Multi-Tower Connected Hybrid Structure - Zhou Ying, Lu Xilin, Lu Wensheng, He Zhijun - Marzo 2009
13 pages
Crankcase Component Overhaul
No ratings yet
Crankcase Component Overhaul
13 pages
CNS 01
No ratings yet
CNS 01
14 pages