Part 1: Data Investigation and Cleaning: Classification For Data Errors
Part 1: Data Investigation and Cleaning: Classification For Data Errors
As noticed that, there are some errors format in the column “year” such as “19-99” instead of
correct format “1999”. So, replacing the character ‘-‘ with blank character for the series in
column “year” solves this issue.
For errors data, that contains the character ‘-‘ or is a negative value, removing the character ‘-‘
will gives the correct data by doubling checking the value with the population in the World Bank
Data population.
From the data type of the column, some entries are in the incorrect data type – not integer.
Moreover, there are missing values and negative value in the datasets. So, the data errors are
classified as 3 categories:
- not integer value: the population must be in integer data type
- negative value: the population of a country can not a negative number
- missing value: there are Missing completely at random (MCAR) as well as Missing at
random (MAR) values, which is critically affect the further progress using classification
of this dataset.
It is necessary to allocate the error data and investigate the patterns of them. So, the library
“CustomElementValidation” from Pandas library was imported to handle this issue[1]. Number of
entries data errors for each category:
- not integer value: 171
- missing value: 178
- negative value: 11
- equal to 0: 24
- total data errors: 386
Logic/Rules to clean up data
After counting and filtering the data errors into categories, it is necessary to filter the errors by
country. This could help to further investigating the pattern of errors to choose the correct rules
and method to handle them.
Most of the country have only one data error and it is complete random, while some other
countries have more than 5 errors following a pattern of the following years in a period.
From the figure above, the countries with many errors were print out with the exact number of
errors. The errors were assigned to another data frame to better investigation. Then, we must
every error by each country to identify the patterns.
- Country Name “Not classified” with 110 data errors: all the population data for this
country is missing. In fact, the country name does not make sense and could not be found
in internet sources for the information. Therefore, Country Name “not classified” is
removed from the dataset.
- Data errors of others country in this list are missing or incorrect format but following by a
period of years. For example, “Guatemala” has missing data from 1960 to 1980 and from
2008 to 2017.
- Especially, for the Country Name “West Bank and Gaza” there are so many missing
values from 1960 to 1989. There are only official records and data collecting from 1990.
Other name of this country is Palestinians and the history are complicated due to by
immigration and domination policy. The data is decided to be dropped since it could be
considered as noise the world population due to outliner data in the classification phrase.
Rules and methods to cleaning up data
Since we have identified the patterns of the errors for critical cases for some country with many
data errors, so the cleaning up data is proceeded by group by country. There are two rules to
cleaning this dataset:
- For country that does not have many errors and the errors are completely random, the
Mice library is applied to imputing the data [4]. Overall, the population of every country is
increase by 3.6% to 5% each year, so the imputing method is suitable and reasonable in
this case.
- For country that have many errors in a pattern, the substituting method is applied because
the imputing method in the case could generate data very differently from the real
population. The substituting data is collected from World Data Bank and from World
Meters (with the data source from the United Nations, Department of Economic and
Social Affairs, Population Division. World Population Prospects: The 2019 Revision)
Number of removed Rows and Columns
Two Columns has been removed, which are the 'Indicator Name', 'Indicator Code'. These two
attributes contain only one string value for all rows, which tend to be the code of population
using worldwide.
For the removed rows, all the rows that are not country are removed from the datasets since the
dataset contains countries as well as world geographical development indicators such as Central
Europe and the Baltics, East Asia & Pacific, etc.…
Part 2: Country Histogram for start and end digit
First digit histogram: The small digit occurs most frequently, and the greater number is, the less
frequency it occurs as the first digit. This can be explained using the observation of the first
digits of numbers in any real-world datasets of Benford’s Law[2]. The same pattern occurs with
the data from 1980 to 2018.
Last digit histogram: The number 0 is occurring most frequently, while other number occurs
equally and less than number 0. The reason behind this could be this population number is often
round to the number, which is dividable by 10. This helps to highlight the Significant Figure [3].
Question 2:
Statistical Analysis:
The statistics of the iris dataset is calculated as follow:
- Statistics of Setosa Iris
- Statistics of Versicolor Iris
The Scatter Plot figure presents more details on the data distribution for each attribute of each
iris specie. The petal length and petal width graph have the highly separation for all three
species, except for some overlapping data between the Versicolor and Virginia Irises. This
finding reinforces the classification rules defined in the statistical analysis (sub-question of
question 2). Moreover, it is also clear that the iris with a high petal length and petal width should
be classified as the Virginica Specie. The graph also verified the classification that there is no
Setosa Iris with the Petal width greater than 0.6 cm.
Question 3:
How long does it would take to calculate the first 10,000 and 25,000 Fibonacci numbers?
- Functions to generate n first numbers of Fibonacci:
- Calculation and Estimation execution time for first 10,000 and 25,000 number:
- Implementation:
As the result, the 1,000,000th number is calculated correctly, and the execution time is only less
than 1 millisecond
Reference
[1]B. Cojocar, "How to do column validation with pandas", Medium, 2020. [Online]. Available:
https://medium.com/@bogdan.cojocar/how-to-do-column-validation-with-pandas-
bbeb38f88990. [Accessed: 10- Aug- 2020].
[2]P. Corn, H. Vee and C. Williams, "Benford's Law | Brilliant Math & Science
Wiki", Brilliant.org, 2020. [Online]. Available: https://brilliant.org/wiki/benfords-law/.
[Accessed: 11- Aug- 2020].
[3]"Significant Figures", Staff.vu.edu.au, 2020. [Online]. Available:
http://www.staff.vu.edu.au/mcaonline/units/numbers/numsig.html. [Accessed: 12- Aug- 2020].
[4]S. Buuren and C. Groothuis-Oudshoorn, "(PDF) MICE: Multivariate Imputation by Chained
Equations in R", ResearchGate, 2020. [Online]. Available:
https://www.researchgate.net/publication/44203418_MICE_Multivariate_Imputation_by_Chaine
d_Equations_in_R. [Accessed: 09- Aug- 2020].
[5]V. Kumar, "Fast Doubling method to find nth Fibonacci number - Vinay
Kumar", HackerEarth, 2020. [Online]. Available:
https://www.hackerearth.com/de/practice/notes/fast-doubling-method-to-find-nth-fibonacci-
number/. [Accessed: 03- Aug- 2020].