Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 948
An User Friendly Interface for Data Preprocessing and Visualization
using Machine Learning Models
Mr. S. Yoganand1, Bharathi Kannan R2, Daya Meenakshi B2
1Assistant Professor, Department of Computer Science and Engineering, Agni College of Technology Chennai-130,
Tamil Nadu, India.
2,3UG Student, Department of Computer Science and Engineering,Agni College of Technology Chennai-130,
Tamil Nadu, India.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract – Machine learning is one of the most efficient
techniques for prediction and classification related problems.
In this modern era, most of the industries all over the world
depend upon the machine learning models which leadintothe
data analytics century. There is no properandefficienttool for
handling the datasets which use machine learning models for
data prediction and Visualization. So, in this paper a novel
idea is proposed for making the user-friendly approach to
handle the machine learning models for data prediction and
visualisation. A tool is developed, such that it performs data
cleaning which will be a prerequisite for data analysis and
then provides a visible representation of the cleansed data.
The developed tool will take the input as structured dataset
that contains both textual and numerical data which are then
processed using machine learning algorithms to obtain a pre-
processed dataset. This process may undergo series of steps to
produce visualized and predicted data as per the chosen
effective algorithm to obtain efficient result.
Key Words: Machine learning, visualization, pre-
processing, Tool, user interface.
1. INTRODUCTION
An organisation uses the dataset for predictive
analysis and an important concern in these cases is data
quality. Using noisy data can hamper with the correctness of
analysis. The common errors are missing values, duplicates
and other errors. These errors need to be corrected for
reliable decisions and analytics. The users must know that
the effects of using the noisy data before proceeding with the
cleaning process. Noise removal will improve the model
performance, due to the fact that noises may disturb the
discovery of important information.
Machine learning is the appreciated application of
Artificial Intelligence. It is used to learn automatically
without any human assistancethatprovideshugedataset for
analysing with a large number of data fields. With the data
provided by the system after implementing the machine
learning algorithms, organizations are able to work more
effective and acquire profit over their competitors. The
system that uses machine learning technique will be able to
predict how the structure looks like and adjust the data
according to their structure. The mainchallengesinmachine
learning model is to deal with large data sources for data
cleaning process. Data cleaning process is carried by taking
in huge datasets which are checkedforthepossible errorsby
using data pre-processing techniques. The other challenges
include avoiding learning process from noisy data, avoiding
building a prejudiced model, not giving reasons for
compromising with the qualityofthedata.The bestpractices
for data cleaning using machine learning techniquesthatare
filling missing values, removing unnecessary rows,reducing
the size of the data and implementing a good quality plan.
The success of machine learning applications
depends on the amount of good quality data that is given to
it. But this process of cleaning may not be considered as a
main area in data pre-processing. The system that uses
powerful algorithms to process the noisy data can yield bad
results if irrelevant or wrong training set of data is given. In
the proposed model ML algorithms to find out the different
patterns in the data and group it by itself into clean and
noisy data which will help in reducing execution time.
2. Related Work:
Data Pre-processing is used to convert the raw data
into pre-processed data set. [1] In Machine Learning, the
data pre-processing is used to transform or encode the data
easily by their algorithm. It consists of interactive steps as
follows. Data cleaning is used to detect and correct
inaccurate records from a record or tables, and then
replacing, modifying or deleting this noisy data. Data
integration will combines the data residing indifferent
sources that provides user with a unified view of these data
[2]. The process of selecting suitable data for a research
project will impact data integritywhereData transformation
converts data from a source data format into resultant data
[3].
The tools which are available to process the data in
data processing and visualizing are Knime, Shogun, Oryx 2,
Tensor flow, Weka, RapidMiner, Trifacta Wrangler, Python
[12] [13]. In this paper, we will focus on removing the noisy
data that identifies the numerical values, predicting and
filling in missing values and detect outliers which hamper
with data analysis [11]. We propose a system that simplifies
the process for the user and allows for better processing. In
summary, Machine learning for data cleaning might be the
only way to provide complete and trustworthy data sets for

2

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 949
effective analytics, so we provide an user friendly interface
for pre-processing and model analysis with visualizationfor
the ease of user.
3. System Design:
The Data Pre-processingisdone withthreemethods
they are Data Cleaning, Data Transformation and Data
Reduction. The data cleaning application is to process the
raw dataset containing both textual and numerical data that
convert it into a cleaned dataset which can be used for data
analysis. Initially, users must upload the dataset in which
they perform the analysis. They can choose the operations
that they want to perform on their dataset from themodules
provided. This application performs a series of operations
which includes removing columns with less information or
no information, removing unnecessary rows, identifyingthe
numerical values, filling in the missing fields and identifying
the outliers. Some columns may contain less information or
no information that makes it hard to rely on such columns
for analysis and so such columns can be removed and they
don’t cause significant damage to the data.
Some rows may contain empty fields which will
again tamper with the proper pre-processing of the dataset.
Hence such values are identified and removed. The dataset
will contain categorical features ranging from numerical to
non-numerical values. This application requires only
numerical data which is used for analysis and prediction,
such that the fields containing numeric values areidentified.
If you try to remove them, you might reduce the amount of
data that is available. So, these fields need to be filled in
appropriate values.
4. Implementation:
The outliers with data points are really far from the
rest of your data points. Mathematically, an outlierisusually
defined as an observation over three standard deviations
from the mean. They can show up due to errors in data entry
or measurement, or just because there's a variation in the
population. Identifying and handlingoutliersisanimportant
part of data cleaning.
In Data Analysis we are using the subsequent
algorithms to analyse the cleansed data. Linear regression,
SVM (Support Vector Machine), KNN (K-Nearest
Neighbours), Logistic Regression, Decision Tree, K-Means,
Random Forest, Naive Bayes, Dimensional Reduction
Algorithms, Gradient Boosting Algorithms.
Linear Regression algorithm will use the
info points to seek out the simplest fit line to model the
info. A line can be represented by the equation, y = m*x +
c where y is the dependent variable and x is the
independent variable. Basic calculus theories are
applied to seek out the values for m and c using the given
data set. The SVM will separate the data points using a line.
The KNN will predict unknown data point with its k nearest
neighbours. The value of k is a critical factor regarding the
accuracy of prediction. It determines the nearest distance
using basic distance functions like Euclidean. Thisalgorithm
has to be a high computation power and that we have to
normalize the information initially to bring every datum
within the same range. The Decision Tree algorithm is used
to solve classification problems.Sometechniquesareusedto
categorize the data they are Gini, Chi-square, entropy etc. K-
Mean is an unsupervised algorithm that provides a solution
for clustering problem. The algorithm will follow the
procedure to form a cluster which contains homogeneous
data.
Random forest is identified as a collection of
decision trees. Every tree will try to estimate a classification
and this is called as a vote. We consider each votefromevery
tree and chose the maximum voted classification. Naive
Bayes can be applied only if the features are independent to
each other. Gradient Boosting Algorithm usesmultipleweak
algorithms to form accurate algorithm. Instead of using the
single estimator, will create a more stable and robust
algorithm. Based on the data set the algorithm is predicted
and provides an efficient result for data analysing process.
5. Results:
The user can click on the Submit button that is
provided and then select the operations they wish to
perform on their dataset from the list of operations
provided. The user can then upload the dataset into the
application by click on the Upload button to start the pre-
processing. Initially the original dataset is displayed and
then dataset after operation 1 will be displayed as cleansed
dataset. The selected operations are performed with the

3

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 950
Cleansed dataset; finally the user will perform the data
analysis with the required algorithm to obtain the result in
visualization and it can be download by the user.
Upload Noisy Dataset:
Displaying the noisy Dataset:
Preprocessing the Data:
Applying Machine Learning Modal:
Output:
6. Conclusion:
Our developed systemperformsData Cleaning,Data
Transformation and Data Reduction in data pre-processing.
Our system which takes the rawdatasetsintotheapplication
which are then pre-processed to clean up all the noisy data
using pre-processing techniques and the cleansed data is
visualized to the users after all the pre-processing is done.
This system saves a lot of time since manual cleaning can be
avoided. After cleansing the user can choose or select the
machine learning model which will provide efficient results
as plots. This serves as an effective purpose for the users
who wants to clean huge datasets and visualizestheanalysis
of pre-processed data. In future the accuracy and
comparison of the machine learning algorithms can be done
within the friendly user interface.
REFERENCES
[1] Cristian Felix, Anshul Vikram Pandey, and EnricoBertini,
“TextTile: An Interactive Visualization Tool for Seamless
Exploratory, Analysis of Structured Dataand Unstructured
Text“, IEEE-2018.
[2] Data,Huawen Liu, Xuelong Li, Jiuyong Li, andShichao
Zhang, “Efficient Outlier Detection for High-Dimensional“,
IEEE-2019.
[3] M. Bostock, V. Ogievetsky, and J. Heer, “Datadriven
documents,” IEEE-2011.

4

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 951
[4] F. Beck, S. Koch, and D. Weiskopf, “Visual Analysis and
Dissemination of Scientific Literature Collections with
SurVis”, IEEE-2016.
[5] Parke Godfrey, JarekGryz and PioterLasek,“Interactive
visualisation of large datasets”, IEEE-2016.
[6] Dileep kumarkoshleyand RajuHadler,“Data Cleaning: An
Abstraction-based approach”, IEEE-2015.
[7] Mehmet Adil Yalçın;NiklasElmqvist; Benjamin B.
Bederson,“Keshif :Rapid and Expressive Tabular Data
Exploration for Novices”, IEEE-2018.
[8] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C.
Faloutsosk, “LOCI: Fast outlier detection using the local
correlation integral,” IEEE 19th Int. Conf. Data Eng. (ICDE),
Bengaluru, India, 2003, pp. 315–326.
[9] Y. Pang, J. Cao, and X. Li, “Learning samplingdistributions
for efficient object detection”, IEEE Trans. Cybern., vol. 47,
no. 1, pp. 117–129, Jan. 2017.
[10] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier
detection for temporal data: A survey”, IEEE Trans. Knowl.
Data Eng., vol. 26, no. 9, pp. 2250–2267, Sep. 2014.
[11] S. F. Roth and J. Mattis, “Automating the presentation of
information,” in Artificial Intelligence Applications, 1991.
Pro-ceedings. , Seventh IEEE Conference on, vol. 1.IEEE,
1991, pp. 90–97.
[12] M. Bostock and J. Heer, “Protovis: A graphical toolkit
for visualization,” Visualization and Computer Graphics,
IEEE Transactions on, vol. 15, no. 6, pp. 1121–1128, 2009.
[13] A. Dziedzic, J. Duggan, A. J. Elmore, V. Gadepally, and M.
Stonebraker, “Bigdawg: a polystore for diverse interactive
applications,” in IEEE Viz Data Systems for Interactive
Analysis, 2015.
[14] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A.
Kementsietsidis “Conditional functional dependencies for
data cleaning. In Data Engineering”, IEEE 23rd International
Conference on, pages 746–755. IEEE, 2007.

More Related Content

IRJET - An User Friendly Interface for Data Preprocessing and Visualization using Machine Learning Models

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 948 An User Friendly Interface for Data Preprocessing and Visualization using Machine Learning Models Mr. S. Yoganand1, Bharathi Kannan R2, Daya Meenakshi B2 1Assistant Professor, Department of Computer Science and Engineering, Agni College of Technology Chennai-130, Tamil Nadu, India. 2,3UG Student, Department of Computer Science and Engineering,Agni College of Technology Chennai-130, Tamil Nadu, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – Machine learning is one of the most efficient techniques for prediction and classification related problems. In this modern era, most of the industries all over the world depend upon the machine learning models which leadintothe data analytics century. There is no properandefficienttool for handling the datasets which use machine learning models for data prediction and Visualization. So, in this paper a novel idea is proposed for making the user-friendly approach to handle the machine learning models for data prediction and visualisation. A tool is developed, such that it performs data cleaning which will be a prerequisite for data analysis and then provides a visible representation of the cleansed data. The developed tool will take the input as structured dataset that contains both textual and numerical data which are then processed using machine learning algorithms to obtain a pre- processed dataset. This process may undergo series of steps to produce visualized and predicted data as per the chosen effective algorithm to obtain efficient result. Key Words: Machine learning, visualization, pre- processing, Tool, user interface. 1. INTRODUCTION An organisation uses the dataset for predictive analysis and an important concern in these cases is data quality. Using noisy data can hamper with the correctness of analysis. The common errors are missing values, duplicates and other errors. These errors need to be corrected for reliable decisions and analytics. The users must know that the effects of using the noisy data before proceeding with the cleaning process. Noise removal will improve the model performance, due to the fact that noises may disturb the discovery of important information. Machine learning is the appreciated application of Artificial Intelligence. It is used to learn automatically without any human assistancethatprovideshugedataset for analysing with a large number of data fields. With the data provided by the system after implementing the machine learning algorithms, organizations are able to work more effective and acquire profit over their competitors. The system that uses machine learning technique will be able to predict how the structure looks like and adjust the data according to their structure. The mainchallengesinmachine learning model is to deal with large data sources for data cleaning process. Data cleaning process is carried by taking in huge datasets which are checkedforthepossible errorsby using data pre-processing techniques. The other challenges include avoiding learning process from noisy data, avoiding building a prejudiced model, not giving reasons for compromising with the qualityofthedata.The bestpractices for data cleaning using machine learning techniquesthatare filling missing values, removing unnecessary rows,reducing the size of the data and implementing a good quality plan. The success of machine learning applications depends on the amount of good quality data that is given to it. But this process of cleaning may not be considered as a main area in data pre-processing. The system that uses powerful algorithms to process the noisy data can yield bad results if irrelevant or wrong training set of data is given. In the proposed model ML algorithms to find out the different patterns in the data and group it by itself into clean and noisy data which will help in reducing execution time. 2. Related Work: Data Pre-processing is used to convert the raw data into pre-processed data set. [1] In Machine Learning, the data pre-processing is used to transform or encode the data easily by their algorithm. It consists of interactive steps as follows. Data cleaning is used to detect and correct inaccurate records from a record or tables, and then replacing, modifying or deleting this noisy data. Data integration will combines the data residing indifferent sources that provides user with a unified view of these data [2]. The process of selecting suitable data for a research project will impact data integritywhereData transformation converts data from a source data format into resultant data [3]. The tools which are available to process the data in data processing and visualizing are Knime, Shogun, Oryx 2, Tensor flow, Weka, RapidMiner, Trifacta Wrangler, Python [12] [13]. In this paper, we will focus on removing the noisy data that identifies the numerical values, predicting and filling in missing values and detect outliers which hamper with data analysis [11]. We propose a system that simplifies the process for the user and allows for better processing. In summary, Machine learning for data cleaning might be the only way to provide complete and trustworthy data sets for
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 949 effective analytics, so we provide an user friendly interface for pre-processing and model analysis with visualizationfor the ease of user. 3. System Design: The Data Pre-processingisdone withthreemethods they are Data Cleaning, Data Transformation and Data Reduction. The data cleaning application is to process the raw dataset containing both textual and numerical data that convert it into a cleaned dataset which can be used for data analysis. Initially, users must upload the dataset in which they perform the analysis. They can choose the operations that they want to perform on their dataset from themodules provided. This application performs a series of operations which includes removing columns with less information or no information, removing unnecessary rows, identifyingthe numerical values, filling in the missing fields and identifying the outliers. Some columns may contain less information or no information that makes it hard to rely on such columns for analysis and so such columns can be removed and they don’t cause significant damage to the data. Some rows may contain empty fields which will again tamper with the proper pre-processing of the dataset. Hence such values are identified and removed. The dataset will contain categorical features ranging from numerical to non-numerical values. This application requires only numerical data which is used for analysis and prediction, such that the fields containing numeric values areidentified. If you try to remove them, you might reduce the amount of data that is available. So, these fields need to be filled in appropriate values. 4. Implementation: The outliers with data points are really far from the rest of your data points. Mathematically, an outlierisusually defined as an observation over three standard deviations from the mean. They can show up due to errors in data entry or measurement, or just because there's a variation in the population. Identifying and handlingoutliersisanimportant part of data cleaning. In Data Analysis we are using the subsequent algorithms to analyse the cleansed data. Linear regression, SVM (Support Vector Machine), KNN (K-Nearest Neighbours), Logistic Regression, Decision Tree, K-Means, Random Forest, Naive Bayes, Dimensional Reduction Algorithms, Gradient Boosting Algorithms. Linear Regression algorithm will use the info points to seek out the simplest fit line to model the info. A line can be represented by the equation, y = m*x + c where y is the dependent variable and x is the independent variable. Basic calculus theories are applied to seek out the values for m and c using the given data set. The SVM will separate the data points using a line. The KNN will predict unknown data point with its k nearest neighbours. The value of k is a critical factor regarding the accuracy of prediction. It determines the nearest distance using basic distance functions like Euclidean. Thisalgorithm has to be a high computation power and that we have to normalize the information initially to bring every datum within the same range. The Decision Tree algorithm is used to solve classification problems.Sometechniquesareusedto categorize the data they are Gini, Chi-square, entropy etc. K- Mean is an unsupervised algorithm that provides a solution for clustering problem. The algorithm will follow the procedure to form a cluster which contains homogeneous data. Random forest is identified as a collection of decision trees. Every tree will try to estimate a classification and this is called as a vote. We consider each votefromevery tree and chose the maximum voted classification. Naive Bayes can be applied only if the features are independent to each other. Gradient Boosting Algorithm usesmultipleweak algorithms to form accurate algorithm. Instead of using the single estimator, will create a more stable and robust algorithm. Based on the data set the algorithm is predicted and provides an efficient result for data analysing process. 5. Results: The user can click on the Submit button that is provided and then select the operations they wish to perform on their dataset from the list of operations provided. The user can then upload the dataset into the application by click on the Upload button to start the pre- processing. Initially the original dataset is displayed and then dataset after operation 1 will be displayed as cleansed dataset. The selected operations are performed with the
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 950 Cleansed dataset; finally the user will perform the data analysis with the required algorithm to obtain the result in visualization and it can be download by the user. Upload Noisy Dataset: Displaying the noisy Dataset: Preprocessing the Data: Applying Machine Learning Modal: Output: 6. Conclusion: Our developed systemperformsData Cleaning,Data Transformation and Data Reduction in data pre-processing. Our system which takes the rawdatasetsintotheapplication which are then pre-processed to clean up all the noisy data using pre-processing techniques and the cleansed data is visualized to the users after all the pre-processing is done. This system saves a lot of time since manual cleaning can be avoided. After cleansing the user can choose or select the machine learning model which will provide efficient results as plots. This serves as an effective purpose for the users who wants to clean huge datasets and visualizestheanalysis of pre-processed data. In future the accuracy and comparison of the machine learning algorithms can be done within the friendly user interface. REFERENCES [1] Cristian Felix, Anshul Vikram Pandey, and EnricoBertini, “TextTile: An Interactive Visualization Tool for Seamless Exploratory, Analysis of Structured Dataand Unstructured Text“, IEEE-2018. [2] Data,Huawen Liu, Xuelong Li, Jiuyong Li, andShichao Zhang, “Efficient Outlier Detection for High-Dimensional“, IEEE-2019. [3] M. Bostock, V. Ogievetsky, and J. Heer, “Datadriven documents,” IEEE-2011.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 951 [4] F. Beck, S. Koch, and D. Weiskopf, “Visual Analysis and Dissemination of Scientific Literature Collections with SurVis”, IEEE-2016. [5] Parke Godfrey, JarekGryz and PioterLasek,“Interactive visualisation of large datasets”, IEEE-2016. [6] Dileep kumarkoshleyand RajuHadler,“Data Cleaning: An Abstraction-based approach”, IEEE-2015. [7] Mehmet Adil Yalçın;NiklasElmqvist; Benjamin B. Bederson,“Keshif :Rapid and Expressive Tabular Data Exploration for Novices”, IEEE-2018. [8] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsosk, “LOCI: Fast outlier detection using the local correlation integral,” IEEE 19th Int. Conf. Data Eng. (ICDE), Bengaluru, India, 2003, pp. 315–326. [9] Y. Pang, J. Cao, and X. Li, “Learning samplingdistributions for efficient object detection”, IEEE Trans. Cybern., vol. 47, no. 1, pp. 117–129, Jan. 2017. [10] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey”, IEEE Trans. Knowl. Data Eng., vol. 26, no. 9, pp. 2250–2267, Sep. 2014. [11] S. F. Roth and J. Mattis, “Automating the presentation of information,” in Artificial Intelligence Applications, 1991. Pro-ceedings. , Seventh IEEE Conference on, vol. 1.IEEE, 1991, pp. 90–97. [12] M. Bostock and J. Heer, “Protovis: A graphical toolkit for visualization,” Visualization and Computer Graphics, IEEE Transactions on, vol. 15, no. 6, pp. 1121–1128, 2009. [13] A. Dziedzic, J. Duggan, A. J. Elmore, V. Gadepally, and M. Stonebraker, “Bigdawg: a polystore for diverse interactive applications,” in IEEE Viz Data Systems for Interactive Analysis, 2015. [14] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis “Conditional functional dependencies for data cleaning. In Data Engineering”, IEEE 23rd International Conference on, pages 746–755. IEEE, 2007.