The performance of different Data Mining Algorithms including Classification, Clustering, Associa... more The performance of different Data Mining Algorithms including Classification, Clustering, Association, Prediction and others are highly related to the approaches used in Data Warehouse design and to the way the data is stored (lightly summarized, highly summarized and detailed).Detailed data is important to get detailed reports but as the amount of data is huge this represents a big challenge to the mining algorithms, on the other hand, the summarized data leads to better algorithms performance but the lack of the required knowledge may affect the overall mining process. Knowledge extraction and mining algorithms performance and complexities represent a big challenge in data analysis field, hence the work in this paper represents a proposed approach to improve the algorithms performance throughout well designed warehouse and data reduction technique. The work in this paper presents a hybrid warehouse galaxy model that stores data in three different formats including detailed, summar...
Missing values in data sets represent one of the greatest challenge in analyzing data to extract ... more Missing values in data sets represent one of the greatest challenge in analyzing data to extract knowledge from the data set. The work in this paper presents a new approach for solving the missing values problems by using and merging two different techniques; clustering (K-means and Expectation Maximization) and curve fitting. More than twenty thousand records of real health data set collected from different Iraqi hospitals were used to create and test the proposed approach that showed better results than the most popular techniques for estimation missing values such as most common values, overall overage, class average, and class most common values. Different software were used in the proposed work including WEKA (Waikato Environment for Knowledge Analysis), Matlab, Excel and C++.
Solving transportation problems where products to be supplied from one side(sources) to another (... more Solving transportation problems where products to be supplied from one side(sources) to another (demands) with a goal to minimize the overall transportation cost represents an activity of great importance. Most of the works done in the field deals with the problem as two-sided model (Sources such as factories and Demands such as warehouses) with no connections between sources or demands. However, real world transportation problems may come in another model where sources are connected in a network like graph in which each source may supply other sources in a specific cost. The work in this paper suggests an algorithm and a graph model with mathematical solution for finding the minimum feasible solution for such widely used transportation problems. In this work, the graph representing the problem in which all sources are connected together in a network model with specific cost on each edge is converted into a new graph where additional virtual sources representing supplies between sou...
The work in this paper presents a proposed solution for preprocessing, analyzing, mining and data... more The work in this paper presents a proposed solution for preprocessing, analyzing, mining and data warehouse model for personal medical data collected from different hospitals and clinics. The proposed solution contains different phases and steps, including Extraction, Transforming and Loading (ETL) and data preprocessing focuses on converting the logged data into categories suitable for analysis and mining process, a star warehouse model was implemented that fulfills the required processing techniques, data are represented by multi-dimensional cubes for efficient and better data representation, and finally link analysis was applied on the data. The proposed framework is simple and straight forward for implementation. Personal medical data from different sources mostly in Excel files were converted into clean, complete and consistent data by different preprocessing techniques. Logged data were converted into high quality, reliable and suitable for analysis and mining process. Star wa...
Clustering represents one of the most popular knowledge extraction algorithms in data mining tech... more Clustering represents one of the most popular knowledge extraction algorithms in data mining techniques. Hierarchical and partitioning approaches are widely used in this field. Each has its own advantages, drawbacks and goals. K-means represents the most popular partitioning clusteringtechnique, however it suffers from two major drawbacks; time complexity and its sensitivity to the initial centroid values. The work in this paper presents an approach for estimating the starting initial centroids throughout three process including density based, normalization and smoothing ideas. The proposed algorithm has a strong mathematical foundation. The proposed approach was tested using a free standard data (20000 records). The results showed that the approach has better complexity and ensures the clustering convergence.
This paper presents a proposed framework for the crime and criminal data analysis and detection u... more This paper presents a proposed framework for the crime and criminal data analysis and detection using Decision tree Algorithms for data classification and Simple K Means algorithm for data clustering. The paper tends to help specialists in discovering patterns and trends, making forecasts, finding relationships and possible explanations, mapping criminal networks and identifying possible suspects. The classification is based mainly on grouping the crimes according to the type, location, time and other attributes; Clustering is based on finding relationships between different Crime and Criminal attributes having some previously unknown common characteristics. The results of both classifications and Clustering are used for prediction of trends and behavior of the given objects (Crimes and Criminals). Data for both crimes and criminals were collected from free police departments’ dataset available on the Internet to create and test the proposed framework, and then these data were prepr...
Journal of Physics: Conference Series, May 1, 2018
Text classification (TC) is an essential field in both text mining (TM) and natural language proc... more Text classification (TC) is an essential field in both text mining (TM) and natural language processing (NLP). Humans have a tendency to organize and categorize everything as they want to make things easier to understand. Therefore, text classification is an important step to achieve this goal. Arabic text classification (ATC) is a difficult process because the Arabic language has complications and limitations resulting from the nature of its morphology. In this paper, a proposed approach called the Master-Slaves technique (MST) was used to improve Arabic text classification. It consists of two main phases: in the first phase, a new Arabic corpus of 16757 text files was collected. These text files were classified into five categories manually. In the second phase, four different classifiers were implemented on the collected corpus. These classifiers are Naïve Bayes (NB), K-Nearest Neighbour (KNN), Multinomial Logistic Regression (MLR) and Maximum Weight (MW). Naïve Bayes classifier was implemented as Master and the others as Slaves. The results of these slave classifiers were used to change the probability of the Naïve Bayes classifier (Master). The four classifiers used were implemented individually and the simple voting technique was implemented among them too on the collected corpus to check the effectiveness and efficiency of the proposed technique. All the tests were applied after the pre-processing of Arabic text documents (tokenization, stemming, and stop-word removal) and each document was represented as vector of weights. For the reliability of the results, 10-fold cross-validation was used in this paper. The results showed that the Master-slaves technique gives a good improvement in accuracy of text document classification with accepted algorithm complexity compared to other techniques.
Indonesian Journal of Electrical Engineering and Computer Science
Clustering represents one of the most popular and used Data Mining techniques due to its usefulne... more Clustering represents one of the most popular and used Data Mining techniques due to its usefulness and the wide variations of the applications in real world. Defining the number of the clusters required is an application oriented context, this means that the number of clusters k is an input to the whole clustering process. The proposed approach represents a solution for estimating the optimum number of clusters. It is based on the use of iterative K-means clustering under three different criteria; centroids convergence, total distance between the objects and the cluster centroid and the number of migrated objects which can be used effectively to ensure better clustering accuracy and performance. A total of 20000 records available on the internet were used in the proposed approach to test the approach. The results obtained from the approach showed good improvement on clustering accuracy and algorithm performance over the other techniques where centroids convergence represents a maj...
This paper presents a proposed framework for the crime and criminal data analysis and detection u... more This paper presents a proposed framework for the crime and criminal data analysis and detection using Decision tree Algorithms for data classification and
Solving transportation problems where products to be supplied from one side(sources) to another (... more Solving transportation problems where products to be supplied from one side(sources) to another (demands) with a goal to minimize the overall transportation cost represents an activity of great importance. Most of the works done in the field deals with the problem as two-sided model (Sources such as factories and Demands such as warehouses) with no connections between sources or demands. However, real world transportation problems may come in another model where sources are connected in a network like graph in which each source may supply other sources in a specific cost. The work in this paper suggests an algorithm and a graph model with mathematical solution for finding the minimum feasible solution for such widely used transportation problems. In this work, the graph representing the problem in which all sources are connected together in a network model with specific cost on each edge is converted into a new graph where additional virtual sources representing supplies between sources are added to the graph , new costs between the added sources and the demands are also calculated, and then modified Kruskal's algorithm is applied to get the minimum feasible solution. The proposed solution is a straight forward model with strong mathematical and graph models. It can be widely used for solving real world transportation problems with feasible time and space complexity where time complexity of O(E 2 + V 2) is required, where E represents the number of edges and V represents the number of vertices. Different numerical examples were used to study the effectiveness and correctness of the proposed algorithm.
The performance of different Data Mining Algorithms including Classification, Clustering, Associa... more The performance of different Data Mining Algorithms including Classification, Clustering, Association, Prediction and others are highly related to the approaches used in Data Warehouse design and to the way the data is stored (lightly summarized, highly summarized and detailed).Detailed data is important to get detailed reports but as the amount of data is huge this represents a big challenge to the mining algorithms, on the other hand, the summarized data leads to better algorithms performance but the lack of the required knowledge may affect the overall mining process. Knowledge extraction and mining algorithms performance and complexities represent a big challenge in data analysis field, hence the work in this paper represents a proposed approach to improve the algorithms performance throughout well designed warehouse and data reduction technique. The work in this paper presents a hybrid warehouse galaxy model that stores data in three different formats including detailed, summar...
Missing values in data sets represent one of the greatest challenge in analyzing data to extract ... more Missing values in data sets represent one of the greatest challenge in analyzing data to extract knowledge from the data set. The work in this paper presents a new approach for solving the missing values problems by using and merging two different techniques; clustering (K-means and Expectation Maximization) and curve fitting. More than twenty thousand records of real health data set collected from different Iraqi hospitals were used to create and test the proposed approach that showed better results than the most popular techniques for estimation missing values such as most common values, overall overage, class average, and class most common values. Different software were used in the proposed work including WEKA (Waikato Environment for Knowledge Analysis), Matlab, Excel and C++.
Solving transportation problems where products to be supplied from one side(sources) to another (... more Solving transportation problems where products to be supplied from one side(sources) to another (demands) with a goal to minimize the overall transportation cost represents an activity of great importance. Most of the works done in the field deals with the problem as two-sided model (Sources such as factories and Demands such as warehouses) with no connections between sources or demands. However, real world transportation problems may come in another model where sources are connected in a network like graph in which each source may supply other sources in a specific cost. The work in this paper suggests an algorithm and a graph model with mathematical solution for finding the minimum feasible solution for such widely used transportation problems. In this work, the graph representing the problem in which all sources are connected together in a network model with specific cost on each edge is converted into a new graph where additional virtual sources representing supplies between sou...
The work in this paper presents a proposed solution for preprocessing, analyzing, mining and data... more The work in this paper presents a proposed solution for preprocessing, analyzing, mining and data warehouse model for personal medical data collected from different hospitals and clinics. The proposed solution contains different phases and steps, including Extraction, Transforming and Loading (ETL) and data preprocessing focuses on converting the logged data into categories suitable for analysis and mining process, a star warehouse model was implemented that fulfills the required processing techniques, data are represented by multi-dimensional cubes for efficient and better data representation, and finally link analysis was applied on the data. The proposed framework is simple and straight forward for implementation. Personal medical data from different sources mostly in Excel files were converted into clean, complete and consistent data by different preprocessing techniques. Logged data were converted into high quality, reliable and suitable for analysis and mining process. Star wa...
Clustering represents one of the most popular knowledge extraction algorithms in data mining tech... more Clustering represents one of the most popular knowledge extraction algorithms in data mining techniques. Hierarchical and partitioning approaches are widely used in this field. Each has its own advantages, drawbacks and goals. K-means represents the most popular partitioning clusteringtechnique, however it suffers from two major drawbacks; time complexity and its sensitivity to the initial centroid values. The work in this paper presents an approach for estimating the starting initial centroids throughout three process including density based, normalization and smoothing ideas. The proposed algorithm has a strong mathematical foundation. The proposed approach was tested using a free standard data (20000 records). The results showed that the approach has better complexity and ensures the clustering convergence.
This paper presents a proposed framework for the crime and criminal data analysis and detection u... more This paper presents a proposed framework for the crime and criminal data analysis and detection using Decision tree Algorithms for data classification and Simple K Means algorithm for data clustering. The paper tends to help specialists in discovering patterns and trends, making forecasts, finding relationships and possible explanations, mapping criminal networks and identifying possible suspects. The classification is based mainly on grouping the crimes according to the type, location, time and other attributes; Clustering is based on finding relationships between different Crime and Criminal attributes having some previously unknown common characteristics. The results of both classifications and Clustering are used for prediction of trends and behavior of the given objects (Crimes and Criminals). Data for both crimes and criminals were collected from free police departments’ dataset available on the Internet to create and test the proposed framework, and then these data were prepr...
Journal of Physics: Conference Series, May 1, 2018
Text classification (TC) is an essential field in both text mining (TM) and natural language proc... more Text classification (TC) is an essential field in both text mining (TM) and natural language processing (NLP). Humans have a tendency to organize and categorize everything as they want to make things easier to understand. Therefore, text classification is an important step to achieve this goal. Arabic text classification (ATC) is a difficult process because the Arabic language has complications and limitations resulting from the nature of its morphology. In this paper, a proposed approach called the Master-Slaves technique (MST) was used to improve Arabic text classification. It consists of two main phases: in the first phase, a new Arabic corpus of 16757 text files was collected. These text files were classified into five categories manually. In the second phase, four different classifiers were implemented on the collected corpus. These classifiers are Naïve Bayes (NB), K-Nearest Neighbour (KNN), Multinomial Logistic Regression (MLR) and Maximum Weight (MW). Naïve Bayes classifier was implemented as Master and the others as Slaves. The results of these slave classifiers were used to change the probability of the Naïve Bayes classifier (Master). The four classifiers used were implemented individually and the simple voting technique was implemented among them too on the collected corpus to check the effectiveness and efficiency of the proposed technique. All the tests were applied after the pre-processing of Arabic text documents (tokenization, stemming, and stop-word removal) and each document was represented as vector of weights. For the reliability of the results, 10-fold cross-validation was used in this paper. The results showed that the Master-slaves technique gives a good improvement in accuracy of text document classification with accepted algorithm complexity compared to other techniques.
Indonesian Journal of Electrical Engineering and Computer Science
Clustering represents one of the most popular and used Data Mining techniques due to its usefulne... more Clustering represents one of the most popular and used Data Mining techniques due to its usefulness and the wide variations of the applications in real world. Defining the number of the clusters required is an application oriented context, this means that the number of clusters k is an input to the whole clustering process. The proposed approach represents a solution for estimating the optimum number of clusters. It is based on the use of iterative K-means clustering under three different criteria; centroids convergence, total distance between the objects and the cluster centroid and the number of migrated objects which can be used effectively to ensure better clustering accuracy and performance. A total of 20000 records available on the internet were used in the proposed approach to test the approach. The results obtained from the approach showed good improvement on clustering accuracy and algorithm performance over the other techniques where centroids convergence represents a maj...
This paper presents a proposed framework for the crime and criminal data analysis and detection u... more This paper presents a proposed framework for the crime and criminal data analysis and detection using Decision tree Algorithms for data classification and
Solving transportation problems where products to be supplied from one side(sources) to another (... more Solving transportation problems where products to be supplied from one side(sources) to another (demands) with a goal to minimize the overall transportation cost represents an activity of great importance. Most of the works done in the field deals with the problem as two-sided model (Sources such as factories and Demands such as warehouses) with no connections between sources or demands. However, real world transportation problems may come in another model where sources are connected in a network like graph in which each source may supply other sources in a specific cost. The work in this paper suggests an algorithm and a graph model with mathematical solution for finding the minimum feasible solution for such widely used transportation problems. In this work, the graph representing the problem in which all sources are connected together in a network model with specific cost on each edge is converted into a new graph where additional virtual sources representing supplies between sources are added to the graph , new costs between the added sources and the demands are also calculated, and then modified Kruskal's algorithm is applied to get the minimum feasible solution. The proposed solution is a straight forward model with strong mathematical and graph models. It can be widely used for solving real world transportation problems with feasible time and space complexity where time complexity of O(E 2 + V 2) is required, where E represents the number of edges and V represents the number of vertices. Different numerical examples were used to study the effectiveness and correctness of the proposed algorithm.
Uploads
Papers by Kadhim AlJanabi