Abstract This paper proposes tackling the difficult course timetabling problem using a multi-agen... more Abstract This paper proposes tackling the difficult course timetabling problem using a multi-agent approach. The proposed design seeks to deal with the problem using a distributed solution environment in which a mediator agent coordinates various timetabling agents that cooperate to improve a common global solution. Initial timetables provided to the multi-agent system are generated using several hybrid heuristics that combine graph colouring heuristics and local search in different ways.
Abstract—Multi Multilingual corpora, containing the same documents in a variety of languages, are... more Abstract—Multi Multilingual corpora, containing the same documents in a variety of languages, are becoming an essential resource for natural language processing. Clustering multilingual corpora provides us with an insight into the differences between languages when term frequencybased Information Retrieval (IR) tools are used. It also allows one to use the Natural Language Processing (NLP) and IR tools in one language to implement IR for another language.
Abstract The aim of this paper is to extend our non-linear great deluge algorithm into an evoluti... more Abstract The aim of this paper is to extend our non-linear great deluge algorithm into an evolutionary approach by incorporating a population and a mutation operator to solve the university course timetabling problems. This approach might be seen as a variation of memetic algorithms. The popularity of evolutionary computation approaches has increased and become an important technique in solving complex combinatorial optimisation problems.
Problem statement: Due to the ever growing amount of biomedical datasets stored in multiple table... more Problem statement: Due to the ever growing amount of biomedical datasets stored in multiple tables, Information Extraction (IE) from these datasets is increasingly recognized as one of the crucial technologies in bioinformatics. However, for IE to be practically applicable, adaptability of a system is crucial, considering extremely diverse demands in biomedical IE application. One should be able to extract a set of hidden patterns from these biomedical datasets at low cost. Approach: In this study, a new method is proposed, called Bio-medical Data Aggregation for Relational Attributes (BioDARA), for automatic structuring information extraction for biomedical datasets. BioDARA summarizes biomedical data stored in multiple tables in order to facilitate data modeling efforts in a multi-relational setting. BioDARA has the advantages or capabilities to transform biomedical data stored in multiple tables or databases into a Vector Space model, summarize biomedical data using the Information Retrieval theory and finally extract frequent patterns that describe the characteristics of these biomedical datasets. Results: the results show that data summarization performed by DARA, can be beneficial in summarizing biomedical datasets in a complex multi-relational environment, in which biomedical datasets are stored in a multi-level of one-to-many relationships and also in the case of datasets stored in more than one one-to-many relationships with non-target tables. Conclusion: This study concludes that data summarization performed by BioDARA, can be beneficial in summarizing biomedical datasets in a complex multi-relational environment, in which biomedical datasets are stored in a multi-level of one-to-many relationships.
Knowledge discovery in both structured and unstructured datasets stored in large repository datab... more Knowledge discovery in both structured and unstructured datasets stored in large repository database systems has always motivated methods for data summarisation. Summarisation is closely related to compression, machine learning, and data mining. The closest connection is to data mining. Data summarisation methods for the unstructured domain usually involve text categorisation which groups together documents that share similar characteristics. With the ever growing number of text documents in large database systems, algorithms for text summarisation in the unstructured domain, such as document clustering, are often limited by the dimensionality of the data features. On the other hand, the application of data summarisation methods in mining data, stored across multiple tables with one-to-many relations, is often limited due to the complexity of the database schema. Most of the data summarisation methods that exist in relational database systems are very limited in term of functionality and flexibility. Such algorithms summarise structured data stored in multiple tables with one-to-many relations through the use of aggregation operators, such as the mean, sum, count, min and max. These aggregation operators are interesting not only because they are able to summarise structured data stored in multiple tables with one-to-many relations, but also because they scale up well. Unfortunately, existing aggregation operators, such as min or count, provide little information about the data stored in a non-target table with high cardinality attributes. This thesis helps the understanding and development of such algorithms summarising structured data stored in a non-target table that has many-to-one relations with the target table, as well as summarising unstructured data such as text documents. In this thesis, the feasibility of data summarisation techniques, borrowed from the Information Retrieval Theory, to summarise patterns obtained from data stored across multiple tables with one-to-many relations is demonstrated. The thesis describes the Dynamic Aggregation of Relational Attributes framework (DARA), which summarises data stored in non-target tables in order to facilitate data modelling efforts in a multi-relational setting. This thesis also studies methods to improve the descriptive accuracy of the proposed data summarisation approach to learning data stored in relational databases. These methods include the discretisation of continuous attributes and feature construction, in the context of summarising data stored in multiple tables with one-to-many relations. The application of the DARA algorithm in two application areas involving structured and unstructured data (text documents) is also presented in order to show the adaptability of this algorithm to real world problems.
Multilingual corpora, containing the same documents in a variety of languages, are becoming an es... more Multilingual corpora, containing the same documents in a variety of languages, are becoming an essential resource for natural language processing. Clustering multilingual corpora provides us with an insight into the differences between languages when term frequency based information retrieval (IR) tools are used. It also allows one to use the natural language processing (NLP) and IR tools in one language to implement IR for another language. For instance, in this way, the most relevant articles to be translated from language X to language Y can be selected after studying the clusters of abstracts in language Y. In this article, we report on our work on applying hierarchical agglomerative clustering (HAC) to a large corpus of documents where each appears both in Bulgarian and English. We cluster these documents for each language and compare the results both with respect to the shape of the tree and content of clusters produced. On the data available, the results of clustering one language resemble the other, provided the number of clusters required is relatively small. Further, we study the effects of reducing the set of terms used for clustering. This step appears a viable strategy for English, but is not acceptable for Bulgarian. Finally, we describe an experiment employing a genetic algorithm to fine-tune the individual term weights in order to reproduce more closely a predefined set of clusters. In this way, clustering becomes a supervised learning technique that is trained to better reproduce known clusters in language X when applied to the corresponding documents in language Y. Other possible applications include training the algorithm on a hand-clustered set of documents, and subsequently applying it to a superset, including unseen documents, incorporating in this way expert knowledge about the domain in the clustering algorithm.
Problem statement: Orang-utan is classified as a totally protected species and is listed as an en... more Problem statement: Orang-utan is classified as a totally protected species and is listed as an endangered species in Borneo. The survival of this species is highly dependent on the existence and quality of the lowland forest of Sabah. However, most of the pristine habitats in the lowland area have been converted into other land use activities such as a large scale plantation. This is due to the fact that most of the lowland forests are facing a continuous degradation process that will decrease its commercial value when it comes to generating revenue to the state government. Thus, the efforts to restore the forest are very vital. The main objectives of this study include establishing the relative spatial distribution of orang-utan in order to assess and determine the effects of the forest conversions in four main wild orang-utan population landscape, demonstrating the orang-utan population movement pattern as a response to the heavy logging activities and also quantifying the effect of logging activities on the status of food trees or plant species for orang-utan in their current forest habitat. Approach: In this research, relevant features are constructed in order to study the impacts of logging and forest conversion on Orang-utan population in Borneo. These features include aerial surveys and feeding behaviors. An aerial survey on orang-utan’s nest in four out of six main forest habitats for orang-utan in Sabah has been conducted between May 2005 and June 2009 in order to map the relative distribution and spatial density of the orang-utans. This is conducted in order to determine the impacts of the forest conversion for the last 20 years upon the orang-utan spatial distribution. In this project, three series of aerial surveys, covering Malua Forest Reserve, have been conducted to demonstrate the dynamic movement and habitat utilization by the orang-utan population, due to the logging activities within the forest habitat. A long term observation of the orang-utan feeding behaviour in the degraded forest in North Ulu Segama (NUS) has been also conducted to determine their feeding ecology in a logged over forest. Results: This study suggested that (i) forest conversion and logging activities have effects on the orang-utan habitat utilization and movement pattern; (ii) due to the influx of the forest conversion, some of the orang-utan in the forest reserve is concentrated in certain area, adjacent to the boundary of the forest reserve due to their movement limitation by the river network; (iii) the orang-utan population in degraded forest of NUS consumed almost different species of food plants in 2009, compared to year 1974. Conclusion: These results demonstrated that forest conversion and unsustainable logging activities are the main threat for the orang-utans conservation in Sabah. Therefore, the conservation of orang-utan populations are not guaranteed by the establishment of more protected area, but by ensuring that more forest habitats can be managed in sustainable way in order to avoid further forest degradation and to ensure their habitats remain connected to the large forest landscape.
Advanced Data Mining and Applications, Jan 1, 2006
A new approach is needed to handle huge dataset stored in multiple tables in a very-large databas... more A new approach is needed to handle huge dataset stored in multiple tables in a very-large database. Data mining and Knowledge Discovery in Databases (KDD) promise to play a crucial role in the way people interact with databases, especially decision support databases where analysis and exploration operations are essential. In this paper, we present related works in Relational Data Mining, define the basic notions of data mining for decision support and the types of data aggregation as a means of categorizing or summarizing data. We then present a novel approach to relational domain learning to support the development of decision making models by introducing automated construction of hierarchical multi-attribute model for decision making. We will describe how relational dataset can naturally be handled to support the construction of hierarchical multi-attribute model by using relational aggregation based on pattern's distance. In this paper, we presents the prototype ofDynamic Aggregation of Relational Attributes (hence called DARA) that is capable of supporting the construction of hierarchical multi-attribute model for decision making. We experimentally show these results in a multi-relational domain that shows higher percentage of correctly classified instances and illustrate set of rules extracted from the relational domains to support decision-making.
Advances in Databases and Information Systems, Jan 1, 2007
Handling numerical data stored in a relational database is different from handling those numerica... more Handling numerical data stored in a relational database is different from handling those numerical data stored in a single table due to the multiple occurrences of an individual record in the non-target table and non-determinate relations between tables. Most traditional data mining methods only deal with a single table and discretize columns that contain continuous numbers into nominal values. In a relational database, multiple records with numerical attributes are stored separately from the target table, and these records are usually associated with a single structured individual stored in the target table. Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database, in order to reduce the continuous domains to more manageable symbolic domains of low cardinality, and the loss of precision is assumed to be acceptable. In this paper, we consider different alternatives for dealing with continuous attributes in MRDM. The discretization procedures considered in this paper include algorithms that do not depend on the multi-relational structure of the data and also that are sensitive to this structure. In this experiment, we study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. We implement a new method of discretization, called the entropy-instance-based discretization method, and we evaluate this discretization method with respect to C4.5 on three varieties of a well-known multi-relational database (Mutagenesis), where numeric attributes play an important role. We demonstrate on the empirical results obtained that entropy-based discretization can be improved by taking into consideration the multiple-instance problem.
Proceeding of the 11th East-European Conference …, Jan 1, 2007
Clustering is an essential data mining task with various types of applications. Traditional clust... more Clustering is an essential data mining task with various types of applications. Traditional clustering algorithms are based on a vector space model representation. A relational database system often contains multi-relational information spread across multiple relations (tables). In order to cluster such data, one would require to restrict the analysis to a single representation, or to construct a feature space comprising all possible representations from the data stored in multiple tables. In this paper, we present a data summarization approach, borrowed from the Information Retrieval theory, to clustering in multi-relational environment. We find that the data summarization technique can be used here to capture the typical high volume of multiple instances and numerous forms of patterns. Our experiments demonstrate a technique to cluster data in a multi-relational environment and show the evaluation results on the mutagenesis dataset. In addition, the effect of varying the number of features considered in clustering on the classification performance is also evaluated.
Second Asia International Conference on Modelling & …, Jan 1, 2008
This paper addresses the question whether or not the descriptive accuracy of the DARA (Dynamic Ag... more This paper addresses the question whether or not the descriptive accuracy of the DARA (Dynamic Aggregation of Relational Attributes) algorithm benefits from the feature construction process. This involves solving the problem of constructing a set of relevant features used to generate patterns representing records in the TF-IDF weighted frequency matrix in order to cluster these records. In this paper, feature construction will be applied to enhance the results of the data summarisation approach in learning data stored in multiple tables with high cardinality of one-tomany relations. It is expected that the predictive accuracy of a classfication problem can be improved by improving the descriptive accuracy of the data summarisation approach, provided that the summarised data is fed into the target table as one of the features considered in the classification task.
Advanced Data Mining and Applications, Jan 1, 2008
The importance of input representation has been recognised already in machine learning. This pape... more The importance of input representation has been recognised already in machine learning. This paper discusses the application of genetic-based feature construction methods to generate input data for the data summarisation method called Dynamic Aggregation of Relational Attributes (DARA). Here, feature construction methods are applied in order to improve the descriptive accuracy of the DARA algorithm. The DARA algorithm is designed to summarise data stored in the non-target tables by clustering them into groups, where multiple records stored in non-target tables correspond to a single record stored in a target table. This paper addresses the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process. This involves solving the problem of constructing a relevant set of features for the DARA algorithm by using a genetic-based algorithm. This work also evaluates several scoring measures used as fitness functions to find the best set of constructed features.
Due to the widespread use of relational databases (mySQL, Oracle, DB2, MsSQL), most data are stor... more Due to the widespread use of relational databases (mySQL, Oracle, DB2, MsSQL), most data are stored as multiple tables in what can be a very large database. As a result, more efficient algorithms for mining data from multirelational domain need to be implemented. Inductive Logic programming (ILP) techniques are useful for analyzing data in multi-relational databases. Unfortunately, even though not complex in structure, such business data are often large and contain highly non-determinate components, making them difficult for ILP learners geared towards structurally complex tasks. In this paper, we build a novel transformation-based approach to relational domain learning and describe the transformation process implemented through relational aggregation based on pattern distance. In this paper, we present the prototype of “Dynamic Aggregation of Relational Attributes ” (hence called DARA) that is capable of mapping one-to-many relationship into one-to-one relationship, while preventing loss of information, in handling classification task in relational domains. We experimentally show these results in a multi-relational domain that show higher percentage of correctly classified instances and illustrate set of rules extracted using our approach.
In solving the classification problem in relational data mining, traditional methods, for example... more In solving the classification problem in relational data mining, traditional methods, for example, the C4.5 and its variants, usually require data transformations from datasets stored in multiple tables into a single table. Unfortunately, we may loss some information when we join tables with a high degree of one-to-many association. Therefore, data transformation becomes a tedious trial-and-error work and the classification result is often not very promising especially when the number of tables and the degree of one-to-many association are large. In this paper, we propose a genetic semi-supervised clustering technique as a means of aggregating data in multiple tables for the classification problem in relational database. This algorithm is suitable for classification of datasets with a high degree of one-to-many associations. It can be used in two ways. One is user-controlled clustering, where the user may control the result of clustering by varying the compactness of the spherical cluster. The other is automatic clustering, where a non-overlap clustering strategy is applied. In this paper, we use the latter method to dynamically cluster multiple instances, as a means of aggregating them, and illustrate the effectiveness of this method using the semi-supervised genetic algorithm-based clustering technique.
Advanced Data Mining and Applications, Jan 1, 2009
Multilingual corpora are becoming an essential resource for work in multilingual natural language... more Multilingual corpora are becoming an essential resource for work in multilingual natural language processing. The aim of this paper is to investigate the effects of applying a clustering technique to parallel multilingual texts. It is interesting to look at the differences of the clustermappings and the tree structures of the clusters. The effect of reducing the set of terms considered in clustering parallel corpora is also studied. After that, a genetic-based algorithm is applied to optimize the weights of terms considered in clustering the texts to classify unseen examples of documents. Specifically, the aim of this work is to introduce the tools necessary for this task and display a set of experimental results and issues which have become apparent.
Abstract: Problem statement: The importance of input representation has been recognized already i... more Abstract: Problem statement: The importance of input representation has been recognized already in machine learning. Feature construction is one of the methods used to generate relevant features for learning data. This study addressed the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process. In other words, this paper discusses the application of genetic algorithm to optimize the feature construction process to generate input data for the data summarization method called Dynamic Aggregation of Relational Attributes (DARA). Approach: The DARA algorithm was designed to summarize data stored in the non-target tables by clustering them into groups, where multiple records stored in non-target tables correspond to a single record stored in a target table. Here, feature construction methods are applied in order to improve the descriptive accuracy of the DARA algorithm. Since, the study addressed the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process, the involved task includes solving the problem of constructing a relevant set of features for the DARA algorithm by using a genetic-based algorithm. Results: It is shown in the experimental results that the quality of summarized data is directly influenced by the methods used to create patterns that represent records in the (n×p) TF-IDF weighted frequency matrix. The results of the evaluation of the geneticbased feature construction algorithm showed that the data summarization results can be improved by constructing features by using the Cluster Entropy (CE) genetic-based feature construction algorithm. Conclusion: This study showed that the data summarization results can be improved by constructing features by using the cluster entropy genetic-based feature construction algorithm.
Advanced Data Mining and Applications, Jan 1, 2007
Most aggregation functions are limited to either categorical or numerical values but not both val... more Most aggregation functions are limited to either categorical or numerical values but not both values. In this paper, we define three concepts of aggregation function and introduce a novel method to aggregate multiple instances that consists of both the categorical and numerical values. We show how these concepts can be implemented using clustering techniques. In our experiment, we discretize continuous values before applying the aggregation function on relational datasets. With the empirical results obtained, we demonstrate that our transformation approach using clustering techniques, as a means of aggregating multiple instances of attribute's values, can compete with existing multi-relational techniques, such as Progol and Tilde. In addition, the effect of the number of interval for discretization on the classification performance is also evaluated.
Advanced Data Mining and Applications, Jan 1, 2010
A distance based classification is one of the popular methods for classifying instances using a p... more A distance based classification is one of the popular methods for classifying instances using a point-to-point distance based on the nearest neighbour or k-NEAREST NEIGHBOUR (k-NN). The representation of distance measure can be one of the various measures available (e.g. Euclidean distance, Manhattan distance, Mahalanobis distance or other specific distance measures). In this paper, we propose a modified nearest neighbour method called Nearest Neighbour Distance Matrix (NNDM) for classification based on unsupervised and supervised distance matrix. In the proposed NNDM method, an Euclidean distance method coupled with a distance loss function is used to create a distance matrix. In our approach, distances of each instance to the rest of the training instances data will be used to create the training distance matrix (TADM). Then, the TADM will be used to classify a new instance. In supervised NNDM, two instances that belong to different classes will be pushed apart from each other. This is to ensure that the instances that are located next to each other belong to the same class. Based on the experimental results, we found that the trained distance matrix yields reasonable performance in classification.
Advanced Data Mining and Applications, Jan 1, 2009
Although the TF-IDF weighted frequency matrix (vector space model) has been widely studied and us... more Although the TF-IDF weighted frequency matrix (vector space model) has been widely studied and used in document clustering or document categorisation, there has been no attempt to extend this application to relational data that contain one-to-many associations between records. This paper explains the rationale for using TF-IDF (term frequency inverse document frequency), a technique for weighting data attributes, borrowed from Information Retrieval theory, to summarise datasets stored in a multi-relational setting with one-to-many relationships. A novel data summarisation algorithm based on TF-IDF is introduced, which is referred to as Dynamic Aggregation of Relational Attributes (DARA). The DARA algorithm applies clustering techniques in order to summarise these datasets. The experimental results show that using the DARA algorithm finds solutions with much greater accuracy.
Abstract This paper proposes tackling the difficult course timetabling problem using a multi-agen... more Abstract This paper proposes tackling the difficult course timetabling problem using a multi-agent approach. The proposed design seeks to deal with the problem using a distributed solution environment in which a mediator agent coordinates various timetabling agents that cooperate to improve a common global solution. Initial timetables provided to the multi-agent system are generated using several hybrid heuristics that combine graph colouring heuristics and local search in different ways.
Abstract—Multi Multilingual corpora, containing the same documents in a variety of languages, are... more Abstract—Multi Multilingual corpora, containing the same documents in a variety of languages, are becoming an essential resource for natural language processing. Clustering multilingual corpora provides us with an insight into the differences between languages when term frequencybased Information Retrieval (IR) tools are used. It also allows one to use the Natural Language Processing (NLP) and IR tools in one language to implement IR for another language.
Abstract The aim of this paper is to extend our non-linear great deluge algorithm into an evoluti... more Abstract The aim of this paper is to extend our non-linear great deluge algorithm into an evolutionary approach by incorporating a population and a mutation operator to solve the university course timetabling problems. This approach might be seen as a variation of memetic algorithms. The popularity of evolutionary computation approaches has increased and become an important technique in solving complex combinatorial optimisation problems.
Problem statement: Due to the ever growing amount of biomedical datasets stored in multiple table... more Problem statement: Due to the ever growing amount of biomedical datasets stored in multiple tables, Information Extraction (IE) from these datasets is increasingly recognized as one of the crucial technologies in bioinformatics. However, for IE to be practically applicable, adaptability of a system is crucial, considering extremely diverse demands in biomedical IE application. One should be able to extract a set of hidden patterns from these biomedical datasets at low cost. Approach: In this study, a new method is proposed, called Bio-medical Data Aggregation for Relational Attributes (BioDARA), for automatic structuring information extraction for biomedical datasets. BioDARA summarizes biomedical data stored in multiple tables in order to facilitate data modeling efforts in a multi-relational setting. BioDARA has the advantages or capabilities to transform biomedical data stored in multiple tables or databases into a Vector Space model, summarize biomedical data using the Information Retrieval theory and finally extract frequent patterns that describe the characteristics of these biomedical datasets. Results: the results show that data summarization performed by DARA, can be beneficial in summarizing biomedical datasets in a complex multi-relational environment, in which biomedical datasets are stored in a multi-level of one-to-many relationships and also in the case of datasets stored in more than one one-to-many relationships with non-target tables. Conclusion: This study concludes that data summarization performed by BioDARA, can be beneficial in summarizing biomedical datasets in a complex multi-relational environment, in which biomedical datasets are stored in a multi-level of one-to-many relationships.
Knowledge discovery in both structured and unstructured datasets stored in large repository datab... more Knowledge discovery in both structured and unstructured datasets stored in large repository database systems has always motivated methods for data summarisation. Summarisation is closely related to compression, machine learning, and data mining. The closest connection is to data mining. Data summarisation methods for the unstructured domain usually involve text categorisation which groups together documents that share similar characteristics. With the ever growing number of text documents in large database systems, algorithms for text summarisation in the unstructured domain, such as document clustering, are often limited by the dimensionality of the data features. On the other hand, the application of data summarisation methods in mining data, stored across multiple tables with one-to-many relations, is often limited due to the complexity of the database schema. Most of the data summarisation methods that exist in relational database systems are very limited in term of functionality and flexibility. Such algorithms summarise structured data stored in multiple tables with one-to-many relations through the use of aggregation operators, such as the mean, sum, count, min and max. These aggregation operators are interesting not only because they are able to summarise structured data stored in multiple tables with one-to-many relations, but also because they scale up well. Unfortunately, existing aggregation operators, such as min or count, provide little information about the data stored in a non-target table with high cardinality attributes. This thesis helps the understanding and development of such algorithms summarising structured data stored in a non-target table that has many-to-one relations with the target table, as well as summarising unstructured data such as text documents. In this thesis, the feasibility of data summarisation techniques, borrowed from the Information Retrieval Theory, to summarise patterns obtained from data stored across multiple tables with one-to-many relations is demonstrated. The thesis describes the Dynamic Aggregation of Relational Attributes framework (DARA), which summarises data stored in non-target tables in order to facilitate data modelling efforts in a multi-relational setting. This thesis also studies methods to improve the descriptive accuracy of the proposed data summarisation approach to learning data stored in relational databases. These methods include the discretisation of continuous attributes and feature construction, in the context of summarising data stored in multiple tables with one-to-many relations. The application of the DARA algorithm in two application areas involving structured and unstructured data (text documents) is also presented in order to show the adaptability of this algorithm to real world problems.
Multilingual corpora, containing the same documents in a variety of languages, are becoming an es... more Multilingual corpora, containing the same documents in a variety of languages, are becoming an essential resource for natural language processing. Clustering multilingual corpora provides us with an insight into the differences between languages when term frequency based information retrieval (IR) tools are used. It also allows one to use the natural language processing (NLP) and IR tools in one language to implement IR for another language. For instance, in this way, the most relevant articles to be translated from language X to language Y can be selected after studying the clusters of abstracts in language Y. In this article, we report on our work on applying hierarchical agglomerative clustering (HAC) to a large corpus of documents where each appears both in Bulgarian and English. We cluster these documents for each language and compare the results both with respect to the shape of the tree and content of clusters produced. On the data available, the results of clustering one language resemble the other, provided the number of clusters required is relatively small. Further, we study the effects of reducing the set of terms used for clustering. This step appears a viable strategy for English, but is not acceptable for Bulgarian. Finally, we describe an experiment employing a genetic algorithm to fine-tune the individual term weights in order to reproduce more closely a predefined set of clusters. In this way, clustering becomes a supervised learning technique that is trained to better reproduce known clusters in language X when applied to the corresponding documents in language Y. Other possible applications include training the algorithm on a hand-clustered set of documents, and subsequently applying it to a superset, including unseen documents, incorporating in this way expert knowledge about the domain in the clustering algorithm.
Problem statement: Orang-utan is classified as a totally protected species and is listed as an en... more Problem statement: Orang-utan is classified as a totally protected species and is listed as an endangered species in Borneo. The survival of this species is highly dependent on the existence and quality of the lowland forest of Sabah. However, most of the pristine habitats in the lowland area have been converted into other land use activities such as a large scale plantation. This is due to the fact that most of the lowland forests are facing a continuous degradation process that will decrease its commercial value when it comes to generating revenue to the state government. Thus, the efforts to restore the forest are very vital. The main objectives of this study include establishing the relative spatial distribution of orang-utan in order to assess and determine the effects of the forest conversions in four main wild orang-utan population landscape, demonstrating the orang-utan population movement pattern as a response to the heavy logging activities and also quantifying the effect of logging activities on the status of food trees or plant species for orang-utan in their current forest habitat. Approach: In this research, relevant features are constructed in order to study the impacts of logging and forest conversion on Orang-utan population in Borneo. These features include aerial surveys and feeding behaviors. An aerial survey on orang-utan’s nest in four out of six main forest habitats for orang-utan in Sabah has been conducted between May 2005 and June 2009 in order to map the relative distribution and spatial density of the orang-utans. This is conducted in order to determine the impacts of the forest conversion for the last 20 years upon the orang-utan spatial distribution. In this project, three series of aerial surveys, covering Malua Forest Reserve, have been conducted to demonstrate the dynamic movement and habitat utilization by the orang-utan population, due to the logging activities within the forest habitat. A long term observation of the orang-utan feeding behaviour in the degraded forest in North Ulu Segama (NUS) has been also conducted to determine their feeding ecology in a logged over forest. Results: This study suggested that (i) forest conversion and logging activities have effects on the orang-utan habitat utilization and movement pattern; (ii) due to the influx of the forest conversion, some of the orang-utan in the forest reserve is concentrated in certain area, adjacent to the boundary of the forest reserve due to their movement limitation by the river network; (iii) the orang-utan population in degraded forest of NUS consumed almost different species of food plants in 2009, compared to year 1974. Conclusion: These results demonstrated that forest conversion and unsustainable logging activities are the main threat for the orang-utans conservation in Sabah. Therefore, the conservation of orang-utan populations are not guaranteed by the establishment of more protected area, but by ensuring that more forest habitats can be managed in sustainable way in order to avoid further forest degradation and to ensure their habitats remain connected to the large forest landscape.
Advanced Data Mining and Applications, Jan 1, 2006
A new approach is needed to handle huge dataset stored in multiple tables in a very-large databas... more A new approach is needed to handle huge dataset stored in multiple tables in a very-large database. Data mining and Knowledge Discovery in Databases (KDD) promise to play a crucial role in the way people interact with databases, especially decision support databases where analysis and exploration operations are essential. In this paper, we present related works in Relational Data Mining, define the basic notions of data mining for decision support and the types of data aggregation as a means of categorizing or summarizing data. We then present a novel approach to relational domain learning to support the development of decision making models by introducing automated construction of hierarchical multi-attribute model for decision making. We will describe how relational dataset can naturally be handled to support the construction of hierarchical multi-attribute model by using relational aggregation based on pattern's distance. In this paper, we presents the prototype ofDynamic Aggregation of Relational Attributes (hence called DARA) that is capable of supporting the construction of hierarchical multi-attribute model for decision making. We experimentally show these results in a multi-relational domain that shows higher percentage of correctly classified instances and illustrate set of rules extracted from the relational domains to support decision-making.
Advances in Databases and Information Systems, Jan 1, 2007
Handling numerical data stored in a relational database is different from handling those numerica... more Handling numerical data stored in a relational database is different from handling those numerical data stored in a single table due to the multiple occurrences of an individual record in the non-target table and non-determinate relations between tables. Most traditional data mining methods only deal with a single table and discretize columns that contain continuous numbers into nominal values. In a relational database, multiple records with numerical attributes are stored separately from the target table, and these records are usually associated with a single structured individual stored in the target table. Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database, in order to reduce the continuous domains to more manageable symbolic domains of low cardinality, and the loss of precision is assumed to be acceptable. In this paper, we consider different alternatives for dealing with continuous attributes in MRDM. The discretization procedures considered in this paper include algorithms that do not depend on the multi-relational structure of the data and also that are sensitive to this structure. In this experiment, we study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. We implement a new method of discretization, called the entropy-instance-based discretization method, and we evaluate this discretization method with respect to C4.5 on three varieties of a well-known multi-relational database (Mutagenesis), where numeric attributes play an important role. We demonstrate on the empirical results obtained that entropy-based discretization can be improved by taking into consideration the multiple-instance problem.
Proceeding of the 11th East-European Conference …, Jan 1, 2007
Clustering is an essential data mining task with various types of applications. Traditional clust... more Clustering is an essential data mining task with various types of applications. Traditional clustering algorithms are based on a vector space model representation. A relational database system often contains multi-relational information spread across multiple relations (tables). In order to cluster such data, one would require to restrict the analysis to a single representation, or to construct a feature space comprising all possible representations from the data stored in multiple tables. In this paper, we present a data summarization approach, borrowed from the Information Retrieval theory, to clustering in multi-relational environment. We find that the data summarization technique can be used here to capture the typical high volume of multiple instances and numerous forms of patterns. Our experiments demonstrate a technique to cluster data in a multi-relational environment and show the evaluation results on the mutagenesis dataset. In addition, the effect of varying the number of features considered in clustering on the classification performance is also evaluated.
Second Asia International Conference on Modelling & …, Jan 1, 2008
This paper addresses the question whether or not the descriptive accuracy of the DARA (Dynamic Ag... more This paper addresses the question whether or not the descriptive accuracy of the DARA (Dynamic Aggregation of Relational Attributes) algorithm benefits from the feature construction process. This involves solving the problem of constructing a set of relevant features used to generate patterns representing records in the TF-IDF weighted frequency matrix in order to cluster these records. In this paper, feature construction will be applied to enhance the results of the data summarisation approach in learning data stored in multiple tables with high cardinality of one-tomany relations. It is expected that the predictive accuracy of a classfication problem can be improved by improving the descriptive accuracy of the data summarisation approach, provided that the summarised data is fed into the target table as one of the features considered in the classification task.
Advanced Data Mining and Applications, Jan 1, 2008
The importance of input representation has been recognised already in machine learning. This pape... more The importance of input representation has been recognised already in machine learning. This paper discusses the application of genetic-based feature construction methods to generate input data for the data summarisation method called Dynamic Aggregation of Relational Attributes (DARA). Here, feature construction methods are applied in order to improve the descriptive accuracy of the DARA algorithm. The DARA algorithm is designed to summarise data stored in the non-target tables by clustering them into groups, where multiple records stored in non-target tables correspond to a single record stored in a target table. This paper addresses the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process. This involves solving the problem of constructing a relevant set of features for the DARA algorithm by using a genetic-based algorithm. This work also evaluates several scoring measures used as fitness functions to find the best set of constructed features.
Due to the widespread use of relational databases (mySQL, Oracle, DB2, MsSQL), most data are stor... more Due to the widespread use of relational databases (mySQL, Oracle, DB2, MsSQL), most data are stored as multiple tables in what can be a very large database. As a result, more efficient algorithms for mining data from multirelational domain need to be implemented. Inductive Logic programming (ILP) techniques are useful for analyzing data in multi-relational databases. Unfortunately, even though not complex in structure, such business data are often large and contain highly non-determinate components, making them difficult for ILP learners geared towards structurally complex tasks. In this paper, we build a novel transformation-based approach to relational domain learning and describe the transformation process implemented through relational aggregation based on pattern distance. In this paper, we present the prototype of “Dynamic Aggregation of Relational Attributes ” (hence called DARA) that is capable of mapping one-to-many relationship into one-to-one relationship, while preventing loss of information, in handling classification task in relational domains. We experimentally show these results in a multi-relational domain that show higher percentage of correctly classified instances and illustrate set of rules extracted using our approach.
In solving the classification problem in relational data mining, traditional methods, for example... more In solving the classification problem in relational data mining, traditional methods, for example, the C4.5 and its variants, usually require data transformations from datasets stored in multiple tables into a single table. Unfortunately, we may loss some information when we join tables with a high degree of one-to-many association. Therefore, data transformation becomes a tedious trial-and-error work and the classification result is often not very promising especially when the number of tables and the degree of one-to-many association are large. In this paper, we propose a genetic semi-supervised clustering technique as a means of aggregating data in multiple tables for the classification problem in relational database. This algorithm is suitable for classification of datasets with a high degree of one-to-many associations. It can be used in two ways. One is user-controlled clustering, where the user may control the result of clustering by varying the compactness of the spherical cluster. The other is automatic clustering, where a non-overlap clustering strategy is applied. In this paper, we use the latter method to dynamically cluster multiple instances, as a means of aggregating them, and illustrate the effectiveness of this method using the semi-supervised genetic algorithm-based clustering technique.
Advanced Data Mining and Applications, Jan 1, 2009
Multilingual corpora are becoming an essential resource for work in multilingual natural language... more Multilingual corpora are becoming an essential resource for work in multilingual natural language processing. The aim of this paper is to investigate the effects of applying a clustering technique to parallel multilingual texts. It is interesting to look at the differences of the clustermappings and the tree structures of the clusters. The effect of reducing the set of terms considered in clustering parallel corpora is also studied. After that, a genetic-based algorithm is applied to optimize the weights of terms considered in clustering the texts to classify unseen examples of documents. Specifically, the aim of this work is to introduce the tools necessary for this task and display a set of experimental results and issues which have become apparent.
Abstract: Problem statement: The importance of input representation has been recognized already i... more Abstract: Problem statement: The importance of input representation has been recognized already in machine learning. Feature construction is one of the methods used to generate relevant features for learning data. This study addressed the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process. In other words, this paper discusses the application of genetic algorithm to optimize the feature construction process to generate input data for the data summarization method called Dynamic Aggregation of Relational Attributes (DARA). Approach: The DARA algorithm was designed to summarize data stored in the non-target tables by clustering them into groups, where multiple records stored in non-target tables correspond to a single record stored in a target table. Here, feature construction methods are applied in order to improve the descriptive accuracy of the DARA algorithm. Since, the study addressed the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process, the involved task includes solving the problem of constructing a relevant set of features for the DARA algorithm by using a genetic-based algorithm. Results: It is shown in the experimental results that the quality of summarized data is directly influenced by the methods used to create patterns that represent records in the (n×p) TF-IDF weighted frequency matrix. The results of the evaluation of the geneticbased feature construction algorithm showed that the data summarization results can be improved by constructing features by using the Cluster Entropy (CE) genetic-based feature construction algorithm. Conclusion: This study showed that the data summarization results can be improved by constructing features by using the cluster entropy genetic-based feature construction algorithm.
Advanced Data Mining and Applications, Jan 1, 2007
Most aggregation functions are limited to either categorical or numerical values but not both val... more Most aggregation functions are limited to either categorical or numerical values but not both values. In this paper, we define three concepts of aggregation function and introduce a novel method to aggregate multiple instances that consists of both the categorical and numerical values. We show how these concepts can be implemented using clustering techniques. In our experiment, we discretize continuous values before applying the aggregation function on relational datasets. With the empirical results obtained, we demonstrate that our transformation approach using clustering techniques, as a means of aggregating multiple instances of attribute's values, can compete with existing multi-relational techniques, such as Progol and Tilde. In addition, the effect of the number of interval for discretization on the classification performance is also evaluated.
Advanced Data Mining and Applications, Jan 1, 2010
A distance based classification is one of the popular methods for classifying instances using a p... more A distance based classification is one of the popular methods for classifying instances using a point-to-point distance based on the nearest neighbour or k-NEAREST NEIGHBOUR (k-NN). The representation of distance measure can be one of the various measures available (e.g. Euclidean distance, Manhattan distance, Mahalanobis distance or other specific distance measures). In this paper, we propose a modified nearest neighbour method called Nearest Neighbour Distance Matrix (NNDM) for classification based on unsupervised and supervised distance matrix. In the proposed NNDM method, an Euclidean distance method coupled with a distance loss function is used to create a distance matrix. In our approach, distances of each instance to the rest of the training instances data will be used to create the training distance matrix (TADM). Then, the TADM will be used to classify a new instance. In supervised NNDM, two instances that belong to different classes will be pushed apart from each other. This is to ensure that the instances that are located next to each other belong to the same class. Based on the experimental results, we found that the trained distance matrix yields reasonable performance in classification.
Advanced Data Mining and Applications, Jan 1, 2009
Although the TF-IDF weighted frequency matrix (vector space model) has been widely studied and us... more Although the TF-IDF weighted frequency matrix (vector space model) has been widely studied and used in document clustering or document categorisation, there has been no attempt to extend this application to relational data that contain one-to-many associations between records. This paper explains the rationale for using TF-IDF (term frequency inverse document frequency), a technique for weighting data attributes, borrowed from Information Retrieval theory, to summarise datasets stored in a multi-relational setting with one-to-many relationships. A novel data summarisation algorithm based on TF-IDF is introduced, which is referred to as Dynamic Aggregation of Relational Attributes (DARA). The DARA algorithm applies clustering techniques in order to summarise these datasets. The experimental results show that using the DARA algorithm finds solutions with much greater accuracy.
Uploads
datasets stored in more than one one-to-many relationships with non-target tables. Conclusion: This study concludes that data summarization performed by BioDARA, can be beneficial in summarizing biomedical datasets in a complex multi-relational environment, in which biomedical datasets are stored in a multi-level of one-to-many relationships.
measures). In this paper, we propose a modified nearest neighbour method called Nearest Neighbour Distance Matrix (NNDM) for classification based on unsupervised and supervised distance matrix. In the proposed NNDM method, an Euclidean distance method coupled with a distance loss function is used to create a distance matrix. In our approach, distances of each instance to the rest of the training instances data will be used to create the training distance matrix (TADM). Then, the TADM will be used to classify a new instance. In supervised NNDM, two instances that belong to different classes will be pushed apart from each other. This is to ensure that the instances that are located next to each other belong to the same class. Based on the experimental results, we found that the trained distance matrix yields reasonable performance in classification.
datasets stored in more than one one-to-many relationships with non-target tables. Conclusion: This study concludes that data summarization performed by BioDARA, can be beneficial in summarizing biomedical datasets in a complex multi-relational environment, in which biomedical datasets are stored in a multi-level of one-to-many relationships.
measures). In this paper, we propose a modified nearest neighbour method called Nearest Neighbour Distance Matrix (NNDM) for classification based on unsupervised and supervised distance matrix. In the proposed NNDM method, an Euclidean distance method coupled with a distance loss function is used to create a distance matrix. In our approach, distances of each instance to the rest of the training instances data will be used to create the training distance matrix (TADM). Then, the TADM will be used to classify a new instance. In supervised NNDM, two instances that belong to different classes will be pushed apart from each other. This is to ensure that the instances that are located next to each other belong to the same class. Based on the experimental results, we found that the trained distance matrix yields reasonable performance in classification.