Risk modelling along with multi-objective optimization problems have been at the epicenter of attention for supply chain managers. In this paper, we introduce a dataset for risk modelling in sophisticated supply chain networks based on... more
Risk modelling along with multi-objective optimization problems have been at the epicenter of attention for supply chain managers. In this paper, we introduce a dataset for risk modelling in sophisticated supply chain networks based on formal mathematical models. We have discussed the methodology and simulation tools used to synthesize the dataset. Additionally, the underlying mathematical models are discussed in granular details along with providing directions to conducting statistical analyses or neural machine learning models. The simulation is performed using MATLAB ™Simulink and the models are illustrated as well.
The Z-value is an attempt to estimate the statistical significance of a Smith-Waterman dynamic alignment score (SW-score) through the use of a Monte-Carlo process. It partly reduces the bias induced by the composition and length of the... more
The Z-value is an attempt to estimate the statistical significance of a Smith-Waterman dynamic alignment score (SW-score) through the use of a Monte-Carlo process. It partly reduces the bias induced by the composition and length of the sequences. This paper is not a theoretical study on the distribution of SW-scores and Z-values. Rather, it presents a statistical analysis of Z-values on large datasets of protein sequences, leading to a law of probability that the experimental Z-values follow. First, we determine the relationships between the computed Z-value, an estimation of its variance and the number of randomizations in the Monte-Carlo process. Then, we illustrate that Z-values are less correlated to sequence lengths than SW-scores. Then we show that pairwise alignments, performed on 'quasi-real' sequences (i.e., randomly shuffled sequences of the same length and amino acid composition as the real ones) lead to Z-value distributions that statistically fit the extreme val...
Classification and regression trees are becoming increasingly popular for partitioning data and identifying local structure in small and large datasets. Classification trees include those models in which the dependent variable (the... more
Classification and regression trees are becoming increasingly popular for partitioning data and identifying local structure in small and large datasets. Classification trees include those models in which the dependent variable (the predicted variable) is categorical. Regression trees include those in which it is continuous. This paper discusses pitfalls in the use of these methods and high- lights where they are
Outlier (or anomaly) detection is an important problem for many domains, including fraud detection, risk analysis, network intrusion and medical diagnosis, and the discovery of significant outliers is becoming an integral aspect of data... more
Outlier (or anomaly) detection is an important problem for many domains, including fraud detection, risk analysis, network intrusion and medical diagnosis, and the discovery of significant outliers is becoming an integral aspect of data mining. This paper presents CURIO, a novel algorithm that uses quantisation and implied distance metrics to provide a fast algorithm that is linear for the number of objects and only requires two sequential scans of disk resident datasets. CURIO includes a novel direct quantisation technique and the ...
Risk modelling along with multi-objective optimization problems have been at the epicenter of attention for supply chain managers. In this paper, we introduce a dataset for risk modelling in sophisticated supply chain networks based on... more
Risk modelling along with multi-objective optimization problems have been at the epicenter of attention for supply chain managers. In this paper, we introduce a dataset for risk modelling in sophisticated supply chain networks based on formal mathematical models. We have discussed the methodology and simulation tools used to synthesize the dataset. Additionally, the underlying mathematical models are discussed in granular details along with providing directions to conducting statistical analyses or neural machine learning models. The simulation is performed using MATLAB ™Simulink and the models are illustrated as well.
In two studies, this thesis depicts the relationship between minority group status in the United States, perceived discrimination, and coping with stress. Past literature on coping and its types – problem-focused versus emotion-focused –... more
In two studies, this thesis depicts the relationship between minority group status in the United States, perceived discrimination, and coping with stress. Past literature on coping and its types – problem-focused versus emotion-focused – is inconsistent in terms of differences between minority status groups and majority groups. It remains unknown whether or why Black Americans and lesbian or gay Americans may demonstrate coping patterns that differ from White Americans and heterosexual Americans, respectively. What is altogether absent from the literature is the possible mediating factor of perceived discrimination experienced by these minority groups. That is, differences in internal, stable coping processes that manage stress may have been molded by one’s experience with discrimination. Study 1 examines the relationship between race (Black versus White) and coping, mediated by perceived discrimination. Study 2 examines the relationship between sexual orientation (lesbian or gay versus heterosexual) and coping, mediated by perceived discrimination. Both studies confirm the thesis that minority group members exhibit maladaptive, emotion-focused coping more than majority group
vi
members – but that this difference is explained by the minority group members’ perceived discrimination. Historical and political relevance, social implications, and possible limitations in design and interpretation are discussed.
Clustering is a division of data into groups of similar objects. K-means has been used in many clustering work because of the ease of the algorithm. Our main effort is to parallelize the k-means clustering algorithm. The parallel version... more
Clustering is a division of data into groups of similar objects. K-means has been used in many clustering work because of the ease of the algorithm. Our main effort is to parallelize the k-means clustering algorithm. The parallel version is implemented based on the inherent parallelism during the Distance Calculation and Centroid Update phases. The parallel K-means algorithm is designed in such a way that each P participating node is responsible for handling n/P data points. We run the program on a Linux Cluster with a maximum of eight nodes using message-passing programming model. We examined the performance based on the percentage of correct answers and its speed-up performance. The outcome shows that our parallel K-means program performs relatively well on large datasets.
The ability of Minkowski Functionals to characterize local structure in different biological tissue types has been demonstrated in a variety of medical image processing tasks. We introduce anisotropic Minkowski Functionals (AMFs) as a... more
The ability of Minkowski Functionals to characterize local structure in different biological tissue types has been demonstrated in a variety of medical image processing tasks. We introduce anisotropic Minkowski Functionals (AMFs) as a novel variant that captures the inherent anisotropy of the underlying gray-level structures. To quantify the anisotropy characterized by our approach, we further introduce a method to compute a quantitative measure motivated by a technique utilized in MR diffusion tensor imaging, namely fractional anisotropy. We showcase the applicability of our method in the research context of characterizing the local structure properties of trabecular bone micro-architecture in the proximal femur as visualized on multi-detector CT. To this end, AMFs were computed locally for each pixel of ROIs extracted from the head, neck and trochanter regions. Fractional anisotropy was then used to quantify the local anisotropy of the trabecular structures found in these ROIs and to compare its distribution in different anatomical regions. Our results suggest a significantly greater concentration of anisotropic trabecular structures in the head and neck regions when compared to the trochanter region (p < 10-4). We also evaluated the ability of such AMFs to predict bone strength in the femoral head of proximal femur specimens obtained from 50 donors. Our results suggest that such AMFs, when used in conjunction with multi-regression models, can outperform more conventional features such as BMD in predicting failure load. We conclude that such anisotropic Minkowski Functionals can capture valuable information regarding directional attributes of local structure, which may be useful in a wide scope of biomedical imaging applications.
In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used... more
In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but ...
The efficiency of frequent itemset mining algorithms is determined mainly by three factors: the way candidates are generated, the data structure that is used and the implementation details. Most papers focus on the first factor, some... more
The efficiency of frequent itemset mining algorithms is determined mainly by three factors: the way candidates are generated, the data structure that is used and the implementation details. Most papers focus on the first factor, some describe the underlying data structures, but implementation details are almost always neglected. In this paper we show that the effect of implementation can be more important than the selection of the algorithm. Ideas that seem to be quite promising, may turn out to be ineffective if we ...
Nowadays Web sites tend to be more and more social: users can upload any kind of information on collaborative platforms and can express their opinions about the content they enjoyed through textual feedbacks or reviews. These platforms... more
Nowadays Web sites tend to be more and more social: users can upload any kind of information on collaborative platforms and can express their opinions about the content they enjoyed through textual feedbacks or reviews. These platforms allow users to annotate resources they like through freely chosen keywords (called tags). The main advantage of these tools is that they perfectly fit user needs, since the use of tags allows organizing the information in a way that closely follows the user mental model, making ...
When applying multivariate analysis techniques in information systems and social science disciplines, such as management information systems (MIS) and marketing, the assumption that the empirical data originate from a single homogeneous... more
When applying multivariate analysis techniques in information systems and social science disciplines, such as management information systems (MIS) and marketing, the assumption that the empirical data originate from a single homogeneous population is often unrealistic. When applying a causal modeling approach, such as partial least squares (PLS) path modeling, segmentation is a key issue in coping with the problem of heterogeneity in estimated cause-and-effect relationships. This chapter presents a new PLS path modeling approach which classifies units on the basis of the heterogeneity of the estimates in the inner model. If unobserved heterogeneity significantly affects the estimated path model relationships on the aggregate data level, the methodology will allow homogenous groups of observations to be created that exhibit distinctive path model estimates. The approach will, thus, provide differentiated analytical outcomes that permit more precise interpretations of each segment formed. An application on a large data set in an example of the American customer satisfaction index (ACSI) substantiates the methodology’s effectiveness in evaluating PLS path modeling results.
This paper presents a novel approach to the task of automatic music genre classification which is based on multiple feature vectors and ensemble of classifiers. Multiple feature vectors are extracted from a single music piece. First,... more
This paper presents a novel approach to the task of automatic music genre classification which is based on multiple feature vectors and ensemble of classifiers. Multiple feature vectors are extracted from a single music piece. First, three 30-second music segments, one from the beginning, one from the middle and one from end part of a music piece are selected and feature vectors are extracted from each segment. Individual classifiers are trained to account for each feature vector extracted from each music segment. At the classification, the outputs provided by each individual classifier are combined through simple combination rules such as majority vote, max, sum and product rules, with the aim of improving music genre classification accuracy. Experiments carried out on a large dataset containing more than 3,000 music samples from ten different Latin music genres have shown that for the task of automatic music genre classification, the features extracted from the middle part of the music provide better results than using the segments from the beginning or end part of the music. Furthermore, the proposed ensemble approach, which combines the multiple feature vectors, provides better accuracy than using single classifiers and any individual music segment.
Many scientific applications can benejit from eficient clustering algorithm of massively large high dimensional datasets. However most of the developed ,algorithms are impractical to use when the amount of data is very large. Given N... more
Many scientific applications can benejit from eficient clustering algorithm of massively large high dimensional datasets. However most of the developed ,algorithms are impractical to use when the amount of data is very large. Given N objects each de3ned by an M-dimensional fea- ture vectol; any clustering technique for handling very large datasets in high dimensional space should run in time
This paper presents a novel approach to knowledge extraction from large-scale datasets using a neural network when applied to the real-world problem of payment card fraud detection. Fraud is a serious and long term threat to a peaceful... more
This paper presents a novel approach to knowledge extraction from large-scale datasets using a neural network when applied to the real-world problem of payment card fraud detection. Fraud is a serious and long term threat to a peaceful and democratic society. We present SOAR (Sparse Oracle-based Adaptive Rule) extraction, a practical approach to process large datasets and extract key generalizing
Significant payment flows now take place on-line, giving rise to a requirement for efficient and effective systems for the detection of credit card fraud. A particular aspect of this problem is that it is highly dynamic, as fraudsters... more
Significant payment flows now take place on-line, giving rise to a requirement for efficient and effective systems for the detection of credit card fraud. A particular aspect of this problem is that it is highly dynamic, as fraudsters continually adapt their strategies in response to the increasing sophistication of detection systems. Hence, system training by exposure to examples of previous examples of fraudulent transactions can lead to fraud detection systems which are susceptible to new patterns of fraudulent transactions. The nature of the problem suggests that Artificial Immune Systems (AIS) may have particular utility for inclusion in fraud detection systems as AIS can be constructed which can flag `non standard' transactions without having seen examples of all possible such transactions during training of the algorithm. In this paper, we investigate the effectiveness of Artificial Immune Systems (AIS) for credit card fraud detection using a large dataset obtained from an on-line retailer. Three AIS algorithms were implemented and their performance was benchmarked against a logistic regression model. The results suggest that AIS algorithms have potential for inclusion in fraud detection systems but that further work is required to realize their full potential in this domain.