This document provides a short review of clustering techniques for students. It defines clustering and different types of grouping methods such as hard vs soft clustering. It discusses popular clustering algorithms like hierarchical clustering, k-means clustering, and density-based clustering. It also covers cluster validity, usability, preprocessing techniques, meta methods, and visual clustering. Open problems in clustering mentioned include how to identify outlier objects and accelerate classification.
This document outlines topics to be covered in a presentation on K-means clustering. It will discuss the introduction of K-means clustering, how the algorithm works, provide an example, and applications. The key aspects are that K-means clustering partitions data into K clusters based on similarity, assigns data points to the closest centroid, and recalculates centroids until clusters are stable. It is commonly used for market segmentation, computer vision, astronomy, and agriculture.
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
This document discusses support vector machines (SVM) and provides an example of using SVM for classification. It begins with common applications of SVM like face detection and image classification. It then provides an overview of SVM, explaining how it finds the optimal separating hyperplane between two classes by maximizing the margin between them. An example demonstrates SVM by classifying people as male or female based on height and weight data. It also discusses how kernels can be used to handle non-linearly separable data. The document concludes by showing an implementation of SVM on a zoos dataset to classify animals as crocodiles or alligators.
The document discusses various clustering approaches including partitioning, hierarchical, density-based, grid-based, model-based, frequent pattern-based, and constraint-based methods. It focuses on partitioning methods such as k-means and k-medoids clustering. K-means clustering aims to partition objects into k clusters by minimizing total intra-cluster variance, representing each cluster by its centroid. K-medoids clustering is a more robust variant that represents each cluster by its medoid or most centrally located object. The document also covers algorithms for implementing k-means and k-medoids clustering.
The document discusses the random forest algorithm. It introduces random forest as a supervised classification algorithm that builds multiple decision trees and merges them to provide a more accurate and stable prediction. It then provides an example pseudocode that randomly selects features to calculate the best split points to build decision trees, repeating the process to create a forest of trees. The document notes key advantages of random forest are that it avoids overfitting and can be used for both classification and regression tasks.
Deepak George provides a presentation on unsupervised learning techniques including K-Means clustering, hierarchical clustering, and DBSCAN. He has experience in data science roles at companies like GE and Mu Sigma. Deepak earned degrees from IIM Bangalore and College of Engineering Trivandrum and lists passions in deep learning, photography, and football. The presentation covers key concepts in clustering algorithms and includes visual explanations and recommendations for applying clustering.
This document discusses machine learning concepts including supervised vs. unsupervised learning, clustering algorithms, and specific clustering methods like k-means and k-nearest neighbors. It provides examples of how clustering can be used for applications such as market segmentation and astronomical data analysis. Key clustering algorithms covered are hierarchy methods, partitioning methods, k-means which groups data by assigning objects to the closest cluster center, and k-nearest neighbors which classifies new data based on its closest training examples.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
This document discusses various classification algorithms including k-nearest neighbors, decision trees, naive Bayes classifier, and logistic regression. It provides examples of how each algorithm works. For k-nearest neighbors, it shows how an unknown data point would be classified based on its nearest neighbors. For decision trees, it illustrates how a tree is built by splitting the data into subsets at each node until pure subsets are reached. It also provides an example decision tree to predict whether Amit will play cricket. For naive Bayes, it gives an example of calculating the probability of cancer given a patient is a smoker.
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
This presentation introduces clustering analysis and the k-means clustering technique. It defines clustering as an unsupervised method to segment data into groups with similar traits. The presentation outlines different clustering types (hard vs soft), techniques (partitioning, hierarchical, etc.), and describes the k-means algorithm in detail through multiple steps. It discusses requirements for clustering, provides examples of applications, and reviews advantages and disadvantages of k-means clustering.
This document discusses various unsupervised machine learning clustering algorithms. It begins with an introduction to unsupervised learning and clustering. It then explains k-means clustering, hierarchical clustering, and DBSCAN clustering. For k-means and hierarchical clustering, it covers how they work, their advantages and disadvantages, and compares the two. For DBSCAN, it defines what it is, how it identifies core points, border points, and outliers to form clusters based on density.
K-means clustering is an algorithm that groups data points into k number of clusters based on their similarity. It works by randomly selecting k data points as initial cluster centroids and then assigning each remaining point to the closest centroid. It then recalculates the centroids and reassigns points in an iterative process until centroids stabilize. While efficient, k-means clustering has weaknesses in that it requires specifying k, can get stuck in local optima, and is not suitable for non-convex shaped clusters or noisy data.
This document summarizes support vector machines (SVMs), a machine learning technique for classification and regression. SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples in the training data. This is achieved by solving a convex optimization problem that minimizes a quadratic function under linear constraints. SVMs can perform non-linear classification by implicitly mapping inputs into a higher-dimensional feature space using kernel functions. They have applications in areas like text categorization due to their ability to handle high-dimensional sparse data.
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what Machine Learning is, what Machine Learning is, what Decision Tree is, the advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with resolved examples, and at the end of the decision Tree use case/demo in Python for loan payment. For both beginners and experts who want to learn Machine Learning Algorithms, this Decision Tree tutorial is perfect.
The document summarizes the counterpropagation neural network algorithm. It consists of an input layer, a Kohonen hidden layer that clusters inputs, and a Grossberg output layer. The algorithm identifies the winning hidden neuron that is most activated by the input. The output is then calculated as the weight between the winning hidden neuron and the output neurons, providing a coarse approximation of the input-output mapping.
This document summarizes Chapter 10 of the book "Data Mining: Concepts and Techniques (3rd ed.)" which covers cluster analysis. The chapter introduces different types of clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods, density-based methods, and grid-based methods. It discusses how to evaluate the quality of clustering results and highlights considerations for cluster analysis such as similarity measures, clustering space, and challenges like scalability and high dimensionality.
The document describes different types of clustering algorithms, including partitioning, hierarchical, density-based, and grid-based methods. Partitioning methods like k-means and k-medoids aim to partition objects into k clusters by optimizing an objective function. Hierarchical clustering builds a hierarchy of clusters based on distance, either through an agglomerative (bottom-up) or divisive (top-down) approach. Density-based methods identify clusters based on density rather than distance. Grid-based methods quantize the space into a finite number of cells that form a grid structure.
The document discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. It provides details on popular partitioning algorithms like k-means and k-medoids, describing how they work, their strengths and weaknesses. Hierarchical clustering methods like AGNES and DIANA are also covered, including how distances between clusters are calculated during the merging or splitting process.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
This chapter discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. Partitioning methods like k-means and k-medoids aim to partition observations into k clusters by optimizing some objective function. Hierarchical clustering builds a hierarchy of clusters based on distance between observations. Density-based methods identify clusters based on density rather than distance. Grid-based methods quantize the space into finite number of cells that form clusters.
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This document summarizes chapter 10 of the book "Data Mining: Concepts and Techniques" which discusses cluster analysis. The chapter covers basic concepts of cluster analysis including partitioning, hierarchical, density-based and grid-based methods. It describes popular partitioning algorithms like k-means and k-medoids, and notes that k-means can be sensitive to outliers while k-medoids uses medioids which are less sensitive to outliers. The chapter also discusses evaluating clustering quality and major considerations for cluster analysis.
This document provides an overview of several clustering algorithms. It begins by defining clustering and its importance in data mining. It then categorizes clustering algorithms into four main types: partitional, hierarchical, grid-based, and density-based. For each type, some representative algorithms are described briefly. The document also reviews several popular clustering algorithms like k-means, CLARA, PAM, CLARANS, and BIRCH in more detail. It discusses aspects like the algorithms' time complexity, types of data handled, ability to detect clusters of different shapes, required input parameters, and advantages/disadvantages. Overall, the document aims to guide selection of suitable clustering algorithms for specific applications by surveying their key characteristics.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
Step by step operations by which we make a group of objects in which attributes
of all the objects are nearly similar, known as clustering. So, a cluster is a collection of
objects that acquire nearly same attribute values. The property of an object in a cluster is
similar to other objects in same cluster but different with objects of other clusters.
Clustering is used in wide range of applications like pattern recognition, image processing,
data analysis, machine learning etc. Nowadays, more attention has been put on categorical
data rather than numerical data. Where, the range of numerical attributes organizes in a
class like small, medium, high, and so on. There is wide range of algorithm that used to
make clusters of given categorical data. Our approach is to enhance the working on well-
known clustering algorithm k-modes to improve accuracy of algorithm. We proposed a new
approach named “High Accuracy Clustering Algorithm for Categorical datasets”.
The document discusses the concept of clustering, which is an unsupervised machine learning technique used to group unlabeled data points that are similar. It describes how clustering algorithms aim to identify natural groups within data based on some measure of similarity, without any labels provided. The key types of clustering are partition-based (like k-means), hierarchical, density-based, and model-based. Applications include marketing, earth science, insurance, and more. Quality measures for clustering include intra-cluster similarity and inter-cluster dissimilarity.
Cluster analysis is an unsupervised learning technique used to group unlabeled data points into meaningful clusters. There are several approaches to cluster analysis including partitioning methods like k-means, hierarchical clustering methods like agglomerative nesting (AGNES), and density-based methods like DBSCAN. The quality of clusters is evaluated based on intra-cluster similarity and inter-cluster dissimilarity. Cluster analysis has applications in fields like pattern recognition, image processing, and market segmentation.
This document provides an overview of hierarchical clustering methods. It discusses agglomerative hierarchical clustering methods like AGNES that start by treating each object as a separate cluster and merge them into larger clusters. It also discusses divisive hierarchical clustering methods like DIANA that start with all objects in one cluster and split them into smaller clusters. A dendrogram is used to visualize the nested clustering formed at different levels in the hierarchical clustering tree. The document also discusses different measures for calculating the distance between clusters during the merging or splitting process.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
This document discusses different types of clustering methods used in data analysis. It covers partitioning methods like k-means and k-medoids that group data into a predefined number of clusters. It also describes different data types that can be clustered, including interval-scaled, binary, categorical, ordinal and ratio-scaled variables. The document provides details on calculating distances and similarities between data objects for various variable types during clustering.
This document discusses different types of clustering methods used in data analysis. It covers partitioning methods like k-means and k-medoids that group data into a predefined number of clusters. It also describes different data types that can be clustered, including interval-scaled, binary, categorical, ordinal and ratio-scaled variables. The document provides details on calculating distances and similarities between objects for different variable types during clustering.
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна ЛандоNLPseminar
Диалоговые системы и чат-боты: как они устроены сейчас (правила, фреймы, шаблоны) и как машинное обучение может изменить их качество и применимость.
Видеозапись лекции: https://www.youtube.com/watch?v=-9zKXLAwm7w
How to install python packages from PycharmCeline George
In this slide, let's discuss how to install Python packages from PyCharm. In case we do any customization in our Odoo environment, sometimes it will be necessary to install some additional Python packages. Let’s check how we can do this from PyCharm.
How to Restrict Price Modification to Managers in Odoo 17 POSCeline George
This slide will represent the price control functionality in Odoo 17 PoS module. This feature provides the opportunity to restrict price adjustments. We can limit pricing changes to managers exclusively with it.
Tale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptxOH TEIK BIN
A PowerPoint Presentation of a meaningful story that teaches important Life Lessons /Virtues /Moral values.
The texts are in English and Chinese.
For the Video with audio narration and explanation in English, please check out the Link:
https://www.youtube.com/watch?v=GH71Ds2WzU8
Dear Sakthi Thiru Dr. G. B. Senthil Kumar,
It is with great honor and respect that we extend this formal invitation to you. As a distinguished leader whose presence commands admiration and reverence, we cordially invite you to join us in celebrating the 25th anniversary of our graduation from Adhiparasakthi Engineering College on 27th July, 2024. we would be honored to have you by our side as we reflect on the achievements and memories of the past 25 years.
How to Manage Advanced Pricelist in Odoo 17Celine George
Maintaining relationships with customers is important for a business. Customizing prices will help to maintain the relationships with customers. Odoo provides a pricing strategy called pricelists. We can set appropriate prices for the clients. And advanced price rules will help to set prices based on different conditions. This slide will show how to manage advanced pricelists in odoo 17.
Vortrag auf der Sub-Konferenz "Planning, democracy and postcapitalism" als Teil der Jahrestagung der französischen Assoziation für politische Ökonomie (Association française d’économie politique) 2024 in Montpellier/Frankreich.
How to Configure Extra Steps During Checkout in Odoo 17 Website AppCeline George
Odoo websites allow us to add an extra step during the checkout process to collect additional information from customers. This can be useful for gathering details that aren't necessarily covered by standard shipping and billing addresses.
Odoo 17 Project Module : New Features - Odoo 17 SlidesCeline George
The Project Management module undergoes significant enhancements, aimed at providing users with more robust tools for planning, organizing, and executing projects effectively.
Types of Diode and its working principle.pptxnitugatkal
A diode is a two-terminal polarized electronic component which mainly conducts current in one direction and blocks in other direction.
Its resistance in one direction is low (ideally zero) and high (ideally infinite) resistance in the other direction.
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesCeline George
This slide explains how to load custom fields you've created into the Odoo 17 Point-of-Sale (POS) interface. This approach involves extending the functionalities of existing POS models (e.g., product.product) to include your custom field.
1. TECHNIQUES OF CLUSTERING (a short review for students) Mikhail Alexandrov 1,2 , Pavel Makagonov 3 1 Autonomous University of Barcelona, Spain 2 Social Network Research Center with UCE, Slovakia 3 Mixtec Technological University, Mexico dyner1950@mail.ru, mpp@mixteco.utm.mx Petersburg 2008
2. Introduction Definitions Clustering Discussion Open Problems CONTENTS
3. Prof. Dr. Benno Stein Dr. Sven Meyer zu Eissen Weimar University, Germany Ideas, Materials, Collaboration - Structuring and Indexing - AI search in IR - Semi-Supervised Learning Dr. Xiaojin Zhu University of Wisconsin, Madison , USA
4. TEXTUAL DATA Subject of Grouping NON TEXTUAL DATA Local Terminology It is not important what is the source of data: textual or non textual. Data : work in the space of numerical parameters Texts : work in the space of words Example: typical dialog between passenger ( US ) and railway directory inquires ( DI )
5. TEXTS (‘indexed’) Presentation of Textual Data Vector model (‘parameterized’) Local Terminology Indexed texts are only parameterized texts in the space of words
6. TEXTS (‘indexed’) Presentation of Textual Data TEXTS (‘parameterized’) Local Terminology Indexed texts are only parameterized texts in the space of themes = > Category /Context Vector Models... Example : manually parameterized dialogs in the space of parameters (transport service and passenger needs)
7. Introduction Definitions Clustering Discussion Open Problems CONTENTS
8. Unsupervised Learning Types of Grouping Supervised Learning Semi-Supervised Learning We know nothing about data sructure We know well data sructure We know something about data sructure
9. Clustering Classification Characteristics: Characteristics : Absence of patterns or descriptions Presence of patterns or descriptiones of classes, so the results are of classes, so the results are defined by the nature of defined by the user ( N>=1 ) the data themselves ( N >1 ) Synonyms: Synonyms : Classification without teacher Classificatio n with teacher Unsupervised learning Supervised learning Number of clusters Specials terms : [ ] is known exactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science) Types of Grouping
10. “ Semi Clustering/Classification” Classification Characteristics: Characteristics : Presence of limited number Presence of patterns or of patterns, so the results are descriptiones of classes, so the results defined both by the user or are defined by the user ( N>=1 ) by the data themselves ( N >1 ) Synonyms: Synonyms : Semi-Classification Classificatio n with teacher Semi Supervised learning Supervised learning Number of clusters/categories Specials terms : [ ] is known exactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science) Types of Grouping
11. Objectives of Grouping 1. Organization (structuring) of an object set Process is named data structuring 2. Searching interesting patterns Process is named navigation 3. Grouping for other applications: - Knowledge discovery (clustering) - Summarization of documents Note: Do not mix the type of grouping and its objective
12. Classification of methods Based on belonging to cluster/category Exclusive methods Every object belongs only to one cluster/category. Methods are named hard grouping methods Non-exclusive methods Every object can belong to several clusters/categories. Methods are named soft grouping methods. Based on data presentation Methods oriented on free metric space Every object is presented as a point in a free space Methods oriented on graphs Every object is presented as an element on graph
13. Hard grouping Hard clustering Hard categorization Soft grouping Soft clustering Soft categorization Example The distribution of letters of Moscovites to the Government is soft categorization (numbers in the table reflect the relative weight of each theme) Fuzzy Grouping
14. Preprocessing <= Processing General Scheme of Clustering Process - I Principal idea : To transform texts to num erical form in order to use matematical tools Remember : our problem is grouping textual documents but not undestanding Here: Both rude and good matrixes are matrix Object/Attributes
15. General Scheme of Clustering Process - II Preprocessing Processing <= Here: matrix Attribute/Attribute can be used instead of matrix Object/Object
17. Clustering for Categorization Colour matrix “words-words” before clustering Matrix contains the value of word co-occurrences in texts. Red : if value more than some threshold. White : if less.
18. Clustering for Categorizatión Colour matriz “words-words” after clustering Words are groupped. Cluster = > Subdictionary Absence of blocks means absence of Subthemes
20. Introduction Definitions Clustering Discussion Open Problems CONTENTS
21. Definitions Def. 1 “Let us V be the set of objects. Clustering C = { Ci | Ci є V } of V is division of V on subsets, for which we have : U i Ci = V and Ci ∩ Cj = 0 i ≠j“ Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight function that reflects the distance between objects, so we have a weighted graph G = { V,E, φ }. In this case C is named as clustering of G .” In the framework of the second definition every Ci produced subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are named clusters . Graph Set Clique
22. Definitions Principal note Both definitions SAYS NOTHING : - about quality of clusters - about numbers of clusters Reason of difficulties Nowadays there is no any general agreement about any universal defintion of the term ‘ cluster ’ What means that clustering is good ? 1. Closeness between objets inside clusters is essentially more than the closeness between clusters themselves 2. Constructed clusters correspond to intuitive presentations of users ( they are natural clusters)
23. Classification of methods 1. Hierarchy based methods Any neighbors N =? N is not given 2. Exemplar based methods K-means N = ? N is given 3. Density based methods MajorClust N = ? N is calculated automatically Based on the way of grouping
24. Hierarchy based methods Neighbors. Every object is cluster General algorithm Initially every object is one cluster The series of steps are performed. On every step the pair of cluster being the closest ones are merged. At the end we have one cluster.
27. Method K-means General algorithm Initially K centers are selected by any random way Series of steps are performed. On every step the objects are distributed between centers according the criterion of the nearest center . Then all centers are recalculated. The end is fixed when the centers are not changed . Exemplar based methods
28. Method X-means (Dan Pelleg, Andrew Moor) Approach Using evaluation of object distribution Selection of the most likely points Advantage - More rapid - Number of cluster is not fixed (in all cases it tends to be less) Exemplar based methods
29. Density based methods MajorClust method Principal idea Total closeness to the objects of his own cluster exceeds the closeness to any other cluster Suboptimal solution Only part of neighbors are considered on every step (to save time, to avoid mergence) .
30. Density based methods MajorClust method General algorithm Initially every object is one cluster and it joins to the nearest neighbor Every object evaluates the total closeness to his own cluster and separately to all other clusters. After such evaluation the objects change its belonging and go off to the closest one The end of searching is fixed when clusters do not change . Preprocessing for MajorClust Many weak links can be stronger than the several strongest ones that disfigures results. So: weak links should be eliminated before clustering
31. Cluster Validity Definition It reflects cluster separability and formally depends on : - Scatters inside clusters - Separation between clusters Indexes It is formal characteristics of structure Dunn index Davies Bouldin index Hypervolume criterion ( Andre Hardy ) Density expected measure DEM ( Benno Stein ) Dunn index (to be max )
32. Cluster Validity Number of clusters Geometrical approach, two variants: Optimum (min, max) of curve Jump of curve Dunn index (to be max ) is too sensible to extremal cases
33. Cluster Usability Definition It reflects user’s opinion and formally expresses the difference between : - Classes selected manually by a user - Clusters constructed by a given method Cluster F -measure ( Benno Stein ) Data Expert Method Here: i, j are indexes of clusses and clusters C * i , C j are classes and clusters prec(i,j), rec(i,j) are precision and recall
34. Validity and Usability Conclusion Density expected measure corresponds to F -measure reflecting expert ’s opinion. So, DEM can be an indicator of expert opinion
35. Tecnologies of Clustering Meta methods They construct separated data sets using criteria of optimization and limitations : Neither much nor small number of clusters Neither large nor small size of clusters etc. Visual methods They present visual images to a user in order to select manually the clusters Using different methods Comparing results
36. Meta Methods Algorithm (example) Notations: N is the number of objects in a given cluster D is the diagonal of a given cluster Initially N 0 and their centers Ci are given Steps 1. Method K - medoid (or any other one) is performed 2. If N > N max or D > D max (in any cluster), then this cluster is divided on 2 parts. Go to p.1 3. If N < N min or D < D min (in any cluster), then this and the closest clusters are joined. Go to p.1 4. When the number of iteration I > I max , Stop Otherwise go to p.1
38. Problem Authorship of Molier dramatic works (comedies, dramas,...). Corneille and/or Molier ? Approach Style based indexing ( NooJ can be used ) Clustering all dramatic works Well-known dramatic works should be marked Style - Formal style estimations Informal style estimations Formal style indicators - Text Complexity - Text Harmonicity Authorship References : Labbe C., Labbe D. Inter-textual distance and authorship attribution Corneille and Molier. Journ. of Quantitative Linguistics. 2001. Vol.8, N_3, pp.213-331
39. Clustering Authorship Results 1) 18 comedies of Molier should be belonged to Corneille 2) 15 comedies of Mollier are weak connected with all his other works. So, they can be written by two authors 3) 2 comedies of Corneille now are considered as works of Molier . etc. Note : During a certain time Molier and Corneille were friends
40. Special and Universal packages with algorithms of С lustering 1. ClustAn ( Scotland ) www. clustan.com Clustan Graphics-7 (2006) 2. MatLab Descriptions are in Internet 3. Statistica Descriptions are in Internet Learning Journals and С ongresses about Clustering 1. Journal “Journal of Classification”, Springer 2. IFCS - International Federation of Classification Societies, Conferences 3. CSNA - Classification Society of North America, Seminars, Workshops
41. Introduction Definitions Clustering Discussion Open Problems CONTENTS
42. Certain Observations The numbers of methods for grouping data is a little bit more than the numbers of researchers working in this area. Problem does not consist in searching the best method for all cases. Problem consists in searching the method being relevant for your data. Only you know what methods are the best for you own data . Principal problems consist in choice of indexes (parameters) and measure of closeness to be adecuate to a given problem and given data Frecuently the results are bad because of the bad indexes and bad measure but not the bad method !
43. Certain Observations Antipodal methods To be sure that results are really good and do not depend on the method used one should test these results using any antipodal methods Solomon G, 1977: “The most antipodes are: NN-method and K-means ” Sensibility To be sure that results do not depend essentially on the method’s parameters one should perform the analysis of sensibility by changing parameters of adjustment.
44. Introduction Definitions Clustering Conclusions Open Problems CONTENTS
45. Some Problems Question 1 How to reveal alien objects? Solution (idea) Revealing a stable structure on different sets of objects. They are subsets of a given set. Object distribution reflects: real structure ( nature ) + noise ( alien objects )
46. Some Problems Question 2 How to accelerate classification? Solution (idea) Filtering objects, which give a minimum contribution to decisive function Representative objects of each cluster
47. CONTACT INFORMATION Mikhail Alexandrov 1,2 , Pavel Makagonov 3 1 Autonomous University of Barcelona, Spain 2 Social Network Research Center with UCE, Slovakia 3 Mixtec Technological University, Mexico dyner1950@mail.ru, mpp@mixteco.utm.mx Petersburg 2008