Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
TECHNIQUES OF CLUSTERING (a short review for students) Mikhail Alexandrov 1,2 , Pavel Makagonov 3 1  Autonomous University of Barcelona, Spain 2  Social Network Research Center with UCE, Slovakia 3  Mixtec Technological University, Mexico dyner1950@mail.ru, mpp@mixteco.utm.mx Petersburg 2008
Introduction   Definitions  Clustering  Discussion Open Problems CONTENTS 
Prof. Dr. Benno Stein  Dr. Sven Meyer zu Eissen  Weimar University,  Germany Ideas,  Materials,  Collaboration - Structuring and Indexing - AI search  in IR  - Semi-Supervised Learning Dr. Xiaojin Zhu University of Wisconsin, Madison ,  USA
TEXTUAL DATA Subject of Grouping NON TEXTUAL DATA Local Terminology It is not important what is the source of data: textual or non textual. Data :  work in the space of numerical   parameters   Texts :  work in the space of  words   Example: typical dialog between  passenger (  US   ) and railway  directory inquires (  DI  )
TEXTS  (‘indexed’) Presentation of Textual Data  Vector model (‘parameterized’) Local Terminology Indexed   texts  are only parameterized texts in the  space of words
TEXTS  (‘indexed’) Presentation of Textual Data  TEXTS (‘parameterized’)  Local Terminology  Indexed   texts  are only parameterized texts in the  space of themes   = >  Category /Context  Vector Models... Example :  manually parameterized dialogs in the  space of parameters  (transport service and passenger needs)
Introduction   Definitions   Clustering  Discussion Open Problems CONTENTS 
Unsupervised Learning Types  of  Grouping Supervised Learning Semi-Supervised Learning We know nothing about data sructure We know well  data sructure We know something about data sructure
Clustering  Classification Characteristics:  Characteristics : Absence of  patterns  or  descriptions   Presence of  patterns  or  descriptiones of classes, so the results  are  of classes, so the results are  defined by the  nature   of  defined by the  user   (  N>=1  ) the data themselves  (  N >1  ) Synonyms:  Synonyms : Classification   without teacher   Classificatio n   with teacher Unsupervised   learning   Supervised   learning Number of clusters   Specials terms :   [  ] is known   exactly   Categorization   (of documents) [x] is known  approximately   Diagnostics   (technics, medicine) [  ]  is  not   known  =>  searching  Recognition  (technics, science) Types  of  Grouping
“ Semi Clustering/Classification”  Classification Characteristics:  Characteristics : Presence of  limited number   Presence of  patterns  or of patterns, so the results  are  descriptiones   of classes, so the results defined both by the  user  or  are defined by the  user   (  N>=1  ) by  the  data  themselves  (  N >1  ) Synonyms:  Synonyms : Semi-Classification   Classificatio n   with teacher Semi Supervised   learning   Supervised   learning Number of clusters/categories  Specials terms :   [  ] is known   exactly   Categorization   (of documents) [x] is known  approximately  Diagnostics   (technics, medicine) [  ]  is  not   known  =>  searching  Recognition  (technics, science) Types  of  Grouping
Objectives of Grouping 1. Organization  (structuring) of an object set  Process is named   data structuring   2. Searching interesting patterns  Process is named   navigation 3. Grouping for other applications: -  Knowledge   discovery (clustering) -  Summarization   of documents   Note:  Do not mix  the   type   of grouping  and its  objective
Classification of methods Based on belonging to  cluster/category Exclusive methods Every object belongs only to one cluster/category. Methods are named  hard  grouping methods Non-exclusive methods Every object can belong to several clusters/categories. Methods are named  soft  grouping methods.  Based on data presentation Methods oriented on  free metric space Every object is presented as a point in a free space Methods oriented on graphs Every object is presented as an element on graph
Hard grouping   Hard   clustering Hard   categorization Soft grouping Soft   clustering Soft   categorization Example The distribution of letters of Moscovites to the Government is  soft categorization   (numbers in the table reflect the relative weight of each theme)  Fuzzy Grouping
Preprocessing  <= Processing General Scheme of Clustering Process - I Principal idea : To transform texts to  num erical  form in order to use  matematical   tools   Remember :   our  problem is grouping textual  documents but  not  undestanding Here: Both rude and good matrixes are matrix  Object/Attributes
General Scheme of Clustering Process - II Preprocessing  Processing  <= Here: matrix   Attribute/Attribute  can be used instead of matrix  Object/Object
Matrixes to be Considered
Clustering for Categorization Colour matrix “words-words” before clustering Matrix contains the value of word co-occurrences in texts.  Red :  if value more than some threshold. White :  if less.
Clustering for Categorizatión Colour matriz “words-words” after clustering Words are groupped. Cluster = >   Subdictionary Absence of blocks means absence  of  Subthemes
Importance of Preprocessing  ( it takes   60%-90% of efforts)
Introduction   Definitions  Clustering   Discussion Open Problems CONTENTS 
Definitions Def. 1  “Let us V be the set of objects. Clustering  C  = { Ci   |  Ci  є  V   }   of V  is  division  of V  on subsets,  for which we have :  U i Ci = V  and  Ci  ∩ Cj = 0  i  ≠j“ Def. 2  “Let us V be the set of nodes, E be arcs,  φ   is weight function that reflects the distance between objects, so we have a weighted graph  G  = { V,E,  φ  }.   In this case  C   is named as clustering of  G .” In the framework of the second definition every Ci  produced subgraph G(Ci).  Both subsets  Ci  and subgraphs  G(Ci)   are  named  clusters . Graph Set Clique
Definitions Principal note Both  definitions  SAYS NOTHING :  -  about quality of clusters  -  about  numbers of  clusters   Reason of difficulties Nowadays  there is no any general agreement   about any universal defintion of the term  ‘ cluster ’ What means that clustering  is good ? 1. Closeness between  objets  inside clusters  is  essentially more  than  the  closeness  between  clusters   themselves 2. Constructed  clusters correspond  to  intuitive presentations  of users  ( they are  natural   clusters)
Classification  of  methods 1.  Hierarchy based methods  Any neighbors  N =?   N is not given 2. Exemplar based methods  K-means N = ?   N  is given 3.   Density based methods MajorClust  N = ?   N is calculated  automatically Based on the way of grouping
Hierarchy based methods Neighbors. Every object is cluster General algorithm Initially every object is  one cluster The series of steps are performed. On every step the pair of cluster being the  closest ones  are merged.  At the end we have one cluster.
Hierarchy  based  methods Nearest neighbor  method   (NN)
Exemplar based methods K - means,  centroid
Method K-means General algorithm   Initially  K centers  are selected by any random way Series of steps are performed. On every step the objects are distributed between centers according the criterion of the  nearest   center . Then all centers are recalculated.  The end is fixed when the centers are not changed . Exemplar based methods
Method X-means (Dan Pelleg, Andrew Moor) Approach Using evaluation of object distribution  Selection of the most  likely points Advantage - More rapid - Number of cluster  is not fixed (in all cases it tends to be less)   Exemplar based methods
Density based methods MajorClust method Principal idea Total closeness to the objects of his own cluster  exceeds the closeness  to any other cluster Suboptimal solution Only part  of neighbors are considered on every step  (to save time, to avoid mergence) .
Density based methods MajorClust method General algorithm   Initially every object is  one cluster  and it joins to the nearest neighbor  Every object evaluates the total closeness to his own cluster and separately to all other clusters. After such evaluation the  objects change  its belonging and go off to the closest one  The end of searching is fixed when clusters do not  change . Preprocessing for MajorClust Many weak links  can be stronger than the several  strongest ones that disfigures  results.  So: weak links should be  eliminated before clustering
Cluster Validity Definition It reflects cluster  separability and formally depends on :  - Scatters inside clusters  - Separation between clusters Indexes It is formal characteristics of structure   Dunn  index Davies Bouldin   index   Hypervolume criterion ( Andre Hardy )  Density expected measure  DEM  ( Benno Stein ) Dunn   index (to be  max )
Cluster Validity Number of clusters Geometrical approach, two variants: Optimum  (min, max) of curve Jump  of curve Dunn  index (to be  max ) is too sensible to extremal cases
Cluster Usability Definition It reflects user’s opinion  and formally expresses the difference between : - Classes selected manually by a user - Clusters constructed by a given method  Cluster  F -measure  (  Benno Stein  ) Data Expert Method Here:  i, j  are indexes of clusses and clusters C * i   , C j  are classes and clusters  prec(i,j), rec(i,j)  are precision and recall
Validity  and  Usability Conclusion Density expected measure  corresponds to  F -measure   reflecting expert ’s opinion. So,  DEM  can be an indicator of expert  opinion
Tecnologies of  Clustering Meta methods They construct separated data sets using criteria of optimization and  limitations : Neither much nor small  number  of clusters Neither large nor small  size  of clusters  etc. Visual methods They present visual images to a user in order to select  manually  the clusters Using  different  methods Comparing   results
Meta Methods Algorithm (example)  Notations: N   is  the number of objects in a given cluster D   is the diagonal of a given cluster  Initially  N 0   and their   centers   Ci  are   given Steps 1. Method  K - medoid (or any other one) is performed 2. If  N   >  N max  or  D  >  D max   (in any cluster), then this cluster is divided on 2 parts. Go to p.1 3. If  N  <  N min   or  D   <  D min  (in any cluster), then  this and the closest clusters are joined. Go to p.1 4. When the number of iteration  I  >  I max , Stop Otherwise go to p.1
Visual Clustering Clustering on dendrite  Clustering in space of factors
Problem  Authorship of Molier dramatic   works (comedies, dramas,...). Corneille   and/or  Molier  ? Approach Style based indexing  (  NooJ   can  be  used ) Clustering all dramatic works Well-known dramatic works should be marked  Style - Formal  style  estimations  Informal  style  estimations  Formal style indicators - Text Complexity - Text Harmonicity Authorship References : Labbe C., Labbe D.  Inter-textual distance and authorship attribution Corneille and Molier.  Journ. of Quantitative Linguistics.  2001. Vol.8, N_3, pp.213-331
Clustering Authorship Results  1) 18 comedies of Molier  should be belonged to  Corneille   2) 15 comedies of Mollier  are weak connected with all his other works.  So, they can be written by  two authors   3) 2 comedies of Corneille now are considered as works of  Molier .  etc. Note : During a certain time Molier and Corneille were friends
Special and Universal packages with algorithms of  С lustering 1.  ClustAn  ( Scotland )  www. clustan.com   Clustan Graphics-7 (2006) 2.  MatLab  Descriptions are in Internet   3.  Statistica  Descriptions are in Internet   Learning Journals and  С ongresses about Clustering 1. Journal  “Journal of Classification”,   Springer   2. IFCS  -  International Federation of Classification Societies, Conferences 3. CSNA  -  Classification Society of North America, Seminars, Workshops
Introduction   Definitions  Clustering  Discussion Open Problems CONTENTS 
Certain Observations The numbers of methods for grouping data is a little bit more than the numbers of researchers working in this area.   Problem does not consist in searching the  best method  for all cases.  Problem consists in searching the  method being relevant  for your data.  Only you know what methods are the best for you own data . Principal problems consist in choice of indexes (parameters) and measure of closeness to be adecuate  to a given problem and given data  Frecuently the results are bad  because of the  bad indexes   and   bad measure   but not the  bad   method   !
Certain Observations Antipodal methods To be sure that results are really good and  do not depend on the method  used  one should test these results using any  antipodal  methods Solomon G, 1977: “The most antipodes are:   NN-method   and  K-means ” Sensibility To be sure that results  do not depend  essentially on the  method’s parameters   one should perform the analysis of sensibility by changing parameters  of  adjustment.
Introduction   Definitions  Clustering  Conclusions Open Problems CONTENTS 
Some Problems Question 1  How to reveal  alien   objects? Solution (idea) Revealing  a stable  structure on different sets of objects. They are subsets of a given set.   Object distribution reflects: real structure ( nature )  +  noise ( alien objects )
Some Problems Question 2  How to  accelerate  classification? Solution (idea) Filtering  objects, which give  a minimum contribution to  decisive function  Representative objects of each cluster
CONTACT  INFORMATION Mikhail Alexandrov 1,2 , Pavel Makagonov 3 1  Autonomous University of Barcelona, Spain 2  Social Network Research Center with UCE, Slovakia 3  Mixtec Technological University, Mexico dyner1950@mail.ru, mpp@mixteco.utm.mx Petersburg 2008

More Related Content

What's hot

Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
Rashid Ansari
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: Clustering
Deepak George
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
CosmoAIMS Bassett
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
Neha Kulkarni
Linear regression
Linear regressionLinear regression
Linear regression
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
Mohammad Junaid Khan
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
Fellowship at Vodafone FutureLab
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
Megha Sharma
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
Mohit Rajput
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
Pabna University of Science & Technology
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
Pravinkumar Landge
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
Kasun Ranga Wijeweera
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Counter propagation Network
Counter propagation NetworkCounter propagation Network
Counter propagation Network
Akshay Dhole

What's hot (20)

Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: Clustering
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
Linear regression
Linear regressionLinear regression
Linear regression
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Counter propagation Network
Counter propagation NetworkCounter propagation Network
Counter propagation Network

Similar to Clustering

Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Houw Liong The
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
Houw Liong The
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year students
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
Aman Jatain
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Subrata Kumer Paul
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
47 292-298
47 292-29847 292-298
47 292-298
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
Chapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxChapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptx
Amy Aung
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions

Similar to Clustering (20)

Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year students
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
47 292-298
47 292-29847 292-298
47 292-298
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
Chapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxChapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptx
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions

More from NLPseminar

[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
конф ии и ея гаврилова
конф ии и ея  гавриловаконф ии и ея  гаврилова
конф ии и ея гаврилова
кудрявцев V3
кудрявцев V3кудрявцев V3
кудрявцев V3
Khomitsevich Khomitsevich
акинина осмоловская
акинина осмоловскаяакинина осмоловская
акинина осмоловская

More from NLPseminar (20)

[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
конф ии и ея гаврилова
конф ии и ея  гавриловаконф ии и ея  гаврилова
конф ии и ея гаврилова
кудрявцев V3
кудрявцев V3кудрявцев V3
кудрявцев V3
Khomitsevich Khomitsevich
акинина осмоловская
акинина осмоловскаяакинина осмоловская
акинина осмоловская

Recently uploaded

Brigada Eskwela editable Certificate.pptx
Brigada Eskwela editable Certificate.pptxBrigada Eskwela editable Certificate.pptx
Brigada Eskwela editable Certificate.pptx
How to install python packages from Pycharm
How to install python packages from PycharmHow to install python packages from Pycharm
How to install python packages from Pycharm
Celine George
How to Restrict Price Modification to Managers in Odoo 17 POS
How to Restrict Price Modification to Managers in Odoo 17 POSHow to Restrict Price Modification to Managers in Odoo 17 POS
How to Restrict Price Modification to Managers in Odoo 17 POS
Celine George
Tale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptx
Tale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptxTale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptx
Tale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptx
Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024
APEC Melmaruvathur
Personality Development , Dr. SAROJ KUMAR DATTA
Personality Development , Dr. SAROJ KUMAR DATTAPersonality Development , Dr. SAROJ KUMAR DATTA
Personality Development , Dr. SAROJ KUMAR DATTA
english 9 Quarter 1 Week 1 Modals and its Uses
english 9 Quarter 1 Week 1 Modals and its Usesenglish 9 Quarter 1 Week 1 Modals and its Uses
english 9 Quarter 1 Week 1 Modals and its Uses
Angular Roadmap For Beginner PDF By ScholarHat.pdf
Angular Roadmap For Beginner PDF By ScholarHat.pdfAngular Roadmap For Beginner PDF By ScholarHat.pdf
Angular Roadmap For Beginner PDF By ScholarHat.pdf
How to Manage Advanced Pricelist in Odoo 17
How to Manage Advanced Pricelist in Odoo 17How to Manage Advanced Pricelist in Odoo 17
How to Manage Advanced Pricelist in Odoo 17
Celine George
Replacing the Whole Capitalist Stack.pdf
Replacing the Whole Capitalist Stack.pdfReplacing the Whole Capitalist Stack.pdf
Replacing the Whole Capitalist Stack.pdf
How to Configure Extra Steps During Checkout in Odoo 17 Website App
How to Configure Extra Steps During Checkout in Odoo 17 Website AppHow to Configure Extra Steps During Checkout in Odoo 17 Website App
How to Configure Extra Steps During Checkout in Odoo 17 Website App
Celine George
Odoo 17 Project Module : New Features - Odoo 17 Slides
Odoo 17 Project Module : New Features - Odoo 17 SlidesOdoo 17 Project Module : New Features - Odoo 17 Slides
Odoo 17 Project Module : New Features - Odoo 17 Slides
Celine George
Java Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHatJava Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHat
Types of Diode and its working principle.pptx
Types of Diode and its working principle.pptxTypes of Diode and its working principle.pptx
Types of Diode and its working principle.pptx
DO5s2024-Orientation-Material.pptx. This is a presentation of DepEd Order No....
DO5s2024-Orientation-Material.pptx. This is a presentation of DepEd Order No....DO5s2024-Orientation-Material.pptx. This is a presentation of DepEd Order No....
DO5s2024-Orientation-Material.pptx. This is a presentation of DepEd Order No....
Q1_LE_English 7_Lesson 1_Week 1 wordfile.docx
Q1_LE_English 7_Lesson 1_Week 1 wordfile.docxQ1_LE_English 7_Lesson 1_Week 1 wordfile.docx
Q1_LE_English 7_Lesson 1_Week 1 wordfile.docx
Powerpoint on Classroom Orientation2024-2025
Powerpoint on Classroom Orientation2024-2025Powerpoint on Classroom Orientation2024-2025
Powerpoint on Classroom Orientation2024-2025
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesHow to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
Celine George
Module 5 Bone, Joints & Muscle Injuries.ppt
Module 5 Bone, Joints & Muscle Injuries.pptModule 5 Bone, Joints & Muscle Injuries.ppt
Module 5 Bone, Joints & Muscle Injuries.ppt

Recently uploaded (20)

Brigada Eskwela editable Certificate.pptx
Brigada Eskwela editable Certificate.pptxBrigada Eskwela editable Certificate.pptx
Brigada Eskwela editable Certificate.pptx
How to install python packages from Pycharm
How to install python packages from PycharmHow to install python packages from Pycharm
How to install python packages from Pycharm
How to Restrict Price Modification to Managers in Odoo 17 POS
How to Restrict Price Modification to Managers in Odoo 17 POSHow to Restrict Price Modification to Managers in Odoo 17 POS
How to Restrict Price Modification to Managers in Odoo 17 POS
Tale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptx
Tale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptxTale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptx
Tale of a Scholar and a Boatman ~ A Story with Life Lessons (Eng. & Chi.).pptx
Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024
Personality Development , Dr. SAROJ KUMAR DATTA
Personality Development , Dr. SAROJ KUMAR DATTAPersonality Development , Dr. SAROJ KUMAR DATTA
Personality Development , Dr. SAROJ KUMAR DATTA
english 9 Quarter 1 Week 1 Modals and its Uses
english 9 Quarter 1 Week 1 Modals and its Usesenglish 9 Quarter 1 Week 1 Modals and its Uses
english 9 Quarter 1 Week 1 Modals and its Uses
Angular Roadmap For Beginner PDF By ScholarHat.pdf
Angular Roadmap For Beginner PDF By ScholarHat.pdfAngular Roadmap For Beginner PDF By ScholarHat.pdf
Angular Roadmap For Beginner PDF By ScholarHat.pdf
How to Manage Advanced Pricelist in Odoo 17
How to Manage Advanced Pricelist in Odoo 17How to Manage Advanced Pricelist in Odoo 17
How to Manage Advanced Pricelist in Odoo 17
Replacing the Whole Capitalist Stack.pdf
Replacing the Whole Capitalist Stack.pdfReplacing the Whole Capitalist Stack.pdf
Replacing the Whole Capitalist Stack.pdf
How to Configure Extra Steps During Checkout in Odoo 17 Website App
How to Configure Extra Steps During Checkout in Odoo 17 Website AppHow to Configure Extra Steps During Checkout in Odoo 17 Website App
How to Configure Extra Steps During Checkout in Odoo 17 Website App
Odoo 17 Project Module : New Features - Odoo 17 Slides
Odoo 17 Project Module : New Features - Odoo 17 SlidesOdoo 17 Project Module : New Features - Odoo 17 Slides
Odoo 17 Project Module : New Features - Odoo 17 Slides
Java Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHatJava Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHat
Types of Diode and its working principle.pptx
Types of Diode and its working principle.pptxTypes of Diode and its working principle.pptx
Types of Diode and its working principle.pptx
DO5s2024-Orientation-Material.pptx. This is a presentation of DepEd Order No....
DO5s2024-Orientation-Material.pptx. This is a presentation of DepEd Order No....DO5s2024-Orientation-Material.pptx. This is a presentation of DepEd Order No....
DO5s2024-Orientation-Material.pptx. This is a presentation of DepEd Order No....
Q1_LE_English 7_Lesson 1_Week 1 wordfile.docx
Q1_LE_English 7_Lesson 1_Week 1 wordfile.docxQ1_LE_English 7_Lesson 1_Week 1 wordfile.docx
Q1_LE_English 7_Lesson 1_Week 1 wordfile.docx
Powerpoint on Classroom Orientation2024-2025
Powerpoint on Classroom Orientation2024-2025Powerpoint on Classroom Orientation2024-2025
Powerpoint on Classroom Orientation2024-2025
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesHow to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
Module 5 Bone, Joints & Muscle Injuries.ppt
Module 5 Bone, Joints & Muscle Injuries.pptModule 5 Bone, Joints & Muscle Injuries.ppt
Module 5 Bone, Joints & Muscle Injuries.ppt


  • 1. TECHNIQUES OF CLUSTERING (a short review for students) Mikhail Alexandrov 1,2 , Pavel Makagonov 3 1 Autonomous University of Barcelona, Spain 2 Social Network Research Center with UCE, Slovakia 3 Mixtec Technological University, Mexico dyner1950@mail.ru, mpp@mixteco.utm.mx Petersburg 2008
  • 2. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 3. Prof. Dr. Benno Stein Dr. Sven Meyer zu Eissen Weimar University, Germany Ideas, Materials, Collaboration - Structuring and Indexing - AI search in IR - Semi-Supervised Learning Dr. Xiaojin Zhu University of Wisconsin, Madison , USA
  • 4. TEXTUAL DATA Subject of Grouping NON TEXTUAL DATA Local Terminology It is not important what is the source of data: textual or non textual. Data : work in the space of numerical parameters Texts : work in the space of words Example: typical dialog between passenger ( US ) and railway directory inquires ( DI )
  • 5. TEXTS (‘indexed’) Presentation of Textual Data Vector model (‘parameterized’) Local Terminology Indexed texts are only parameterized texts in the space of words
  • 6. TEXTS (‘indexed’) Presentation of Textual Data TEXTS (‘parameterized’) Local Terminology Indexed texts are only parameterized texts in the space of themes = > Category /Context Vector Models... Example : manually parameterized dialogs in the space of parameters (transport service and passenger needs)
  • 7. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 8. Unsupervised Learning Types of Grouping Supervised Learning Semi-Supervised Learning We know nothing about data sructure We know well data sructure We know something about data sructure
  • 9. Clustering Classification Characteristics: Characteristics : Absence of patterns or descriptions Presence of patterns or descriptiones of classes, so the results are of classes, so the results are defined by the nature of defined by the user ( N>=1 ) the data themselves ( N >1 ) Synonyms: Synonyms : Classification without teacher Classificatio n with teacher Unsupervised learning Supervised learning Number of clusters Specials terms : [ ] is known exactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science) Types of Grouping
  • 10. “ Semi Clustering/Classification” Classification Characteristics: Characteristics : Presence of limited number Presence of patterns or of patterns, so the results are descriptiones of classes, so the results defined both by the user or are defined by the user ( N>=1 ) by the data themselves ( N >1 ) Synonyms: Synonyms : Semi-Classification Classificatio n with teacher Semi Supervised learning Supervised learning Number of clusters/categories Specials terms : [ ] is known exactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science) Types of Grouping
  • 11. Objectives of Grouping 1. Organization (structuring) of an object set Process is named data structuring 2. Searching interesting patterns Process is named navigation 3. Grouping for other applications: - Knowledge discovery (clustering) - Summarization of documents Note: Do not mix the type of grouping and its objective
  • 12. Classification of methods Based on belonging to cluster/category Exclusive methods Every object belongs only to one cluster/category. Methods are named hard grouping methods Non-exclusive methods Every object can belong to several clusters/categories. Methods are named soft grouping methods. Based on data presentation Methods oriented on free metric space Every object is presented as a point in a free space Methods oriented on graphs Every object is presented as an element on graph
  • 13. Hard grouping Hard clustering Hard categorization Soft grouping Soft clustering Soft categorization Example The distribution of letters of Moscovites to the Government is soft categorization (numbers in the table reflect the relative weight of each theme) Fuzzy Grouping
  • 14. Preprocessing <= Processing General Scheme of Clustering Process - I Principal idea : To transform texts to num erical form in order to use matematical tools Remember : our problem is grouping textual documents but not undestanding Here: Both rude and good matrixes are matrix Object/Attributes
  • 15. General Scheme of Clustering Process - II Preprocessing Processing <= Here: matrix Attribute/Attribute can be used instead of matrix Object/Object
  • 16. Matrixes to be Considered
  • 17. Clustering for Categorization Colour matrix “words-words” before clustering Matrix contains the value of word co-occurrences in texts. Red : if value more than some threshold. White : if less.
  • 18. Clustering for Categorizatión Colour matriz “words-words” after clustering Words are groupped. Cluster = > Subdictionary Absence of blocks means absence of Subthemes
  • 19. Importance of Preprocessing ( it takes 60%-90% of efforts)
  • 20. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 21. Definitions Def. 1 “Let us V be the set of objects. Clustering C = { Ci | Ci є V } of V is division of V on subsets, for which we have : U i Ci = V and Ci ∩ Cj = 0 i ≠j“ Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight function that reflects the distance between objects, so we have a weighted graph G = { V,E, φ }. In this case C is named as clustering of G .” In the framework of the second definition every Ci produced subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are named clusters . Graph Set Clique
  • 22. Definitions Principal note Both definitions SAYS NOTHING : - about quality of clusters - about numbers of clusters Reason of difficulties Nowadays there is no any general agreement about any universal defintion of the term ‘ cluster ’ What means that clustering is good ? 1. Closeness between objets inside clusters is essentially more than the closeness between clusters themselves 2. Constructed clusters correspond to intuitive presentations of users ( they are natural clusters)
  • 23. Classification of methods 1. Hierarchy based methods Any neighbors N =? N is not given 2. Exemplar based methods K-means N = ? N is given 3. Density based methods MajorClust N = ? N is calculated automatically Based on the way of grouping
  • 24. Hierarchy based methods Neighbors. Every object is cluster General algorithm Initially every object is one cluster The series of steps are performed. On every step the pair of cluster being the closest ones are merged. At the end we have one cluster.
  • 25. Hierarchy based methods Nearest neighbor method (NN)
  • 26. Exemplar based methods K - means, centroid
  • 27. Method K-means General algorithm Initially K centers are selected by any random way Series of steps are performed. On every step the objects are distributed between centers according the criterion of the nearest center . Then all centers are recalculated. The end is fixed when the centers are not changed . Exemplar based methods
  • 28. Method X-means (Dan Pelleg, Andrew Moor) Approach Using evaluation of object distribution Selection of the most likely points Advantage - More rapid - Number of cluster is not fixed (in all cases it tends to be less) Exemplar based methods
  • 29. Density based methods MajorClust method Principal idea Total closeness to the objects of his own cluster exceeds the closeness to any other cluster Suboptimal solution Only part of neighbors are considered on every step (to save time, to avoid mergence) .
  • 30. Density based methods MajorClust method General algorithm Initially every object is one cluster and it joins to the nearest neighbor Every object evaluates the total closeness to his own cluster and separately to all other clusters. After such evaluation the objects change its belonging and go off to the closest one The end of searching is fixed when clusters do not change . Preprocessing for MajorClust Many weak links can be stronger than the several strongest ones that disfigures results. So: weak links should be eliminated before clustering
  • 31. Cluster Validity Definition It reflects cluster separability and formally depends on : - Scatters inside clusters - Separation between clusters Indexes It is formal characteristics of structure Dunn index Davies Bouldin index Hypervolume criterion ( Andre Hardy ) Density expected measure DEM ( Benno Stein ) Dunn index (to be max )
  • 32. Cluster Validity Number of clusters Geometrical approach, two variants: Optimum (min, max) of curve Jump of curve Dunn index (to be max ) is too sensible to extremal cases
  • 33. Cluster Usability Definition It reflects user’s opinion and formally expresses the difference between : - Classes selected manually by a user - Clusters constructed by a given method Cluster F -measure ( Benno Stein ) Data Expert Method Here: i, j are indexes of clusses and clusters C * i , C j are classes and clusters prec(i,j), rec(i,j) are precision and recall
  • 34. Validity and Usability Conclusion Density expected measure corresponds to F -measure reflecting expert ’s opinion. So, DEM can be an indicator of expert opinion
  • 35. Tecnologies of Clustering Meta methods They construct separated data sets using criteria of optimization and limitations : Neither much nor small number of clusters Neither large nor small size of clusters etc. Visual methods They present visual images to a user in order to select manually the clusters Using different methods Comparing results
  • 36. Meta Methods Algorithm (example) Notations: N is the number of objects in a given cluster D is the diagonal of a given cluster Initially N 0 and their centers Ci are given Steps 1. Method K - medoid (or any other one) is performed 2. If N > N max or D > D max (in any cluster), then this cluster is divided on 2 parts. Go to p.1 3. If N < N min or D < D min (in any cluster), then this and the closest clusters are joined. Go to p.1 4. When the number of iteration I > I max , Stop Otherwise go to p.1
  • 37. Visual Clustering Clustering on dendrite Clustering in space of factors
  • 38. Problem Authorship of Molier dramatic works (comedies, dramas,...). Corneille and/or Molier ? Approach Style based indexing ( NooJ can be used ) Clustering all dramatic works Well-known dramatic works should be marked Style - Formal style estimations Informal style estimations Formal style indicators - Text Complexity - Text Harmonicity Authorship References : Labbe C., Labbe D. Inter-textual distance and authorship attribution Corneille and Molier. Journ. of Quantitative Linguistics. 2001. Vol.8, N_3, pp.213-331
  • 39. Clustering Authorship Results 1) 18 comedies of Molier should be belonged to Corneille 2) 15 comedies of Mollier are weak connected with all his other works. So, they can be written by two authors 3) 2 comedies of Corneille now are considered as works of Molier . etc. Note : During a certain time Molier and Corneille were friends
  • 40. Special and Universal packages with algorithms of С lustering 1. ClustAn ( Scotland ) www. clustan.com Clustan Graphics-7 (2006) 2. MatLab Descriptions are in Internet 3. Statistica Descriptions are in Internet Learning Journals and С ongresses about Clustering 1. Journal “Journal of Classification”, Springer 2. IFCS - International Federation of Classification Societies, Conferences 3. CSNA - Classification Society of North America, Seminars, Workshops
  • 41. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 42. Certain Observations The numbers of methods for grouping data is a little bit more than the numbers of researchers working in this area. Problem does not consist in searching the best method for all cases. Problem consists in searching the method being relevant for your data. Only you know what methods are the best for you own data . Principal problems consist in choice of indexes (parameters) and measure of closeness to be adecuate to a given problem and given data Frecuently the results are bad because of the bad indexes and bad measure but not the bad method !
  • 43. Certain Observations Antipodal methods To be sure that results are really good and do not depend on the method used one should test these results using any antipodal methods Solomon G, 1977: “The most antipodes are: NN-method and K-means ” Sensibility To be sure that results do not depend essentially on the method’s parameters one should perform the analysis of sensibility by changing parameters of adjustment.
  • 44. Introduction Definitions Clustering Conclusions Open Problems CONTENTS 
  • 45. Some Problems Question 1 How to reveal alien objects? Solution (idea) Revealing a stable structure on different sets of objects. They are subsets of a given set. Object distribution reflects: real structure ( nature ) + noise ( alien objects )
  • 46. Some Problems Question 2 How to accelerate classification? Solution (idea) Filtering objects, which give a minimum contribution to decisive function Representative objects of each cluster
  • 47. CONTACT INFORMATION Mikhail Alexandrov 1,2 , Pavel Makagonov 3 1 Autonomous University of Barcelona, Spain 2 Social Network Research Center with UCE, Slovakia 3 Mixtec Technological University, Mexico dyner1950@mail.ru, mpp@mixteco.utm.mx Petersburg 2008