Introduction To Machine Learning - Wikipedia
Introduction To Machine Learning - Wikipedia
1 Machine learning 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Types of problems and tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 History and relationships to other elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Relation to statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Decision tree learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Association rule learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.3 Articial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.4 Inductive logic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.5 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.7 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.8 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.9 Representation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.10 Similarity and metric learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.11 Sparse dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.12 Genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Open-source software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 Commercial software with open-source editions . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.3 Commercial software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.11 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.12 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Articial intelligence 9
2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
i
ii CONTENTS
2.2 Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Evaluating progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Competitions and prizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Toys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Philosophy and ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 The possibility/impossibility of articial general intelligence . . . . . . . . . . . . . . . . 19
2.4.2 Intelligent behaviour and machine ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 Machine consciousness, sentience and mind . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.4 Superintelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 In ction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8.1 AI textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8.2 History of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8.3 Other sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Information theory 37
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Historical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Quantities of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Conditional entropy (equivocation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.4 Mutual information (transinformation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.5 KullbackLeibler divergence (information gain) . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.6 KullbackLeibler divergence of a prior from the truth . . . . . . . . . . . . . . . . . . . . 40
3.3.7 Other quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Coding theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.1 Source theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 Channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Applications to other elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 Intelligence uses and secrecy applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.2 Pseudorandom number generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.3 Seismic exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
CONTENTS iii
3.5.4 Semiotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.5 Miscellaneous applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.4 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7.1 The classic work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7.2 Other journal articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7.3 Textbooks on information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7.4 Other books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Computational science 46
4.1 Applications of computational science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2 Model tting and data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.3 Computational optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Methods and algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Reproducibility and open research computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Related elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 Additional sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Predictive analytics 54
iv CONTENTS
6.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.1 Predictive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.2 Descriptive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.3 Decision models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.1 Analytical customer relationship management (CRM) . . . . . . . . . . . . . . . . . . . . 55
6.3.2 Clinical decision support systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.3 Collection analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.4 Cross-sell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3.5 Customer retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3.6 Direct marketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3.7 Fraud detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3.8 Portfolio, product or economy-level prediction . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3.9 Risk management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3.10 Underwriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Technology and big data inuences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.5 Analytical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.5.1 Regression techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.5.2 Machine learning techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.6 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6.1 PMML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.7 Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.10 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Business intelligence 64
7.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 Data warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.4 Comparison with competitive intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.5 Comparison with business analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.6 Applications in an enterprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.7 Prioritization of projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.8 Success factors of implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.8.1 Business sponsorship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.8.2 Business needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.8.3 Amount and quality of available data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.9 User aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.10 BI Portals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.11 Marketplace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
CONTENTS v
7.11.1 Industry-specic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.12 Semi-structured or unstructured data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.12.1 Unstructured data vs. semi-structured data . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.12.2 Problems with semi-structured or unstructured data . . . . . . . . . . . . . . . . . . . . . 69
7.12.3 The use of metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.13 Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.14 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.15 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.16 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.17 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8 Analytics 73
8.1 Analytics vs. analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2.1 Marketing optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2.2 Portfolio analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.3 Risk analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.4 Digital analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.5 Security analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.6 Software analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.4 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9 Data mining 76
9.1 Etymology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.2.1 Research and evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.3 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.3.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.3.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.3.3 Results validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.4 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.5 Notable uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.5.1 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.5.2 Business . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.5.3 Science and engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.5.4 Human rights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.5.5 Medical data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.5.6 Spatial data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
vi CONTENTS
10 Big data 91
10.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.2 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.4 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.5.1 Government . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.5.2 International development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.5.3 Manufacturing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.5.4 Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.5.5 Private sector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.5.6 Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.6 Research activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.7 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.7.1 Critiques of the big data paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.7.2 Critiques of big data execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.10Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
CONTENTS vii
26.13References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
26.14Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
26.15External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
26.15.1 Online calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
30 Goodness of t 214
30.1 Fit of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
30.2 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
30.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
CONTENTS xv
59 Perceptron 360
59.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
59.2 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
59.3 Learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
59.3.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
59.3.2 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
59.3.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
59.4 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
59.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
59.6 Multiclass perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
59.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
59.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Machine learning
For the journal, see Machine Learning (journal). "Computing Machinery and Intelligence" that the ques-
tion Can machines think?" be replaced with the ques-
[1] tion Can machines do what we (as thinking entities) can
Machine learning is a subeld of computer science [9]
that evolved from the study of pattern recognition and do?"
computational learning theory in articial intelligence.[1]
Machine learning explores the construction and study of
1.1.1 Types of problems and tasks
algorithms that can learn from and make predictions on
data.[2] Such algorithms operate by building a model from
Machine learning tasks are typically classied into three
example inputs in order to make data-driven predictions
broad categories, depending on the nature of the learn-
or decisions,[3]:2 rather than following strictly static pro-
ing signal or feedback available to a learning system.
gram instructions.
These are:[10]
Machine learning is closely related to and often over-
laps with computational statistics; a discipline that also Supervised learning: The computer is presented
specializes in prediction-making. It has strong ties to with example inputs and their desired outputs, given
mathematical optimization, which deliver methods, the- by a teacher, and the goal is to learn a general rule
ory and application domains to the eld. Machine learn- that maps inputs to outputs.
ing is employed in a range of computing tasks where
designing and programming explicit algorithms is in- Unsupervised learning: No labels are given to the
feasible. Example applications include spam ltering, learning algorithm, leaving it on its own to nd struc-
optical character recognition (OCR),[4] search engines ture in its input. Unsupervised learning can be a goal
and computer vision. Machine learning is sometimes in itself (discovering hidden patterns in data) or a
conated with data mining,[5] although that focuses more means towards an end.
on exploratory data analysis.[6] Machine learning and pat-
tern recognition can be viewed as two facets of the same Reinforcement learning: A computer program in-
eld. [3]:vii teracts with a dynamic environment in which it must
perform a certain goal (such as driving a vehicle),
When employed in industrial contexts, machine learn- without a teacher explicitly telling it whether it has
ing methods may be referred to as predictive analytics or come close to its goal or not. Another example
predictive modelling. is learning to play a game by playing against an
opponent.[3]:3
1
2 CHAPTER 1. MACHINE LEARNING
Density estimation nds the distribution of inputs in Data mining focuses on the discovery of (previously)
some space. unknown properties in the data. This is the analysis
step of Knowledge Discovery in Databases.
Dimensionality reduction simplies inputs by map-
ping them into a lower-dimensional space. Topic The two areas overlap in many ways: data mining uses
modeling is a related problem, where a program is many machine learning methods, but often with a slightly
given a list of human language documents and is dierent goal in mind. On the other hand, machine
tasked to nd out which documents cover similar learning also employs data mining methods as unsuper-
topics. vised learning or as a preprocessing step to improve
1.4. APPROACHES 3
learner accuracy. Much of the confusion between these resentative of the space of occurrences) and the learner
two research communities (which do often have sepa- has to build a general model about this space that en-
rate conferences and separate journals, ECML PKDD ables it to produce suciently accurate predictions in new
being a major exception) comes from the basic assump- cases.
tions they work with: in machine learning, performance The computational analysis of machine learning algo-
is usually evaluated with respect to the ability to re- rithms and their performance is a branch of theoretical
produce known knowledge, while in Knowledge Discov- computer science known as computational learning the-
ery and Data Mining (KDD) the key task is the discov- ory. Because training sets are nite and the future is un-
ery of previously unknown knowledge. Evaluated with
certain, learning theory usually does not yield guarantees
respect to known knowledge, an uninformed (unsuper- of the performance of algorithms. Instead, probabilis-
vised) method will easily be outperformed by supervised
tic bounds on the performance are quite common. The
methods, while in a typical KDD task, supervised meth- biasvariance decomposition is one way to quantify gen-
ods cannot be used due to the unavailability of training
eralization error.
data.
In addition to performance bounds, computational learn-
Machine learning also has intimate ties to optimization: ing theorists study the time complexity and feasibility of
many learning problems are formulated as minimization learning. In computational learning theory, a computa-
of some loss function on a training set of examples. Loss tion is considered feasible if it can be done in polynomial
functions express the discrepancy between the predic- time. There are two kinds of time complexity results.
tions of the model being trained and the actual prob- Positive results show that a certain class of functions can
lem instances (for example, in classication, one wants be learned in polynomial time. Negative results show that
to assign a label to instances, and models are trained certain classes cannot be learned in polynomial time.
to correctly predict the pre-assigned labels of a set ex-
amples). The dierence between the two elds arises There are many similarities between machine learning
from the goal of generalization: while optimization algo- theory and statistical inference, although they use dier-
rithms can minimize the loss on a training set, machine ent terms.
learning is concerned with minimizing the loss on unseen
samples.[12]
1.4 Approaches
1.2.1 Relation to statistics Main article: List of machine learning algorithms
higher-level, more abstract features dened in terms of techniques have been used to improve the performance
(or generating) lower-level features. It has been argued of genetic and evolutionary algorithms.[23]
that an intelligent machine is one that learns a represen-
tation that disentangles the underlying factors of variation
that explain the observed data.[18]
1.5 Applications
Applications for machine learning include:
1.4.10 Similarity and metric learning
Aective computing
In this problem, the learning machine is given pairs of ex-
amples that are considered similar and pairs of less simi- Bioinformatics
lar objects. It then needs to learn a similarity function (or
a distance metric function) that can predict if new objects Brain-machine interfaces
are similar. It is sometimes used in Recommendation sys-
tems. Cheminformatics
In 2006, the online movie company Netix held the rst Spark
"Netix Prize" competition to nd a program to better
predict user preferences and improve the accuracy on its Yooreeka
existing Cinematch movie recommendation algorithm by
Weka
at least 10%. A joint team made up of researchers from
AT&T Labs-Research in collaboration with the teams Big
Chaos and Pragmatic Theory built an ensemble model to 1.6.2 Commercial software with open-
win the Grand Prize in 2009 for $1 million.[26] Shortly
after the prize was awarded, Netix realized that view-
source editions
ers ratings were not the best indicators of their view-
KNIME
ing patterns (everything is a recommendation) and they
changed their recommendation engine accordingly.[27] RapidMiner
In 2010 The Wall Street Journal wrote about money man-
agement rm Rebellion Researchs use of machine learn-
ing to predict economic movements. The article de- 1.6.3 Commercial software
scribes Rebellion Researchs prediction of the nancial
crisis and economic recovery.[28] Amazon Machine Learning
In 2014 it has been reported that a machine learning al- Angoss KnowledgeSTUDIO
gorithm has been applied in Art History to study ne art
paintings, and that it may have revealed previously unrec- Databricks
ognized inuences between artists.[29] IBM SPSS Modeler
KXEN Modeler
1.6 Software LIONsolver
Neural Designer
dlib
NeuroSolutions
ELKI
Oracle Data Mining
Encog
RCASE
H2O
MLPACK
R 1.8 Conferences
scikit-learn
Conference on Neural Information Processing Sys-
Shogun tems
1.9 See also [10] Russell, Stuart; Norvig, Peter (2003) [1995]. Articial
Intelligence: A Modern Approach (2nd ed.). Prentice Hall.
Adaptive control ISBN 978-0137903955.
Cache language model [12] Le Roux, Nicolas; Bengio, Yoshua; Fitzgibbon, Andrew
(2012). Improving First and Second-Order Methods by
Cognitive model Modeling Uncertainty. In Sra, Suvrit; Nowozin, Sebas-
tian; Wright, Stephen J. Optimization for Machine Learn-
Cognitive science ing. MIT Press. p. 404.
Computational intelligence [13] MI Jordan (2014-09-10). statistics and machine learn-
ing. reddit. Retrieved 2014-10-01.
Computational neuroscience
[14] http://projecteuclid.org/download/pdf_1/euclid.ss/
Ethics of articial intelligence 1009213726
Existential risk of articial general intelligence [15] Gareth James; Daniela Witten; Trevor Hastie; Robert Tib-
shirani (2013). An Introduction to Statistical Learning.
Explanation-based learning
Springer. p. vii.
Hidden Markov model
[16] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar
Important publications in machine learning (2012) Foundations of Machine Learning, MIT Press
ISBN 978-0-262-01825-8.
List of machine learning algorithms
[17] Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N.
(2011). A Survey of Multilinear Subspace Learning for
Tensor Data (PDF). Pattern Recognition 44 (7): 1540
1.10 References 1551. doi:10.1016/j.patcog.2011.01.004.
[1] http://www.britannica.com/EBchecked/topic/1116194/ [18] Yoshua Bengio (2009). Learning Deep Architectures for
machine-learning This is a tertiary source that clearly AI. Now Publishers Inc. pp. 13. ISBN 978-1-60198-
includes information from other sources but does not 294-0.
name them. [19] A. M. Tillmann, "On the Computational Intractability of
[2] Ron Kohavi; Foster Provost (1998). Glossary of terms. Exact and Approximate Dictionary Learning", IEEE Sig-
Machine Learning 30: 271274. nal Processing Letters 22(1), 2015: 4549.
[3] C. M. Bishop (2006). Pattern Recognition and Machine [20] Aharon, M, M Elad, and A Bruckstein. 2006. K-
Learning. Springer. ISBN 0-387-31073-8. SVD: An Algorithm for Designing Overcomplete Dic-
tionaries for Sparse Representation. Signal Processing,
[4] Wernick, Yang, Brankov, Yourganov and Strother, Ma- IEEE Transactions on 54 (11): 4311-4322
chine Learning in Medical Imaging, IEEE Signal Process-
ing Magazine, vol. 27, no. 4, July 2010, pp. 25-38 [21] Goldberg, David E.; Holland, John H. (1988). Genetic
algorithms and machine learning. Machine Learning 3
[5] Mannila, Heikki (1996). Data mining: machine learning, (2): 9599.
statistics, and databases. Int'l Conf. Scientic and Statis-
tical Database Management. IEEE Computer Society. [22] Michie, D.; Spiegelhalter, D. J.; Taylor, C. C. (1994). Ma-
chine Learning, Neural and Statistical Classication. Ellis
[6] Friedman, Jerome H. (1998). Data Mining and Statistics: Horwood.
Whats the connection?". Computing Science and Statistics
29 (1): 39. [23] Zhang, Jun; Zhan, Zhi-hui; Lin, Ying; Chen, Ni; Gong,
Yue-jiao; Zhong, Jing-hui; Chung, Henry S.H.; Li, Yun;
[7] Phil Simon (March 18, 2013). Too Big to Ignore: The Shi, Yu-hui (2011). Evolutionary Computation Meets
Business Case for Big Data. Wiley. p. 89. ISBN 978-1- Machine Learning: A Survey (PDF). Computational In-
118-63817-0. telligence Magazine (IEEE) 6 (4): 6875.
[8] Mitchell, T. (1997). Machine Learning, McGraw [24] Tesauro, Gerald (March 1995). Temporal Dierence
Hill. ISBN 0-07-042807-7, p.2. Learning and TD-Gammon". Communications of the
ACM 38 (3).
[9] Harnad, Stevan (2008), The Annotation Game: On Tur-
ing (1950) on Computing, Machinery, and Intelligence, [25] Daniel Jurafsky and James H. Martin (2009). Speech and
in Epstein, Robert; Peters, Grace, The Turing Test Source- Language Processing. Pearson Education. pp. 207 .
book: Philosophical and Methodological Issues in the Quest
for the Thinking Computer, Kluwer [26] BelKor Home Page research.att.com
8 CHAPTER 1. MACHINE LEARNING
Articial intelligence
AI redirects here. For other uses, see Ai and Articial sues about the nature of the mind and the ethics of cre-
intelligence (disambiguation). ating articial beings endowed with human-like intelli-
gence, issues which have been addressed by myth, ction
[9]
Articial intelligence (AI) is the intelligence exhibited and philosophy since antiquity. Articial intelligence
has been the subject of tremendous optimism[10] but has
by machines or software. It is also the name of the aca- [11]
demic eld of study which studies how to create comput- also suered stunning setbacks. Today it has become
an essential part of the technology industry, providing the
ers and computer software that are capable of intelligent
behavior. Major AI researchers and textbooks dene this heavy lifting for many of the most challenging problems
[1] in computer science.[12]
eld as the study and design of intelligent agents, in
which an intelligent agent is a system that perceives its
environment and takes actions that maximize its chances
of success.[2] John McCarthy, who coined the term in
1955,[3] denes it as the science and engineering of mak-
2.1 History
[4]
ing intelligent machines.
AI research is highly technical and specialized, and is Main articles: History of articial intelligence and
deeply divided into subelds that often fail to commu- Timeline of articial intelligence
nicate with each other.[5] Some of the division is due
to social and cultural factors: subelds have grown up Thinking machines and articial beings appear in Greek
around particular institutions and the work of individual myths, such as Talos of Crete, the bronze robot of
researchers. AI research is also divided by several tech- Hephaestus, and Pygmalions Galatea.[13] Human like-
nical issues. Some subelds focus on the solution of spe- nesses believed to have intelligence were built in ev-
cic problems. Others focus on one of several possible ery major civilization: animated cult images were wor-
approaches or on the use of a particular tool or towards shiped in Egypt and Greece[14] and humanoid automatons
the accomplishment of particular applications. were built by Yan Shi, Hero of Alexandria and Al-
The central problems (or goals) of AI research include Jazari.[15] It was also widely believed that articial be-
reasoning, knowledge, planning, learning, natural lan- ings had been created by Jbir ibn Hayyn, Judah Loew
guage processing (communication), perception and the and Paracelsus.[16] By the 19th and 20th centuries, arti-
ability to move and manipulate objects.[6] General in- cial beings had become a common feature in ction, as
telligence is still among the elds long-term goals.[7] in Mary Shelley's Frankenstein or Karel apek's R.U.R.
Currently popular approaches include statistical methods, (Rossums Universal Robots).[17] Pamela McCorduck ar-
computational intelligence and traditional symbolic AI. gues that all of these are some examples of an ancient
There are a large number of tools used in AI, includ- urge, as she describes it, to forge the gods.[9] Stories of
ing versions of search and mathematical optimization, these creatures and their fates discuss many of the same
logic, methods based on probability and economics, and hopes, fears and ethical concerns that are presented by
many others. The AI eld is interdisciplinary, in which a articial intelligence.
number of sciences and professions converge, including Mechanical or formal reasoning has been developed
computer science, mathematics, psychology, linguistics, by philosophers and mathematicians since antiquity.
philosophy and neuroscience, as well as other specialized The study of logic led directly to the invention of the
elds such as articial psychology. programmable digital electronic computer, based on the
The eld was founded on the claim that a central prop- work of mathematician Alan Turing and others. Turings
erty of humans, intelligencethe sapience of Homo sapi- theory of computation suggested that a machine, by shuf-
ens"can be so precisely described that a machine can ing symbols as simple as 0 and 1, could simulate[18][19]
be made to simulate it.[8] This raises philosophical is- any conceivable act of mathematical deduction.
This, along with concurrent discoveries in neurology,
9
10 CHAPTER 2. ARTIFICIAL INTELLIGENCE
information theory and cybernetics, inspired a small Jeopardy champions, Brad Rutter and Ken Jennings, by
group of researchers to begin to seriously consider the a signicant margin.[36] The Kinect, which provides a
possibility of building an electronic brain.[20] 3D bodymotion interface for the Xbox 360 and the
The eld of AI research was founded at a conference Xbox One, [37] uses algorithms that emerged from lengthy
on the campus of Dartmouth College in the summer AI research as do intelligent personal assistants in
[38]
of 1956. [21]
The attendees, including John McCarthy, smartphones.
Marvin Minsky, Allen Newell, Arthur Samuel, and
Herbert Simon, became the leaders of AI research for
many decades.[22] They and their students wrote pro- 2.2 Research
grams that were, to most people, simply astonishing:[23]
computers were winning at checkers, solving word prob-
2.2.1 Goals
lems in algebra, proving logical theorems and speak-
[24]
ing English. By the middle of the 1960s, research in You awake one morning to nd your brain
the U.S. was heavily funded by the Department of De- has another lobe functioning. Invisible, this
fense[25] and laboratories had been established around the auxiliary lobe answers your questions with
world.[26] AIs founders were profoundly optimistic about information beyond the realm of your own
the future of the new eld: Herbert Simon predicted that memory, suggests plausible courses of action,
machines will be capable, within twenty years, of doing and asks questions that help bring out relevant
any work a man can do and Marvin Minsky agreed, writ- facts. You quickly come to rely on the new
ing that within a generation ... the problem of creating lobe so much that you stop wondering how it
'articial intelligence' will substantially be solved.[27] works. You just use it. This is the dream of
They had failed to recognize the diculty of some of the articial intelligence.
[28]
problems they faced. In 1974, in response to the criti- BYTE, April 1985[39]
cism of Sir James Lighthill[29] and ongoing pressure from
the US Congress to fund more productive projects, both
the U.S. and British governments cut o all undirected The general problem of simulating (or creating) intelli-
exploratory research in AI. The next few years would later gence has been broken down into a number of specic
be called an "AI winter",[30] a period when funding for AI sub-problems. These consist of particular traits or capa-
projects was hard to nd. bilities that researchers would like an intelligent system
In the early 1980s, AI research was revived by the com- to display. The[6]traits described below have received the
mercial success of expert systems,[31] a form of AI pro- most attention.
gram that simulated the knowledge and analytical skills
of one or more human experts. By 1985 the market for
Deduction, reasoning, problem solving
AI had reached over a billion dollars. At the same time,
Japans fth generation computer project inspired the U.S
Early AI researchers developed algorithms that imitated
and British governments to restore funding for academic
[32] the step-by-step reasoning that humans use when they
research in the eld. However, beginning with the col-
solve puzzles or make logical deductions.[40] By the late
lapse of the Lisp Machine market in 1987, AI once again
1980s and 1990s, AI research had also developed highly
fell into disrepute, and a second, longer lasting AI winter
[33] successful methods for dealing with uncertain or incom-
began.
plete information, employing concepts from probability
In the 1990s and early 21st century, AI achieved its great- and economics.[41]
est successes, albeit somewhat behind the scenes. Arti-
For dicult problems, most of these algorithms can re-
cial intelligence is used for logistics, data mining, medical
quire enormous computational resources most experi-
diagnosis and many other areas throughout the technol-
ence a "combinatorial explosion": the amount of memory
ogy industry.[12] The success was due to several factors:
or computer time required becomes astronomical when
the increasing computational power of computers (see
the problem goes beyond a certain size. The search for
Moores law), a greater emphasis on solving specic sub-
more ecient problem-solving algorithms is a high pri-
problems, the creation of new ties between AI and other
ority for AI research.[42]
elds working on similar problems, and a new commit-
ment by researchers to solid mathematical methods and Human beings solve most of their problems using fast,
rigorous scientic standards.[34] intuitive judgements rather than the conscious, step-
by-step deduction that early AI research was able to
On 11 May 1997, Deep Blue became the rst com-
model.[43] AI has made some progress at imitating this
puter chess-playing system to beat a reigning world chess
kind of sub-symbolic problem solving: embodied agent
champion, Garry Kasparov.[35] In February 2011, in a
approaches emphasize the importance of sensorimotor
Jeopardy! quiz show exhibition match, IBM's question
skills to higher reasoning; neural net research attempts
answering system, Watson, defeated the two greatest
to simulate the structures inside the brain that give rise to
2.2. RESEARCH 11
this skill; statistical approaches to AI mimic the proba- requires. AI research has explored a number of
bilistic nature of the human ability to guess. solutions to this problem.[52]
the ball
Learning
A parse tree represents the syntactic structure of a sentence ac-
cording to some formal grammar.
Main article: Machine learning
Machine learning is the study of computer algorithms that speak. A suciently powerful natural language process-
improve automatically through experience[62][63] and has ing system would enable natural language user interfaces
been central to AI research since the elds inception.[64] and the acquisition of knowledge directly from human-
written sources, such as newswire texts. Some straightfor-
Unsupervised learning is the ability to nd patterns in ward applications of natural language processing include
a stream of input. Supervised learning includes both information retrieval (or text mining), question answer-
classication and numerical regression. Classication is ing[72] and machine translation.[73]
used to determine what category something belongs in,
after seeing a number of examples of things from several A common method of processing and extracting mean-
categories. Regression is the attempt to produce a func- ing from natural language is through semantic indexing.
tion that describes the relationship between inputs and Increases in processing speeds and the drop in the cost
outputs and predicts how the outputs should change as the of data storage makes indexing large volumes of abstrac-
inputs change. In reinforcement learning[65] the agent is tions of the users input much more ecient.
rewarded for good responses and punished for bad ones.
The agent uses this sequence of rewards and punishments Perception
to form a strategy for operating in its problem space.
These three types of learning can be analyzed in terms of Main articles: Machine perception, Computer vision and
decision theory, using concepts like utility. The mathe- Speech recognition
matical analysis of machine learning algorithms and their
performance is a branch of theoretical computer science
known as computational learning theory.[66] Machine perception[74] is the ability to use input from
sensors (such as cameras, microphones, tactile sensors,
Within developmental robotics, developmental learning sonar and others more exotic) to deduce aspects of the
approaches were elaborated for lifelong cumulative ac- world. Computer vision[75] is the ability to analyze visual
quisition of repertoires of novel skills by a robot, through input. A few selected subproblems are speech recogni-
autonomous self-exploration and social interaction with tion,[76] facial recognition and object recognition.[77]
human teachers, and using guidance mechanisms such
as active learning, maturation, motor synergies, and
imitation.[67][68][69][70] Motion and manipulation
what is around you, building a map of the environment), Creativity Main article: Computational creativity
and motion planning (guring out how to get there) or
path planning (going from one point in space to another A sub-eld of AI addresses creativity both theoretically
point, which may involve compliant motion where the (from a philosophical and psychological perspective) and
robot moves while maintaining physical contact with an practically (via specic implementations of systems that
object).[80][81] generate outputs that can be considered creative, or sys-
tems that identify and assess creativity). Related areas of
computational research are Articial intuition and Arti-
Long-term goals cial thinking.
was revived by David Rumelhart and others in the symbolic and sub-symbolic components is a hybrid
middle 1980s.[110] Neural networks are an example intelligent system, and the study of such systems
of soft computing --- they are solutions to problems is articial intelligence systems integration. A
which cannot be solved with complete logical cer- hierarchical control system provides a bridge be-
tainty, and where an approximate solution is often tween sub-symbolic AI at its lowest, reactive levels
enough. Other soft computing approaches to AI in- and traditional symbolic AI at its highest levels,
clude fuzzy systems, evolutionary computation and where relaxed time constraints permit planning and
many statistical tools. The application of soft com- world modelling.[117] Rodney Brooks' subsumption
puting to AI is studied collectively by the emerging architecture was an early proposal for such a
discipline of computational intelligence.[111] hierarchical system.[118]
Statistical
2.2.3 Tools
In the 1990s, AI researchers developed sophisticated
In the course of 50 years of research, AI has developed a
mathematical tools to solve specic subproblems. These
large number of tools to solve the most dicult problems
tools are truly scientic, in the sense that their results
in computer science. A few of the most general of these
are both measurable and veriable, and they have been
methods are discussed below.
responsible for many of AIs recent successes. The
shared mathematical language has also permitted a high
level of collaboration with more established elds (like
Search and optimization
mathematics, economics or operations research). Stuart
Russell and Peter Norvig describe this movement as
nothing less than a revolution and the victory of the Main articles: Search algorithm, Mathematical opti-
neats.[34] Critics argue that these techniques (with few mization and Evolutionary computation
exceptions[112] ) are too focused on particular problems
and have failed to address the long-term goal of general Many problems in AI can be solved in theory by intel-
intelligence.[113] There is an ongoing debate about the rel- ligently searching through many possible solutions:[119]
evance and validity of statistical approaches in AI, exem- Reasoning can be reduced to performing a search. For
plied in part by exchanges between Peter Norvig and example, logical proof can be viewed as searching for a
Noam Chomsky.[114][115] path that leads from premises to conclusions, where each
step is the application of an inference rule.[120] Planning
algorithms search through trees of goals and subgoals,
Integrating the approaches attempting to nd a path to a target goal, a process
called means-ends analysis.[121] Robotics algorithms for
Intelligent agent paradigm An intelligent agent is a
moving limbs and grasping objects use local searches
system that perceives its environment and takes ac-
in conguration space.[79] Many learning algorithms use
tions which maximize its chances of success. The
search algorithms based on optimization.
simplest intelligent agents are programs that solve
[122]
specic problems. More complicated agents include Simple exhaustive searches are rarely sucient for
human beings and organizations of human beings most real world problems: the search space (the num-
(such as rms). The paradigm gives researchers li- ber of places to search) quickly grows to astronomical
cense to study isolated problems and nd solutions numbers. The result is a search that is too slow or never
that are both veriable and useful, without agree- completes. The solution, for many problems, is to use
ing on one single approach. An agent that solves a "heuristics" or rules of thumb that eliminate choices
specic problem can use any approach that works that are unlikely to lead to the goal (called "pruning the
some agents are symbolic and logical, some are sub- search tree"). Heuristics supply the program with a best
[123]
symbolic neural networks and others may use new guess for the path on which the solution lies. Heuris-
approaches. The paradigm also gives researchers tics limit the search for solutions into a smaller sample
[80]
a common language to communicate with other size.
eldssuch as decision theory and economics A very dierent kind of search came to prominence in the
that also use concepts of abstract agents. The intelli- 1990s, based on the mathematical theory of optimization.
gent agent paradigm became widely accepted during For many problems, it is possible to begin the search with
the 1990s.[2] some form of a guess and then rene the guess incremen-
tally until no more renements can be made. These algo-
Agent architectures and cognitive architectures rithms can be visualized as blind hill climbing: we begin
Researchers have designed systems to build intel- the search at a random point on the landscape, and then,
ligent systems out of interacting intelligent agents by jumps or steps, we keep moving our guess uphill, un-
in a multi-agent system.[116] A system with both til we reach the top. Other optimization algorithms are
16 CHAPTER 2. ARTIFICIAL INTELLIGENCE
simulated annealing, beam search and random optimiza- ate with incomplete or uncertain information. AI re-
tion.[124] searchers have devised a number of powerful tools to
Evolutionary computation uses a form of optimization solve these problems[134]
using methods from probability the-
search. For example, they may begin with a population ory and economics.
of organisms (the guesses) and then allow them to mutate Bayesian networks[135] are a very general tool that can
and recombine, selecting only the ttest to survive each be used for a large number of problems: reasoning (us-
generation (rening the guesses). Forms of evolutionary ing the Bayesian inference algorithm),[136] learning (using
computation include swarm intelligence algorithms (such the expectation-maximization algorithm),[137] planning
as ant colony or particle swarm optimization)[125] and (using decision networks)[138] and perception (using
evolutionary algorithms (such as genetic algorithms, gene dynamic Bayesian networks).[139] Probabilistic algo-
expression programming, and genetic programming).[126] rithms can also be used for ltering, prediction, smooth-
ing and nding explanations for streams of data, helping
perception systems to analyze processes that occur over
Logic time (e.g., hidden Markov models or Kalman lters).[139]
Main articles: Logic programming and Automated A key concept from the science of economics is "utility":
reasoning a measure of how valuable something is to an intelli-
gent agent. Precise mathematical tools have been devel-
oped that analyze how an agent can make choices and
Logic[127] is used for knowledge representation and prob- plan, using decision theory, decision analysis,[140] and
lem solving, but it can be applied to other problems information value theory.[58] These tools include models
as well. For example, the satplan algorithm uses logic such as Markov decision processes,[141] dynamic decision
for planning[128] and inductive logic programming is a networks,[139] game theory and mechanism design.[142]
method for learning.[129]
Several dierent forms of logic are used in AI research.
Propositional or sentential logic[130] is the logic of state- Classiers and statistical learning methods
ments which can be true or false. First-order logic[131]
also allows the use of quantiers and predicates, and can Main articles: Classier (mathematics), Statistical
express facts about objects, their properties, and their classication and Machine learning
relations with each other. Fuzzy logic,[132] is a version
of rst-order logic which allows the truth of a statement The simplest AI applications can be divided into two
to be represented as a value between 0 and 1, rather types: classiers (if shiny then diamond) and con-
than simply True (1) or False (0). Fuzzy systems can be trollers (if shiny then pick up). Controllers do, how-
used for uncertain reasoning and have been widely used ever, also classify conditions before inferring actions, and
in modern industrial and consumer product control sys- therefore classication forms a central part of many AI
tems. Subjective logic[133] models uncertainty in a dier- systems. Classiers are functions that use pattern match-
ent and more explicit manner than fuzzy-logic: a given ing to determine a closest match. They can be tuned ac-
binomial opinion satises belief + disbelief + uncertainty cording to examples, making them very attractive for use
= 1 within a Beta distribution. By this method, ignorance in AI. These examples are known as observations or pat-
can be distinguished from probabilistic statements that an terns. In supervised learning, each pattern belongs to a
agent makes with high condence. certain predened class. A class can be seen as a deci-
Default logics, non-monotonic logics and sion that has to be made. All the observations combined
circumscription[52] are forms of logic designed to with their class labels are known as a data set. When a
help with default reasoning and the qualication prob- new observation is received, that observation is classied
lem. Several extensions of logic have been designed based on previous experience.[143]
to handle specic domains of knowledge, such as: A classier can be trained in various ways; there
description logics;[46] situation calculus, event calculus
are many statistical and machine learning approaches.
and uent calculus (for representing events and time);[47]
The most widely used classiers are the neural net-
causal calculus;[48] belief calculus; and modal logics.[49]
work,[144] kernel methods such as the support vector
machine,[145] k-nearest neighbor algorithm,[146] Gaussian
mixture model,[147] naive Bayes classier,[148] and
Probabilistic methods for uncertain reasoning decision tree.[149] The performance of these classiers
have been compared over a wide range of tasks. Clas-
Main articles: Bayesian network, Hidden Markov model, sier performance depends greatly on the characteristics
Kalman lter, Decision theory and Utility theory of the data to be classied. There is no single classier
that works best on all given problems; this is also referred
Many problems in AI (in reasoning, planning, learn- to as the "no free lunch" theorem. Determining a suit-
ing, perception and robotics) require the agent to oper- able classier for a given problem is still more an art than
2.2. RESEARCH 17
2.3.3 Toys
AIBO, the rst robotic pet, grew out of Sonys Computer
Science Laboratory (CSL). Famed engineer Toshitada
Doi is credited as AIBOs original progenitor: in 1994
he had started work on robots with articial intelligence
expert Masahiro Fujita, at CSL. Dois friend, the artist
Hajime Sorayama, was enlisted to create the initial de-
signs for the AIBOs body. Those designs are now part of
the permanent collections of Museum of Modern Art and
the Smithsonian Institution, with later versions of AIBO
being used in studies in Carnegie Mellon University. In
2006, AIBO was added into Carnegie Mellon Universitys
Robot Hall of Fame.
An automated online assistant providing customer service on a
web page one of many very primitive applications of articial
intelligence.
2.4 Philosophy and ethics
Main article: Applications of articial intelligence
Main articles: Philosophy of articial intelligence and
Articial intelligence techniques are pervasive and are too Ethics of articial intelligence
numerous to list. Frequently, when a technique reaches
mainstream use, it is no longer considered articial intelli- Alan Turing wrote in 1950 I propose to consider the
gence; this phenomenon is described as the AI eect.[166] question 'can a machine think'?"[160] and began the dis-
An area that articial intelligence has contributed greatly cussion that has become the philosophy of articial intel-
2.4. PHILOSOPHY AND ETHICS 19
ligence. Because thinking is dicult to dene, there program designed to search for these statements can
are two versions of the question that philosophers have have its methods reduced to a formal system, and so
addressed. First, can a machine be intelligent? I.e., can will always have a Gdel statement derivable from
it solve all the problems the humans solve by using intel- its program which it can never discover. However, if
ligence? And second, can a machine be built with a mind humans are indeed capable of understanding mathe-
and the experience of subjective consciousness?[170] matical truth, it doesn't seem possible that we could
The existence of an articial intelligence that rivals or be limited in the same way. This is quite a general
exceeds human intelligence raises dicult ethical issues, result, if accepted, since it can be shown that hard-
ware neural nets, and computers based on random
both on behalf of humans and on behalf of any possi-
ble sentient AI. The potential power of the technology processes (e.g. annealing approaches) and quantum
computers based on entangled qubits (so long as they
inspires both hopes and fears for society.
involve no new physics) can all be reduced to Tur-
ing machines. All they do is reduce the complex-
2.4.1 The possibility/impossibility of arti- ity of the tasks, not permit new types of problems
cial general intelligence to be solved. Roger Penrose speculates that there
may be new physics involved in our brain, perhaps
Main articles: philosophy of AI, Turing test, Physical at the intersection of gravity and quantum mechan-
symbol systems hypothesis, Dreyfus critique of AI, The ics at the Planck scale. This argument, if accepted
Emperors New Mind and AI eect does not rule out the possibility of true articial in-
telligence, but means it has to be biological in basis
or based on new physical principles. The argument
Can a machine be intelligent? Can it think"? has been followed up by many counter arguments,
and then Roger Penrose has replied to those with
Turings polite convention We need not decide if a counter counter examples, and it is now an intricate
machine can think"; we need only decide if a ma- complex debate.[177] For details see Philosophy of
chine can act as intelligently as a human being. This articial intelligence: Lucas, Penrose and Gdel
approach to the philosophical problems associated
with articial intelligence forms the basis of the The articial brain argument The brain can be simu-
Turing test.[160] lated by machines and because brains are intelli-
gent, simulated brains must also be intelligent; thus
The Dartmouth proposal Every aspect of learning or machines can be intelligent. Hans Moravec, Ray
any other feature of intelligence can be so precisely Kurzweil and others have argued that it is techno-
described that a machine can be made to simulate logically feasible to copy the brain directly into hard-
it. This conjecture was printed in the proposal for ware and software, and that such a simulation will be
the Dartmouth Conference of 1956, and represents essentially identical to the original.[91]
the position of most working AI researchers.[171]
The AI eect Machines are already intelligent, but ob-
Newell and Simons physical symbol system hypothesis servers have failed to recognize it. When Deep Blue
A physical symbol system has the necessary and beat Gary Kasparov in chess, the machine was acting
sucient means of general intelligent action. intelligently. However, onlookers commonly dis-
Newell and Simon argue that intelligence consists count the behavior of an articial intelligence pro-
of formal operations on symbols.[172] Hubert gram by arguing that it is not real intelligence af-
Dreyfus argued that, on the contrary, human ex- ter all; thus real intelligence is whatever intelligent
pertise depends on unconscious instinct rather than behavior people can do that machines still can not.
conscious symbol manipulation and on having a This is known as the AI Eect: AI is whatever
feel for the situation rather than explicit symbolic hasn't been done yet.
knowledge. (See Dreyfus critique of AI.)[173][174]
Gdelian arguments Gdel himself,[175] John Lucas (in 2.4.2 Intelligent behaviour and machine
1961) and Roger Penrose (in a more detailed argu- ethics
ment from 1989 onwards) argued that humans are
not reducible to Turing machines.[176] The detailed As a minimum, an AI system must be able to reproduce
arguments are complex, but in essence they derive aspects of human intelligence. This raises the issue of
from Kurt Gdel's 1931 proof in his rst incom- how ethically the machine should behave towards both
pleteness theorem that it is always possible to create humans and other AI agents. This issue was addressed
statements that a formal system could not prove. A by Wendell Wallach in his book titled Moral Machines
human being, however, can (with some thought) see in which he introduced the concept of articial moral
the truth of these Gdel statements. Any Turing agents (AMA).[178] For Wallach, AMAs have become a
20 CHAPTER 2. ARTIFICIAL INTELLIGENCE
part of the research landscape of articial intelligence because there is no a priori reason to believe that they
as guided by its two central questions which he identi- would be sympathetic to our system of morality, which
es as Does Humanity Want Computers Making Moral has evolved along with our particular biology (which AIs
Decisions[179] and Can (Ro)bots Really Be Moral.[180] would not share). Hyper-intelligent software may not
For Wallach the question is not centered on the issue necessarily decide to support the continued existence of
of whether machines can demonstrate the equivalent of mankind, and would be extremely dicult to stop. This
moral behavior in contrast to the constraints which soci- topic has also recently begun to be discussed in academic
ety may place on the development of AMAs.[181] publications as a real source of risks to civilization, hu-
mans, and planet Earth.
Physicist Stephen Hawking, Microsoft founder Bill Gates
Machine ethics
and SpaceX founder Elon Musk have expressed concerns
about the possibility that AI could evolve to the point that
Main article: Machine ethics
humans could not control it, with Hawking theorizing that
this could "spell the end of the human race".[185]
The eld of machine ethics is concerned with giving ma-
One proposal to deal with this is to ensure that the rst
chines ethical principles, or a procedure for discovering a
generally intelligent AI is 'Friendly AI', and will then be
way to resolve the ethical dilemmas they might encounter,
able to control subsequently developed AIs. Some ques-
enabling them to function in an ethically responsible man-
tion whether this kind of check could really remain in
ner through their own ethical decision making.[182] The
place.
eld was delineated in the AAAI Fall 2005 Symposium
on Machine Ethics: Past research concerning the rela- Leading AI researcher Rodney Brooks writes, I think it
tionship between technology and ethics has largely fo- is a mistake to be worrying about us developing malevo-
cused on responsible and irresponsible use of technol- lent AI anytime in the next few hundred years. I think the
ogy by human beings, with a few people being inter- worry stems from a fundamental error in not distinguish-
ested in how human beings ought to treat machines. In ing the dierence between the very real recent advances
all cases, only human beings have engaged in ethical rea- in a particular aspect of AI, and the enormity and com-
soning. The time has come for adding an ethical dimen- plexity of building sentient volitional intelligence.[186]
sion to at least some machines. Recognition of the ethi-
cal ramications of behavior involving machines, as well
as recent and potential developments in machine auton- Devaluation of humanity
omy, necessitate this. In contrast to computer hacking,
software property issues, privacy issues and other topics Main article: Computer Power and Human Reason
normally ascribed to computer ethics, machine ethics is
concerned with the behavior of machines towards human Joseph Weizenbaum wrote that AI applications can not,
users and other machines. Research in machine ethics by denition, successfully simulate genuine human em-
is key to alleviating concerns with autonomous systems pathy and that the use of AI technology in elds such as
it could be argued that the notion of autonomous ma- customer service or psychotherapy[187] was deeply mis-
chines without such a dimension is at the root of all fear guided. Weizenbaum was also bothered that AI re-
concerning machine intelligence. Further, investigation searchers (and some philosophers) were willing to view
of machine ethics could enable the discovery of prob- the human mind as nothing more than a computer pro-
lems with current ethical theories, advancing our think- gram (a position now known as computationalism). To
ing about Ethics.[183] Machine ethics is sometimes re- Weizenbaum these points suggest that AI research deval-
ferred to as machine morality, computational ethics or ues human life.[188]
computational morality. A variety of perspectives of this
nascent eld can be found in the collected edition Ma-
chine Ethics [182] that stems from the AAAI Fall 2005 Decrease in demand for human labor
Symposium on Machine Ethics.[183]
Martin Ford, author of The Lights in the Tunnel: Automa-
tion, Accelerating Technology and the Economy of the Fu-
Malevolent and friendly AI
ture,[189] and others argue that specialized articial intel-
ligence applications, robotics and other forms of automa-
Main article: Friendly AI tion will ultimately result in signicant unemployment
as machines begin to match and exceed the capability
Political scientist Charles T. Rubin believes that AI can be of workers to perform most routine and repetitive jobs.
neither designed nor guaranteed to be benevolent.[184] He Ford predicts that many knowledge-based occupations
argues that any suciently advanced benevolence may and in particular entry level jobswill be increasingly
be indistinguishable from malevolence. Humans should susceptible to automation via expert systems, machine
not assume machines or robots would treat us favorably, learning[190] and other AI-enhanced applications. AI-
2.4. PHILOSOPHY AND ETHICS 21
based applications may also be used to amplify the ca- machine surpassing human abilities to perform the skills
pabilities of low-wage oshore workers, making it more implanted in it, a scary thought to many, who fear los-
feasible to outsource knowledge work.[191] ing control of such a powerful machine. Obstacles for
researchers are mainly time contstraints. That is, AI sci-
entists cannot establish much of a database for common-
2.4.3 Machine consciousness, sentience sense knowledge because it must be ontologically crafted
and mind into the machine which takes up a tremendous amount of
time. To combat this, AI research looks to have the ma-
Main article: Articial consciousness chine able to understand enough concepts in order to add
to its own ontology, but how can it do this when machine
ethics is primarily concerned with behavior of machines
If an AI system replicates all key aspects of human in- towards humans or other machines, limiting the extent of
telligence, will that system also be sentient will it have developing AI. In order to function like a common human
a mind which has conscious experiences? This question AI must also display, the ability to solve subsymbolic
is closely related to the philosophical problem as to the commonsense knowledge tasks such as how artists can
nature of human consciousness, generally referred to as tell statues are fake or how chess masters dont move cer-
the hard problem of consciousness. tain spots to avoid exposure, but by developing machines
who can do it all AI research is faced with the diculty of
potentially putting a lot of people out of work, while on
Consciousness
the economy side of things businesses would boom from
eciency, thus forcing AI into a bottleneck trying to de-
Main articles: Hard problem of consciousness and
veloping self improving machines.
Theory of mind
2.4.4 Superintelligence robots to be real part of nature but always unnatural prod-
uct of the human mind, a fantasy existing in the mind even
Main article: Superintelligence when realized in actual form.
Edward Fredkin argues that articial intelligence is the
Are there limits to how intelligent machines or human- next stage in evolution, an idea rst proposed by Samuel
machine hybrids can be? A superintelligence, hyper- Butler's "Darwin among the Machines" (1863), and ex-
intelligence, or superhuman intelligence is a hypothetical panded upon by George Dyson in his book of the same
agent that would possess intelligence far surpassing that name in 1998.[199]
of the brightest and most gifted human mind. Superin-
telligence may also refer to the form or degree of intel-
ligence possessed by such an agent. 2.5 In ction
Main articles: Technological singularity and Moores law The implications of articial intelligence have been a per-
sistent theme in science ction. Early stories typically re-
If research into Strong AI produced suciently intelli- volved around intelligent robots. The word robot itself
gent software, it might be able to reprogram and im- was coined by Karel apek in his 1921 play R.U.R., the
prove itself. The improved software would be even title standing for "Rossums Universal Robots". Later,
better at improving itself, leading to recursive self- the SF writer Isaac Asimov developed the three laws of
improvement.[196] The new intelligence could thus in- robotics which he subsequently explored in a long series
crease exponentially and dramatically surpass humans. of robot stories. These laws have since gained some trac-
Science ction writer Vernor Vinge named this scenario tion in genuine AI research.
"singularity".[197] Technological singularity is when ac- Other inuential ctional intelligences include HAL, the
celerating progress in technologies will cause a runaway computer in charge of the spaceship in 2001: A Space
eect wherein articial intelligence will exceed human Odyssey, released as both a lm and a book in 1968 and
intellectual capacity and control, thus radically changing written by Arthur C. Clarke.
or even ending civilization. Because the capabilities of
such an intelligence may be impossible to comprehend, Since then, AI has become rmly rooted in popular cul-
the technological singularity is an occurrence beyond ture.
which events are unpredictable or even unfathomable.[197]
Ray Kurzweil has used Moores law (which describes the
relentless exponential improvement in digital technology)
2.6 See also
to calculate that desktop computers will have the same
processing power as human brains by the year 2029, and Main article: Outline of articial intelligence
predicts that the singularity will occur in 2045.[197]
AI takeover
Transhumanism
Articial Intelligence (journal)
Main article: Transhumanism
Articial intelligence (video games)
In the 1980s artist Hajime Sorayama's Sexy Robots series Existential risk of articial general intelligence
were painted and published in Japan depicting the actual
organic human form with lifelike muscular metallic skins Future of Humanity Institute
and later the Gynoids book followed that was used by Human Cognome Project
or inuenced movie makers including George Lucas and
other creatives. Sorayama never considered these organic List of articial intelligence projects
2.7. NOTES 23
List of emerging technologies [5] Pamela McCorduck (2004, pp. 424) writes of the rough
shattering of AI in subeldsvision, natural language, de-
List of important articial intelligence publications cision theory, genetic algorithms, robotics ... and these
with own sub-subeldthat would hardly have anything
List of machine learning algorithms to say to each other.
List of scientic journals [6] This list of intelligent traits is based on the topics covered
by the major AI textbooks, including:
Machine ethics
Russell & Norvig 2003
Machine learning
Luger & Stubbleeld 2004
Never-Ending Language Learning Poole, Mackworth & Goebel 1998
Nilsson 1998
Our Final Invention
[7] General intelligence (strong AI) is discussed in popular
Outline of articial intelligence introductions to AI:
Outline of human intelligence Kurzweil 1999 and Kurzweil 2005
Philosophy of mind [8] See the Dartmouth proposal, under Philosophy, below.
Simulated reality [9] This is a central idea of Pamela McCorduck's Machines
Who Think. She writes: I like to think of articial intel-
Superintelligence ligence as the scientic apotheosis of a venerable cultural
tradition. (McCorduck 2004, p. 34) Articial intelli-
gence in one form or another is an idea that has pervaded
2.7 Notes Western intellectual history, a dream in urgent need of
being realized. (McCorduck 2004, p. xviii) Our his-
tory is full of attemptsnutty, eerie, comical, earnest,
[1] Denition of AI as the study of intelligent agents: legendary and realto make articial intelligences, to re-
Poole, Mackworth & Goebel 1998, p. 1, which pro- produce what is the essential usbypassing the ordinary
vides the version that is used in this article. Note means. Back and forth between myth and reality, our
that they use the term computational intelligence imaginations supplying what our workshops couldn't, we
as a synonym for articial intelligence. have engaged for a long time in this odd form of self-
reproduction. (McCorduck 2004, p. 3) She traces the
Russell & Norvig (2003) (who prefer the term ra- desire back to its Hellenistic roots and calls it the urge to
tional agent) and write The whole-agent view is forge the Gods. (McCorduck 2004, pp. 340400)
now widely accepted in the eld (Russell & Norvig
2003, p. 55). [10] The optimism referred to includes the predictions of early
Nilsson 1998 AI researchers (see optimism in the history of AI) as
well as the ideas of modern transhumanists such as Ray
Legg & Hutter 2007. Kurzweil.
[2] The intelligent agent paradigm: [11] The setbacks referred to include the ALPAC report
of 1966, the abandonment of perceptrons in 1970, the
Russell & Norvig 2003, pp. 27, 3258, 968972
Lighthill Report of 1973 and the collapse of the Lisp ma-
Poole, Mackworth & Goebel 1998, pp. 721 chine market in 1987.
Luger & Stubbleeld 2004, pp. 235240
[12] AI applications widely used behind the scenes:
Hutter 2005, pp. 125126
Russell & Norvig 2003, p. 28
The denition used in this article, in terms of goals, ac-
Kurzweil 2005, p. 265
tions, perception and environment, is due to Russell &
Norvig (2003). Other denitions also include knowledge NRC 1999, pp. 216222
and learning as additional criteria.
[13] AI in myth:
[3] Although there is some controversy on this point (see
McCorduck 2004, pp. 45
Crevier (1993, p. 50)), McCarthy states unequivocally I
came up with the term in a c|net interview. (Skillings Russell & Norvig 2003, p. 939
2006) McCarthy rst used the term in the proposal
for the Dartmouth conference, which appeared in 1955. [14] Cult images as articial intelligence:
(McCarthy et al. 1955) Crevier (1993, p. 1) (statue of Amun)
[4] McCarthy's denition of AI: McCorduck (2004, pp. 69)
24 CHAPTER 2. ARTIFICIAL INTELLIGENCE
These were the rst machines to be believed to have true McCorduck 2004, pp. 111136
intelligence and consciousness. Hermes Trismegistus ex- Crevier 1993, pp. 4749, who writes the confer-
pressed the common belief that with these statues, crafts- ence is generally recognized as the ocial birthdate
man had reproduced the true nature of the gods, their of the new science.
sensus and spiritus. McCorduck makes the connection
between sacred automatons and Mosaic law (developed Russell & Norvig 2003, p. 17, who call the confer-
around the same time), which expressly forbids the wor- ence the birth of articial intelligence.
ship of robots (McCorduck 2004, pp. 69) NRC 1999, pp. 200201
[15] Humanoid automata: [22] Hegemony of the Dartmouth conference attendees:
Yan Shi:
Russell & Norvig 2003, p. 17, who write for the
Needham 1986, p. 53 next 20 years the eld would be dominated by these
people and their students.
Hero of Alexandria:
McCorduck 2004, pp. 129130
McCorduck 2004, p. 6
[23] Russell and Norvig write it was astonishing whenever
Al-Jazari: a computer did anything kind of smartish. Russell &
Norvig 2003, p. 18
A Thirteenth Century Programmable Robot.
Shef.ac.uk. Retrieved 25 April 2009. [24] "Golden years" of AI (successful symbolic reasoning pro-
grams 19561973):
Wolfgang von Kempelen:
McCorduck 2004, pp. 243252
McCorduck 2004, p. 17
Crevier 1993, pp. 52107
[16] Articial beings: Moravec 1988, p. 9
Jbir ibn Hayyn's Takwin:
Russell & Norvig 2003, pp. 1821
O'Connor 1994
The programs described are Arthur Samuel's checkers
Judah Loew's Golem: program for the IBM 701, Daniel Bobrow's STUDENT,
Newell and Simon's Logic Theorist and Terry Winograd's
McCorduck 2004, pp. 1516 SHRDLU.
Buchanan 2005, p. 50
[25] DARPA pours money into undirected pure research into
Paracelsus' Homunculus: AI during the 1960s:
Russell & Norvig 2003, pp. 2224 Wason & Shapiro (1966) showed that people do
poorly on completely abstract problems, but if the
Luger & Stubbleeld 2004, pp. 227331
problem is restated to allow the use of intuitive
Nilsson 1998, chpt. 17.4 social intelligence, performance dramatically im-
McCorduck 2004, pp. 327335, 434435 proves. (See Wason selection task)
Crevier 1993, pp. 14562, 197203 Kahneman, Slovic & Tversky (1982) have shown
that people are terrible at elementary problems that
[32] Boom of the 1980s: rise of expert systems, Fifth Genera- involve uncertain reasoning. (See list of cognitive
tion Project, Alvey, MCC, SCI: biases for several examples).
Lako & Nez (2000) have controversially ar-
McCorduck 2004, pp. 426441
gued that even our skills at mathematics depend on
Crevier 1993, pp. 161162,197203, 211, 240 knowledge and skills that come from the body,
i.e. sensorimotor and perceptual skills. (See Where
Russell & Norvig 2003, p. 24
Mathematics Comes From)
NRC 1999, pp. 210211
[44] Knowledge representation:
[33] Second AI winter:
ACM 1998, I.2.4,
McCorduck 2004, pp. 430435 Russell & Norvig 2003, pp. 320363,
Crevier 1993, pp. 209210 Poole, Mackworth & Goebel 1998, pp. 2346, 69
NRC 1999, pp. 214216 81, 169196, 235277, 281298, 319345,
Luger & Stubbleeld 2004, pp. 227243,
[34] Formal methods are now preferred (Victory of the
neats"): Nilsson 1998, chpt. 18
McCorduck 2004, pp. 486487 Russell & Norvig 2003, pp. 260266,
[35] McCorduck 2004, pp. 480483 Poole, Mackworth & Goebel 1998, pp. 199233,
Nilsson 1998, chpt. ~17.117.4
[36] Marko 2011.
[46] Representing categories and relations: Semantic net-
[37] Administrator. Kinects AI breakthrough explained. i- works, description logics, inheritance (including frames
programmer.info. and scripts):
Russell & Norvig 2003, pp. 9, 2122 Russell & Norvig 2003, pp. 320328
McCarthy & Hayes 1969 [60] Planning and acting in non-deterministic domains: con-
Russell & Norvig 2003 ditional planning, execution monitoring, replanning and
continuous planning:
While McCarthy was primarily concerned with issues in
the logical representation of actions, Russell & Norvig Russell & Norvig 2003, pp. 430449
2003 apply the term to the more general issue of default [61] Multi-agent planning and emergent behavior:
reasoning in the vast network of assumptions underlying
all our commonsense knowledge. Russell & Norvig 2003, pp. 449455
[52] Default reasoning and default logic, non-monotonic log- [62] This is a form of Tom Mitchell's widely quoted denition
ics, circumscription, closed world assumption, abduction of machine learning: A computer program is set to learn
(Poole et al. places abduction under default reasoning. from an experience E with respect to some task T and
Luger et al. places this under uncertain reasoning): some performance measure P if its performance on T as
measured by P improves with experience E.
Russell & Norvig 2003, pp. 354360,
Poole, Mackworth & Goebel 1998, pp. 248256, [63] Learning:
323335,
ACM 1998, I.2.6,
Luger & Stubbleeld 2004, pp. 335363,
Russell & Norvig 2003, pp. 649788,
Nilsson 1998, ~18.3.3
Poole, Mackworth & Goebel 1998, pp. 397438,
[53] Breadth of commonsense knowledge: Luger & Stubbleeld 2004, pp. 385542,
Russell & Norvig 2003, p. 21, Nilsson 1998, chpt. 3.3, 10.3, 17.5, 20
Crevier 1993, pp. 113114, [64] Alan Turing discussed the centrality of learning as early
Moravec 1988, p. 13, as 1950, in his classic paper "Computing Machinery and
Intelligence".(Turing 1950) In 1956, at the original Dart-
Lenat & Guha 1989 (Introduction)
mouth AI summer conference, Ray Solomono wrote
[54] Dreyfus & Dreyfus 1986 a report on unsupervised probabilistic machine learning:
An Inductive Inference Machine.(Solomono 1956)
[55] Gladwell 2005
[65] Reinforcement learning:
[56] Expert knowledge as embodied intuition:
Russell & Norvig 2003, pp. 763788
Dreyfus & Dreyfus 1986 (Hubert Dreyfus is a Luger & Stubbleeld 2004, pp. 442449
philosopher and critic of AI who was among the
rst to argue that most useful human knowledge was [66] Computational learning theory:
encoded sub-symbolically. See Dreyfus critique of
AI) CITATION IN PROGRESS.
Gladwell 2005 (Gladwells Blink is a popular intro- [67] Weng et al. 2001.
duction to sub-symbolic reasoning and knowledge.)
[68] Lungarella et al. 2003.
Hawkins & Blakeslee 2005 (Hawkins argues that
sub-symbolic knowledge should be the primary fo- [69] Asada et al. 2009.
cus of AI research.)
[70] Oudeyer 2010.
[57] Planning:
[71] Natural language processing:
ACM 1998, ~I.2.8,
Russell & Norvig 2003, pp. 375459, ACM 1998, I.2.7
Poole, Mackworth & Goebel 1998, pp. 281316, Russell & Norvig 2003, pp. 790831
Luger & Stubbleeld 2004, pp. 314329, Poole, Mackworth & Goebel 1998, pp. 91104
Nilsson 1998, chpt. 10.12, 22 Luger & Stubbleeld 2004, pp. 591632
[58] Information value theory: [72] "Versatile question answering systems: seeing in synthe-
sis", Mittal et al., IJIIDS, 5(2), 119-142, 2011
Russell & Norvig 2003, pp. 600604
[73] Applications of natural language processing, including
[59] Classical planning: information retrieval (i.e. text mining) and machine trans-
lation:
Russell & Norvig 2003, pp. 375430,
Poole, Mackworth & Goebel 1998, pp. 281315, Russell & Norvig 2003, pp. 840857,
Luger & Stubbleeld 2004, pp. 314329, Luger & Stubbleeld 2004, pp. 623630
ACM 1998, I.2.10 The most extreme form of this argument (the brain re-
placement scenario) was put forward by Clark Glymour
Russell & Norvig 2003, pp. 863898 in the mid-1970s and was touched on by Zenon Pylyshyn
Nilsson 1998, chpt. 6 and John Searle in 1980.
ACM 1998, ~I.2.7 [93] Nils Nilsson writes: Simply put, there is wide disagree-
Russell & Norvig 2003, pp. 568578 ment in the eld about what AI is all about (Nilsson 1983,
p. 10).
[77] Object recognition:
[94] Biological intelligence vs. intelligence in general:
Russell & Norvig 2003, pp. 885892
Russell & Norvig 2003, pp. 23, who make the
[78] Robotics: analogy with aeronautical engineering.
ACM 1998, I.2.9, McCorduck 2004, pp. 100101, who writes that
there are two major branches of articial intelli-
Russell & Norvig 2003, pp. 901942, gence: one aimed at producing intelligent behav-
Poole, Mackworth & Goebel 1998, pp. 443460 ior regardless of how it was accomplioshed, and the
other aimed at modeling intelligent processes found
[79] Moving and conguration space: in nature, particularly human ones.
Russell & Norvig 2003, pp. 916932 Kolata 1982, a paper in Science, which describes
McCarthys indierence to biological models. Ko-
[80] Tecuci 2012. lata quotes McCarthy as writing: This is AI, so we
don't care if its psychologically real. McCarthy
[81] Robotic mapping (localization, etc):
recently reiterated his position at the AI@50 con-
Russell & Norvig 2003, pp. 908915 ference where he said Articial intelligence is not,
by denition, simulation of human intelligence
[82] Thro 1993. (Maker 2006).
[83] Edelson 1991. [95] Neats vs. scrues:
[84] Tao & Tan 2005. McCorduck 2004, pp. 421424, 486489
[85] James 1884. Crevier 1993, pp. 168
Nilsson 1983, pp. 1011
[86] Picard 1995.
[96] Symbolic vs. sub-symbolic AI:
[87] Kleine-Cosack 2006: The introduction of emotion to
computer science was done by Pickard (sic) who created Nilsson (1998, p. 7), who uses the term sub-
the eld of aective computing. symbolic.
[88] Diamond 2003: Rosalind Picard, a genial MIT professor, [97] Haugeland 1985, p. 255.
is the elds godmother; her 1997 book, Aective Com-
puting, triggered an explosion of interest in the emotional [98] Law 1994.
side of computers and their users.
[99] Bach 2008.
[89] Emotion and aective computing:
[100] Haugeland 1985, pp. 112117
Minsky 2006
[101] The most dramatic case of sub-symbolic AI being pushed
[90] Gerald Edelman, Igor Aleksander and others have ar- into the background was the devastating critique of
gued that articial consciousness is required for strong AI. perceptrons by Marvin Minsky and Seymour Papert in
(Aleksander 1995; Edelman 2007) 1969. See History of AI, AI winter, or Frank Rosenblatt.
[91] Articial brain arguments: AI requires a simulation of the [102] Cognitive simulation, Newell and Simon, AI at CMU
operation of the human brain (then called Carnegie Tech):
Russell & Norvig 2003, p. 957 McCorduck 2004, pp. 139179, 245250, 322
Crevier 1993, pp. 271 and 279 323 (EPAM)
Crevier 1993, pp. 145149
A few of the people who make some form of the argu-
ment: [103] Soar (history):
28 CHAPTER 2. ARTIFICIAL INTELLIGENCE
McCorduck 2004, pp. 454462 Russell & Norvig 2003, pp. 94109,
Brooks 1990 Poole, Mackworth & Goebel 1998, pp. pp. 132
147,
Moravec 1988
Luger & Stubbleeld 2004, pp. 133150,
[110] Revival of connectionism: Nilsson 1998, chpt. 9
Crevier 1993, pp. 214215 [124] Optimization searches:
Russell & Norvig 2003, p. 25
Russell & Norvig 2003, pp. 110116,120129
[111] Computational intelligence Poole, Mackworth & Goebel 1998, pp. 56163
IEEE Computational Intelligence Society Luger & Stubbleeld 2004, pp. 127133
[112] Hutter 2012. [125] Articial life and society based learning:
Luger & Stubbleeld 2004, pp. 3577, [137] Bayesian learning and the expectation-maximization algo-
Nilsson 1998, chpt. 1316 rithm:
[129] Explanation based learning, relevance based learning, Russell & Norvig 2003, pp. 597600
inductive logic programming, case based reasoning: [139] Stochastic temporal models:
Russell & Norvig 2003, pp. 678710, Russell & Norvig 2003, pp. 537581
Poole, Mackworth & Goebel 1998, pp. 414416,
Dynamic Bayesian networks:
Luger & Stubbleeld 2004, pp. ~422442,
Nilsson 1998, chpt. 10.3, 17.5 Russell & Norvig 2003, pp. 551557
Russell & Norvig 2003, pp. 204233, (Russell & Norvig 2003, pp. 549551)
Russell & Norvig 2003, pp. 739748, 758 Kumar & Kumar 2012
Luger & Stubbleeld 2004, pp. 458467 [168] Brooks 1991.
[153] Recurrent neural networks, Hopeld nets: [169] Hacking Roomba. hackingroomba.com.
Russell & Norvig 2003, p. 758 [170] Philosophy of AI. All of these positions in this section are
Luger & Stubbleeld 2004, pp. 474505 mentioned in standard discussions of the subject, such as:
[154] Competitive learning, Hebbian coincidence learning, Russell & Norvig 2003, pp. 947960
Hopeld networks and attractor networks: Fearn 2007, pp. 3855
Luger & Stubbleeld 2004, pp. 474505 [171] Dartmouth proposal:
[155] Hierarchical temporal memory: McCarthy et al. 1955 (the original proposal)
Hawkins & Blakeslee 2005 Crevier 1993, p. 49 (historical signicance)
Russell & Norvig 2003, p. 949 [189] Ford, Martin R. (2009), The Lights in the Tunnel: Au-
McCorduck 2004, pp. 448449 tomation, Accelerating Technology and the Economy of the
Future, Acculant Publishing, ISBN 978-1448659814. (e-
Making the Mathematical Objection: book available free online.)
Turing 1950 under "(2) The Mathematical Objec- Russell & Norvig 2003, pp. 960961
tion Ford, Martin (2009). The Lights in the Tunnel: Au-
Hofstadter 1979 tomation, Accelerating Technology and the Economy
of the Future. Acculant Publishing. ISBN 978-1-
Background: 4486-5981-4.
Gdel 1931, Church 1936, Kleene 1935, Turing [192] This version is from Searle (1999), and is also quoted
1937 in Dennett 1991, p. 435. Searles original formulation
was The appropriately programmed computer really is
[177] Beyond the Doubting of a Shadow, A Reply to Commen- a mind, in the sense that computers given the right pro-
taries on Shadows of the Mind, Roger Penrose 1996 The grams can be literally said to understand and have other
links to the original articles he responds to there are eas- cognitive states. (Searle 1980, p. 1). Strong AI is de-
ily found in the Wayback machine: Can Physics Provide ned similarly by Russell & Norvig (2003, p. 947): The
a Theory of Consciousness? Barnard J. Bars, Penroses assertion that machines could possibly act intelligently (or,
Gdelian Argument etc. perhaps better, act as if they were intelligent) is called the
'weak AI' hypothesis by philosophers, and the assertion
[178] Wendell Wallach (2010). Moral Machines, Oxford Uni-
that machines that do so are actually thinking (as opposed
versity Press.
to simulating thinking) is called the 'strong AI' hypothe-
[179] Wallach, pp 3754. sis.
Buchanan, Bruce G. (2005). A (Very) Brief His- Haugeland, John (1985). Articial Intelligence: The
tory of Articial Intelligence (PDF). AI Magazine: Very Idea. Cambridge, Mass.: MIT Press. ISBN
5360. Archived (PDF) from the original on 26 0-262-08153-9.
September 2007.
Hawkins, Je; Blakeslee, Sandra (2005). On Intelli-
Butler, Samuel (13 June 1863). Darwin among gence. New York, NY: Owl Books. ISBN 0-8050-
the Machines. Letters to the Editor. The Press 7853-3.
(Christchurch, New Zealand). Retrieved 16 Octo-
Henderson, Mark (24 April 2007). Human rights
ber 2014 via Victoria University of Wellington.
for robots? We're getting carried away. The Times
AI set to exceed human brain power. CNN. 26 July Online (London).
2006. Archived from the original on 19 February Hernandez-Orallo, Jose (2000). Beyond the Turing
2008. Test. Journal of Logic, Language and Information
9 (4): 447466. doi:10.1023/A:1008367325700.
Dennett, Daniel (1991). Consciousness Explained.
The Penguin Press. ISBN 0-7139-9037-6. Hernandez-Orallo, J.; Dowe, D. L. (2010).
Measuring Universal Intelligence: Towards
Diamond, David (December 2003). The Love
an Anytime Intelligence Test. Articial
Machine; Building computers that care. Wired.
Intelligence Journal 174 (18): 15081539.
Archived from the original on 18 May 2008.
doi:10.1016/j.artint.2010.09.006.
Dowe, D. L.; Hajek, A. R. (1997). A computa- Hinton, G. E. (2007). Learning multiple layers
tional extension to the Turing Test. Proceedings of of representation. Trends in Cognitive Sciences 11:
the 4th Conference of the Australasian Cognitive Sci- 428434. doi:10.1016/j.tics.2007.09.004.
ence Society.
Hofstadter, Douglas (1979). Gdel, Escher, Bach:
Dreyfus, Hubert (1972). What Computers Can't Do. an Eternal Golden Braid. New York, NY: Vintage
New York: MIT Press. ISBN 0-06-011082-1. Books. ISBN 0-394-74502-7.
Dreyfus, Hubert; Dreyfus, Stuart (1986). Mind over Holland, John H. (1975). Adaptation in Natural and
Machine: The Power of Human Intuition and Exper- Articial Systems. University of Michigan Press.
tise in the Era of the Computer. Oxford, UK: Black- ISBN 0-262-58111-6.
well. ISBN 0-02-908060-6.
Howe, J. (November 1994). Articial Intelligence
Dreyfus, Hubert (1992). What Computers Still Can't at Edinburgh University: a Perspective. Retrieved
Do. New York: MIT Press. ISBN 0-262-54067-3. 30 August 2007.
Dyson, George (1998). Darwin among the Ma- Hutter, M. (2012). One Decade of Universal Ar-
chines. Allan Lane Science. ISBN 0-7382-0030-1. ticial Intelligence. Theoretical Foundations of
Articial General Intelligence. Atlantis Thinking
Edelman, Gerald (23 November 2007). Gerald Machines 4. doi:10.2991/978-94-91216-62-6_5.
Edelman Neural Darwinism and Brain-based De- ISBN 978-94-91216-61-9.
vices. Talking Robots.
James, William (1884). What is Emotion. Mind
Edelson, Edward (1991). The Nervous System. New 9: 188205. doi:10.1093/mind/os-IX.34.188.
York: Chelsea House. ISBN 978-0-7910-0464-7. Cited by Tao & Tan 2005.
Fearn, Nicholas (2007). The Latest Answers to the Kahneman, Daniel; Slovic, D.; Tversky, Amos
Oldest Questions: A Philosophical Adventure with the (1982). Judgment under uncertainty: Heuristics and
Worlds Greatest Thinkers. New York: Grove Press. biases. New York: Cambridge University Press.
ISBN 0-8021-1839-9. ISBN 0-521-28414-7.
Katz, Yarden (1 November 2012). Noam Chom-
Gladwell, Malcolm (2005). Blink. New York: Lit-
sky on Where Articial Intelligence Went Wrong.
tle, Brown and Co. ISBN 0-316-17232-4.
The Atlantic. Retrieved 26 October 2014.
Gdel, Kurt (1951). Some basic theorems on the Kismet. MIT Articial Intelligence Laboratory,
foundations of mathematics and their implications. Humanoid Robotics Group. Retrieved 25 October
Gibbs Lecture. In 2014.
Feferman, Solomon, ed. (1995). Kurt Gdel: Col-
lected Works, Vol. III: Unpublished Essays and Lec- Koza, John R. (1992). Genetic Programming (On
tures. Oxford University Press. pp. 30423. ISBN the Programming of Computers by Means of Natural
978-0-19-514722-3. Selection). MIT Press. ISBN 0-262-11170-5.
34 CHAPTER 2. ARTIFICIAL INTELLIGENCE
Kleine-Cosack, Christian (October 2006). Maker, Meg Houston (2006). AI@50: AI Past,
Recognition and Simulation of Emotions Present, Future. Dartmouth College. Archived
(PDF). Archived from the original (PDF) on 28 from the original on 8 October 2008. Retrieved 16
May 2008. October 2008.
Kolata, G. (1982). How can computers get Marko, John (16 February 2011). Computer
common sense?". Science 217 (4566): 1237 Wins on 'Jeopardy!': Trivial, Its Not. The New
1238. doi:10.1126/science.217.4566.1237. PMID York Times. Retrieved 25 October 2014.
17837639.
Kumar, Gulshan; Kumar, Krishan (2012). The McCarthy, John; Minsky, Marvin; Rochester,
Use of Articial-Intelligence-Based Ensembles for Nathan; Shannon, Claude (1955). A Proposal for
Intrusion Detection: A Review. Applied Computa- the Dartmouth Summer Research Project on Arti-
tional Intelligence and Soft Computing 2012: 120. cial Intelligence. Archived from the original on 26
doi:10.1155/2012/850160. August 2007. Retrieved 30 August 2007..
Kurzweil, Ray (1999). The Age of Spiritual Ma- McCarthy, John; Hayes, P. J. (1969). Some philo-
chines. Penguin Books. ISBN 0-670-88217-8. sophical problems from the standpoint of articial
intelligence. Machine Intelligence 4: 463502.
Kurzweil, Ray (2005). The Singularity is Near. Pen- Archived from the original on 10 August 2007. Re-
guin Books. ISBN 0-670-03384-7. trieved 30 August 2007.
Lako, George; Nez, Rafael E. (2000). Where
Mathematics Comes From: How the Embodied Mind McCarthy, John (12 November 2007). What Is Ar-
Brings Mathematics into Being. Basic Books. ISBN ticial Intelligence?".
0-465-03771-2.
Minsky, Marvin (1967). Computation: Finite and
Langley, Pat (2011). The changing science of ma- Innite Machines. Englewood Clis, N.J.: Prentice-
chine learning. Machine Learning 82 (3): 275 Hall. ISBN 0-13-165449-7.
279. doi:10.1007/s10994-011-5242-y.
Minsky, Marvin (2006). The Emotion Machine.
Law, Diane (June 1994). Searle, Subsymbolic Func- New York, NY: Simon & Schusterl. ISBN 0-7432-
tionalism and Synthetic Intelligence (Technical re- 7663-9.
port). University of Texas at Austin. p. AI94-222.
CiteSeerX: 10.1.1.38.8384. Moravec, Hans (1988). Mind Children. Harvard
University Press. ISBN 0-674-57616-0.
Legg, Shane; Hutter, Marcus (15 June 2007). A
Collection of Denitions of Intelligence (Technical
Norvig, Peter (25 June 2012). On Chomsky and
report). IDSIA. arXiv:0706.3639. 07-07.
the Two Cultures of Statistical Learning. Peter
Lenat, Douglas; Guha, R. V. (1989). Building Large Norvig. Archived from the original on 19 October
Knowledge-Based Systems. Addison-Wesley. ISBN 2014.
0-201-51752-3.
NRC (United States National Research Council)
Lighthill, James (1973). Articial Intelligence: A (1999). Developments in Articial Intelligence.
General Survey. Articial Intelligence: a paper Funding a Revolution: Government Support for
symposium. Science Research Council. Computing Research. National Academy Press.
Lucas, John (1961). Minds, Machines and Gdel.
Needham, Joseph (1986). Science and Civilization
In Anderson, A.R. Minds and Machines. Archived
in China: Volume 2. Caves Books Ltd.
from the original on 19 August 2007. Retrieved 30
August 2007. Newell, Allen; Simon, H. A. (1976). Computer
Lungarella, M.; Metta, G.; Pfeifer, R.; San- Science as Empirical Inquiry: Symbols and Search.
dini, G. (2003). Developmental robotics: a Communications of the ACM 19 (3)..
survey. Connection Science 15: 151190.
doi:10.1080/09540090310001655110. CiteSeerX: Nilsson, Nils (1983). Articial Intelligence Pre-
10.1.1.83.7615. pares for 2001 (PDF). AI Magazine 1 (1). Presi-
dential Address to the Association for the Advance-
Mahmud, Ashik (June 2015), Post/Human Be- ment of Articial Intelligence.
ings & Techno-salvation: Exploring Articial Intel-
ligence in Selected Science Fictions, Socrates Jour- O'Brien, James; Marakas, George (2011). Man-
nal, doi:10.7910/DVN/VAELLN, retrieved 2015- agement Information Systems (10th ed.). McGraw-
06-26 Hill/Irwin. ISBN 978-0-07-337681-3.
2.9. FURTHER READING 35
O'Connor, Kathleen Malone (1994). The alchem- Tecuci, Gheorghe (MarchApril 2012). Arti-
ical creation of life (takwin) and other concepts of cial Intelligence. Wiley Interdisciplinary Reviews:
Genesis in medieval Islam. University of Pennsyl- Computational Statistics (Wiley) 4 (2): 168180.
vania. doi:10.1002/wics.200.
Oudeyer, P-Y. (2010). On the impact of Thro, Ellen (1993). Robotics: The Marriage of Com-
robotics in behavioral and cognitive sciences: puters and Machines. New York: Facts on File.
from insect navigation to human cognitive de- ISBN 978-0-8160-2628-9.
velopment (PDF). IEEE Transactions on Au-
Turing, Alan (October 1950), Computing Machin-
tonomous Mental Development 2 (1): 216.
ery and Intelligence, Mind LIX (236): 433460,
doi:10.1109/tamd.2009.2039057.
doi:10.1093/mind/LIX.236.433, ISSN 0026-4423,
Penrose, Roger (1989). The Emperors New Mind: retrieved 2008-08-18.
Concerning Computer, Minds and The Laws of
van der Walt, Christiaan; Bernard, Etienne (2006).
Physics. Oxford University Press. ISBN 0-19-
Data characteristics that determine classier per-
851973-7.
formance (PDF). Retrieved 5 August 2009.
Picard, Rosalind (1995). Aective Computing (PDF)
Vinge, Vernor (1993). The Coming Technologi-
(Technical report). MIT. 321. Lay summary Ab-
cal Singularity: How to Survive in the Post-Human
stract.
Era.
Poli, R.; Langdon, W. B.; McPhee, N. F.
Wason, P. C.; Shapiro, D. (1966). Reasoning.
(2008). A Field Guide to Genetic Programming.
In Foss, B. M. New horizons in psychology. Har-
Lulu.com. ISBN 978-1-4092-0073-4 via gp-eld-
mondsworth: Penguin.
guide.org.uk.
Weizenbaum, Joseph (1976). Computer Power and
Rajani, Sandeep (2011). Articial Intelligence
Human Reason. San Francisco: W.H. Freeman &
Man or Machine (PDF). International Journal of
Company. ISBN 0-7167-0464-1.
Information Technology and Knowledge Manage-
ment 4 (1): 173176. Weng, J.; McClelland; Pentland, A.; Sporns,
O.; Stockman, I.; Sur, M.; Thelen, E. (2001).
Searle, John (1980). Minds, Brains and Programs.
Autonomous mental development by robots and
Behavioral and Brain Sciences 3 (3): 417457.
animals (PDF). Science 291: 599600 via
doi:10.1017/S0140525X00005756.
msu.edu.
Searle, John (1999). Mind, language and society.
New York, NY: Basic Books. ISBN 0-465-04521-
9. OCLC 231867665 43689264. 2.9 Further reading
Shapiro, Stuart C. (1992). Articial Intelligence.
TechCast Article Series, John Sagi, Framing Con-
In Shapiro, Stuart C. Encyclopedia of Articial In-
sciousness
telligence (PDF) (2nd ed.). New York: John Wiley.
pp. 5457. ISBN 0-471-50306-1. Boden, Margaret, Mind As Machine, Oxford Uni-
versity Press, 2006
Simon, H. A. (1965). The Shape of Automation for
Men and Management. New York: Harper & Row. Johnston, John (2008) The Allure of Machinic
Life: Cybernetics, Articial Life, and the New AI,
Skillings, Jonathan (3 July 2006). Getting Ma-
MIT Press
chines to Think Like Us. cnet. Retrieved 3 Febru-
ary 2011. Myers, Courtney Boyd ed. (2009). The AI Report.
Forbes June 2009
Solomono, Ray (1956). An Inductive Infer-
ence Machine (PDF). Dartmouth Summer Research Serenko, Alexander (2010). The development of
Conference on Articial Intelligence via std.com, an AI journal ranking based on the revealed pref-
pdf scanned copy of the original. Later published as erence approach (PDF). Journal of Informetrics 4
Solomono, Ray (1957). An Inductive Inference (4): 447459. doi:10.1016/j.joi.2010.04.001.
Machine. IRE Convention Record. Section on In-
formation Theory, part 2. pp. 5662. Serenko, Alexander; Michael Dohan (2011).
Comparing the expert survey and citation
Tao, Jianhua; Tan, Tieniu (2005). Aective Com- impact journal ranking methods: Exam-
puting and Intelligent Interaction. Aective Com- ple from the eld of Articial Intelligence
puting: A Review. LNCS 3784. Springer. pp. 981 (PDF). Journal of Informetrics 5 (4): 629649.
995. doi:10.1007/11573548. doi:10.1016/j.joi.2011.06.002.
36 CHAPTER 2. ARTIFICIAL INTELLIGENCE
Cybernetics and AI
Chapter 3
Information theory
37
38 CHAPTER 3. INFORMATION THEORY
intelligence gathering, gambling, statistics, and even in With it came the ideas of
musical composition.
the information entropy and redundancy of a source,
and its relevance through the source coding theorem;
3.2 Historical background the mutual information, and the channel capacity of
a noisy channel, including the promise of perfect
Main article: History of information theory loss-free communication given by the noisy-channel
coding theorem;
The landmark event that established the discipline of in- the practical result of the ShannonHartley law for
formation theory, and brought it to immediate worldwide the channel capacity of a Gaussian channel; as well
attention, was the publication of Claude E. Shannon's as
classic paper "A Mathematical Theory of Communica-
tion" in the Bell System Technical Journal in July and Oc- the bita new way of seeing the most fundamental
tober 1948. unit of information.
Prior to this paper, limited information-theoretic ideas
had been developed at Bell Labs, all implicitly assuming
events of equal probability. Harry Nyquist's 1924 paper, 3.3 Quantities of information
Certain Factors Aecting Telegraph Speed, contains a the-
oretical section quantifying intelligence and the line Main article: Quantities of information
speed at which it can be transmitted by a communica-
tion system, giving the relation W = K log m (recalling
Boltzmanns constant), where W is the speed of transmis- Information theory is based on probability theory and
sion of intelligence, m is the number of dierent voltage statistics. Information theory often concerns itself with
levels to choose from at each time step, and K is a con- measures of information of the distributions associated
stant. Ralph Hartley's 1928 paper, Transmission of Infor- with random variables. Important quantities of informa-
mation, uses the word information as a measurable quan- tion are entropy, a measure of information in a single
tity, reecting the receivers ability to distinguish one se- random variable, and mutual information, a measure of
quence of symbols from any other, thus quantifying infor- information in common between two random variables.
mation as H = log S n = n log S , where S was the num- The former quantity is a property of the probability dis-
ber of possible symbols, and n the number of symbols in tribution of a random variable and gives a limit on the rate
a transmission. The unit of information was therefore the at which data generated by independent samples with the
decimal digit, much later renamed the hartley in his hon- given distribution can be reliably compressed. The latter
our as a unit or scale or measure of information. Alan is a property of the joint distribution of two random vari-
Turing in 1940 used similar ideas as part of the statistical able, and is the maximum rate of reliable communication
analysis of the breaking of the German second world war across a noisy channel in the limit of long block lengths,
Enigma ciphers. when the channel statistics are determined by the joint
distribution.
Much of the mathematics behind information theory
with events of dierent probabilities were developed for The choice of logarithmic base in the following formulae
the eld of thermodynamics by Ludwig Boltzmann and determines the unit of information entropy that is used.
J. Willard Gibbs. Connections between information- A common unit of information is the bit, based on the
theoretic entropy and thermodynamic entropy, includ- binary logarithm. Other units include the nat, which is
ing the important contributions by Rolf Landauer in the based on the natural logarithm, and the hartley, which is
1960s, are explored in Entropy in thermodynamics and based on the common logarithm.
information theory. In what follows, an expression of the form p log p is
In Shannons revolutionary and groundbreaking paper, considered by convention to be equal to zero whenever
the work for which had been substantially completed at p = 0. This is justied because limp0+ p log p = 0 for
Bell Labs by the end of 1944, Shannon for the rst time any logarithmic base.
introduced the qualitative and quantitative model of com-
munication as a statistical process underlying information
theory, opening with the assertion that 3.3.1 Entropy
0.5 piece X the row and Y the column, then the joint
entropy of the row of the piece and the column of the
piece will be the entropy of the position of the piece.
H(X, Y ) = EX,Y [ log p(x, y)] = p(x, y) log p(x, y)
x,y
measured by a distortion function. This subset of It is common in information theory to speak of the rate
Information theory is called ratedistortion theory. or entropy of a language. This is appropriate, for exam-
ple, when the source of information is English prose. The
Error-correcting codes (channel coding): While rate of a source of information is related to its redundancy
data compression removes as much redundancy as and how well it can be compressed, the subject of source
possible, an error correcting code adds just the right coding.
kind of redundancy (i.e., error correction) needed
to transmit the data eciently and faithfully across
a noisy channel.
3.4.2 Channel capacity
Main article: Channel capacity
This division of coding theory into compression and
transmission is justied by the information transmission
theorems, or sourcechannel separation theorems that Communications over a channelsuch as an ethernet
justify the use of bits as the universal currency for infor- cableis the primary motivation of information theory.
mation in many contexts. However, these theorems only As anyone whos ever used a telephone (mobile or land-
hold in the situation where one transmitting user wishes line) knows, however, such channels often fail to produce
to communicate to one receiving user. In scenarios with exact reconstruction of a signal; noise, periods of silence,
more than one transmitter (the multiple-access channel), and other forms of signal corruption often degrade qual-
more than one receiver (the broadcast channel) or inter- ity. How much information can one hope to communicate
mediary helpers (the relay channel), or more general over a noisy (or otherwise imperfect) channel?
networks, compression followed by transmission may no Consider the communications process over a discrete
longer be optimal. Network information theory refers to channel. A simple model of the process is shown below:
these multi-agent communication models.
x (noisy) y
Transmitter Receiver
Channel
3.4.1 Source theory
Any process that generates successive messages can be
considered a source of information. A memoryless Here X represents the space of messages transmitted,
source is one in which each message is an independent and Y the space of messages received during a unit time
identically distributed random variable, whereas the over our channel. Let p(y|x) be the conditional probabil-
properties of ergodicity and stationarity impose less re- ity distribution function of Y given X. We will consider
strictive constraints. All such sources are stochastic. p(y|x) to be an inherent xed property of our communi-
These terms are well studied in their own right outside cations channel (representing the nature of the noise of
information theory. our channel). Then the joint distribution of X and Y is
completely determined by our channel and by our choice
of f (x) , the marginal distribution of messages we choose
Rate to send over the channel. Under these constraints, we
would like to maximize the rate of information, or the
Information rate is the average entropy per symbol. For signal, we can communicate over the channel. The ap-
memoryless sources, this is merely the entropy of each propriate measure for this is the mutual information, and
symbol, while, in the case of a stationary stochastic pro- this maximum mutual information is called the channel
cess, it is capacity and is given by:
noisy channel with a small coding error at a rate near the Europe. Shannon himself dened an important concept
channel capacity. now called the unicity distance. Based on the redundancy
of the plaintext, it attempts to give a minimum amount of
ciphertext necessary to ensure unique decipherability.
Capacity of particular channel models
Information theory leads us to believe it is much more
A continuous-time analog communications channel dicult to keep secrets than it might rst appear. A
subject to Gaussian noise see ShannonHartley brute force attack can break systems based on asymmetric
theorem. key algorithms or on most commonly used methods of
symmetric key algorithms (sometimes called secret key
A binary symmetric channel (BSC) with crossover algorithms), such as block ciphers. The security of all
probability p is a binary input, binary output channel such methods currently comes from the assumption that
that ips the input bit with probability p. The BSC no known attack can break them in a practical amount of
has a capacity of 1 Hb (p) bits per channel use, time.
where Hb is the binary entropy function to the base Information theoretic security refers to methods such as
2 logarithm: the one-time pad that are not vulnerable to such brute
force attacks. In such cases, the positive conditional
1p mutual information between the plaintext and ciphertext
0 0 (conditioned on the key) can ensure proper transmis-
sion, while the unconditional mutual information between
the plaintext and ciphertext remains zero, resulting in
p absolutely secure communications. In other words, an
eavesdropper would not be able to improve his or her
p guess of the plaintext by gaining knowledge of the ci-
phertext but not of the key. However, as in any other
1 1 cryptographic system, care must be used to correctly ap-
1p ply even information-theoretically secure methods; the
Venona project was able to crack the one-time pads of
the Soviet Union due to their improper reuse of key ma-
A binary erasure channel (BEC) with erasure prob- terial.
ability p is a binary input, ternary output channel.
The possible channel outputs are 0, 1, and a third
symbol 'e' called an erasure. The erasure represents 3.5.2 Pseudorandom number generation
complete loss of information about an input bit. The
capacity of the BEC is 1 - p bits per channel use. Pseudorandom number generators are widely available
in computer language libraries and application pro-
grams. They are, almost universally, unsuited to cryp-
1p tographic use as they do not evade the deterministic na-
0 0 ture of modern computer equipment and software. A
class of improved random number generators is termed
p cryptographically secure pseudorandom number gener-
e ators, but even they require random seeds external to
p the software to work as intended. These can be ob-
tained via extractors, if done carefully. The measure
1 1 of sucient randomness in extractors is min-entropy, a
1p value related to Shannon entropy through Rnyi entropy;
Rnyi entropy is also used in evaluating randomness in
cryptographic systems. Although related, the distinctions
among these measures mean that a random variable with
3.5 Applications to other elds high Shannon entropy is not necessarily satisfactory for
use in an extractor and so for cryptography uses.
tion theory and digital signal processing oer a major im- 3.6.2 History
provement of resolution and image clarity over previous
analog methods.[11] Hartley, R.V.L.
History of information theory
3.5.4 Semiotics Shannon, C.E.
Concepts from information theory such as redundancy Timeline of information theory
and code control have been used by semioticians such as
Umberto Eco and Rossi-Landi to explain ideology as a Yockey, H.P.
form of message transmission whereby a dominant social
class emits its message by using signs that exhibit a high
3.6.3 Theory
degree of redundancy such that only one message is de-
coded among a selection of competing ones.[12] Coding theory
Detection theory
3.5.5 Miscellaneous applications
Estimation theory
Information theory also has applications in gambling and
Fisher information
investing, black holes, bioinformatics, and music.
Information algebra
Information asymmetry
3.6 See also
Information eld theory
Algorithmic probability
Information geometry
Algorithmic information theory
Information theory and measure theory
Bayesian inference Kolmogorov complexity
Communication theory Logic of information
Constructor theory - a generalization of information Network coding
theory that includes quantum information
Philosophy of Information
Inductive probability
Quantum information science
Minimum message length
Semiotic information theory
Minimum description length
Source coding
List of important publications
Unsolved Problems
Philosophy of information
3.6.4 Concepts
3.6.1 Applications
Ban (unit)
Active networking
Channel capacity
Cryptanalysis Channel (communications)
Cryptography Communication source
Cybernetics Conditional entropy
Entropy in thermodynamics and information theory Covert channel
Gambling Decoder
Intelligence (information gathering) Dierential entropy
Seismic exploration Encoder
44 CHAPTER 3. INFORMATION THEORY
Mutual information
3.7.1 The classic work
Pointwise mutual information (PMI)
Shannon, C.E. (1948), "A Mathematical Theory of
Receiver (information theory)
Communication", Bell System Technical Journal, 27,
Redundancy pp. 379423 & 623656, July & October, 1948.
PDF.
Rnyi entropy Notes and other formats.
3.7 References
3.7.2 Other journal articles
[1] F. Rieke, D. Warland, R Ruyter van Steveninck, W Bialek
J. L. Kelly, Jr., Saratoga.ny.us, A New Interpre-
(1997). Spikes: Exploring the Neural Code. The MIT
tation of Information Rate Bell System Technical
press. ISBN 978-0262681087.
Journal, Vol. 35, July 1956, pp. 91726.
[2] cf. Huelsenbeck, J. P., F. Ronquist, R. Nielsen and J. P.
Bollback (2001) Bayesian inference of phylogeny and its R. Landauer, IEEE.org, Information is Physi-
impact on evolutionary biology, Science 294:2310-2314 cal Proc. Workshop on Physics and Computation
PhysComp'92 (IEEE Comp. Sci.Press, Los Alami-
[3] Rando Allikmets, Wyeth W. Wasserman, Amy Hutchin- tos, 1993) pp. 14.
son, Philip Smallwood, Jeremy Nathans, Peter K. Rogan,
Thomas D. Schneider, Michael Dean (1998) Organization R. Landauer, IBM.com, Irreversibility and Heat
of the ABCR gene: analysis of promoter and splice junc- Generation in the Computing Process IBM J. Res.
tion sequences, Gene 215:1, 111-122 Develop. Vol. 5, No. 3, 1961
[4] Burnham, K. P. and Anderson D. R. (2002) Model Selec- Timme, Nicholas; Alford, Wesley; Flecker, Ben-
tion and Multimodel Inference: A Practical Information- jamin; Beggs, John M. (2012). Multivariate in-
Theoretic Approach, Second Edition (Springer Science, formation measures: an experimentalists perspec-
New York) ISBN 978-0-387-95364-9. tive. arXiv:111.6857v5 (Cornell University) 5.
[5] Jaynes, E. T. (1957) Information Theory and Statistical Retrieved 7 June 2015.
Mechanics, Phys. Rev. 106:620
[6] Charles H. Bennett, Ming Li, and Bin Ma (2003) Chain 3.7.3 Textbooks on information theory
Letters and Evolutionary Histories, Scientic American
288:6, 76-81 Arndt, C. Information Measures, Information and
its Description in Science and Engineering (Springer
[7] David R. Anderson (November 1, 2003). Some back- Series: Signals and Communication Technology),
ground on why people in the empirical sciences may want
2004, ISBN 978-3-540-40855-0
to better understand the information-theoretic methods
(PDF). Retrieved 2010-06-23. Ash, RB. Information Theory. New York: Inter-
science, 1965. ISBN 0-470-03445-9. New York:
[8] Fazlollah M. Reza (1994) [1961]. An Introduction to In-
formation Theory. Dover Publications, Inc., New York. Dover 1990. ISBN 0-486-66521-6
ISBN 0-486-68210-2.
Gallager, R. Information Theory and Reliable Com-
[9] Robert B. Ash (1990) [1965]. Information Theory. Dover munication. New York: John Wiley and Sons, 1968.
Publications, Inc. ISBN 0-486-66521-6. ISBN 0-471-29048-3
[10] Jerry D. Gibson (1998). Digital Compression for Multime- Goldman, S. Information Theory. New York: Pren-
dia: Principles and Standards. Morgan Kaufmann. ISBN tice Hall, 1953. New York: Dover 1968 ISBN 0-
1-55860-369-7. 486-62209-6, 2005 ISBN 0-486-44271-3
3.8. EXTERNAL LINKS 45
Cover, TM, Thomas, JA. Elements of information Tom Siegfried, The Bit and the Pendulum, Wiley,
theory, 1st Edition. New York: Wiley-Interscience, 2000. ISBN 0-471-32174-5
1991. ISBN 0-471-06259-6.
Charles Seife, Decoding The Universe, Viking,
2006. ISBN 0-670-03441-X
2nd Edition. New York: Wiley-Interscience,
2006. ISBN 0-471-24195-4. Jeremy Campbell, Grammatical Man, Touch-
stone/Simon & Schuster, 1982, ISBN 0-671-44062-
Csiszar, I, Korner, J. Information Theory: Cod- 4
ing Theorems for Discrete Memoryless Systems
Henri Theil, Economics and Information Theory,
Akademiai Kiado: 2nd edition, 1997. ISBN 963-
Rand McNally & Company - Chicago, 1967.
05-7440-3
Escolano, Suau, Bonev, Information Theory in Com-
MacKay, DJC. Information Theory, Inference, and
puter Vision and Pattern Recognition, Springer,
Learning Algorithms Cambridge: Cambridge Uni-
2009. ISBN 978-1-84882-296-2
versity Press, 2003. ISBN 0-521-64298-1
Pierce, JR. An introduction to information theory: Erill I. (2012), "A gentle introduction to informa-
symbols, signals and noise. Dover (2nd Edition). tion content in transcription factor binding sites"
1961 (reprinted by Dover 1980). (University of Maryland, Baltimore County)
Shannon, CE. Warren Weaver. The Mathematical Lambert F. L. (1999), "Shued Cards, Messy
Theory of Communication. Univ of Illinois Press, Desks, and Disorderly Dorm Rooms - Examples of
1949. ISBN 0-252-72548-4 Entropy Increase? Nonsense!", Journal of Chemical
Education
Stone, JV. Chapter 1 of book Information Theory:
A Tutorial Introduction, University of Sheeld, Schneider T. D. (2014), "Information Theory
England, 2014. ISBN 978-0956372857. Primer"
Yeung, RW. A First Course in Information Theory Srinivasa, S., "A Review on Multivariate Mutual In-
Kluwer Academic/Plenum Publishers, 2002. ISBN formation"
0-306-46791-7. IEEE Information Theory Society and ITSoc review
Yeung, RW. Information Theory and Network Cod- articles
ing Springer 2008, 2002. ISBN 978-0-387-79233-0
Computational science
Predict future or unobserved situations (e.g., High order dierence approximations via Taylor se-
weather, sub-atomic particle behaviour, and ries and Richardson extrapolation
46
4.3. REPRODUCIBILITY AND OPEN RESEARCH COMPUTING 47
There are also programs in areas such as computational Numerical linear algebra
physics, computational chemistry, etc.
Numerical weather prediction
Pattern recognition
4.6 Related elds
Scientic visualization
Bioinformatics
Cheminformatics 4.7 See also
Chemometrics
Computational science and engineering
Computational archaeology
Comparison of computer algebra systems
Computational biology
List of molecular modeling software
Computational chemistry
List of numerical analysis software
Computational economics
Computational electromagnetics List of statistical packages
Computational neuroscience [5] Steeb W.-H., Hardy Y., Hardy A. and Stoop R., 2004.
Problems and Solutions in Scientic Computing with C++
Computational particle physics and Java Simulations, World Scientic Publishing. ISBN
981-256-112-9
Computational physics
[6] Sergey Fomel and Jon Claerbout, "Guest Editors Intro-
Computational sociology duction: Reproducible Research, Computing in Science
and Engineering, vol. 11, no. 1, pp. 57, Jan./Feb. 2009,
Computational statistics doi:10.1109/MCSE.2009.14
Computer algebra [7] J. B. Buckheit and D. L. Donoho, "WaveLab and Repro-
Environmental simulation ducible Research, Dept. of Statistics, Stanford Univer-
sity, Tech. Rep. 474, 1995.
Financial modeling
[8] The Yale Law School Round Table on Data and Core
Geographic information system (GIS) Sharing: "Reproducible Research", Computing in Sci-
ence and Engineering, vol. 12, no. 5, pp. 812, Sept/Oct
High performance computing 2010, doi:10.1109/MCSE.2010.113
Machine learning [9] Science Code Manifesto homepage. Accessed Feb 2013.
4.10. EXTERNAL LINKS 49
In statistics, exploratory data analysis (EDA) is an ap- tailed distributions than traditional summaries (the mean
proach to analyzing data sets to summarize their main and standard deviation). The packages S, S-PLUS, and
characteristics, often with visual methods. A statistical R included routines using resampling statistics, such as
model can be used or not, but primarily EDA is for see- Quenouille and Tukeys jackknife and Efron 's bootstrap,
ing what the data can tell us beyond the formal model- which are nonparametric and robust (for many problems).
ing or hypothesis testing task. Exploratory data analysis
Exploratory data analysis, robust statistics, nonparamet-
was promoted by John Tukey to encourage statisticians to ric statistics, and the development of statistical program-
explore the data, and possibly formulate hypotheses that
ming languages facilitated statisticians work on scien-
could lead to new data collection and experiments. EDA tic and engineering problems. Such problems included
is dierent from initial data analysis (IDA),[1] which fo- the fabrication of semiconductors and the understand-
cuses more narrowly on checking assumptions required ing of communications networks, which concerned Bell
for model tting and hypothesis testing, and handling Labs. These statistical developments, all championed
missing values and making transformations of variables by Tukey, were designed to complement the analytic
as needed. EDA encompasses IDA. theory of testing statistical hypotheses, particularly the
Laplacian traditions emphasis on exponential families.[3]
50
5.4. HISTORY 51
Principal component analysis Histogram of tips given by customers with bins equal
to $1 increments. Distribution of values is skewed
Multilinear PCA
right and unimodal, which says that there are few
Projection methods such as grand tour, guided tour high tips, but lots of low tips.
and manual tour
Histogram of tips given by customers with bins equal
Interactive versions of these plots to 10c increments. An interesting phenomenon is
visible, peaks in the counts at the full and half-dollar
Typical quantitative techniques are: amounts. This corresponds to customers rounding
tips. This is a behaviour that is common to other
Median polish types of purchases too, like gasoline.
in the lower right than upper left. Points in the lower 5.8 References
right correspond to tips that are lower than expected,
and it is clear that more customers are cheap rather [1] Chateld, C. (1995). Problem Solving: A Statisticians
than generous. Guide (2nd ed.). Chapman and Hall. ISBN 0412606305.
Scatterplot of tips vs bill separately by gender and [2] John Tukey-The Future of Data Analysis-July 1961
smoking party. Smoking parties have a lot more [3] Conversation with John W. Tukey and Elizabeth
variability in the tips that they give. Males tend to Tukey, Luisa T. Fernholz and Stephan Morgen-
pay the (few) higher bills, and female non-smokers thaler. Statistical Science 15 (1): 7994. 2000.
tend to be very consistent tippers (with the exception doi:10.1214/ss/1009212675.
of three women).
[4] Tukey, John W. (1977). Exploratory Data Analysis. Pear-
son. ISBN 978-0201076165.
What is learned from the graphics is dierent from what
could be learned by the modeling. You can say that these [5] Behrens-Principles and Procedures of Exploratory Data
Analysis-American Psychological Association-1997
pictures help the data tell us a story, that we have dis-
covered some features of tipping that perhaps we didn't [6] Konold, C. (1999). Statistics goes to school. Contem-
anticipate in advance. porary Psychology 44 (1): 8182. doi:10.1037/001949.
Predictive analytics
Predictive analytics encompasses a variety of statisti- lies on capturing relationships between explanatory vari-
cal techniques from modeling, machine learning, and ables and the predicted variables from past occurrences,
data mining that analyze current and historical facts to and exploiting them to predict the unknown outcome. It
make predictions about future, or otherwise unknown, is important to note, however, that the accuracy and us-
events.[1][2] ability of results will depend greatly on the level of data
analysis and the quality of assumptions.
In business, predictive models exploit patterns found in
historical and transactional data to identify risks and op- Predictive analytics is often dened as predicting at a
portunities. Models capture relationships among many more detailed level of granularity, i.e., generating pre-
factors to allow assessment of risk or potential associated dictive scores (probabilities) for each individual organiza-
with a particular set of conditions, guiding decision mak- tional element. This distinguishes it from forecasting. For
ing for candidate transactions.[3] example, Predictive analyticsTechnology that learns
The dening functional eect of these technical ap- from experience (data) to predict the future behavior [13]
of
proaches is that predictive analytics provides a predictive individuals in order to drive better decisions.
score (probability) for each individual (customer, em-
ployee, healthcare patient, product SKU, vehicle, com-
ponent, machine, or other organizational unit) in order 6.2 Types
to determine, inform, or inuence organizational pro-
cesses that pertain across large numbers of individuals, Generally, the term predictive analytics is used to mean
such as in marketing, credit risk assessment, fraud detec- predictive modeling, scoring data with predictive mod-
tion, manufacturing, healthcare, and government opera- els, and forecasting. However, people are increasingly
tions including law enforcement. using the term to refer to related analytical disciplines,
Predictive analytics is used in actuarial science,[4] such as descriptive modeling and decision modeling or
marketing,[5] nancial services,[6] insurance, optimization. These disciplines also involve rigorous data
telecommunications, retail, travel,[9] healthcare,[10]
[7] [8] analysis, and are widely used in business for segmentation
pharmaceuticals[11] and other elds. and decision making, but have dierent purposes and the
statistical techniques underlying them vary.
One of the most well known applications is credit scor-
ing,[1] which is used throughout nancial services. Scor-
ing models process a customers credit history, loan appli-
cation, customer data, etc., in order to rank-order individ-
6.2.1 Predictive models
uals by their likelihood of making future credit payments
on time. Predictive models are models of the relation between the
specic performance of a unit in a sample and one or
more known attributes or features of the unit. The ob-
jective of the model is to assess the likelihood that a
6.1 Denition similar unit in a dierent sample will exhibit the spe-
cic performance. This category encompasses models in
Predictive analytics is an area of data mining that deals many areas, such as marketing, where they seek out subtle
with extracting information from data and using it to pre- data patterns to answer questions about customer perfor-
dict trends and behavior patterns. Often the unknown mance, or fraud detection models. Predictive models of-
event of interest is in the future, but predictive analyt- ten perform calculations during live transactions, for ex-
ics can be applied to any type of unknown whether it be ample, to evaluate the risk or opportunity of a given cus-
in the past, present or future. For example, identifying tomer or transaction, in order to guide a decision. With
suspects after a crime has been committed, or credit card advancements in computing speed, individual agent mod-
fraud as it occurs.[12] The core of predictive analytics re- eling systems have become capable of simulating human
54
6.3. APPLICATIONS 55
can extend from project to market, and from near to 6.5 Analytical Techniques
long term. Underwriting (see below) and other busi-
ness approaches identify risk management as a predictive The approaches and techniques used to conduct predic-
method. tive analytics can broadly be grouped into regression tech-
niques and machine learning techniques.
6.3.10 Underwriting
6.5.1 Regression techniques
Many businesses have to account for risk exposure due
to their dierent services and determine the cost needed Regression models are the mainstay of predictive analyt-
to cover the risk. For example, auto insurance providers ics. The focus lies on establishing a mathematical equa-
need to accurately determine the amount of premium to tion as a model to represent the interactions between the
charge to cover each automobile and driver. A nancial dierent variables in consideration. Depending on the
company needs to assess a borrowers potential and abil- situation, there are a wide variety of models that can be
ity to pay before granting a loan. For a health insurance applied while performing predictive analytics. Some of
provider, predictive analytics can analyze a few years of them are briey discussed below.
past medical claims data, as well as lab, pharmacy and
other records where available, to predict how expensive
an enrollee is likely to be in the future. Predictive analyt- Linear regression model
ics can help underwrite these quantities by predicting the
chances of illness, default, bankruptcy, etc. Predictive The linear regression model analyzes the relationship be-
analytics can streamline the process of customer acquisi- tween the response or dependent variable and a set of in-
tion by predicting the future risk behavior of a customer dependent or predictor variables. This relationship is ex-
using application level data.[4] Predictive analytics in the pressed as an equation that predicts the response variable
form of credit scores have reduced the amount of time it as a linear function of the parameters. These parameters
takes for loan approvals, especially in the mortgage mar- are adjusted so that a measure of t is optimized. Much
ket where lending decisions are now made in a matter of of the eort in model tting is focused on minimizing the
hours rather than days or even weeks. Proper predictive size of the residual, as well as ensuring that it is randomly
analytics can lead to proper pricing decisions, which can distributed with respect to the model predictions.
help mitigate future risk of default. The goal of regression is to select the parameters of the
model so as to minimize the sum of the squared residu-
als. This is referred to as ordinary least squares (OLS)
6.4 Technology and big data inu- estimation and results in best linear unbiased estimates
(BLUE) of the parameters if and only if the Gauss-
ences Markov assumptions are satised.
Once the model has been estimated we would be inter-
Big data is a collection of data sets that are so large ested to know if the predictor variables belong in the
and complex that they become awkward to work with model i.e. is the estimate of each variables contribution
using traditional database management tools. The vol- reliable? To do this we can check the statistical signi-
ume, variety and velocity of big data have introduced cance of the models coecients which can be measured
challenges across the board for capture, storage, search, using the t-statistic. This amounts to testing whether the
sharing, analysis, and visualization. Examples of big data coecient is signicantly dierent from zero. How well
sources include web logs, RFID, sensor data, social net- the model predicts the dependent variable based on the
works, Internet search indexing, call detail records, mil- value of the independent variables can be assessed by us-
itary surveillance, and complex data in astronomic, bio- ing the R statistic. It measures predictive power of the
geochemical, genomics, and atmospheric sciences. Big model i.e. the proportion of the total variation in the de-
Data is the core of most predictive analytic services of- pendent variable that is explained (accounted for) by
fered by IT organizations.[20] Thanks to technological ad- variation in the independent variables.
vances in computer hardware faster CPUs, cheaper
memory, and MPP architectures and new technolo-
gies such as Hadoop, MapReduce, and in-database and Discrete choice models
text analytics for processing big data, it is now feasible to
collect, analyze, and mine massive amounts of structured Multivariate regression (above) is generally used when
and unstructured data for new insights.[15] Today, explor- the response variable is continuous and has an unbounded
ing big data and using predictive analytics is within reach range. Often the response variable may not be continuous
of more organizations than ever before and new methods but rather discrete. While mathematically it is feasible to
that are capable for handling such datasets are proposed apply multivariate regression to discrete ordered depen-
[21] [22]
dent variables, some of the assumptions behind the theory
58 CHAPTER 6. PREDICTIVE ANALYTICS
of multivariate linear regression no longer hold, and there situations where the observed variable y is continuous but
are other techniques such as discrete choice models which takes values between 0 and 1.
are better suited for this type of analysis. If the dependent
variable is discrete, some of those superior methods are
logistic regression, multinomial logit and probit models. Logit versus probit
Logistic regression and probit models are used when the
dependent variable is binary. The Probit model has been around longer than the logit
model. They behave similarly, except that the logistic dis-
tribution tends to be slightly atter tailed. One of the rea-
Logistic regression sons the logit model was formulated was that the probit
model was computationally dicult due to the require-
For more details on this topic, see logistic regression. ment of numerically calculating integrals. Modern com-
puting however has made this computation fairly simple.
The coecients obtained from the logit and probit model
In a classication setting, assigning outcome probabilities are fairly close. However, the odds ratio is easier to in-
to observations can be achieved through the use of a logis- terpret in the logit model.
tic model, which is basically a method which transforms
information about the binary dependent variable into an Practical reasons for choosing the probit model over the
unbounded continuous variable and estimates a regular logistic model would be:
multivariate model (See Allisons Logistic Regression for
more information on the theory of Logistic Regression). There is a strong belief that the underlying distribu-
tion is normal
The Wald and likelihood-ratio test are used to test the
statistical signicance of each coecient b in the model The actual event is not a binary outcome (e.g.,
(analogous to the t tests used in OLS regression; see bankruptcy status) but a proportion (e.g., proportion
above). A test assessing the goodness-of-t of a classi- of population at dierent debt levels).
cation model is the percentage correctly predicted.
the series is stationary or not and the presence of sea- models are: F, gamma, Weibull, log normal, inverse nor-
sonality by examining plots of the series, autocorrelation mal, exponential etc. All these distributions are for a non-
and partial autocorrelation functions. In the estimation negative random variable.
stage, models are estimated using non-linear time series Duration models can be parametric, non-parametric or
or maximum likelihood estimation procedures. Finally semi-parametric. Some of the models commonly used
the validation stage involves diagnostic checking such as are Kaplan-Meier and Cox proportional hazard model
plotting the residuals to detect outliers and evidence of (non parametric).
model t.
In recent years time series models have become
more sophisticated and attempt to model condi- Classication and regression trees
tional heteroskedasticity with models such as ARCH
(autoregressive conditional heteroskedasticity) and Main article: decision tree learning
GARCH (generalized autoregressive conditional het-
eroskedasticity) models frequently used for nancial time Globally-optimal classication tree analysis (GO-CTA)
series. In addition time series models are also used to (also called hierarchical optimal discriminant analysis) is
understand inter-relationships among economic variables a generalization of optimal discriminant analysis that may
represented by systems of equations using VAR (vector be used to identify the statistical model that has maxi-
autoregression) and structural VAR models. mum accuracy for predicting the value of a categorical
dependent variable for a dataset consisting of categori-
cal and continuous variables. The output of HODA is a
Survival or duration analysis non-orthogonal tree that combines categorical variables
and cut points for continuous variables that yields max-
Survival analysis is another name for time to event anal- imum predictive accuracy, an assessment of the exact
ysis. These techniques were primarily developed in the Type I error rate, and an evaluation of potential cross-
medical and biological sciences, but they are also widely generalizability of the statistical model. Hierarchical op-
used in the social sciences like economics, as well as in timal discriminant analysis may be thought of as a gener-
engineering (reliability and failure time analysis). alization of Fishers linear discriminant analysis. Optimal
discriminant analysis is an alternative to ANOVA (analy-
Censoring and non-normality, which are characteristic of sis of variance) and regression analysis, which attempt to
survival data, generate diculty when trying to analyze express one dependent variable as a linear combination of
the data using conventional statistical models such as mul- other features or measurements. However, ANOVA and
tiple linear regression. The normal distribution, being a regression analysis give a dependent variable that is a nu-
symmetric distribution, takes positive as well as negative merical variable, while hierarchical optimal discriminant
values, but duration by its very nature cannot be negative analysis gives a dependent variable that is a class variable.
and therefore normality cannot be assumed when dealing
with duration/survival data. Hence the normality assump- Classication and regression trees (CART) are a non-
tion of regression models is violated. parametric decision tree learning technique that produces
either classication or regression trees, depending on
The assumption is that if the data were not censored it whether the dependent variable is categorical or numeric,
would be representative of the population of interest. In respectively.
survival analysis, censored observations arise whenever
the dependent variable of interest represents the time to Decision trees are formed by a collection of rules based
a terminal event, and the duration of the study is limited on variables in the modeling data set:
in time.
An important concept in survival analysis is the hazard Rules based on variables values are selected to get
rate, dened as the probability that the event will occur the best split to dierentiate observations based on
at time t conditional on surviving until time t. Another the dependent variable
concept related to the hazard rate is the survival function Once a rule is selected and splits a node into two, the
which can be dened as the probability of surviving to same process is applied to each child node (i.e. it
time t. is a recursive procedure)
Most models try to model the hazard rate by choosing
the underlying distribution depending on the shape of the Splitting stops when CART detects no further gain
hazard function. A distribution whose hazard function can be made, or some pre-set stopping rules are met.
slopes upward is said to have positive duration depen- (Alternatively, the data are split as much as possible
dence, a decreasing hazard shows negative duration de- and then the tree is later pruned.)
pendence whereas constant hazard is a process with no
memory usually characterized by the exponential distri- Each branch of the tree ends in a terminal node. Each
bution. Some of the distributional choices in survival observation falls into one and exactly one terminal node,
60 CHAPTER 6. PREDICTIVE ANALYTICS
and each terminal node is uniquely dened by a set of classication or control in a wide spectrum of elds such
rules. as nance, cognitive psychology/neuroscience, medicine,
A very popular method for predictive analytics is Leo engineering, and physics.
Breimans Random forests or derived versions of this Neural networks are used when the exact nature of the re-
technique like Random multinomial logit. lationship between inputs and output is not known. A key
feature of neural networks is that they learn the relation-
ship between inputs and output through training. There
Multivariate adaptive regression splines are three types of training in neural networks used by
dierent networks, supervised and unsupervised training,
Multivariate adaptive regression splines (MARS) is a non- reinforcement learning, with supervised being the most
parametric technique that builds exible models by tting common one.
piecewise linear regressions.
Some examples of neural network training techniques
An important concept associated with regression splines are backpropagation, quick propagation, conjugate gra-
is that of a knot. Knot is where one local regression model dient descent, projection operator, Delta-Bar-Delta etc.
gives way to another and thus is the point of intersection Some unsupervised network architectures are multilayer
between two splines. perceptrons, Kohonen networks, Hopeld networks, etc.
In multivariate and adaptive regression splines, basis
functions are the tool used for generalizing the search for
knots. Basis functions are a set of functions used to repre- Multilayer Perceptron (MLP)
sent the information contained in one or more variables.
Multivariate and Adaptive Regression Splines model al- The Multilayer Perceptron (MLP) consists of an input
most always creates the basis functions in pairs. and an output layer with one or more hidden layers of
nonlinearly-activating nodes or sigmoid nodes. This is
Multivariate and adaptive regression spline approach de- determined by the weight vector and it is necessary to
liberately overts the model and then prunes to get to the adjust the weights of the network. The backpropagation
optimal model. The algorithm is computationally very in- employs gradient fall to minimize the squared error be-
tensive and in practice we are required to specify an upper tween the network output values and desired values for
limit on the number of basis functions. those outputs. The weights adjusted by an iterative pro-
cess of repetitive present of attributes. Small changes in
the weight to get the desired values are done by the pro-
6.5.2 Machine learning techniques cess called training the net and is done by the training set
(learning rule).
Machine learning, a branch of articial intelligence, was
originally employed to develop techniques to enable com-
puters to learn. Today, since it includes a number of Radial basis functions
advanced statistical methods for regression and classi-
cation, it nds application in a wide variety of elds in- A radial basis function (RBF) is a function which has built
cluding medical diagnostics, credit card fraud detection, into it a distance criterion with respect to a center. Such
face and speech recognition and analysis of the stock mar- functions can be used very eciently for interpolation
ket. In certain applications it is sucient to directly pre- and for smoothing of data. Radial basis functions have
dict the dependent variable without focusing on the un- been applied in the area of neural networks where they
derlying relationships between variables. In other cases, are used as a replacement for the sigmoidal transfer func-
the underlying relationships can be very complex and the tion. Such networks have 3 layers, the input layer, the
mathematical form of the dependencies unknown. For hidden layer with the RBF non-linearity and a linear out-
such cases, machine learning techniques emulate human put layer. The most popular choice for the non-linearity
cognition and learn from training examples to predict fu- is the Gaussian. RBF networks have the advantage of not
ture events. being locked into local minima as do the feed-forward
A brief discussion of some of these methods used com- networks such as the multilayer perceptron.
monly for predictive analytics is provided below. A de-
tailed study of machine learning can be found in Mitchell
(1997). Support vector machines
GNU Octave
Geospatial predictive modeling
Apache Mahout
Conceptually, geospatial predictive modeling is rooted in
the principle that the occurrences of events being mod- Notable commercial predictive analytic tools include:
eled are limited in distribution. Occurrences of events
are neither uniform nor random in distribution there are Alpine Data Labs
spatial environment factors (infrastructure, sociocultural,
topographic, etc.) that constrain and inuence where the BIRT Analytics
locations of events occur. Geospatial predictive modeling Angoss KnowledgeSTUDIO
attempts to describe those constraints and inuences by
spatially correlating occurrences of historical geospatial IBM SPSS Statistics and IBM SPSS Modeler
locations with environmental factors that represent those
constraints and inuences. Geospatial predictive model- KXEN Modeler
ing is a process for analyzing events through a geographic Mathematica
lter in order to make statements of likelihood for event
occurrence or emergence. MATLAB
62 CHAPTER 6. PREDICTIVE ANALYTICS
STATISTICA
TIBCO 6.9 References
The most popular commercial predictive analytics soft- [1] Nyce, Charles (2007), Predictive Analytics White Paper
ware packages according to the Rexer Analytics Sur- (PDF), American Institute for Chartered Property Casu-
vey for 2013 are IBM SPSS Modeler, SAS Enterprise alty Underwriters/Insurance Institute of America, p. 1
Miner, and Dell Statistica <http://www.rexeranalytics.
[2] Eckerson, Wayne (May 10, 2007), Extending the Value of
com/Data-Miner-Survey-2013-Intro.html>
Your Data Warehousing Investment, The Data Warehouse
Institute
6.6.1 PMML [3] Coker, Frank (2014). Pulse: Understanding the Vital Signs
of Your Business (1st ed.). Bellevue, WA: Ambient Light
In an attempt to provide a standard language for express- Publishing. pp. 30, 39, 42,more. ISBN 978-0-9893086-
ing predictive models, the Predictive Model Markup Lan- 0-1.
guage (PMML) has been proposed. Such an XML-based
language provides a way for the dierent tools to de- [4] Conz, Nathan (September 2, 2008), Insurers Shift to
ne predictive models and to share these between PMML Customer-focused Predictive Analytics Technologies,
Insurance & Technology
compliant applications. PMML 4.0 was released in June,
2009. [5] Fletcher, Heather (March 2, 2011), The 7 Best Uses for
Predictive Analytics in Multichannel Marketing, Target
Marketing
6.7 Criticism
[6] Korn, Sue (April 21, 2011), The Opportunity for Predic-
tive Analytics in Finance, HPC Wire
There are plenty of skeptics when it comes to comput-
ers and algorithms abilities to predict the future, includ- [7] Barkin, Eric (May 2011), CRM + Predictive Analytics:
ing Gary King, a professor from Harvard University and Why It All Adds Up, Destination CRM
the director of the Institute for Quantitative Social Sci-
[8] Das, Krantik; Vidyashankar, G.S. (July 1, 2006),
ence. [25] People are inuenced by their environment in
Competitive Advantage in Retail Through Analytics:
innumerable ways. Trying to understand what people will Developing Insights, Creating Value, Information Man-
do next assumes that all the inuential variables can be agement
known and measured accurately. Peoples environments
change even more quickly than they themselves do. Ev- [9] McDonald, Michle (September 2, 2010), New Technol-
erything from the weather to their relationship with their ogy Taps 'Predictive Analytics to Target Travel Recom-
mother can change the way people think and act. All of mendations, Travel Market Report
those variables are unpredictable. How they will impact
[10] Stevenson, Erin (December 16, 2011), Tech Beat: Can
a person is even less predictable. If put in the exact same
you pronounce health care predictive analytics?", Times-
situation tomorrow, they may make a completely dier- Standard
ent decision. This means that a statistical prediction is
only valid in sterile laboratory conditions, which suddenly [11] McKay, Lauren (August 2009), The New Prescription
isn't as useful as it seemed before. [26] for Pharma, Destination CRM
6.10. FURTHER READING 63
[12] Finlay, Steven (2014). Predictive Analytics, Data Mining Coggeshall, Stephen, Davies, John, Jones, Roger.,
and Big Data. Myths, Misconceptions and Methods (1st and Schutzer, Daniel, Intelligent Security Sys-
ed.). Basingstoke: Palgrave Macmillan. p. 237. ISBN tems, in Freedman, Roy S., Flein, Robert A., and
1137379278. Lederman, Jess, Editors (1995). Articial Intelli-
[13] Siegel, Eric (2013). Predictive Analytics: The Power to gence in the Capital Markets. Chicago: Irwin. ISBN
Predict Who Will Click, Buy, Lie, or Die (1st ed.). Wiley. 1-55738-811-3.
ISBN 978-1-1183-5685-2.
L. Devroye, L. Gyr, G. Lugosi (1996). A Prob-
[14] Reichheld, Frederick; Schefter, Phil. The Economics abilistic Theory of Pattern Recognition. New York:
of E-Loyalty. http://hbswk.hbs.edu/''. Havard Business Springer-Verlag.
School. Retrieved 10 November 2014.
Enders, Walter (2004). Applied Time Series Econo-
[15] Schi, Mike (March 6, 2012), BI Experts: Why Predictive metrics. Hoboken: John Wiley and Sons. ISBN 0-
Analytics Will Continue to Grow, The Data Warehouse In- 521-83919-X.
stitute
Greene, William (2012). Econometric Analysis,
[16] Nigrini, Mark (June 2011). Forensic Analytics: Methods 7th Ed. London: Prentice Hall. ISBN 978-0-13-
and Techniques for Forensic Accounting Investigations.
139538-1.
Hoboken, NJ: John Wiley & Sons Inc. ISBN 978-0-470-
89046-2. Guidre, Mathieu; Howard N, Sh. Argamon
[17] Dhar, Vasant (April 2011). Prediction in Financial Mar-
(2009). Rich Language Analysis for Counterterrror-
kets: The Case for Small Disjuncts. ACM Transactions ism. Berlin, London, New York: Springer-Verlag.
on Intelligent Systems and Technologies 2 (3). ISBN 978-3-642-01140-5.
[18] Dhar, Vasant; Chou, Dashin and Provost Foster (October Mitchell, Tom (1997). Machine Learning. New
2000). Discovering Interesting Patterns in Investment York: McGraw-Hill. ISBN 0-07-042807-7.
Decision Making with GLOWER A Genetic Learning
Algorithm Overlaid With Entropy Reduction. Data Min- Siegel, Eric (2013). Predictive Analytics: The Power
ing and Knowledge Discovery 4 (4). to Predict Who Will Click, Buy, Lie, or Die. John
Wiley. ISBN 978-1-1183-5685-2.
[19] https://acc.dau.mil/CommunityBrowser.aspx?id=
126070 Tukey, John (1977). Exploratory Data Analysis.
New York: Addison-Wesley. ISBN 0-201-07616-
[20] http://www.hcltech.com/sites/default/files/key_to_ 0.
monetizing_big_data_via_predictive_analytics.pdf
Finlay, Steven (2014). Predictive Analytics, Data
[21] Ben-Gal I. Dana A., Shkolnik N. and Singer (2014).
Mining and Big Data. Myths, Misconceptions and
Ecient Construction of Decision Trees by the Dual In-
formation Distance Method (PDF). Quality Technology Methods. Basingstoke: Palgrave Macmillan. ISBN
& Quantitative Management (QTQM), 11( 1), 133-147. 978-1-137-37927-6.
[22] Ben-Gal I., Shavitt Y., Weinsberg E., Weinsberg Coker, Frank (2014). Pulse: Understanding the Vi-
U. (2014). Peer-to-peer information retrieval using tal Signs of Your Business. Bellevue, WA: Ambient
shared-content clustering (PDF). Knowl Inf Syst DOI Light Publishing. ISBN 978-0-9893086-0-1.
10.1007/s10115-013-0619-9.
Business intelligence
Business intelligence (BI) is the set of techniques and Group consolidation, budgeting and rolling forecasts
tools for the transformation of raw data into meaningful
and useful information for business analysis purposes. BI Statistical inference and probabilistic simulation
technologies are capable of handling large amounts of un-
structured data to help identify, develop and otherwise Key performance indicators optimization
create new strategic business opportunities. The goal of
Version control and process management
BI is to allow for the easy interpretation of these large vol-
umes of data. Identifying new opportunities and imple- Open item management
menting an eective strategy based on insights can pro-
vide businesses with a competitive market advantage and
long-term stability.[1]
7.2 History
BI technologies provide historical, current and predictive
views of business operations. Common functions of busi-
ness intelligence technologies are reporting, online ana- The term Business Intelligence was originally coined by
lytical processing, analytics, data mining, process mining, Richard Millar Devens in the Cyclopdia of Commer-
complex event processing, business performance man- cial and Business Anecdotes from 1865. Devens used
agement, benchmarking, text mining, predictive analytics the term to describe how the banker, Sir Henry Fur-
and prescriptive analytics. nese, gained prot by receiving and acting upon infor-
mation about his environment, prior to his competitors.
BI can be used to support a wide range of business de- Throughout Holland, Flanders, France, and Germany,
cisions ranging from operational to strategic. Basic op- he maintained a complete and perfect train of business in-
erating decisions include product positioning or pricing. telligence. The news of the many battles fought was thus
Strategic business decisions include priorities, goals and received rst by him, and the fall of Namur added to his
directions at the broadest level. In all cases, BI is most ef- prots, owing to his early receipt of the news. (Devens,
fective when it combines data derived from the market in (1865), p. 210). The ability to collect and react accord-
which a company operates (external data) with data from ingly based on the information retrieved, an ability that
company sources internal to the business such as nancial Furnese excelled in, is today still at the very heart of BI.[3]
and operations data (internal data). When combined, ex-
ternal and internal data can provide a more complete pic- In a 1958 article, IBM researcher Hans Peter Luhn used
ture which, in eect, creates an intelligence that cannot the term business intelligence. He employed the Web-
be derived by any singular set of data.[2] sters dictionary denition of intelligence: the ability
to apprehend the interrelationships of presented facts in
such a way as to guide action towards a desired goal.[4]
7.1 Components Business intelligence as it is understood today is said to
have evolved from the decision support systems (DSS)
that began in the 1960s and developed throughout the
Business intelligence is made up of an increasing number
mid-1980s. DSS originated in the computer-aided mod-
of components including:
els created to assist with decision making and planning.
From DSS, data warehouses, Executive Information Sys-
Multidimensional aggregation and allocation tems, OLAP and business intelligence came into focus
Denormalization, tagging and standardization beginning in the late 80s.
64
7.5. COMPARISON WITH BUSINESS ANALYTICS 65
the patterns within the data) that can then be presented with a topical focus on company competitors. If under-
to human decision-makers. stood broadly, business intelligence can include the subset
[11]
In 1989, Howard Dresner (later a Gartner Group an- of competitive intelligence.
alyst) proposed business intelligence as an umbrella
term to describe concepts and methods to improve
business decision making by using fact-based support 7.5 Comparison with business an-
systems.[6] It was not until the late 1990s that this us-
age was widespread.[7]
alytics
Business intelligence and business analytics are some-
times used interchangeably, but there are alternate
7.3 Data warehousing denitions.[12] One denition contrasts the two, stat-
ing that the term business intelligence refers to collect-
Often BI applications use data gathered from a data ware- ing business data to nd information primarily through
house (DW) or from a data mart, and the concepts of asking questions, reporting, and online analytical pro-
BI and DW sometimes combine as "BI/DW"[8] or as cesses. Business analytics, on the other hand, uses statis-
"BIDW". A data warehouse contains a copy of analyt- tical and quantitative tools for explanatory and predictive
ical data that facilitates decision support. However, not modeling.[13]
all data warehouses serve for business intelligence, nor do In an alternate denition, Thomas Davenport, professor
all business intelligence applications require a data ware- of information technology and management at Babson
house. College argues that business intelligence should be di-
To distinguish between the concepts of business intelli- vided into querying, reporting, Online analytical process-
gence and data warehouses, Forrester Research denes ing (OLAP), an alerts tool, and business analytics. In
business intelligence in one of two ways: this denition, business analytics is the subset of BI fo-
cusing on statistics, prediction, and optimization, rather
than the reporting functionality.[14]
1. Using a broad denition: Business Intelligence
is a set of methodologies, processes, architec-
tures, and technologies that transform raw data into
meaningful and useful information used to enable 7.6 Applications in an enterprise
more eective strategic, tactical, and operational in-
sights and decision-making.[9] Under this deni- Business intelligence can be applied to the following busi-
tion, business intelligence also includes technologies ness purposes, in order to drive business value.
such as data integration, data quality, data warehous-
ing, master-data management, text- and content-
analytics, and many others that the market some- 1. Measurement program that creates a hierarchy
times lumps into the "Information Management" of performance metrics (see also Metrics Refer-
segment. Therefore, Forrester refers to data prepa- ence Model) and benchmarking that informs busi-
ration and data usage as two separate but closely ness leaders about progress towards business goals
linked segments of the business-intelligence archi- (business process management).
tectural stack.
2. Analytics program that builds quantitative pro-
2. Forrester denes the narrower business-intelligence cesses for a business to arrive at optimal deci-
market as, "...referring to just the top layers of the BI sions and to perform business knowledge discovery.
architectural stack such as reporting, analytics and Frequently involves: data mining, process mining,
dashboards. [10] statistical analysis, predictive analytics, predictive
modeling, business process modeling, data lineage,
complex event processing and prescriptive analytics.
5. Knowledge management program to make the 1. The level of commitment and sponsorship of the
company data-driven through strategies and prac- project from senior management
tices to identify, create, represent, distribute, and
enable adoption of insights and experiences that are 2. The level of business need for creating a BI imple-
true business knowledge. Knowledge management mentation
leads to learning management and regulatory com- 3. The amount and quality of business data available.
pliance.
In addition to the above, business intelligence can provide 7.8.1 Business sponsorship
a pro-active approach, such as alert functionality that im-
mediately noties the end-user if certain conditions are The commitment and sponsorship of senior management
met. For example, if some business metric exceeds a is according to Kimball et al., the most important criteria
pre-dened threshold, the metric will be highlighted in for assessment.[20] This is because having strong manage-
standard reports, and the business analyst may be alerted ment backing helps overcome shortcomings elsewhere in
via e-mail or another monitoring service. This end-to- the project. However, as Kimball et al. state: even the
end process requires data governance, which should be most elegantly designed DW/BI system cannot overcome
handled by the expert. a lack of business [management] sponsorship.[21]
It is important that personnel who participate in the
project have a vision and an idea of the benets and draw-
7.7 Prioritization of projects backs of implementing a BI system. The best business
sponsor should have organizational clout and should be
well connected within the organization. It is ideal that the
It can be dicult to provide a positive business case for
business sponsor is demanding but also able to be realis-
business intelligence initiatives, and often the projects
tic and supportive if the implementation runs into delays
must be prioritized through strategic initiatives. BI
or drawbacks. The management sponsor also needs to
projects can attain higher prioritization within the orga-
be able to assume accountability and to take responsibil-
nization if managers consider the following:
ity for failures and setbacks on the project. Support from
multiple members of the management ensures the project
As described by Kimball[15] the BI manager must does not fail if one person leaves the steering group. How-
determine the tangible benets such as eliminated ever, having many managers work together on the project
cost of producing legacy reports. can also mean that there are several dierent interests that
attempt to pull the project in dierent directions, such as
Data access for the entire organization must be
if dierent departments want to put more emphasis on
enforced.[16] In this way even a small benet, such
their usage. This issue can be countered by an early and
as a few minutes saved, makes a dierence when
specic analysis of the business areas that benet the most
multiplied by the number of employees in the entire
from the implementation. All stakeholders in the project
organization.
should participate in this analysis in order for them to feel
As described by Ross, Weil & Roberson for En- invested in the project and to nd common ground.
terprise Architecture,[17] managers should also con- Another management problem that may be encountered
sider letting the BI project be driven by other busi- before the start of an implementation is an overly aggres-
ness initiatives with excellent business cases. To sive business sponsor. Problems of scope creep occur
support this approach, the organization must have when the sponsor requests data sets that were not spec-
enterprise architects who can identify suitable busi- ied in the original planning phase.
ness projects.
Companies that implement BI are often large, multina- Data Proling: check inappropriate value,
tional organizations with diverse subsidiaries.[23] A well- null/empty
designed BI solution provides a consolidated view of key
business data not available anywhere else in the organiza- 3. Data warehouse:
tion, giving management visibility and control over mea-
sures that otherwise would not exist. Completeness: check that all expected data are
loaded
Referential integrity: unique and existing ref-
7.8.3 Amount and quality of available data erential over all sources
Without proper data, or with too little quality data, any Consistency between sources: check consoli-
BI implementation fails; it does not matter how good the dated data vs sources
management sponsorship or business-driven motivation
is. Before implementation it is a good idea to do data pro- 4. Reporting:
ling. This analysis identies the content, consistency
and structure [..][22] of the data. This should be done as Uniqueness of indicators: only one share dic-
early as possible in the process and if the analysis shows tionary of indicators
that data is lacking, put the project on hold temporarily Formula accuracy: local reporting formula
while the IT department gures out how to properly col- should be avoided or checked
lect data.
When planning for business data and business intelligence
requirements, it is always advisable to consider specic 7.9 User aspect
scenarios that apply to a particular organization, and then
select the business intelligence features best suited for the
scenario. Some considerations must be made in order to success-
fully integrate the usage of business intelligence systems
Often, scenarios revolve around distinct business pro- in a company. Ultimately the BI system must be accepted
cesses, each built on one or more data sources. These and utilized by the users in order for it to add value to the
sources are used by features that present that data as in- organization.[24][25] If the usability of the system is poor,
formation to knowledge workers, who subsequently act the users may become frustrated and spend a consider-
on that information. The business needs of the organiza- able amount of time guring out how to use the system
tion for each business process adopted correspond to the or may not be able to really use the system. If the system
essential steps of business intelligence. These essential does not add value to the users mission, they simply don't
steps of business intelligence include but are not limited use it.[25]
to:
To increase user acceptance of a BI system, it can be ad-
visable to consult business users at an early stage of the
1. Go through business data sources in order to collect
DW/BI lifecycle, for example at the requirements gather-
needed data
ing phase.[24] This can provide an insight into the business
2. Convert business data to information and present ap- process and what the users need from the BI system.
propriately There are several methods for gathering this information,
such as questionnaires and interview sessions.
3. Query and analyze data
When gathering the requirements from the business users,
4. Act on the collected data the local IT department should also be consulted in order
to determine to which degree it is possible to fulll the
[24]
The quality aspect in business intelligence should cover businesss needs based on the available data.
all the process from the source data to the nal reporting. Taking a user-centered approach throughout the design
At each step, the quality gates are dierent: and development stage may further increase the chance
of rapid user adoption of the BI system.[25]
1. Source Data: Besides focusing on the user experience oered by the
Data Standardization: make data comparable BI applications, it may also possibly motivate the users to
(same unit, same pattern...) utilize the system by adding an element of competition.
Kimball[24] suggests implementing a function on the Busi-
Master Data Management: unique referential ness Intelligence portal website where reports on system
2. Operational Data Store (ODS): usage can be found. By doing so, managers can see how
well their departments are doing and compare themselves
Data Cleansing: detect & correct inaccurate to others and this may spur them to encourage their sta
data to utilize the BI system even more.
68 CHAPTER 7. BUSINESS INTELLIGENCE
In a 2007 article, H. J. Watson gives an example of how Clean The portal should be designed so it is easily un-
the competitive element can act as an incentive.[26] Wat- derstandable and not over complex as to confuse the
son describes how a large call centre implemented per- users
formance dashboards for all call agents, with monthly in-
centive bonuses tied to performance metrics. Also, agents Current The portal should be updated regularly.
could compare their performance to other team members.
Interactive The portal should be implemented in a way
The implementation of this type of performance mea-
that makes it easy for the user to use its functionality
surement and competition signicantly improved agent
and encourage them to use the portal. Scalability
performance.
and customization give the user the means to t the
BI chances of success can be improved by involving portal to each user.
senior management to help make BI a part of the
organizational culture, and by providing the users with Value Oriented It is important that the user has the feel-
necessary tools, training, and support.[26] Training en- ing that the DW/BI application is a valuable resource
courages more people to use the BI application.[24] that is worth working on.
Providing user support is necessary to maintain the BI
system and resolve user problems.[25] User support can
be incorporated in many ways, for example by creating 7.11 Marketplace
a website. The website should contain great content and
tools for nding the necessary information. Furthermore, There are a number of business intelligence vendors, of-
helpdesk support can be used. The help desk can be ten categorized into the remaining independent pure-
manned by power users or the DW/BI project team.[24] play vendors and consolidated megavendors that have
entered the market through a recent trend[29] of acquisi-
tions in the BI industry.[30] The business intelligence mar-
7.10 BI Portals ket is gradually growing. In 2012 business intelligence
services brought in $13.1 billion in revenue.[31]
A Business Intelligence portal (BI portal) is the primary Some companies adopting BI software decide to pick and
access interface for Data Warehouse (DW) and Business choose from dierent product oerings (best-of-breed)
Intelligence (BI) applications. The BI portal is the users rather than purchase one comprehensive integrated solu-
rst impression of the DW/BI system. It is typically a tion (full-service).[32]
browser application, from which the user has access to
all the individual services of the DW/BI system, reports
and other analytical functionality. The BI portal must be 7.11.1 Industry-specic
implemented in such a way that it is easy for the users of
the DW/BI application to call on the functionality of the Specic considerations for business intelligence systems
application.[27] have to be taken in some sectors such as governmental
The BI portals main functionality is to provide a naviga- banking regulations. The information collected by bank-
tion system of the DW/BI application. This means that ing institutions and analyzed with BI software must be
the portal has to be implemented in a way that the user protected from some groups or individuals, while being
has access to all the functions of the DW/BI application. fully available to other groups or individuals. Therefore,
BI solutions must be sensitive to those needs and be ex-
The most common way to design the portal is to custom t ible enough to adapt to new regulations and changes to
it to the business processes of the organization for which existing law.
the DW/BI application is designed, in that way the portal
can best t the needs and requirements of its users.[28]
The BI portal needs to be easy to use and understand, 7.12 Semi-structured or unstruc-
and if possible have a look and feel similar to other ap-
plications or web content of the organization the DW/BI tured data
application is designed for (consistency).
Businesses create a huge amount of valuable informa-
The following is a list of desirable features for web portals
tion in the form of e-mails, memos, notes from call-
in general and BI portals in particular:
centers, news, user groups, chats, reports, web-pages,
presentations, image-les, video-les, and marketing ma-
Usable User should easily nd what they need in the BI
terial and news. According to Merrill Lynch, more than
tool.
85% of all business information exists in these forms.
Content Rich The portal is not just a report printing These information types are called either semi-structured
tool, it should contain more functionality such as ad- or unstructured data. However, organizations often only
vice, help, support information and documentation. use these documents once.[33]
7.13. FUTURE 69
The management of semi-structured data is recognized 1. Physically accessing unstructured textual data un-
as a major unsolved problem in the information technol- structured data is stored in a huge variety of formats.
ogy industry.[34] According to projections from Gartner
(2003), white collar workers spend anywhere from 30 to 2. Terminology Among researchers and analysts,
40 percent of their time searching, nding and assessing there is a need to develop a standardized terminol-
unstructured data. BI uses both structured and unstruc- ogy.
tured data, but the former is easy to search, and the latter
3. Volume of data As stated earlier, up to 85% of all
contains a large quantity of the information needed for
data exists as semi-structured data. Couple that with
analysis and decision making.[34][35] Because of the di-
the need for word-to-word and semantic analysis.
culty of properly searching, nding and assessing unstruc-
tured or semi-structured data, organizations may not draw 4. Searchability of unstructured textual data A sim-
upon these vast reservoirs of information, which could ple search on some data, e.g. apple, results in links
inuence a particular decision, task or project. This can where there is a reference to that precise search
ultimately lead to poorly informed decision making.[33] term. (Inmon & Nesavich, 2008)[36] gives an exam-
Therefore, when designing a business intelligence/DW- ple: a search is made on the term felony. In a sim-
solution, the specic problems associated with semi- ple search, the term felony is used, and everywhere
structured and unstructured data must be accommodated there is a reference to felony, a hit to an unstructured
for as well as those for the structured data.[35] document is made. But a simple search is crude.
It does not nd references to crime, arson, murder,
embezzlement, vehicular homicide, and such, even
though these crimes are types of felonies.
7.12.1 Unstructured data vs. semi-
structured data
7.12.3 The use of metadata
Unstructured and semi-structured data have dierent
meanings depending on their context. In the context of To solve problems with searchability and assessment of
relational database systems, unstructured data cannot be data, it is necessary to know something about the content.
stored in predictably ordered columns and rows. One type This can be done by adding context through the use of
of unstructured data is typically stored in a BLOB (bi- metadata.[33] Many systems already capture some meta-
nary large object), a catch-all data type available in most data (e.g. lename, author, size, etc.), but more useful
relational database management systems. Unstructured would be metadata about the actual content e.g. sum-
data may also refer to irregularly or randomly repeated maries, topics, people or companies mentioned. Two
column patterns that vary from row to row within each technologies designed for generating metadata about con-
le or document. tent are automatic categorization and information extrac-
tion.
Many of these data types, however, like e-mails, word
processing text les, PPTs, image-les, and video-les
conform to a standard that oers the possibility of meta-
data. Metadata can include information such as author 7.13 Future
and time of creation, and this can be stored in a rela-
tional database. Therefore, it may be more accurate to A 2009 paper predicted[37] these developments in the
talk about this as semi-structured documents or data,[34] business intelligence market:
but no specic consensus seems to have been reached.
Unstructured data can also simply be the knowledge that Because of lack of information, processes, and tools,
business users have about future business trends. Busi- through 2012, more than 35 percent of the top 5,000
ness forecasting naturally aligns with the BI system be- global companies regularly fail to make insightful
cause business users think of their business in aggregate decisions about signicant changes in their business
terms. Capturing the business knowledge that may only and markets.
exist in the minds of business users provides some of the
most important data points for a complete BI solution. By 2012, business units will control at least 40 per-
cent of the total budget for business intelligence.
There are several challenges to developing BI with semi- A 2009 Information Management special report pre-
structured data. According to Inmon & Nesavich,[36] dicted the top BI trends: "green computing, social
some of those are: networking services, data visualization, mobile BI,
70 CHAPTER 7. BUSINESS INTELLIGENCE
[6] D. J. Power (10 March 2007). A Brief History [25] Swain Scheps Business Intelligence for Dummies, 2008,
of Decision Support Systems, version 4.0. DSSRe- ISBN 978-0-470-12723-0
sources.COM. Retrieved 10 July 2008.
[26] Watson, Hugh J.; Wixom, Barbara H. (2007). The Cur-
[7] Power, D. J. A Brief History of Decision Support Sys- rent State of Business Intelligence. Computer 40 (9): 96.
tems. Retrieved 1 November 2010. doi:10.1109/MC.2007.331.
[8] Golden, Bernard (2013). Amazon Web Services For Dum- [27] The Data Warehouse Lifecycle Toolkit (2nd ed.). Ralph
mies. For dummies. John Wiley & Sons. p. 234. ISBN Kimball (2008).
9781118652268. Retrieved 2014-07-06. [...] traditional
business intelligence or data warehousing tools (the terms [28] Microsoft Data Warehouse Toolkit. Wiley Publishing.
are used so interchangeably that they're often referred to (2006)
as BI/DW) are extremely expensive [...]
[29] Andrew Brust (2013-02-14). Gartner releases 2013 BI
[9] Evelson, Boris (21 November 2008). Topic Overview: Magic Quadrant. ZDNet. Retrieved 21 August 2013.
Business Intelligence.
[30] Pendse, Nigel (7 March 2008). Consolidations in the BI
[10] Evelson, Boris (29 April 2010). Want to know what For- industry. The OLAP Report.
resters lead data analysts are thinking about BI and the
data domain?". [31] Why Business Intelligence Is Key For Competitive Ad-
vantage. Boston University. Retrieved 23 October 2014.
[11] Kobielus, James (30 April 2010). Whats Not BI? Oh,
Dont Get Me Started....Oops Too Late...Here Goes..... [32] Imho, Claudia (4 April 2006). Three Trends in Busi-
Business intelligence is a non-domain-specic catchall ness Intelligence Technology.
for all the types of analytic data that can be delivered to
users in reports, dashboards, and the like. When you spec- [33] Rao, R. (2003). From unstructured data to action-
ify the subject domain for this intelligence, then you can able intelligence (PDF). IT Professional 5 (6): 29.
refer to competitive intelligence, market intelligence, doi:10.1109/MITP.2003.1254966.
social intelligence, nancial intelligence, HR intelli-
gence, supply chain intelligence, and the like. [34] Blumberg, R. & S. Atre (2003). The Problem with Un-
structured Data (PDF). DM Review: 4246.
[12] Business Analytics vs Business Intelligence?". timoel-
liott.com. 2011-03-09. Retrieved 2014-06-15. [35] Negash, S (2004). Business Intelligence (PDF). Com-
munications of the Association of Information Systems 13:
[13] Dierence between Business Analytics and Business In- 177195.
telligence. businessanalytics.com. 2013-03-15. Re-
trieved 2014-06-15. [36] Inmon, B. & A. Nesavich, Unstructured Textual Data in
the Organization from Managing Unstructured data in
[14] Henschen, Doug (4 January 2010). Analytics at Work: the organization, Prentice Hall 2008, pp. 113
Q&A with Tom Davenport. (Interview).
[37] Gartner Reveals Five Business Intelligence Predictions for
[15] Kimball et al., 2008: 29 2009 and Beyond. gartner.com. 15 January 2009
[16] Are You Ready for the New Business Intelligence?". [38] Campbell, Don (23 June 2009). 10 Red Hot BI Trends.
Dell.com. Retrieved 19 June 2012. Information Management.
[17] Jeanne W. Ross, Peter Weill, David C. Robertson (2006) [39] Lock, Michael (27 March 2014). Cloud Analytics in
Enterprise Architecture As Strategy, p. 117 ISBN 1-59139- 2014: Infusing the Workforce with Insight.
839-8.
[40] Rodriguez, Carlos; Daniel, Florian; Casati, Fabio; Cap-
[18] Krapohl, Donald. A Structured Methodology for Group piello, Cinzia (2010). Toward Uncertain Business Intel-
Decision Making. AugmentedIntel. Retrieved 22 April ligence: The Case of Key Indicators. IEEE Internet Com-
2013. puting 14 (4): 32. doi:10.1109/MIC.2010.59.
[19] Kimball et al. 2008: p. 298
[41] Rodriguez, C., Daniel, F., Casati, F. & Cappiello, C.
[20] Kimball et al., 2008: 16 (2009), Computing Uncertain Key Indicators from Uncer-
tain Data (PDF), pp. 106120
[21] Kimball et al., 2008: 18
[42] Lock, Michael. "http://baroi.aberdeen.com/pdfs/
[22] Kimball et al., 2008: 17 5874-RA-BIDashboards-MDL-06-NSP.pdf" (PDF).
Aberdeen. Aberdeen Group. Retrieved 23 October 2014.
[23] How Companies Are Implementing Business Intelli-
gence Competency Centers (PDF). Computer World. [43] SaaS BI growth will soar in 2010 | Cloud Computing. In-
Retrieved 1 April 2014. foWorld (2010-02-01). Retrieved 17 January 2012.
7.16 Bibliography
Ralph Kimball et al. The Data warehouse Lifecycle
Toolkit (2nd ed.) Wiley ISBN 0-470-47957-4
Peter Rausch, Alaa Sheta, Aladdin Ayesh : Business
Intelligence and Performance Management: Theory,
Systems, and Industrial Applications, Springer Ver-
lag U.K., 2013, ISBN 978-1-4471-4865-4.
Analytics
For the ice hockey term, see Analytics (ice hockey). 8.2 Examples
73
74 CHAPTER 8. ANALYTICS
needs. For example, in the banking industry, Basel III [8] Big Data: The next frontier for innovation, competition
and future capital adequacy needs are likely to make even and productivity as reported in Building with Big Data.
smaller banks adopt internal risk models. In such cases, The Economist. May 26, 2011. Archived from the original
cloud computing and open source R (programming lan- on 3 June 2011. Retrieved May 26, 2011.
guage) can help smaller banks to adopt risk analytics and [9] Ortega, Dan. Mobililty: Fueling a Brainier Business In-
support branch level monitoring by applying predictive telligence. IT Business Edge. Retrieved June 21, 2011.
analytics.
[10] Khambadkone, Krish. Are You Ready for Big Data?".
InfoGain. Retrieved February 10, 2011.
8.6 References
[1] Kohavi, Rothleder and Simoudis (2002). Emerging
Trends in Business Analytics. Communications of the
ACM 45 (8): 4548. doi:10.1145/545151.545177.
[7] Fake doctors sick notes for Sale for 25, NHS fraud
squad warns. London: The Telegraph. Retrieved August
2008.
Chapter 9
Data mining
Not to be confused with analytics, information extrac- step might identify multiple groups in the data, which can
tion, or data analysis. then be used to obtain more accurate prediction results
by a decision support system. Neither the data collection,
Data mining (the analysis step of the Knowledge Dis- data preparation, nor result interpretation and reporting
are part of the data mining step, but do belong to the over-
covery in Databases process, or KDD),[1] an interdisci-
plinary subeld of computer science,[2][3][4]
is the com- all KDD process as additional steps.
putational process of discovering patterns in large data The related terms data dredging, data shing, and data
sets involving methods at the intersection of articial in- snooping refer to the use of data mining methods to sam-
telligence, machine learning, statistics, and database sys- ple parts of a larger population data set that are (or may
tems.[2] The overall goal of the data mining process is be) too small for reliable statistical inferences to be made
to extract information from a data set and transform it about the validity of any patterns discovered. These
into an understandable structure for further use.[2] Aside methods can, however, be used in creating new hypothe-
from the raw analysis step, it involves database and ses to test against the larger data populations.
data management aspects, data pre-processing, model
and inference considerations, interestingness metrics,
complexity considerations, post-processing of discovered
structures, visualization, and online updating.[2]
9.1 Etymology
The term is a misnomer, because the goal is the ex-
In the 1960s, statisticians used terms like Data Fish-
traction of patterns and knowledge from large amount
ing or Data Dredging to refer to what they consid-
of data, not the extraction of data itself.[5] It also is
ered the bad practice of analyzing data without an a-priori
a buzzword[6] and is frequently applied to any form of
hypothesis. The term Data Mining appeared around
large-scale data or information processing (collection,
1990 in the database community. For a short time in
extraction, warehousing, analysis, and statistics) as well
1980s, a phrase database mining", was used, but since
as any application of computer decision support sys-
it was trademarked by HNC, a San Diego-based com-
tem, including articial intelligence, machine learning,
pany, to pitch their Database Mining Workstation;[9] re-
and business intelligence. The popular book Data min-
searchers consequently turned to data mining. Other
ing: Practical machine learning tools and techniques with
terms used include Data Archaeology, Information Har-
Java[7] (which covers mostly machine learning material)
vesting, Information Discovery, Knowledge Extraction,
was originally to be named just Practical machine learn-
etc. Gregory Piatetsky-Shapiro coined the term Knowl-
ing, and the term data mining was only added for mar-
edge Discovery in Databases for the rst workshop on
keting reasons.[8] Often the more general terms "(large
the same topic (KDD-1989) and this term became more
scale) data analysis", or "analytics" or when referring to
popular in AI and Machine Learning Community. How-
actual methods, articial intelligence and machine learn-
ever, the term data mining became more popular in the
ing are more appropriate.
business and press communities.[10] Currently, Data Min-
The actual data mining task is the automatic or semi- ing and Knowledge Discovery are used interchangeably.
automatic analysis of large quantities of data to ex- Since about 2007, Predictive Analytics and since 2011,
tract previously unknown, interesting patterns such as Data Science terms were also used to describe this eld.
groups of data records (cluster analysis), unusual records
(anomaly detection), and dependencies (association rule
mining). This usually involves using database techniques
such as spatial indices. These patterns can then be seen 9.2 Background
as a kind of summary of the input data, and may be used
in further analysis or, for example, in machine learning The manual extraction of patterns from data has occurred
and predictive analytics. For example, the data mining for centuries. Early methods of identifying patterns in
76
9.3. PROCESS 77
data include Bayes theorem (1700s) and regression anal- PAKDD Conference The annual Pacic-Asia
ysis (1800s). The proliferation, ubiquity and increas- Conference on Knowledge Discovery and Data Min-
ing power of computer technology has dramatically in- ing
creased data collection, storage, and manipulation abil-
ity. As data sets have grown in size and complexity, di- PAW Conference Predictive Analytics World
rect hands-on data analysis has increasingly been aug-
SDM Conference SIAM International Conference
mented with indirect, automated data processing, aided
on Data Mining (SIAM)
by other discoveries in computer science, such as neural
networks, cluster analysis, genetic algorithms (1950s), SSTD Symposium Symposium on Spatial and
decision trees and decision rules (1960s), and support Temporal Databases
vector machines (1990s). Data mining is the process
of applying these methods with the intention of uncov- WSDM Conference ACM Conference on Web
ering hidden patterns[11] in large data sets. It bridges Search and Data Mining
the gap from applied statistics and articial intelligence
(which usually provide the mathematical background) to Data mining topics are also present on many data man-
database management by exploiting the way data is stored agement/database conferences such as the ICDE Con-
and indexed in databases to execute the actual learning ference, SIGMOD Conference and International Confer-
and discovery algorithms more eciently, allowing such ence on Very Large Data Bases
methods to be applied to ever larger data sets.
The premier professional body in the eld is the The Knowledge Discovery in Databases (KDD) pro-
Association for Computing Machinery's (ACM) Special cess is commonly dened with the stages:
Interest Group (SIG) on Knowledge Discovery and Data
Mining (SIGKDD).[12][13] Since 1989 this ACM SIG has (1) Selection
hosted an annual international conference and published (2) Pre-processing
its proceedings,[14] and since 1999 it has published a bian-
nual academic journal titled SIGKDD Explorations.[15] (3) Transformation
Market basket analysis, relates to data-mining use of mining historical die-test data to create a proba-
in retail sales. If a clothing store records the pur- bilistic model of patterns of die failure. These pat-
chases of customers, a data mining system could terns are then utilized to decide, in real time, which
identify those customers who favor silk shirts over die to test next and when to stop testing. This system
cotton ones. Although some explanations of rela- has been shown, based on experiments with histori-
tionships may be dicult, taking advantage of it cal test data, to have the potential to improve prots
is easier. The example deals with association rules on mature IC products. Other examples[33][34] of the
within transaction-based data. Not all data are trans- application of data mining methodologies in semi-
action based and logical, or inexact rules may also be conductor manufacturing environments suggest that
present within a database. data mining methodologies may be particularly use-
ful when data is scarce, and the various physical and
Market basket analysis has been used to identify the chemical parameters that aect the process exhibit
purchase patterns of the Alpha Consumer. Analyz- highly complex interactions. Another implication is
ing the data collected on this type of user has allowed that on-line monitoring of the semiconductor man-
companies to predict future buying trends and fore- ufacturing process using data mining may be highly
cast supply demands. eective.
Data mining is a highly eective tool in the catalog 9.5.3 Science and engineering
marketing industry. Catalogers have a rich database
of history of their customer transactions for millions In recent years, data mining has been used widely in the
of customers dating back a number of years. Data areas of science and engineering, such as bioinformatics,
mining tools can identify patterns among customers genetics, medicine, education and electrical power engi-
and help identify the most likely customers to re- neering.
spond to upcoming mailing campaigns.
In the study of human genetics, sequence mining
Data mining for business applications can be inte- helps address the important goal of understand-
grated into a complex modeling and decision mak- ing the mapping relationship between the inter-
ing process.[28] Reactive business intelligence (RBI) individual variations in human DNA sequence and
advocates a holistic approach that integrates data the variability in disease susceptibility. In simple
mining, modeling, and interactive visualization into terms, it aims to nd out how the changes in an
an end-to-end discovery and continuous innova- individuals DNA sequence aects the risks of de-
tion process powered by human and automated veloping common diseases such as cancer, which is
learning.[29] of great importance to improving methods of diag-
nosing, preventing, and treating these diseases. One
In the area of decision making, the RBI approach data mining method that is used to perform this task
has been used to mine knowledge that is progres- is known as multifactor dimensionality reduction.[35]
sively acquired from the decision maker, and then
self-tune the decision method accordingly.[30] The In the area of electrical power engineering, data
relation between the quality of a data mining sys- mining methods have been widely used for condition
tem and the amount of investment that the deci- monitoring of high voltage electrical equipment.
sion maker is willing to make was formalized by The purpose of condition monitoring is to obtain
providing an economic perspective on the value valuable information on, for example, the status of
of extracted knowledge in terms of its payo to the insulation (or other important safety-related pa-
the organization[28] This decision-theoretic classi- rameters). Data clustering techniques such as the
cation framework[28] was applied to a real-world self-organizing map (SOM), have been applied to
semiconductor wafer manufacturing line, where vibration monitoring and analysis of transformer on-
decision rules for eectively monitoring and con- load tap-changers (OLTCS). Using vibration mon-
trolling the semiconductor wafer fabrication line itoring, it can be observed that each tap change
were developed.[31] operation generates a signal that contains informa-
tion about the condition of the tap changer contacts
An example of data mining related to an integrated- and the drive mechanisms. Obviously, dierent tap
circuit (IC) production line is described in the positions will generate dierent signals. However,
paper Mining IC Test Data to Optimize VLSI there was considerable variability amongst normal
Testing.[32] In this paper, the application of data condition signals for exactly the same tap position.
mining and decision analysis to the problem of die- SOM has been applied to detect abnormal condi-
level functional testing is described. Experiments tions and to hypothesize about the nature of the
mentioned demonstrate the ability to apply a system abnormalities.[36]
9.5. NOTABLE USES 81
Data mining methods have been applied to dissolved In 2011, the case of Sorrell v. IMS Health, Inc., decided
gas analysis (DGA) in power transformers. DGA, as by the Supreme Court of the United States, ruled that
a diagnostics for power transformers, has been avail- pharmacies may share information with outside compa-
able for many years. Methods such as SOM has been nies. This practice was authorized under the 1st Amend-
applied to analyze generated data and to determine ment of the Constitution, protecting the freedom of
trends which are not obvious to the standard DGA speech.[47] However, the passage of the Health Informa-
ratio methods (such as Duval Triangle).[36] tion Technology for Economic and Clinical Health Act
(HITECH Act) helped to initiate the adoption of the elec-
In educational research, where data mining has tronic health record (EHR) and supporting technology in
been used to study the factors leading students to the United States.[48] The HITECH Act was signed into
choose to engage in behaviors which reduce their law on February 17, 2009 as part of the American Recov-
learning,[37] and to understand factors inuencing ery and Reinvestment Act (ARRA) and helped to open
university student retention.[38] A similar exam- the door to medical data mining.[49] Prior to the signing
ple of social application of data mining is its use of this law, estimates of only 20% of United States-based
in expertise nding systems, whereby descriptors physicians were utilizing electronic patient records.[48]
of human expertise are extracted, normalized, and Sren Brunak notes that the patient record becomes as
classied so as to facilitate the nding of experts, information-rich as possible and thereby maximizes the
particularly in scientic and technical elds. In this data mining opportunities.[48] Hence, electronic patient
way, data mining can facilitate institutional memory. records further expands the possibilities regarding medi-
cal data mining thereby opening the door to a vast source
of medical data analysis.
Data mining methods of biomedical data facili-
tated by domain ontologies,[39] mining clinical trial
data,[40] and trac analysis using SOM.[41] 9.5.6 Spatial data mining
In adverse drug reaction surveillance, the Uppsala Spatial data mining is the application of data mining
Monitoring Centre has, since 1998, used data min- methods to spatial data. The end objective of spatial data
ing methods to routinely screen for reporting pat- mining is to nd patterns in data with respect to geog-
terns indicative of emerging drug safety issues in raphy. So far, data mining and Geographic Information
the WHO global database of 4.6 million suspected Systems (GIS) have existed as two separate technologies,
adverse drug reaction incidents.[42] Recently, simi-
each with its own methods, traditions, and approaches to
lar methodology has been developed to mine large visualization and data analysis. Particularly, most con-
collections of electronic health records for tempo-temporary GIS have only very basic spatial analysis func-
ral patterns associating drug prescriptions to medi-
tionality. The immense explosion in geographically ref-
cal diagnoses.[43] erenced data occasioned by developments in IT, digital
mapping, remote sensing, and the global diusion of GIS
Data mining has been applied to software artifacts emphasizes the importance of developing data-driven in-
within the realm of software engineering: Mining ductive approaches to geographical analysis and model-
Software Repositories. ing.
Data mining oers great potential benets for GIS-based
applied decision-making. Recently, the task of integrat-
9.5.4 Human rights ing these two technologies has become of critical impor-
tance, especially as various public and private sector or-
Data mining of government records particularly records
ganizations possessing huge databases with thematic and
of the justice system (i.e., courts, prisons) enables the
geographically referenced data begin to realize the huge
discovery of systemic human rights violations in connec-
potential of the information contained therein. Among
tion to generation and publication of invalid or fraudulent
those organizations are:
legal records by various government agencies.[44][45]
Some machine learning algorithms can be applied in public health services searching for explanations of
medical eld as second-opinion diagnostic tools and as disease clustering
tools for the knowledge extraction phase in the process environmental agencies assessing the impact of
of knowledge discovery in databases. One of these classi- changing land-use patterns on climate change
ers (called Prototype exemplar learning classier (PEL-
C)[46] is able to discover syndromes as well as atypical geo-marketing companies doing customer segmen-
clinical cases. tation based on spatial location.
82 CHAPTER 9. DATA MINING
Challenges in Spatial mining: Geospatial data reposito- 9.5.8 Sensor data mining
ries tend to be very large. Moreover, existing GIS datasets
are often splintered into feature and attribute compo- Wireless sensor networks can be used for facilitating the
nents that are conventionally archived in hybrid data man- collection of data for spatial data mining for a variety of
agement systems. Algorithmic requirements dier sub- applications such as air pollution monitoring.[53] A char-
stantially for relational (attribute) data management and acteristic of such networks is that nearby sensor nodes
for topological (feature) data management.[50] Related to monitoring an environmental feature typically register
this is the range and diversity of geographic data for- similar values. This kind of data redundancy due to the
mats, which present unique challenges. The digital ge- spatial correlation between sensor observations inspires
ographic data revolution is creating new types of data the techniques for in-network data aggregation and min-
formats beyond the traditional vector and raster for- ing. By measuring the spatial correlation between data
mats. Geographic data repositories increasingly include sampled by dierent sensors, a wide class of specialized
ill-structured data, such as imagery and geo-referenced algorithms can be developed to develop more ecient
multi-media.[51] spatial data mining algorithms.[54]
There are several critical research challenges in geo-
graphic knowledge discovery and data mining. Miller and 9.5.9 Visual data mining
Han[52] oer the following list of emerging research top-
ics in the eld: In the process of turning from analogical into digi-
tal, large data sets have been generated, collected, and
Developing and supporting geographic data stored discovering statistical patterns, trends and infor-
warehouses (GDWs): Spatial properties are often mation which is hidden in data, in order to build pre-
reduced to simple aspatial attributes in mainstream dictive patterns. Studies suggest visual data mining is
data warehouses. Creating an integrated GDW re- faster and much more intuitive than is traditional data
quires solving issues of spatial and temporal data in- mining.[55][56][57] See also Computer vision.
teroperability including dierences in semantics,
referencing systems, geometry, accuracy, and posi-
tion. 9.5.10 Music data mining
Data mining techniques, and in particular co-occurrence
Better spatio-temporal representations in geo- analysis, has been used to discover relevant similarities
graphic knowledge discovery: Current geographic among music corpora (radio lists, CD databases) for pur-
knowledge discovery (GKD) methods generally use poses including classifying music into genres in a more
very simple representations of geographic objects objective manner.[58]
and spatial relationships. Geographic data min-
ing methods should recognize more complex geo-
graphic objects (i.e., lines and polygons) and rela- 9.5.11 Surveillance
tionships (i.e., non-Euclidean distances, direction,
connectivity, and interaction through attributed ge- Data mining has been used by the U.S. government. Pro-
ographic space such as terrain). Furthermore, the grams include the Total Information Awareness (TIA)
time dimension needs to be more fully integrated program, Secure Flight (formerly known as Computer-
into these geographic representations and relation- Assisted Passenger Prescreening System (CAPPS II)),
ships. Analysis, Dissemination, Visualization, Insight, Seman-
tic Enhancement (ADVISE),[59] and the Multi-state Anti-
Geographic knowledge discovery using diverse Terrorism Information Exchange (MATRIX).[60] These
data types: GKD methods should be developed programs have been discontinued due to controversy over
that can handle diverse data types beyond the tradi- whether they violate the 4th Amendment to the United
tional raster and vector models, including imagery States Constitution, although many programs that were
and geo-referenced multimedia, as well as dynamic formed under them continue to be funded by dierent
data types (video streams, animation). organizations or under dierent names.[61]
In the context of combating terrorism, two particularly
plausible methods of data mining are pattern mining
9.5.7 Temporal data mining and subject-based data mining.
often means association rules. The original motivation The ways in which data mining can be used can in some
for searching association rules came from the desire to cases and contexts raise questions regarding privacy, le-
analyze supermarket transaction data, that is, to examine gality, and ethics.[70] In particular, data mining govern-
customer behavior in terms of the purchased products. ment or commercial data sets for national security or law
For example, an association rule beer potato chips enforcement purposes, such as in the Total Information
(80%)" states that four out of ve customers that bought Awareness Program or in ADVISE, has raised privacy
beer also bought potato chips. concerns.[71][72]
In the context of pattern mining as a tool to identify Data mining requires data preparation which can uncover
terrorist activity, the National Research Council pro- information or patterns which may compromise conden-
vides the following denition: Pattern-based data min- tiality and privacy obligations. A common way for this
ing looks for patterns (including anomalous data patterns) to occur is through data aggregation. Data aggregation
that might be associated with terrorist activity these involves combining data together (possibly from various
patterns might be regarded as small signals in a large sources) in a way that facilitates analysis (but that also
ocean of noise.[62][63][64] Pattern Mining includes new might make identication of private, individual-level data
areas such a Music Information Retrieval (MIR) where deducible or otherwise apparent).[73] This is not data min-
patterns seen both in the temporal and non temporal ing per se, but a result of the preparation of data before
domains are imported to classical knowledge discovery and for the purposes of the analysis. The threat to an
search methods. individuals privacy comes into play when the data, once
compiled, cause the data miner, or anyone who has access
to the newly compiled data set, to be able to identify spe-
9.5.13 Subject-based data mining cic individuals, especially when the data were originally
anonymous.[74][75][76]
Subject-based data mining is a data mining method
involving the search for associations between individu- It is recommended that an individual[73] is made aware of the
als in data. In the context of combating terrorism, the following before data are collected:
National Research Council provides the following de-
nition: Subject-based data mining uses an initiating in- the purpose of the data collection and any (known)
dividual or other datum that is considered, based on other data mining projects;
information, to be of high interest, and the goal is to de-
termine what other persons or nancial transactions or how the data will be used;
movements, etc., are related to that initiating datum.[63]
who will be able to mine the data and use the data
and their derivatives;
9.5.14 Knowledge grid
the status of security surrounding access to the data;
Knowledge discovery On the Grid generally refers to
conducting knowledge discovery in an open environment how collected data can be updated.
using grid computing concepts, allowing users to inte-
grate data from various online data sources, as well make Data may also be modied so as to become anonymous,
use of remote resources, for executing their data mining so that individuals may not readily be identied.[73] How-
tasks. The earliest example was the Discovery Net,[65][66] ever, even de-identied"/"anonymized data sets can po-
developed at Imperial College London, which won the tentially contain enough information to allow identica-
Most Innovative Data-Intensive Application Award at tion of individuals, as occurred when journalists were
the ACM SC02 (Supercomputing 2002) conference and able to nd several individuals based on a set of search
exhibition, based on a demonstration of a fully interactive histories that were inadvertently released by AOL.[77]
distributed knowledge discovery application for a bioin-
formatics application. Other examples include work con-
ducted by researchers at the University of Calabria, who
9.6.1 Situation in Europe
developed a Knowledge Grid architecture for distributed
knowledge discovery, based on grid computing.[67][68]
Europe has rather strong privacy laws, and eorts are un-
derway to further strengthen the rights of the consumers.
However, the U.S.-E.U. Safe Harbor Principles currently
9.6 Privacy concerns and ethics eectively expose European users to privacy exploitation
by U.S. companies. As a consequence of Edward Snow-
While the term data mining itself has no ethical im- den's Global surveillance disclosure, there has been in-
plications, it is often associated with the mining of in- creased discussion to revoke this agreement, as in partic-
formation in relation to peoples behavior (ethical and ular the data will be fully exposed to the National Security
otherwise).[69] Agency, and attempts to reach an agreement have failed.
84 CHAPTER 9. DATA MINING
9.6.2 Situation in the United States in America, as well as other fair use countries such as Is-
rael, Taiwan and South Korea is viewed as being legal. As
In the United States, privacy concerns have been ad- content mining is transformative, that is it does not sup-
dressed by the US Congress via the passage of regulatory plant the original work, it is viewed as being lawful under
controls such as the Health Insurance Portability and Ac- fair use. For example as part of the Google Book settle-
countability Act (HIPAA). The HIPAA requires individ- ment the presiding judge on the case ruled that Googles
uals to give their informed consent regarding informa- digitisation project of in-copyright books was lawful, in
tion they provide and its intended present and future uses. part because of the transformative uses that the digitisa-
According to an article in Biotech Business Week', "'[i]n tion project displayed - one being text and data mining.[82]
practice, HIPAA may not oer any greater protection than
the longstanding regulations in the research arena,' says
the AAHC. More importantly, the rules goal of protection 9.8 Software
through informed consent is undermined by the complexity
of consent forms that are required of patients and partic-
ipants, which approach a level of incomprehensibility to See also: Category:Data mining and machine learning
average individuals.[78] This underscores the necessity for software.
data anonymity in data aggregation and mining practices.
U.S. information privacy legislation such as HIPAA and
the Family Educational Rights and Privacy Act (FERPA) 9.8.1 Free open-source data mining soft-
applies only to the specic areas that each such law ad-
ware and applications
dresses. Use of data mining by the majority of businesses
in the U.S. is not controlled by any legislation.
Carrot2: Text and search results clustering frame-
work.
Orange: A component-based data mining and RapidMiner: An environment for machine learning
machine learning software suite written in the and data mining experiments.
Python language.
SAS Enterprise Miner: data mining software pro-
R: A programming language and software environ- vided by the SAS Institute.
ment for statistical computing, data mining, and
graphics. It is part of the GNU Project. STATISTICA Data Miner: data mining software
provided by StatSoft.
SCaViS: Java cross-platform data analysis frame-
work developed at Argonne National Laboratory. Qlucore Omics Explorer: data mining software pro-
vided by Qlucore.
SenticNet API: A semantic and aective resource
for opinion mining and sentiment analysis.
9.8.3 Marketplace surveys
Tanagra: A visualisation-oriented data mining soft-
ware, also for teaching. Several researchers and organizations have conducted re-
Torch: An open source deep learning library for the views of data mining tools and surveys of data miners.
Lua programming language and scientic comput- These identify some of the strengths and weaknesses of
ing framework with wide support for machine learn- the software packages. They also provide an overview
ing algorithms. of the behaviors, preferences and views of data miners.
Some of these reports include:
UIMA: The UIMA (Unstructured Information
Management Architecture) is a component frame- 2011 Wiley Interdisciplinary Reviews: Data Mining
work for analyzing unstructured content such as text, and Knowledge Discovery[83]
audio and video originally developed by IBM.
Rexer Analytics Data Miner Surveys (2007
Weka: A suite of machine learning software appli- 2013)[84]
cations written in the Java programming language.
Forrester Research 2010 Predictive Analytics and
Data Mining Solutions report[85]
9.8.2 Commercial data-mining software
and applications Gartner 2008 Magic Quadrant report[86]
Predictive analytics [5] Han, Jiawei; Kamber, Micheline (2001). Data mining:
concepts and techniques. Morgan Kaufmann. p. 5.
Web mining ISBN 9781558604896. Thus, data mining should habe
been more appropriately named knowledge mining from
data, which is unfortunately somewhat long
Application examples
[6] See e.g. OKAIRP 2005 Fall Conference, Arizona State
See also: Category:Applied data mining. University About.com: Datamining
[7] Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 Jan-
uary 2011). Data Mining: Practical Machine Learning
Customer analytics Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12-
374856-0.
Data mining in agriculture
[8] Bouckaert, Remco R.; Frank, Eibe; Hall, Mark A.;
Data mining in meteorology Holmes, Georey; Pfahringer, Bernhard; Reutemann,
Educational data mining Peter; Witten, Ian H. (2010). WEKA Experiences with
a Java open-source project. Journal of Machine Learn-
National Security Agency ing Research 11: 25332541. the original title, Practical
machine learning, was changed ... The term data min-
Police-enforced ANPR in the UK ing was [added] primarily for marketing reasons.
Quantitative structureactivity relationship [9] Mena, Jess (2011). Machine Learning Forensics for Law
Enforcement, Security, and Intelligence. Boca Raton, FL:
Surveillance / Mass surveillance (e.g., Stellar Wind) CRC Press (Taylor & Francis Group). ISBN 978-1-4398-
6069-4.
Related topics
[10] Piatetsky-Shapiro, Gregory; Parker, Gary (2011).
Lesson: Data Mining, and Knowledge Discovery: An
Data mining is about analyzing data; for information Introduction. Introduction to Data Mining. KD Nuggets.
about extracting information out of data, see: Retrieved 30 August 2012.
9.10. REFERENCES 87
[11] Kantardzic, Mehmed (2003). Data Mining: Concepts, [28] Elovici, Yuval; Braha, Dan (2003). A Decision-
Models, Methods, and Algorithms. John Wiley & Sons. Theoretic Approach to Data Mining (PDF). IEEE Trans-
ISBN 0-471-22852-4. OCLC 50055336. actions on Systems, Man, and CyberneticsPart A: Sys-
tems and Humans 33 (1).
[12] Microsoft Academic Search: Top conferences in data
mining. Microsoft Academic Search. [29] Battiti, Roberto; and Brunato, Mauro; Reactive Business
Intelligence. From Data to Models to Insight, Reactive
[13] Google Scholar: Top publications - Data Mining & Anal- Search Srl, Italy, February 2011. ISBN 978-88-905795-
ysis. Google Scholar. 0-9.
[14] Proceedings, International Conferences on Knowledge [30] Battiti, Roberto; Passerini, Andrea (2010). Brain-
Discovery and Data Mining, ACM, New York. Computer Evolutionary Multi-Objective Optimiza-
[15] SIGKDD Explorations, ACM, New York. tion (BC-EMO): a genetic algorithm adapting to
the decision maker (PDF). IEEE Transactions
[16] Gregory Piatetsky-Shapiro (2002) KDnuggets Methodol- on Evolutionary Computation 14 (15): 671687.
ogy Poll doi:10.1109/TEVC.2010.2058118.
[17] Gregory Piatetsky-Shapiro (2004) KDnuggets Methodol- [31] Braha, Dan; Elovici, Yuval; Last, Mark (2007). Theory
ogy Poll of actionable data mining with application to semiconduc-
tor manufacturing control (PDF). International Journal
[18] Gregory Piatetsky-Shapiro (2007) KDnuggets Methodol- of Production Research 45 (13).
ogy Poll
[32] Fountain, Tony; Dietterich, Thomas; and Sudyka, Bill
[19] scar Marbn, Gonzalo Mariscal and Javier Segovia (2000); Mining IC Test Data to Optimize VLSI Testing, in
(2009); A Data Mining & Knowledge Discovery Process Proceedings of the Sixth ACM SIGKDD International
Model. In Data Mining and Knowledge Discovery in Conference on Knowledge Discovery & Data Mining,
Real Life Applications, Book edited by: Julio Ponce and ACM Press, pp. 1825
Adem Karahoca, ISBN 978-3-902613-53-0, pp. 438
453, February 2009, I-Tech, Vienna, Austria. [33] Braha, Dan; Shmilovici, Armin (2002). Data Mining for
Improving a Cleaning Process in the Semiconductor In-
[20] Lukasz Kurgan and Petr Musilek (2006); A survey of dustry (PDF). IEEE Transactions on Semiconductor Man-
Knowledge Discovery and Data Mining process models. ufacturing 15 (1).
The Knowledge Engineering Review. Volume 21 Issue 1,
March 2006, pp 124, Cambridge University Press, New [34] Braha, Dan; Shmilovici, Armin (2003). On the Use of
York, NY, USA doi:10.1017/S0269888906000737 Decision Tree Induction for Discovery of Interactions in a
Photolithographic Process (PDF). IEEE Transactions on
[21] Azevedo, A. and Santos, M. F. KDD, SEMMA and Semiconductor Manufacturing 16 (4).
CRISP-DM: a parallel overview. In Proceedings of the
IADIS European Conference on Data Mining 2008, pp [35] Zhu, Xingquan; Davidson, Ian (2007). Knowledge Dis-
182185. covery and Data Mining: Challenges and Realities. New
York, NY: Hershey. p. 18. ISBN 978-1-59904-252-7.
[22] Gnnemann, Stephan; Kremer, Hardy; Seidl, Thomas
(2011). An extension of the PMML standard to subspace [36] McGrail, Anthony J.; Gulski, Edward; Allan, David;
clustering models. Proceedings of the 2011 workshop on Birtwhistle, David; Blackburn, Trevor R.; Groot, Edwin
Predictive markup language modeling - PMML '11. p. 48. R. S. Data Mining Techniques to Assess the Condition
doi:10.1145/2023598.2023605. ISBN 9781450308373. of High Voltage Electrical Plant. CIGR WG 15.11 of
Study Committee 15.
[23] O'Brien, J. A., & Marakas, G. M. (2011). Manage-
ment Information Systems. New York, NY: McGraw- [37] Baker, Ryan S. J. d. Is Gaming the System State-
Hill/Irwin. or-Trait? Educational Data Mining Through the Multi-
Contextual Application of a Validated Behavioral Model.
[24] Alexander, D. (n.d.). Data Mining. Retrieved from Workshop on Data Mining for User Modeling 2007.
The University of Texas at Austin: College of Lib-
eral Arts: http://www.laits.utexas.edu/~{}anorman/BUS. [38] Superby Aguirre, Juan Francisco; Vandamme, Jean-
FOR/course.mat/Alex/ Philippe; Meskens, Nadine. Determination of factors in-
uencing the achievement of the rst-year university stu-
[25] Daniele Medri: Big Data & Business: An on-going rev- dents using data mining methods. Workshop on Educa-
olution. Statistics Views. 21 Oct 2013. tional Data Mining 2006.
[26] Goss, S. (2013, April 10). Data-mining and [39] Zhu, Xingquan; Davidson, Ian (2007). Knowledge Dis-
our personal privacy. Retrieved from The Tele- covery and Data Mining: Challenges and Realities. New
graph: http://www.macon.com/2013/04/10/2429775/ York, NY: Hershey. pp. 163189. ISBN 978-1-59904-
data-mining-and-our-personal-privacy.html 252-7.
[27] Monk, Ellen; Wagner, Bret (2006). Concepts in Enterprise [40] Zhu, Xingquan; Davidson, Ian (2007). Knowledge Dis-
Resource Planning, Second Edition. Boston, MA: Thom- covery and Data Mining: Challenges and Realities. New
son Course Technology. ISBN 0-619-21663-8. OCLC York, NY: Hershey. pp. 3148. ISBN 978-1-59904-252-
224465825. 7.
88 CHAPTER 9. DATA MINING
[41] Chen, Yudong; Zhang, Yi; Hu, Jianming; Li, Xiang [55] Zhao, Kaidi; and Liu, Bing; Tirpark, Thomas M.; and
(2006). Trac Data Analysis Using Kernel PCA and Weimin, Xiao; A Visual Data Mining Framework for Con-
Self-Organizing Map. IEEE Intelligent Vehicles Sympo- venient Identication of Useful Knowledge
sium.
[56] Keim, Daniel A.; Information Visualization and Visual
[42] Bate, Andrew; Lindquist, Marie; Edwards, I. Ralph; Ols- Data Mining
son, Sten; Orre, Roland; Lansner, Anders; de Freitas,
Rogelio Melhado (Jun 1998). A Bayesian neural net- [57] Burch, Michael; Diehl, Stephan; Weigerber, Peter;
work method for adverse drug reaction signal genera- Visual Data Mining in Software Archives
tion (PDF). European Journal of Clinical Pharmacology
[58] Pachet, Franois; Westermann, Gert; and Laigre,
54 (4): 31521. doi:10.1007/s002280050466. PMID
Damien; Musical Data Mining for Electronic Music Dis-
9696956.
tribution, Proceedings of the 1st WedelMusic Confer-
[43] Norn, G. Niklas; Bate, Andrew; Hopstadius, Johan; Star, ence,Firenze, Italy, 2001, pp. 101106.
Kristina; and Edwards, I. Ralph (2008); Temporal Pattern
Discovery for Trends and Transient Eects: Its Applica- [59] Government Accountability Oce, Data Mining: Early
tion to Patient Records. Proceedings of the Fourteenth In- Attention to Privacy in Developing a Key DHS Pro-
ternational Conference on Knowledge Discovery and Data gram Could Reduce Risks, GAO-07-293 (February 2007),
Mining (SIGKDD 2008), Las Vegas, NV, pp. 963971. Washington, DC
[44] Zernik, Joseph; Data Mining as a Civic Duty On- [60] Secure Flight Program report, MSNBC
line Public Prisoners Registration Systems, International
[61] Total/Terrorism Information Awareness (TIA): Is It
Journal on Social Media: Monitoring, Measurement, Min-
Truly Dead?". Electronic Frontier Foundation (ocial
ing, 1: 8496 (2010)
website). 2003. Retrieved 2009-03-15.
[45] Zernik, Joseph; Data Mining of Online Judicial Records
[62] Agrawal, Rakesh; Mannila, Heikki; Srikant, Ramakrish-
of the Networked US Federal Courts, International Jour-
nan; Toivonen, Hannu; and Verkamo, A. Inkeri; Fast dis-
nal on Social Media: Monitoring, Measurement, Mining,
covery of association rules, in Advances in knowledge dis-
1:6983 (2010)
covery and data mining, MIT Press, 1996, pp. 307328
[46] Gagliardi, F (2011). Instance-based classiers applied
to medical databases: Diagnosis and knowledge extrac- [63] National Research Council, Protecting Individual Privacy
tion. Articial Intelligence in Medicine 52 (3): 123139. in the Struggle Against Terrorists: A Framework for Pro-
doi:10.1016/j.artmed.2011.04.002. gram Assessment, Washington, DC: National Academies
Press, 2008
[47] David G. Savage (2011-06-24). Pharmaceutical indus-
try: Supreme Court sides with pharmaceutical industry in [64] Haag, Stephen; Cummings, Maeve; Phillips, Amy (2006).
two decisions. Los Angeles Times. Retrieved 2012-11- Management Information Systems for the information age.
07. Toronto: McGraw-Hill Ryerson. p. 28. ISBN 0-07-
095569-7. OCLC 63194770.
[48] Analyzing Medical Data. (2012). Communications of the
ACM 55(6), 13-15. doi:10.1145/2184319.2184324 [65] Ghanem, Moustafa; Guo, Yike; Rowe, Anthony;
Wendel, Patrick (2002). Grid-based knowledge
[49] http://searchhealthit.techtarget.com/definition/ discovery services for high throughput informatics.
HITECH-Act Proceedings 11th IEEE International Symposium on
High Performance Distributed Computing. p. 416.
[50] Healey, Richard G. (1991); Database Management Sys-
doi:10.1109/HPDC.2002.1029946. ISBN 0-7695-1686-
tems, in Maguire, David J.; Goodchild, Michael F.; and
6.
Rhind, David W., (eds.), Geographic Information Systems:
Principles and Applications, London, GB: Longman [66] Ghanem, Moustafa; Curcin, Vasa; Wendel, Patrick;
Guo, Yike (2009). Building and Using Analyt-
[51] Camara, Antonio S.; and Raper, Jonathan (eds.) (1999);
ical Workows in Discovery Net. Data Min-
Spatial Multimedia and Virtual Reality, London, GB: Tay-
ing Techniques in Grid Computing Environments.
lor and Francis
p. 119. doi:10.1002/9780470699904.ch8. ISBN
[52] Miller, Harvey J.; and Han, Jiawei (eds.) (2001); Geo- 9780470699904.
graphic Data Mining and Knowledge Discovery, London,
GB: Taylor & Francis [67] Cannataro, Mario; Talia, Domenico (January 2003). The
Knowledge Grid: An Architecture for Distributed Knowl-
[53] Ma, Y.; Richards, M.; Ghanem, M.; Guo, Y.; Has- edge Discovery (PDF). Communications of the ACM 46
sard, J. (2008). Air Pollution Monitoring and Mining (1): 8993. doi:10.1145/602421.602425. Retrieved 17
Based on Sensor Grid in London. Sensors 8 (6): 3601. October 2011.
doi:10.3390/s8063601.
[68] Talia, Domenico; Truno, Paolo (July 2010). How dis-
[54] Ma, Y.; Guo, Y.; Tian, X.; Ghanem, M. (2011). tributed data mining tasks can thrive as knowledge ser-
Distributed Clustering-Based Aggregation Algorithm for vices (PDF). Communications of the ACM 53 (7): 132
Spatial Correlated Sensor Networks. IEEE Sensors Jour- 137. doi:10.1145/1785414.1785451. Retrieved 17 Oc-
nal 11 (3): 641. doi:10.1109/JSEN.2010.2056916. tober 2011.
9.11. FURTHER READING 89
[69] Seltzer, William. The Promise and Pitfalls of Data Min- [87] Nisbet, Robert A. (2006); Data Mining Tools: Which One
ing: Ethical Issues (PDF). is Best for CRM? Part 1, Information Management Special
Reports, January 2006
[70] Pitts, Chip (15 March 2007). The End of Illegal Domes-
tic Spying? Don't Count on It. Washington Spectator. [88] Haughton, Dominique; Deichmann, Joel; Eshghi, Abdol-
reza; Sayek, Selin; Teebagy, Nicholas; and Topi, Heikki
[71] Taipale, Kim A. (15 December 2003). Data Mining and (2003); A Review of Software Packages for Data Mining,
Domestic Security: Connecting the Dots to Make Sense The American Statistician, Vol. 57, No. 4, pp. 290309
of Data. Columbia Science and Technology Law Review
5 (2). OCLC 45263753. SSRN 546782. [89] Goebel, Michael; Gruenwald, Le (1999); A Survey of
Data Mining and Knowledge Discovery Software Tools,
[72] Resig, John; and Teredesai, Ankur (2004). A Frame-
SIGKDD Explorations, Vol. 1, Issue 1, pp. 2033
work for Mining Instant Messaging Services. Proceed-
ings of the 2004 SIAM DM Conference.
[81] Text and Data Mining:Its importance and the need for Hastie, Trevor, Tibshirani, Robert and Friedman,
change in Europe. Association of European Research Li- Jerome (2001); The Elements of Statistical Learning:
braries. Retrieved 14 November 2014. Data Mining, Inference, and Prediction, Springer,
ISBN 0-387-95284-5
[82] Judge grants summary judgment in favor of Google
Books a fair use victory. Lexology.com. Antonelli Liu, Bing (2007); Web Data Mining: Exploring Hy-
Law Ltd. Retrieved 14 November 2014. perlinks, Contents and Usage Data, Springer, ISBN
[83] Mikut, Ralf; Reischl, Markus (SeptemberOctober 3-540-37881-2
2011). Data Mining Tools. Wiley Interdisciplinary Re-
views: Data Mining and Knowledge Discovery 1 (5): 431 Murphy, Chris (16 May 2011). Is Data Mining
445. doi:10.1002/widm.24. Retrieved October 21, 2011. Free Speech?". InformationWeek (UMB): 12.
[84] Karl Rexer, Heather Allen, & Paul Gearan (2011); Nisbet, Robert; Elder, John; Miner, Gary (2009);
Understanding Data Miners, Analytics Magazine, Handbook of Statistical Analysis & Data Mining Ap-
May/June 2011 (INFORMS: Institute for Operations plications, Academic Press/Elsevier, ISBN 978-0-
Research and the Management Sciences). 12-374765-5
[85] Kobielus, James; The Forrester Wave: Predictive Analytics Poncelet, Pascal; Masseglia, Florent; and Teisseire,
and Data Mining Solutions, Q1 2010, Forrester Research,
Maguelonne (editors) (October 2007); Data Min-
1 July 2008
ing Patterns: New Methods and Applications, In-
[86] Herschel, Gareth; Magic Quadrant for Customer Data- formation Science Reference, ISBN 978-1-59904-
Mining Applications, Gartner Inc., 1 July 2008 162-9
90 CHAPTER 9. DATA MINING
Big data
This article is about large collections of data. For the the use of predictive analytics or other certain advanced
graph database, see Graph database. For the band, see methods to extract value from data, and seldom to a par-
Big Data (band). ticular size of data set. Accuracy in big data may lead
Big data is a broad term for data sets so large or com- to more condent decision making. And better decisions
can mean greater operational eciency, cost reductions
and reduced risk.
Analysis of data sets can nd new correlations, to spot
business trends, prevent diseases, combat crime and so
on.[1] Scientists, business executives, practitioners of
media and advertising and governments alike regularly
meet diculties with large data sets in areas including
Internet search, nance and business informatics. Sci-
entists encounter limitations in e-Science work, includ-
ing meteorology, genomics,[2] connectomics, complex
physics simulations,[3] and biological and environmental
research.[4]
Data sets grow in size in part because they are increas-
ingly being gathered by cheap and numerous information-
Visualization of daily Wikipedia edits created by IBM. At multiple
sensing mobile devices, aerial (remote sensing), software
terabytes in size, the text and images of Wikipedia are an example logs, cameras, microphones, radio-frequency identica-
of big data. tion (RFID) readers, and wireless sensor networks.[5][6][7]
The worlds technological per-capita capacity to store in-
formation has roughly doubled every 40 months since the
1980s;[8] as of 2012, every day 2.5 exabytes (2.51018 )
of data were created;[9] The challenge for large enter-
prises is determining who should own big data initiatives
that straddle the entire organization.[10]
Work with big data is necessarily uncommon; most anal-
ysis is of PC size data, on a desktop PC or notebook[11]
that can handle the available data set.
Relational database management systems and desktop
statistics and visualization packages often have diculty
handling big data. The work instead requires massively
parallel software running on tens, hundreds, or even thou-
sands of servers.[12] What is considered big data varies
depending on the capabilities of the users and their tools,
and expanding capabilities make Big Data a moving tar-
Growth of and Digitization of Global Information Storage Ca-
get. Thus, what is considered to be Big in one year will
pacity Source
become ordinary in later years. For some organizations,
facing hundreds of gigabytes of data for the rst time may
plex that traditional data processing applications are in- trigger a need to reconsider data management options.
adequate. Challenges include analysis, capture, data cu- For others, it may take tens or hundreds of terabytes be-
ration, search, sharing, storage, transfer, visualization, fore data size becomes a signicant consideration.[13]
and information privacy. The term often refers simply to
91
92 CHAPTER 10. BIG DATA
10.1 Denition Data or not. The name Big Data itself contains a term
which is related to size and hence the characteristic.
Big data usually includes data sets with sizes beyond Variety - The next aspect of Big Data is its variety. This
the ability of commonly used software tools to capture, means that the category to which Big Data belongs to is
curate, manage, and process data within a tolerable also an essential fact that needs to be known by the data
elapsed time.[14] Big data size is a constantly moving analysts. This helps the people, who are closely analyzing
target, as of 2012 ranging from a few dozen terabytes to the data and are associated with it, to eectively use the
many petabytes of data. Big data is a set of techniques data to their advantage and thus upholding the importance
and technologies that require new forms of integration to of the Big Data.
uncover large hidden values from large datasets that are
Velocity - The term velocity in the context refers to the
diverse, complex, and of a massive scale.[15]
speed of generation of data or how fast the data is gen-
In a 2001 research report[16] and related lectures, META erated and processed to meet the demands and the chal-
Group (now Gartner) analyst Doug Laney dened data lenges which lie ahead in the path of growth and devel-
growth challenges and opportunities as being three- opment.
dimensional, i.e. increasing volume (amount of data),
Variability - This is a factor which can be a problem for
velocity (speed of data in and out), and variety (range of
those who analyse the data. This refers to the inconsis-
data types and sources). Gartner, and now much of the
tency which can be shown by the data at times, thus ham-
industry, continue to use this 3Vs model for describing
[17] pering the process of being able to handle and manage
big data. In 2012, Gartner updated its denition as fol-
the data eectively.
lows: Big data is high volume, high velocity, and/or high
variety information assets that require new forms of pro- Veracity - The quality of the data being captured can vary
cessing to enable enhanced decision making, insight dis- greatly. Accuracy of analysis depends on the veracity of
covery and process optimization.[18] Additionally, a new the source data.
V Veracity is added by some organizations to describe Complexity - Data management can become a very com-
it.[19] plex process, especially when large volumes of data come
If Gartners denition (the 3Vs) is still widely used, the from multiple sources. These data need to be linked, con-
growing maturity of the concept fosters a more sound dif- nected and correlated in order to be able to grasp the in-
ference between big data and Business Intelligence, re- formation that is supposed to be conveyed by these data.
garding data and their use:[20] This situation, is therefore, termed as the complexity of
Big Data.
Business Intelligence uses descriptive statistics with Factory work and Cyber-physical systems may have a 6C
data with high information density to measure system:
things, detect trends etc.;
Big data uses inductive statistics and concepts from 1. Connection (sensor and networks),
nonlinear system identication [21] to infer laws (re-
gressions, nonlinear relationships, and causal ef- 2. Cloud (computing and data on demand),
fects) from large sets of data with low infor-
mation density[22] to reveal relationships, depen-
dencies and perform predictions of outcomes and 3. Cyber (model and memory),
behaviors.[21][23]
4. content/context (meaning and correlation),
A more recent, consensual denition states that Big Data
represents the Information assets characterized by such 5. community (sharing and collaboration), and
a High Volume, Velocity and Variety to require specic
Technology and Analytical Methods for its transforma- 6. customization (personalization and value).
tion into Value.[24]
In 2000, Seisint Inc. developed C++ based distributed le Big data requires exceptional technologies to eciently
sharing framework for data storage and querying. Struc- process large quantities of data within tolerable elapsed
tured, semi-structured and/or unstructured data is stored times. A 2011 McKinsey report[37] suggests suit-
and distributed across multiple servers. Querying of data able technologies include A/B testing, crowdsourcing,
is done by modied C++ called ECL which uses ap- data fusion and integration, genetic algorithms, machine
ply scheme on read method to create structure of stored learning, natural language processing, signal processing,
data during time of query. In 2004 LexisNexis acquired simulation, time series analysis and visualisation. Multi-
Seisint Inc.[27] and 2008 acquired ChoicePoint, Inc.[28] dimensional big data can also be represented as tensors,
and their high speed parallel processing platform. The which can be more eciently handled by tensor-based
two platforms were merged into HPCC Systems and in computation,[38] such as multilinear subspace learning.[39]
2011 was open sourced under Apache v2.0 License. Cur- Additional technologies being applied to big data include
rently HPCC and Quantcast File System[29] are the only massively parallel-processing (MPP) databases, search-
publicly available platforms capable of analyzing multiple based applications, data mining, distributed le systems,
exabytes of data. distributed databases, cloud based infrastructure (appli-
cations, storage and computing resources) and the Inter-
In 2004, Google published a paper on a process called
net.
MapReduce that used such an architecture. The MapRe-
duce framework provides a parallel processing model and Some but not all MPP relational databases have the ability
associated implementation to process huge amounts of to store and manage petabytes of data. Implicit is the
data. With MapReduce, queries are split and distributed ability to load, monitor, back up, and optimize the use of
across parallel nodes and processed in parallel (the Map the large data tables in the RDBMS.[40]
step). The results are then gathered and delivered (the DARPAs Topological Data Analysis program seeks the
Reduce step). The framework was very successful,[30] fundamental structure of massive data sets and in 2008
so others wanted to replicate the algorithm. There- the technology went public with the launch of a company
fore, an implementation of the MapReduce framework called Ayasdi.[41]
was adopted by an Apache open source project named
Hadoop.[31] The practitioners of big data analytics processes are
generally hostile to slower shared storage,[42] preferring
MIKE2.0 is an open approach to information manage- direct-attached storage (DAS) in its various forms from
ment that acknowledges the need for revisions due to solid state drive (SSD) to high capacity SATA disk
big data implications in an article titled Big Data Solu- buried inside parallel processing nodes. The perception
tion Oering.[32] The methodology addresses handling of shared storage architecturesStorage area network
big data in terms of useful permutations of data sources, (SAN) and Network-attached storage (NAS) is that
complexity in interrelationships, and diculty in deleting they are relatively slow, complex, and expensive. These
(or modifying) individual records.[33] qualities are not consistent with big data analytics sys-
Recent studies show that the use of a multiple layer ar- tems that thrive on system performance, commodity in-
chitecture is an option for dealing with big data. The Dis- frastructure, and low cost.
tributed Parallel architecture distributes data across mul- Real or near-real time information delivery is one of the
tiple processing units and parallel processing units pro- dening characteristics of big data analytics. Latency is
vide data much faster, by improving processing speeds. therefore avoided whenever and wherever possible. Data
This type of architecture inserts data into a parallel in memory is gooddata on spinning disk at the other
DBMS, which implements the use of MapReduce and end of a FC SAN connection is not. The cost of a SAN
Hadoop frameworks. This type of framework looks to at the scale needed for analytics applications is very much
make the processing power transparent to the end user by higher than other storage techniques.
using a front end application server.[34]
There are advantages as well as disadvantages to shared
Big Data Analytics for Manufacturing Applications can storage in big data analytics, but big data analytics prac-
be based on a 5C architecture (connection, conversion, titioners as of 2011 did not favour it.[43]
cyber, cognition, and conguration).[35]
Big Data Lake - With the changing face of business and
IT sector, capturing and storage of data has emerged into
a sophisticated system. The big data lake allows an or- 10.5 Applications
ganization to shift its focus from centralized control to a
shared model to respond to the changing dynamics of in- Big data has increased the demand of information man-
formation management. This enables quick segregation agement specialists in that Software AG, Oracle Corpo-
of data into the data lake thereby reducing the overhead ration, IBM, Microsoft, SAP, EMC, HP and Dell have
time.[36] spent more than $15 billion on software rms specializing
94 CHAPTER 10. BIG DATA
ICT4D) suggests that big data technology can make im- Synthesis and Service. The coupled model rst constructs
portant contributions but also present unique challenges a digital image from the early design stage. System infor-
to International development.[54][55] Advancements in mation and physical knowledge are logged during prod-
big data analysis oer cost-eective opportunities to uct design, based on which a simulation model is built
improve decision-making in critical development areas as a reference for future analysis. Initial parameters may
such as health care, employment, economic productiv- be statistically generalized and they can be tuned using
ity, crime, security, and natural disaster and resource data from testing or the manufacturing process using pa-
management.[56][57][58] However, longstanding challenges rameter estimation. After which, the simulation model
for developing regions such as inadequate technolog- can be considered as a mirrored image of the real ma-
ical infrastructure and economic and human resource chine, which is able to continuously record and track ma-
scarcity exacerbate existing concerns with big data such chine condition during the later utilization stage. Finally,
as privacy, imperfect methodology, and interoperability with ubiquitous connectivity oered by cloud computing
issues.[56] technology, the coupled model also provides better ac-
cessibility of machine condition for factory managers in
cases where physical access to actual equipment or ma-
10.5.3 Manufacturing chine data is limited.[26][62]
Technology second. There are nearly 600 million collisions per sec-
ond. After ltering and refraining from recording more
eBay.com uses two data warehouses at 7.5 petabytes than 99.99995% [72] of these streams, there are 100 col-
and 40PB as well as a 40PB Hadoop cluster for lisions of interest per second.[73][74][75]
search, consumer recommendations, and merchan-
dising. Inside eBays 90PB data warehouse As a result, only working with less than 0.001% of
the sensor stream data, the data ow from all four
Amazon.com handles millions of back-end opera-
LHC experiments represents 25 petabytes annual
tions every day, as well as queries from more than
rate before replication (as of 2012). This becomes
half a million third-party sellers. The core technol-
nearly 200 petabytes after replication.
ogy that keeps Amazon running is Linux-based and
as of 2005 they had the worlds three largest Linux If all sensor data were to be recorded in LHC, the
databases, with capacities of 7.8 TB, 18.5 TB, and data ow would be extremely hard to work with. The
24.7 TB.[64] data ow would exceed 150 million petabytes annual
rate, or nearly 500 exabytes per day, before replica-
Facebook handles 50 billion photos from its user
tion. To put the number in perspective, this is equiv-
base.[65]
alent to 500 quintillion (51020 ) bytes per day, al-
As of August 2012, Google was handling roughly most 200 times more than all the other sources com-
100 billion searches per month.[66] bined in the world.
Oracle NoSQL Database has been tested to past the The Square Kilometre Array is a telescope which consists
1M ops/sec mark with 8 shards and proceeded to hit of millions of antennas and is expected to be operational
1.2M ops/sec with 10 shards.[67] by 2024. Collectively, these antennas are expected to
gather 14 exabytes and store one petabyte per day.[76][77]
It is considered to be one of the most ambitious scientic
10.5.5 Private sector projects ever undertaken.
Retail
Science and Research
Walmart handles more than 1 million customer
transactions every hour, which are imported into When the Sloan Digital Sky Survey (SDSS) began
databases estimated to contain more than 2.5 collecting astronomical data in 2000, it amassed
petabytes (2560 terabytes) of data the equivalent more in its rst few weeks than all data collected
of 167 times the information contained in all the in the history of astronomy. Continuing at a rate
books in the US Library of Congress.[1] of about 200 GB per night, SDSS has amassed
more than 140 terabytes of information. When
the Large Synoptic Survey Telescope, successor to
Retail Banking SDSS, comes online in 2016 it is anticipated to ac-
quire that amount of data every ve days.[1]
FICO Card Detection System protects accounts
world-wide.[68] Decoding the human genome originally took 10
years to process, now it can be achieved in less than a
The volume of business data worldwide, across all day: the DNA sequencers have divided the sequenc-
companies, doubles every 1.2 years, according to ing cost by 10,000 in the last ten years, which is 100
estimates.[69][70] times cheaper than the reduction in cost predicted
by Moores Law.[78]
Real Estate The NASA Center for Climate Simulation (NCCS)
stores 32 petabytes of climate observations and sim-
Windermere Real Estate uses anonymous GPS sig- ulations on the Discover supercomputing cluster.[79]
nals from nearly 100 million drivers to help new
home buyers determine their typical drive times
to and from work throughout various times of the
day.[71] 10.6 Research activities
Encrypted search and cluster formation in big data was
10.5.6 Science demonstrated in March 2014 at the American Society
of Engineering Education. Gautam Siwach engaged at
The Large Hadron Collider experiments represent about Tackling the challenges of Big Data by MIT Computer
150 million sensors delivering data 40 million times per Science and Articial Intelligence Laboratory and Dr.
10.7. CRITIQUE 97
Amir Esmailpour at UNH Research Group investigated dian Open Data Experience (CODE) Inspiration Day,
the key features of big data as formation of clusters and it was demonstrated how using data visualization tech-
their interconnections. They focused on the security of niques can increase the understanding and appeal of big
big data and the actual orientation of the term towards data sets in order to communicate a story to the world.[91]
the presence of dierent type of data in an encrypted In order to make manufacturing more competitive in
form at cloud interface by providing the raw denitions the United States (and globe), there is a need to in-
and real time examples within the technology. Moreover, tegrate more American ingenuity and innovation into
they proposed an approach for identifying the encoding manufacturing ; Therefore, National Science Founda-
technique to advance towards an expedited search over
tion has granted the Industry University cooperative re-
encrypted text leading to the security enhancements in search center for Intelligent Maintenance Systems (IMS)
big data.[80]
at university of Cincinnati to focus on developing ad-
In March 2012, The White House announced a national vanced predictive tools and techniques to be applicable
Big Data Initiative that consisted of six Federal depart- in a big data environment.[61][92] In May 2013, IMS Cen-
ments and agencies committing more than $200 million ter held an industry advisory board meeting focusing on
to big data research projects.[81] big data where presenters from various industrial compa-
The initiative included a National Science Foundation nies discussed their concerns, issues and future goals in
Expeditions in Computing grant of $10 million over Big Data environment.
5 years to the AMPLab[82] at the University of Califor- Computational social sciences Anyone can use Appli-
nia, Berkeley.[83] The AMPLab also received funds from cation Programming Interfaces (APIs) provided by Big
DARPA, and over a dozen industrial sponsors and uses Data holders, such as Google and Twitter, to do research
big data to attack a wide range of problems from predict- in the social and behavioral sciences.[93] Often these APIs
ing trac congestion[84] to ghting cancer.[85] are provided for free.[93] Tobias Preis et al. used Google
The White House Big Data Initiative also included a com- Trends data to demonstrate that Internet users from coun-
tries with a higher per capita gross domestic product
mitment by the Department of Energy to provide $25
million in funding over 5 years to establish the Scalable (GDP) are more likely to search for information about
the future than information about the past. The ndings
Data Management, Analysis and Visualization (SDAV)
Institute,[86] led by the Energy Departments Lawrence suggest there may be a link between online behaviour and
real-world economic indicators.[94][95][96] The authors of
Berkeley National Laboratory. The SDAV Institute aims
to bring together the expertise of six national laborato- the study examined Google queries logs made by ratio of
ries and seven universities to develop new tools to help the volume of searches for the coming year (2011) to the
scientists manage and visualize data on the Departments volume of searches for the previous year (2009), which
supercomputers. they call the future orientation index.[97] They compared
the future orientation index to the per capita GDP of
The U.S. state of Massachusetts announced the Mas- each country and found a strong tendency for countries
sachusetts Big Data Initiative in May 2012, which pro- in which Google users enquire more about the future to
vides funding from the state government and private exhibit a higher GDP. The results hint that there may po-
companies to a variety of research institutions.[87] The tentially be a relationship between the economic success
Massachusetts Institute of Technology hosts the Intel Sci- of a country and the information-seeking behavior of its
ence and Technology Center for Big Data in the MIT citizens captured in big data.
Computer Science and Articial Intelligence Laboratory,
combining government, corporate, and institutional fund- Tobias Preis and his colleagues Helen Susannah Moat
ing and research eorts.[88] and H. Eugene Stanley introduced a method to iden-
tify online precursors for stock market moves, using
The European Commission is funding the 2-year-long Big trading strategies based on search volume data pro-
Data Public Private Forum through their Seventh Frame- vided by Google Trends.[98] Their analysis of Google
work Program to engage companies, academics and other search volume for 98 terms of varying nancial rele-
stakeholders in discussing big data issues. The project vance, published in Scientic Reports,[99] suggests that
aims to dene a strategy in terms of research and innova- increases in search volume for nancially relevant
tion to guide supporting actions from the European Com- search terms tend to precede large losses in nancial
mission in the successful implementation of the big data markets.[100][101][102][103][104][105][106][107]
economy. Outcomes of this project will be used as input
for Horizon 2020, their next framework program.[89]
The British government announced in March 2014 the
founding of the Alan Turing Institute, named after the 10.7 Critique
computer pioneer and code-breaker, which will focus on
new ways of collecting and analysing large sets of data.[90] Critiques of the big data paradigm come in two avors,
At the University of Waterloo Stratford Campus Cana- those that question the implications of the approach itself,
and those that question the way it is currently done.
98 CHAPTER 10. BIG DATA
no large data analysis happening, but the challenge is the Operations research
extract, transform, load part of data preprocessing.[122]
Programming with Big Data in R (a series of R pack-
Big data is a buzzword and a vague term,[123] but at the ages)
same time an obsession[123] with entrepreneurs, consul-
tants, scientists and the media. Big data showcases such Sqrrl
as Google Flu Trends failed to deliver good predictions
in recent years, overstating the u outbreaks by a factor Supercomputer
of two. Similarly, Academy awards and election predic-
tions solely based on Twitter were more often o than on Talend
target. Big data often poses the same challenges as small
Transreality gaming
data; and adding more data does not solve problems of
bias, but may emphasize other problems. In particular Tuple space
data sources such as Twitter are not representative of the
overall population, and results drawn from such sources Unstructured data
may then lead to wrong conclusions. Google Translate -
which is based on big data statistical analysis of text - does
a remarkably good job at translating web pages. How- 10.9 References
ever, results from specialized domains may be dramati-
cally skewed. On the other hand, big data may also in-
[1] Data, data everywhere. The Economist. 25 February
troduce new problems, such as the multiple comparisons
2010. Retrieved 9 December 2012.
problem: simultaneously testing a large set of hypothe-
ses is likely to produce many false results that mistak- [2] Community cleverness required. Nature 455 (7209): 1.
enly appear to be signicant. Ioannidis argued that most 4 September 2008. doi:10.1038/455001a.
published research ndings are false [124] due to essen-
tially the same eect: when many scientic teams and re- [3] Sandia sees data management challenges spiral. HPC
searchers each perform many experiments (i.e. process a Projects. 4 August 2009.
big amount of scientic data; although not with big data
[4] Reichman, O.J.; Jones, M.B.; Schildhauer, M.P.
technology), the likelihood of a signicant result being (2011). Challenges and Opportunities of Open
actually false grows fast - even more so, when only posi- Data in Ecology. Science 331 (6018): 7035.
tive results are published. doi:10.1126/science.1197962. PMID 21311007.
Data mining [9] IBM What is big data? Bringing big data to the enter-
prise. www.ibm.com. Retrieved 2013-08-26.
Cask (company)
[10] Oracle and FSN, Mastering Big Data: CFO Strategies to
Cloudera Transform Insight into Opportunity, December 2012
HPCC Systems [11] Computing Platforms for Analytics, Data Mining, Data
Science. kdnuggets.com. Retrieved 15 April 2015.
Intelligent Maintenance Systems
[12] Jacobs, A. (6 July 2009). The Pathologies of Big Data.
Internet of Things ACMQueue.
Oracle NoSQL Database [14] Snijders, C.; Matzat, U.; Reips, U.-D. (2012). "'Big Data':
Big gaps of knowledge in the eld of Internet. Interna-
Nonlinear system identication tional Journal of Internet Science 7: 15.
100 CHAPTER 10. BIG DATA
[15] Ibrahim; Targio Hashem, Abaker; Yaqoob, Ibrar; Badrul [32] Big Data Solution Oering. MIKE2.0. Retrieved 8 Dec
Anuar, Nor; Mokhtar, Salimah; Gani, Abdullah; Ullah 2013.
Khan, Samee (2015). big data on cloud computing: Re-
view and open research issues. Information Systems 47: [33] Big Data Denition. MIKE2.0. Retrieved 9 March
98115. doi:10.1016/j.is.2014.07.006. 2013.
[16] Laney, Douglas. 3D Data Management: Controlling [34] Boja, C; Pocovnicu, A; Btgan, L. (2012). Distributed
Data Volume, Velocity and Variety (PDF). Gartner. Re- Parallel Architecture for Big Data. Informatica Econom-
trieved 6 February 2001. ica 16 (2): 116127.
[17] Beyer, Mark. Gartner Says Solving 'Big Data' Challenge [35] Intelligent Maintenance System
Involves More Than Just Managing Volumes of Data. [36] http://www.hcltech.com/sites/default/files/solving_key_
Gartner. Archived from the original on 10 July 2011. Re- businesschallenges_with_big_data_lake_0.pdf
trieved 13 July 2011.
[37] Manyika, James; Chui, Michael; Bughin, Jaques; Brown,
[18] Laney, Douglas. The Importance of 'Big Data': A De- Brad; Dobbs, Richard; Roxburgh, Charles; Byers, Angela
nition. Gartner. Retrieved 21 June 2012. Hung (May 2011). Big Data: The next frontier for inno-
[19] What is Big Data?". Villanova University. vation, competition, and productivity. McKinsey Global
Institute.
[20] http://www.bigdataparis.com/presentation/
mercredi/PDelort.pdf?PHPSESSID= [38] Future Directions in Tensor-Based Computation and
tv7k70pcr3egpi2r6fi3qbjtj6#page=4 Modeling (PDF). May 2009.
[21] Billings S.A. Nonlinear System Identication: NAR- [39] Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N.
MAX Methods in the Time, Frequency, and Spatio- (2011). A Survey of Multilinear Subspace Learning for
Temporal Domains. Wiley, 2013 Tensor Data (PDF). Pattern Recognition 44 (7): 1540
1551. doi:10.1016/j.patcog.2011.01.004.
[22] Delort P., Big data Paris 2013 http://www.andsi.fr/tag/
dsi-big-data/ [40] Monash, Curt (30 April 2009). eBays two enormous
data warehouses.
[23] Delort P., Big Data car Low-Density Data ? Monash, Curt (6 October 2010). eBay followup
La faible densit en information comme fac- Greenplum out, Teradata > 10 petabytes, Hadoop has
teur discriminant http://lecercle.lesechos.fr/ some value, and more.
entrepreneur/tendances-innovation/221169222/
[41] Resources on how Topological Data Analysis is used to
big-data-low-density-data-faible-densite-information-com
analyze big data. Ayasdi.
[24] De Mauro, Andrea; Greco, Marco; Grimaldi, Michele
[42] CNET News (1 April 2011). Storage area networks need
(2015). What is big data? A consensual denition and a
not apply.
review of key research topics. AIP Conference Proceed-
ings 1644: 97104. doi:10.1063/1.4907823. [43] How New Analytic Systems will Impact Storage.
September 2011.
[25] Lee, Jay; Bagheri, Behrad; Kao, Hung-An (2014).
Recent Advances and Trends of Cyber-Physical Systems [44] What Is the Content of the Worlds Technologically Me-
and Big Data Analytics in Industrial Informatics. IEEE diated Information and Communication Capacity: How
Int. Conference on Industrial Informatics (INDIN) 2014. Much Text, Image, Audio, and Video?", Martin Hilbert
(2014), The Information Society; free access to the article
[26] Lee, Jay; Lapira, Edzel; Bagheri, Behrad; Kao, Hung-an.
through this link: martinhilbert.net/WhatsTheContent_
Recent advances and trends in predictive manufacturing
Hilbert.pdf
systems in big data environment. Manufacturing Letters
1 (1): 3841. doi:10.1016/j.mfglet.2013.09.005. [45] Rajpurohit, Anmol (2014-07-11). Interview: Amy
Gershko, Director of Customer Analytics & Insights,
[27] LexisNexis To Buy Seisint For $775 Million. Washing-
eBay on How to Design Custom In-House BI Tools.
ton Post. Retrieved 15 July 2004.
KDnuggets. Retrieved 2014-07-14. Dr. Amy Gershko:
[28] LexisNexis Parent Set to Buy ChoicePoint. Washington Generally, I nd that o-the-shelf business intelligence
Post. Retrieved 22 February 2008. tools do not meet the needs of clients who want to derive
custom insights from their data. Therefore, for medium-
[29] Quantcast Opens Exabyte-Ready File System. www. to-large organizations with access to strong technical tal-
datanami.com. Retrieved 1 October 2012. ent, I usually recommend building custom, in-house solu-
tions.
[30] Bertolucci, Je Hadoop: From Experiment To Lead-
ing Big Data Platform, Information Week, 2013. Re- [46] Kalil, Tom. Big Data is a Big Deal. White House. Re-
trieved on 14 November 2013. trieved 26 September 2012.
[31] Webster, John. MapReduce: Simplied Data Processing [47] Executive Oce of the President (March 2012). Big
on Large Clusters, Search Storage, 2004. Retrieved on Data Across the Federal Government (PDF). White
25 March 2013. House. Retrieved 26 September 2012.
10.9. REFERENCES 101
[48] Lampitt, Andrew. The real story of how big data ana- [64] Layton, Julia. Amazon Technology.
lytics helped Obama win. Infoworld. Retrieved 31 May Money.howstuworks.com. Retrieved 2013-03-05.
2014.
[65] Scaling Facebook to 500 Million Users and Beyond.
[49] Hoover, J. Nicholas. Governments 10 Most Powerful Facebook.com. Retrieved 2013-07-21.
Supercomputers. Information Week. UBM. Retrieved
26 September 2012. [66] Google Still Doing At Least 1 Trillion Searches Per
Year. Search Engine Land. 16 January 2015. Retrieved
[50] Bamford, James (15 March 2012). The NSA Is Building 15 April 2015.
the Countrys Biggest Spy Center (Watch What You Say)".
Wired Magazine. Retrieved 2013-03-18. [67] Lamb, Charles. Oracle NoSQL Database Exceeds 1 Mil-
lion Mixed YCSB Ops/Sec.
[51] Groundbreaking Ceremony Held for $1.2 Billion Utah
Data Center. National Security Agency Central Security [68] FICO Falcon Fraud Manager. Fico.com. Retrieved
Service. Retrieved 2013-03-18. 2013-07-21.
[52] Hill, Kashmir. TBlueprints Of NSAs Ridiculously Ex- [69] eBay Study: How to Build Trust and Improve the Shop-
pensive Data Center In Utah Suggest It Holds Less Info ping Experience. Knowwpcarey.com. 2012-05-08. Re-
Than Thought. Forbes. Retrieved 2013-10-31. trieved 2013-03-05.
[53] News: Live Mint. Are Indian companies making enough [70] Leading Priorities for Big Data for Business and IT. eMar-
sense of Big Data?. Live Mint - http://www.livemint. keter. October 2013. Retrieved January 2014.
com/. 2014-06-23. Retrieved 2014-11-22.
[71] Wingeld, Nick (2013-03-12). Predicting Commutes
[54] UN GLobal Pulse (2012). Big Data for Development: More Accurately for Would-Be Home Buyers - NY-
Opportunities and Challenges (White p. by Letouz, E.). Times.com. Bits.blogs.nytimes.com. Retrieved 2013-
New York: United Nations. Retrieved from http://www. 07-21.
unglobalpulse.org/projects/BigDataforDevelopment
[72] Alexandru, Dan. Prof (PDF). cds.cern.ch. CERN. Re-
[55] WEF (World Economic Forum), & Vital Wave trieved 24 March 2015.
Consulting. (2012). Big Data, Big Impact:
New Possibilities for International Development. [73] LHC Brochure, English version. A presentation of the
World Economic Forum. Retrieved 24 Au- largest and the most powerful particle accelerator in the
gust 2012, from http://www.weforum.org/reports/ world, the Large Hadron Collider (LHC), which started
big-data-big-impact-new-possibilities-international-developmentup in 2008. Its role, characteristics, technologies, etc.
are explained for the general public.. CERN-Brochure-
[56] Big Data for Development: From Information- to Knowl-
2010-006-Eng. LHC Brochure, English version. CERN.
edge Societies, Martin Hilbert (2013), SSRN Scholarly
Retrieved 20 January 2013.
Paper No. ID 2205145). Rochester, NY: Social Sci-
ence Research Network; http://papers.ssrn.com/abstract= [74] LHC Guide, English version. A collection of facts and
2205145 gures about the Large Hadron Collider (LHC) in the
form of questions and answers.. CERN-Brochure-2008-
[57] Elena Kvochko, Four Ways To talk About Big Data
001-Eng. LHC Guide, English version. CERN. Retrieved
(Information Communication Technologies for Develop-
20 January 2013.
ment Series)". worldbank.org. Retrieved 2012-05-30.
[58] Daniele Medri: Big Data & Business: An on-going rev- [75] Brumel, Geo (19 January 2011). High-energy
olution. Statistics Views. 21 Oct 2013. physics: Down the petabyte highway. Nature 469. pp.
28283. doi:10.1038/469282a.
[59] Manufacturing: Big Data Benets and Challenges. TCS
Big Data Study. Mumbai, India: Tata Consultancy Ser- [76] http://www.zurich.ibm.com/pdf/astron/CeBIT%
vices Limited. Retrieved 2014-06-03. 202013%20Background%20DOME.pdf
[60] Lee, Jay; Wu, F.; Zhao, W.; Ghaari, M.; Liao, L (Jan [77] Future telescope array drives development of exabyte
2013). Prognostics and health management design for ro- processing. Ars Technica. Retrieved 15 April 2015.
tary machinery systemsReviews, methodology and ap-
[78] Delort P., OECD ICCP Technology Foresight Fo-
plications. Mechanical Systems and Signal Processing 42
rum, 2012. http://www.oecd.org/sti/ieconomy/Session_
(1).
3_Delort.pdf#page=6
[61] Center for Intelligent Maintenance Systems (IMS Cen-
ter)". [79] Webster, Phil. Supercomputing the Climate: NASAs
Big Data Mission. CSC World. Computer Sciences Cor-
[62] Predictive manufacturing system poration. Retrieved 2013-01-18.
[63] Couldry, Nick; Turow, Joseph (2014). Advertising, Big [80] Siwach, Gautam; Esmailpour, Amir (March 2014).
Data, and the Clearance of the Public Realm: Marketers Encrypted Search & Cluster Formation in Big Data (PDF).
New Approaches to the Content Subsidy. International ASEE 2014 Zone I Conference. University of Bridgeport,
Journal of Communication 8: 17101726. Bridgeport, Connecticut, USA.
102 CHAPTER 10. BIG DATA
[81] Obama Administration Unveils Big Data Initiative: [98] Philip Ball (26 April 2013). Counting Google searches
Announces $200 Million In New R&D Investments predicts market movements. Nature. Retrieved 9 August
(PDF). The White House. 2013.
[82] AMPLab at the University of California, Berkeley. [99] Tobias Preis, Helen Susannah Moat and H. Eugene Stan-
Amplab.cs.berkeley.edu. Retrieved 2013-03-05. ley (2013). Quantifying Trading Behavior in Finan-
cial Markets Using Google Trends. Scientic Reports 3:
[83] NSF Leads Federal Eorts In Big Data. National Sci- 1684. doi:10.1038/srep01684.
ence Foundation (NSF). 29 March 2012.
[100] Nick Bilton (26 April 2013). Google Search Terms Can
[84] Timothy Hunter; Teodor Moldovan; Matei Zaharia; Justin Predict Stock Market, Study Finds. New York Times. Re-
Ma; Michael Franklin; Pieter Abbeel; Alexandre Bayen trieved 9 August 2013.
(October 2011). Scaling the Mobile Millennium System in
the Cloud. [101] Christopher Matthews (26 April 2013). Trouble With
Your Investment Portfolio? Google It!". TIME Magazine.
[85] David Patterson (5 December 2011). Computer Scien- Retrieved 9 August 2013.
tists May Have What It Takes to Help Cure Cancer. The
[102] Philip Ball (26 April 2013). Counting Google searches
New York Times.
predicts market movements. Nature. Retrieved 9 August
[86] Secretary Chu Announces New Institute to Help Scien- 2013.
tists Improve Massive Data Set Research on DOE Super- [103] Bernhard Warner (25 April 2013). "'Big Data'
computers. energy.gov. Researchers Turn to Google to Beat the Markets.
Bloomberg Businessweek. Retrieved 9 August 2013.
[87] Governor Patrick announces new initiative to strengthen
Massachusetts position as a World leader in Big Data. [104] Hamish McRae (28 April 2013). Hamish McRae: Need
Commonwealth of Massachusetts. a valuable handle on investor sentiment? Google it. The
Independent (London). Retrieved 9 August 2013.
[88] Big Data @ CSAIL. Bigdata.csail.mit.edu. 2013-02-
22. Retrieved 2013-03-05. [105] Richard Waters (25 April 2013). Google search proves
to be new word in stock market prediction. Financial
[89] Big Data Public Private Forum. Cordis.europa.eu. Times. Retrieved 9 August 2013.
2012-09-01. Retrieved 2013-03-05.
[106] David Leinweber (26 April 2013). Big Data Gets Bigger:
[90] Alan Turing Institute to be set up to research big data. Now Google Trends Can Predict The Market. Forbes.
BBC News. 19 March 2014. Retrieved 2014-03-19. Retrieved 9 August 2013.
[91] Inspiration day at University of Waterloo, Stratford Cam- [107] Jason Palmer (25 April 2013). Google searches predict
pus. http://www.betakit.com/. Retrieved 2014-02-28. market moves. BBC. Retrieved 9 August 2013.
[92] Lee, Jay; Lapira, Edzel; Bagheri, Behrad; Kao, [108] Graham M. (9 March 2012). Big data and the end of
Hung-An (2013). Recent Advances and Trends in theory?". The Guardian (London).
Predictive Manufacturing Systems in Big Data En-
vironment. Manufacturing Letters 1 (1): 3841. [109] Good Data Won't Guarantee Good Decisions. Har-
doi:10.1016/j.mfglet.2013.09.005. vard Business Review. Shah, Shvetank; Horne, Andrew;
Capell, Jaime;. HBR.org. Retrieved 8 September 2012.
[93] Reips, Ulf-Dietrich; Matzat, Uwe (2014). Mining Big
[110] Anderson, C. (2008, 23 June). The End of The-
Data using Big Data Services. International Journal of
ory: The Data Deluge Makes the Scientic Method
Internet Science 1 (1): 18.
Obsolete. Wired Magazine, (Science: Discoveries).
[94] Preis, Tobias; Moat,, Helen Susannah; Stanley, H. Eu- http://www.wired.com/science/discoveries/magazine/
gene; Bishop, Steven R. (2012). Quantifying the Ad- 16-07/pb_theory
vantage of Looking Forward. Scientic Reports 2: [111] Braha, D.; Stacey, B.; Bar-Yam, Y. (2011). Corporate
350. doi:10.1038/srep00350. PMC 3320057. PMID Competition: A Self-organized Network. Social Net-
22482034. works 33: 219230.
[95] Marks, Paul (5 April 2012). Online searches for future [112] Rauch, J. (2002). Seeing Around Corners. The Atlantic,
linked to economic success. New Scientist. Retrieved 9 (April), 3548. http://www.theatlantic.com/magazine/
April 2012. archive/2002/04/seeing-around-corners/302471/
[96] Johnston, Casey (6 April 2012). Google Trends reveals [113] Epstein, J. M., & Axtell, R. L. (1996). Growing Articial
clues about the mentality of richer nations. Ars Technica. Societies: Social Science from the Bottom Up. A Brad-
Retrieved 9 April 2012. ford Book.
[97] Tobias Preis (2012-05-24). Supplementary Information: [114] Delort P., Big data in Biosciences, Big Data Paris,
The Future Orientation Index is available for download 2012 http://www.bigdataparis.com/documents/
(PDF). Retrieved 2012-05-24. Pierre-Delort-INSERM.pdf#page=5
10.11. EXTERNAL LINKS 103
[115] Ohm, Paul. Don't Build a Database of Ruin. Harvard Hilbert, Martin; Lpez, Priscila (2011). The
Business Review. Worlds Technological Capacity to Store, Com-
municate, and Compute Information. Science
[116] Darwin Bond-Graham, Iron Cagebook - The Logical End
332 (6025): 6065. doi:10.1126/science.1200970.
of Facebooks Patents, Counterpunch.org, 2013.12.03
PMID 21310967.
[117] Darwin Bond-Graham, Inside the Tech industrys Startup
Conference, Counterpunch.org, 2013.09.11 The Rise of Industrial Big Data. GE Intelligent
Platforms. Retrieved 2013-11-12.
[118] danah boyd (2010-04-29). Privacy and Publicity in the
Context of Big Data. WWW 2010 conference. Retrieved History of Big Data Timeline. A visual history of
2011-04-18. Big Data with links to supporting articles.
Euclidean distance
104
11.3. REFERENCES 105
11.3 References
11.1.3 Three dimensions
Deza, Elena; Deza, Michel Marie (2009). Encyclo-
In three-dimensional Euclidean space, the distance is pedia of Distances. Springer. p. 94.
Cluster analysis. March 2, 2011.
d(p, q) = (p1 q1 )2 + (p2 q2 )2 + (p3 q3 )2 .
11.1.4 n dimensions
In general, for an n-dimensional space, the distance is
d(p, q) = (p1 q1 )2 + (p2 q2 )2 + + (pi qi )2 + + (pn qn )2 .
Hamming distance
In information theory, the Hamming distance between equivalent as a metric space to the set of distances be-
two strings of equal length is the number of positions at tween vertices in a hypercube graph. One can also view a
which the corresponding symbols are dierent. In an- binary string of length n as a vector in Rn by treating each
other way, it measures the minimum number of substitu- symbol in the string as a real coordinate; with this embed-
tions required to change one string into the other, or the ding, the strings form the vertices of an n-dimensional
minimum number of errors that could have transformed hypercube, and the Hamming distance of the strings is
one string into the other. equivalent to the Manhattan distance between the ver-
A major application is in coding theory, more speci- tices.
cally to block codes, in which the equal-length strings are
vectors over a nite eld.
12.3 Error detection and error cor-
rection
12.1 Examples
The Hamming distance is used to dene some essential
The Hamming distance between: notions in coding theory, such as error detecting and er-
ror correcting codes. In particular, a code C is said to be
k-errors detecting if any two codewords c1 and c2 from C
"karolin" and "kathrin" is 3.
that have a Hamming distance less than k coincide; Oth-
"karolin" and "kerstin" is 3. erwise put it, a code is k-errors detecting if and only if
the minimum Hamming distance between any two of its
1011101 and 1001001 is 2. codewords is at least k+1.[1]
A code C is said to be k-errors correcting if for every
2173896 and 2233796 is 3.
word w in the underlying Hamming space H there exists
at most one codeword c (from C) such that the Hamming
On a two-dimensional grid such as a chessboard, the distance between w and c is less than k. In other words,
Hamming distance is the minimum number of moves it a code is k-errors correcting if and only if the minimum
would take a rook to move from one cell to the other. Hamming distance between any two of its codewords is
at least 2k+1. This is more easily understood geometri-
cally as any closed balls of radius k centered on distinct
12.2 Properties codewords being disjoint.[1] These balls are also called
Hamming spheres in this context.[2]
For a xed length n, the Hamming distance is a metric Thus a code with minimum Hamming distance d between
on the vector space of the words of length n (also known its codewords can detect at most d1 errors and can cor-
as a Hamming space), as it fullls the conditions of non- rect (d1)/2 errors.[1] The latter number is also called
negativity, identity of indiscernibles and symmetry, and the packing radius or the error-correcting capability of the
it can be shown by complete induction that it satises the code.[2]
triangle inequality as well.[1] The Hamming distance be-
tween two words a and b can also be seen as the Hamming
weight of ab for an appropriate choice of the operator. 12.4 History and applications
For binary strings a and b the Hamming distance is equal
to the number of ones (population count) in a XOR b. The Hamming distance is named after Richard Ham-
The metric space of length-n binary strings, with the ming, who introduced it in his fundamental paper on
Hamming distance, is known as the Hamming cube; it is Hamming codes Error detecting and error correcting
106
12.6. SEE ALSO 107
Norm (mathematics)
This article is about linear algebra and analysis. For By the rst axiom, absolute homogeneity, we have p(0) =
eld theory, see Field norm. For ideals, see Ideal norm. 0 and p(v) = p(v), so that by the triangle inequality
For group theory, see Norm (group). For norms in
descriptive set theory, see prewellordering. p(v) 0 (positivity).
In linear algebra, functional analysis and related areas of A seminorm on V is a function p : V R with the prop-
mathematics, a norm is a function that assigns a strictly erties 1. and 2. above.
positive length or size to each vector in a vector space
Every vector space V with seminorm p induces a normed
save for the zero vector, which is assigned a length of zero.
space V/W, called the quotient space, where W is the sub-
A seminorm, on the other hand, is allowed to assign zero
space of V consisting of all vectors v in V with p(v) = 0.
length to some non-zero vectors (in addition to the zero
The induced norm on V/W is clearly well-dened and is
vector).
given by:
A norm must also satisfy certain properties pertaining to
scalability and additivity which are given in the formal p(W + v) = p(v).
denition below.
A simple example is the 2-dimensional Euclidean space Two norms (or seminorms) p and q on a vector space V
R2 equipped with the Euclidean norm. Elements in this are equivalent if there exist two real constants c and C,
vector space (e.g., (3, 7)) are usually drawn as arrows in a with c > 0 such that
2-dimensional cartesian coordinate system starting at the
origin (0, 0). The Euclidean norm assigns to each vector
for every vector v in V, one has that: c q(v)
the length of its arrow. Because of this, the Euclidean
p(v) C q(v).
norm is often known as the magnitude.
A vector space on which a norm is dened is called a A topological vector space is called normable
normed vector space. Similarly, a vector space with a (seminormable) if the topology of the space can
seminorm is called a seminormed vector space. It is of- be induced by a norm (seminorm).
ten possible to supply a norm for a given vector space in
more than one way.
13.2 Notation
13.1 Denition If a norm p : V R is given on a vector space V then the
norm of a vector v V is usually denoted by enclosing it
Given a vector space V over a subeld F of the complex within double vertical lines: v = p(v). Such notation is
numbers, a norm on V is a function p: V R with the also sometimes used if p is only a seminorm.
following properties:[1]
For the length of a vector in Euclidean space (which is an
For all a F and all u, v V, example of a norm, as explained below), the notation |v|
with single vertical lines is also widespread.
1. p(av) = |a| p(v), (absolute homogeneity or absolute In Unicode, the codepoint of the double vertical line
scalability). character is U+2016. The double vertical line should
2. p(u + v) p(u) + p(v) (triangle inequality or not be confused with the parallel to symbol, Unicode
subadditivity). U+2225 ( ). This is usually not a problem because the
former is used in parenthesis-like fashion, whereas the lat-
3. If p(v) = 0 then v is the zero vector (separates points). ter is used as an inx operator. The double vertical line
108
13.3. EXAMPLES 109
used here should also not be confused with the symbol This formula is valid for any inner product space, includ-
used to denote lateral clicks, Unicode U+01C1 ( ). The ing Euclidean and complex spaces. For Euclidean spaces,
single vertical line | is called vertical line in Unicode the inner product is equivalent to the dot product. Hence,
and its codepoint is U+007C. in this specic case the formula can be also written with
the following notation:
13.3 Examples
x := x x.
All norms are seminorms.
The Euclidean norm is also called the Euclidean length,
The trivial seminorm has p(x) = 0 for all x in V. L2 distance, 2 distance, L2 norm, or 2 norm; see Lp
space.
Every linear form f on a vector space denes a semi-
norm by x |f(x)|. The set of vectors in Rn+1 whose Euclidean norm is a
given positive constant forms an n-sphere.
2 2
n
z := |z1 | + + |zn | = z1 z1 + + zn zn . xi
i=1
In both cases we can also express the norm as the square
root of the inner product of the vector and itself: is not a norm because it may yield negative results.
x := x x, 13.3.4 p-norm
where x is represented as a column vector ([x1 ; x2 ; ...; Main article: Lp space
xn]), and x denotes its conjugate transpose.
110 CHAPTER 13. NORM (MATHEMATICS)
(
n )1/p
x
p
xp := |xi | .
i=1
x = 1
p
|f (x) g(x)| d
X
(without pth root) denes a distance that makes Lp (X) into 13.3.6 Zero norm
a complete metric topological vector space. These spaces
are of great interest in functional analysis, probability In probability and functional analysis, the zero norm in-
theory, and harmonic analysis. However, outside trivial duces a complete metric topology for the space of mea-
cases, this topological vector space is not locally convex sureable functions and
fornthe F-space of sequences with
and has no continuous nonzero linear forms. Thus the Fnorm (xn ) 7 n2 xn /(1 + xn ) , which is dis-
topological dual space contains only the zero functional. cussed by Stefan Rolewicz in Metric Linear Spaces.[3]
The derivative of the p-norm is given by
Hamming distance of a vector from zero
p2
xk |xk | See also: Hamming distance and discrete metric
xp = p1 .
xk xp
In metric geometry, the discrete metric takes the value
For the special case of p = 2, this becomes
one for distinct points and zero otherwise. When applied
coordinate-wise to the elements of a vector space, the
discrete distance denes the Hamming distance, which
xk
x2 = , is important in coding and information theory. In the
xk x2 eld of real or complex numbers, the distance of the dis-
or crete metric from zero is not homogeneous in the non-
zero point; indeed, the distance from zero remains one
as its non-zero argument approaches zero. However, the
x discrete distance of a number from zero does satisfy the
x2 = . other properties of a norm, namely the triangle inequality
x x2
and positive deniteness. When applied component-wise
to vectors, the discrete distance from zero behaves like a
13.3.5 Maximum norm (special case of: non-homogeneous norm, which counts the number of
innity norm, uniform norm, or non-zero components in its vector argument; again, this
non-homogeneous norm is discontinuous.
supremum norm)
In signal processing and statistics, David Donoho referred
Main article: Maximum norm to the zero "norm" with quotation marks. Following
Donohos notation, the zero norm of x is simply the
x := max (|x1 | , . . . , |xn |) . number of non-zero coordinates of x, or the Hamming
distance of the vector from zero. When this norm is lo-
The set of vectors whose innity norm is a given constant, calized to a bounded set, it is the limit of p-norms as p ap-
c, forms the surface of a hypercube with edge length 2c. proaches 0. Of course, the zero norm is not a B-norm,
13.4. PROPERTIES 111
Conversely:
n Any locally convex topological vector space has a local
p(x) := pi (x) basis consisting of absolutely convex sets. A common
i=0 method to construct such a basis is to use a family (p) of
is again a seminorm. seminorms p that separates points: the collection of all
nite intersections of sets {p < 1/n} turns the space into
For any norm p on a vector space V, we have that for all a locally convex topological vector space so that every p
u and v V: is continuous.
Such a method is used to design weak and weak* topolo-
p(u v) |p(u) p(v)|.
gies.
Gowers norm
Mahalanobis distance x1
Manhattan distance
13.8 Notes
[1] Prugoveki 1981, page 20
13.9 References
Bourbaki, Nicolas (1987). Chapters 15. Topo-
logical vector spaces. Springer. ISBN 3-540-13627-
4.
Regularization (mathematics)
For other uses in related elds, see Regularization logistic regression, neural nets, support vector machines,
(disambiguation). conditional random elds and some matrix decompo-
sition methods. L regularization may also be called
Regularization, in mathematics and statistics and partic- weight decay, in particular in the setting of neural nets.
ularly in the elds of machine learning and inverse prob- L regularization is often preferred because it produces
lems, refers to a process of introducing additional in- sparse models and thus performs feature selection within
formation in order to solve an ill-posed problem or to the learning algorithm, but since the L norm is not dif-
prevent overtting. This information is usually of the ferentiable, it may require changes to learning algorithms,
form of a penalty for complexity, such as restrictions for in particular gradient-based learners.[1][2]
smoothness or bounds on the vector space norm. Bayesian learning methods make use of a prior proba-
A theoretical justication for regularization is that it at-
bility that (usually) gives lower probability to more com-
tempts to impose Occams razor on the solution. From a plex models. Well-known model selection techniques in-
Bayesian point of view, many regularization techniques clude the Akaike information criterion (AIC), minimum
correspond to imposing certain prior distributions on description length (MDL), and the Bayesian informa-
model parameters. tion criterion (BIC). Alternative methods of control-
The same idea arose in many elds of science. For ling overtting not involving regularization include cross-
example, the least-squares method can be viewed as validation.
a very simple form of regularization. A simple form Regularization can be used to ne-tune model complexity
of regularization applied to integral equations, gener- using an augmented error function with cross-validation.
ally termed Tikhonov regularization after Andrey Niko- The data sets used in complex models can produce a
layevich Tikhonov, is essentially a trade-o between t- levelling-o of validation as complexity of the models in-
ting the data and reducing a norm of the solution. More creases. Training data sets errors decrease while the vali-
recently, non-linear regularization methods, including dation data set error remains constant. Regularization in-
total variation regularization have become popular. troduces a second factor which weights the penalty against
more complex models with an increasing variance in the
data errors. This gives an increasing penalty as model
14.1 Regularization in statistics complexity increases.[3]
and machine learning Examples of applications of dierent methods of regu-
larization to the linear model are:
In statistics and machine learning, regularization meth- A linear combination of the LASSO and ridge regression
ods are used for model selection, in particular to prevent methods is elastic net regularization.
overtting by penalizing models with extreme parame-
ter values. The most common variants in machine learn-
ing are L and L regularization, which can be added to
learning algorithms that minimize a loss function E(X,
Y) by instead minimizing E(X, Y) + w, where w is 14.2 See also
the models weight vector, is either the L norm or the
squared L norm, and is a free parameter that needs
to be tuned empirically (typically by cross-validation; see
Bayesian interpretation of regularization
hyperparameter optimization). This method applies to
many models. When applied in linear regression, the re-
sulting models are termed lasso or ridge regression, but
regularization is also employed in (binary and multiclass) Regularization by spectral ltering
114
14.4. REFERENCES 115
14.3 Notes
[1] Andrew, Galen; Gao, Jianfeng (2007). Scalable train-
ing of L-regularized log-linear models. Proceedings of
the 24th International Conference on Machine Learning.
doi:10.1145/1273496.1273501. ISBN 9781595937933.
14.4 References
A. Neumaier, Solving ill-conditioned and singular
linear systems: A tutorial on regularization, SIAM
Review 40 (1998), 636-666. Available in pdf from
authors website.
Chapter 15
Loss function
In mathematical optimization, statistics, decision theory X = (X1 , . . . , Xn ) , where Xi F i.i.d. The X is the
and machine learning, a loss function or cost function set of things the decision rule will be making decisions
is a function that maps an event or values of one or more on. There exists some number of possible ways F to
variables onto a real number intuitively representing some model our data X, which our decision function can use to
cost associated with the event. An optimization prob- make decisions. For a nite number of models, we can
lem seeks to minimize a loss function. An objective thus think of as the index to this family of probability
function is either a loss function or its negative (some- models. For an innite family of models, it is a set of
times called a reward function, a prot function, a utility parameters to the family of distributions.
function, etc.), in which case it is to be maximized. On a more practical note, it is important to understand
In statistics, typically a loss function is used for parameter that, while it is tempting to think of loss functions as nec-
estimation, and the event in question is some function of essarily parametric (since they seem to take as a pa-
the dierence between estimated and true values for an rameter), the fact that is innite-dimensional is com-
instance of data. The concept, as old as Laplace, was pletely incompatible with this notion; for example, if the
reintroduced in statistics by Abraham Wald in the mid- family of probability functions is uncountably innite,
dle of the 20th century.[1] In the context of economics, indexes an uncountably innite space.
for example, this is usually economic cost or regret. In From here, given a set A of possible actions, a decision
classication, it is the penalty for an incorrect classi- rule is a function : X A.
cation of an example. In actuarial science, it is used in
an insurance context to model benets paid over premi- A loss function is a real lower-bounded function L on
ums, particularly since the works of Harald Cramr in A for some . The value L(, (X)) is the cost of
the 1920s.[2] In optimal control the loss is the penalty for action (X) under parameter .[3]
failing to achieve a desired value. In nancial risk man-
agement the function is precisely mapped to a monetary
loss.
15.2 Expected loss
15.1 Use in statistics The value of the loss function itself is a random quantity
because it depends on the outcome of a random variable
Parameter estimation for supervised learning tasks such X. Both frequentist and Bayesian statistical theory involve
as regression or classication can be formulated as the making a decision based on the expected value of the loss
minimization of a loss function over a training set. The function: however this quantity is dened dierently un-
goal of estimation is to nd a function that models its input der the two paradigms.
well: if it were applied to the training set, it should predict
the values (or class labels) associated with the samples in
that set. The loss function quanties the amount by which
the prediction deviates from the actual values. 15.2.1 Frequentist expected loss
116
15.3. DECISION RULES 117
15.2.3 Economic choice under uncertainty Invariance: Choose the optimal decision rule which
satises an invariance requirement.
In economics, decision-making under uncertainty is of-
ten modelled using the von Neumann-Morgenstern util- Choose the decision rule with the lowest average loss
ity function of the uncertain variable of interest, such as (i.e. minimize the expected value of the loss func-
end-of-period wealth. Since the value of this variable is tion):
uncertain, so is the value of the utility function; it is the
expected value of utility that is maximized.
arg min E [R(, )] = arg min R(, ) p() d.
15.2.4 Examples
But for risk-averse (or risk-loving) agents, loss is mea- circumstances been known and the decision that was in
sured as the negative of a utility function, which rep- fact taken before they were known.
resents satisfaction and is usually interpreted in ordinal
terms rather than in cardinal (absolute) terms.
Other measures of cost are possible, for example 15.7 Quadratic loss function
mortality or morbidity in the eld of public health or
safety engineering.
The use of a quadratic loss function is common, for ex-
For most optimization algorithms, it is desirable to have ample when using least squares techniques. It is often
a loss function that is globally continuous and dieren- more mathematically tractable than other loss functions
tiable. because of the properties of variances, as well as being
Two very commonly used loss functions are the squared symmetric: an error above the target causes the same loss
loss, L(a) = a2 , and the absolute loss, L(a) = |a| . as the same magnitude of error below the target. If the
However the absolute loss has the disadvantage that it target is t, then a quadratic loss function is
is not dierentiable at a = 0 . The squared loss has
the disadvantage that it has the tendency to be domi-
nated
nby outlierswhen summing over a set of a 's (as (x) = C(t x)2
in i=1 L(ai ) ), the nal sum tends to be the result of a
few particularly large a-values, rather than an expression for some constant C; the value of the constant makes no
of the average a-value. dierence to a decision, and can be ignored by setting it
The choice of a loss function is not arbitrary. It is very re- equal to 1.
strictive and sometimes the loss function may be charac- Many common statistics, including t-tests, regression
terized by its desirable properties.[10] Among the choice models, design of experiments, and much else, use least
principles are, for example, the requirement of complete- squares methods applied using linear regression theory,
ness of the class of symmetric statistics in the case of i.i.d. which is based on the quadratric loss function.
observations, the principle of complete information, and
some others. The quadratic loss function is also used in linear-
quadratic optimal control problems. In these problems,
even in the absence of uncertainty, it may not be possi-
ble to achieve the desired values of all target variables.
15.5 Loss functions in Bayesian Often loss is expressed as a quadratic form in the devia-
statistics tions of the variables of interest from their desired values;
this approach is tractable because it results in linear rst-
order conditions. In the context of stochastic control, the
One of the consequences of Bayesian inference is that
expected value of the quadratic form is used.
in addition to experimental data, the loss function does
not in itself wholly determine a decision. What is im-
portant is the relationship between the loss function and
the posterior probability. So it is possible to have two 15.8 0-1 loss function
dierent loss functions which lead to the same decision
when the prior probability distributions associated with
each compensate for the details of each loss function. In statistics and decision theory, a frequently used loss
function is the 0-1 loss function
Combining the three elements of the prior probability,
the data, and the loss function then allows decisions to
be based on maximizing the subjective expected utility, a
concept introduced by Leonard J. Savage. L(y , y) = I(y = y),
15.6 Regret
Main article: Regret (decision theory) 15.9 See also
Savage also argued that using non-Bayesian methods such Discounted maximum loss
as minimax, the loss function should be based on the idea
of regret, i.e., the loss associated with a decision should Hinge loss
be the dierence between the consequences of the best
decision that could have been taken had the underlying Scoring rule
15.11. FURTHER READING 119
15.10 References
[1] Wald, A. (1950). Statistical Decision Functions. Wiley.
[8] Here,
Least squares
The method of least squares is a standard approach For the topic of approximating a function by a sum of
in regression analysis to the approximate solution of others using an objective function based on squared dis-
overdetermined systems, i.e., sets of equations in which tances, see least squares (function approximation).
there are more equations than unknowns. Least squares
means that the overall solution minimizes the sum of the
squares of the errors made in the results of every single
equation.
The most important application is in data tting. The
best t in the least-squares sense minimizes the sum of
squared residuals, a residual being the dierence between
an observed value and the tted value provided by a
model. When the problem has substantial uncertainties
in the independent variable (the x variable), then simple
regression and least squares methods have problems; in
such cases, the methodology required for tting errors-
in-variables models may be considered instead of that for
least squares.
Least squares problems fall into two categories: linear or
ordinary least squares and non-linear least squares, de-
pending on whether or not the residuals are linear in all
unknowns. The linear least-squares problem occurs in
statistical regression analysis; it has a closed-form solu-
tion. The non-linear problem is usually solved by iterative
renement; at each iteration the system is approximated
by a linear one, and thus the core calculation is similar in
both cases. The result of tting a set of data points with a quadratic function.
Polynomial least squares describes the variance in a pre-
diction of the dependent variable as a function of the The least-squares method is usually credited to Carl
independent variable and the deviations from the tted Friedrich Gauss (1795),[2] but it was rst published by
curve. Adrien-Marie Legendre.[3]
When the observations come from an exponential family
and mild conditions are satised, least-squares estimates
and maximum-likelihood estimates are identical.[1] The 16.1 History
method of least squares can also be derived as a method
of moments estimator.
16.1.1 Context
The following discussion is mostly presented in terms of
linear functions but the use of least-squares is valid and The method of least squares grew out of the elds of
practical for more general families of functions. Also, astronomy and geodesy as scientists and mathematicians
by iteratively applying local quadratic approximation to sought to provide solutions to the challenges of navigating
the likelihood (through the Fisher information), the least- the Earths oceans during the Age of Exploration. The ac-
squares method may be used to t a generalized linear curate description of the behavior of celestial bodies was
model. the key to enabling ships to sail in open seas, where sailors
could no longer rely on land sightings for navigation.
120
16.1. HISTORY 121
The combination of dierent observations taken un- The rst clear and concise exposition of the method of
der the same conditions contrary to simply trying least squares was published by Legendre in 1805.[5] The
ones best to observe and record a single observa- technique is described as an algebraic procedure for t-
tion accurately. The approach was known as the ting linear equations to data and Legendre demonstrates
method of averages. This approach was notably the new method by analyzing the same data as Laplace for
used by Tobias Mayer while studying the librations the shape of the earth. The value of Legendres method
of the moon in 1750, and by Pierre-Simon Laplace of least squares was immediately recognized by leading
in his work in explaining the dierences in motion astronomers and geodesists of the time.
of Jupiter and Saturn in 1788. In 1809 Carl Friedrich Gauss published his method of
calculating the orbits of celestial bodies. In that work
The combination of dierent observations taken un- he claimed to have been in possession of the method of
der dierent conditions. The method came to be least squares since 1795. This naturally led to a priority
known as the method of least absolute deviation. It dispute with Legendre. However, to Gausss credit, he
was notably performed by Roger Joseph Boscovich went beyond Legendre and succeeded in connecting the
in his work on the shape of the earth in 1757 and method of least squares with the principles of probabil-
by Pierre-Simon Laplace for the same problem in ity and to the normal distribution. He had managed to
1799. complete Laplaces program of specifying a mathemati-
cal form of the probability density for the observations,
The development of a criterion that can be evaluated depending on a nite number of unknown parameters,
to determine when the solution with the minimum and dene a method of estimation that minimizes the er-
error has been achieved. Laplace tried to specify ror of estimation. Gauss showed that arithmetic mean
a mathematical form of the probability density for is indeed the best estimate of the location parameter by
the errors and dene a method of estimation that changing both the probability density and the method of
minimizes the error of estimation. For this pur- estimation. He then turned the problem around by ask-
pose, Laplace used a symmetric two-sided exponen- ing what form the density should have and what method
tial distribution we now call Laplace distribution to of estimation should be used to get the arithmetic mean
model the error distribution, and used the sum of ab- as estimate of the location parameter. In this attempt, he
solute deviation as error of estimation. He felt these invented the normal distribution.
to be the simplest assumptions he could make, and An early demonstration of the strength of Gauss Method
he had hoped to obtain the arithmetic mean as the came when it was used to predict the future location
best estimate. Instead, his estimator was the poste- of the newly discovered asteroid Ceres. On 1 January
rior median. 1801, the Italian astronomer Giuseppe Piazzi discovered
122 CHAPTER 16. LEAST SQUARES
Ceres and was able to track its path for 40 days before height measurements, the plane is a function of two inde-
it was lost in the glare of the sun. Based on this data, pendent variables, x and z, say. In the most general case
astronomers desired to determine the location of Ceres there may be one or more independent variables and one
after it emerged from behind the sun without solving or more dependent variables at each data point.
the complicated Keplers nonlinear equations of plane-
tary motion. The only predictions that successfully al-
lowed Hungarian astronomer Franz Xaver von Zach to 16.3 Limitations
relocate Ceres were those performed by the 24-year-old
Gauss using least-squares analysis.
This regression formulation considers only residuals in
In 1810, after reading Gausss work, Laplace, after prov- the dependent variable. There are two rather dierent
ing the central limit theorem, used it to give a large sample contexts in which dierent implications apply:
justication for the method of least square and the nor-
mal distribution. In 1822, Gauss was able to state that the Regression for prediction. Here a model is tted to
least-squares approach to regression analysis is optimal in provide a prediction rule for application in a sim-
the sense that in a linear model where the errors have a ilar situation to which the data used for tting ap-
mean of zero, are uncorrelated, and have equal variances, ply. Here the dependent variables corresponding to
the best linear unbiased estimator of the coecients is such future application would be subject to the same
the least-squares estimator. This result is known as the types of observation error as those in the data used
GaussMarkov theorem. for tting. It is therefore logically consistent to use
the least-squares prediction rule for such data.
The idea of least-squares analysis was also independently
formulated by the American Robert Adrain in 1808. In Regression for tting a true relationship. In stan-
the next two centuries workers in the theory of errors and dard regression analysis, that leads to tting by least
in statistics found many dierent ways of implementing squares, there is an implicit assumption that er-
least squares.[6] rors in the independent variable are zero or strictly
controlled so as to be negligible. When errors in
the independent variable are non-negligible, models
16.2 Problem statement of measurement error can be used; such methods
can lead to parameter estimates, hypothesis testing
and condence intervals that take into account the
The objective consists of adjusting the parameters of a
presence of observation errors in the independent
model function to best t a data set. A simple data set
variables.[7] An alternative approach is to t a model
consists of n points (data pairs) (xi , yi ), i = 1, ..., n, where
by total least squares; this can be viewed as taking a
xi is an independent variable and yi is a dependent variable
pragmatic approach to balancing the eects of the
whose value is found by observation. The model function
dierent sources of error in formulating an objec-
has the form f (x, ) , where m adjustable parameters are
tive function for use in model-tting.
held in the vector . The goal is to nd the parameter
values for the model which best ts the data. The least
squares method nds its optimum when the sum, S, of
squared residuals 16.4 Solving the least squares prob-
lem
n
S= ri 2 The minimum of the sum of squares is found by setting
i=1 the gradient to zero. Since the model contains m param-
eters, there are m gradient equations:
is a minimum. A residual is dened as the dierence be-
tween the actual value of the dependent variable and the
value predicted by the model. S ri
=2 ri = 0, j = 1, . . . , m,
j i
j
m
( )
f (x, ) = j j (x),
n
m
j=1 2 Jij yi Jik k = 0,
i=1 k=1
where the function j is a function of x .
Letting which, on rearrangement, become m simultaneous linear
equations, the normal equations:
f (xi , )
Xij = = j (xi ),
n
m
n
j
Jij Jik k =
Jij yi (j = 1, . . . , m).
we can then see that in that case the least square estimate i=1 k=1 i=1
(or estimator, in the context of a random sample), is
The normal equations are written in matrix notation as
given by
( )
= (X T X)1 X T y.
JT J = JT y.
For a derivation of this estimate see Linear least squares These are the dening equations of the GaussNewton
(mathematics). algorithm.
16.4.2 Non-linear least squares 16.4.3 Dierences between linear and non-
linear least squares
Main article: Non-linear least squares
The model function, f, in LLSQ (linear least
There is no closed-form solution to a non-linear least squares) is a linear combination of parameters of
squares problem. Instead, numerical algorithms are used the form f = Xi1 1 + Xi2 2 + The model
to nd the value of the parameters that minimizes the may represent a straight line, a parabola or any other
objective. Most algorithms involve choosing initial val- linear combination of functions. In NLLSQ (non-
ues for the parameters. Then, the parameters are rened linear least squares) the parameters appear as func-
iteratively, that is, the values are obtained by successive tions, such as 2 , ex and so forth. If the deriva-
approximation: tives f /j are either constant or depend only on
the values of the independent variable, the model
is linear in the parameters. Otherwise the model is
j k+1 = j k + j , non-linear.
where k is an iteration number, and the vector of incre- Algorithms for nding the solution to a NLLSQ
ments j is called the shift vector. In some commonly problem require initial values for the parameters,
used algorithms, at each iteration the model may be lin- LLSQ does not.
earized by approximation to a rst-order Taylor series ex-
Like LLSQ, solution algorithms for NLLSQ often
pansion about k :
require that the Jacobian be calculated. Analytical
expressions for the partial derivatives can be compli-
f (xi , ) ( ) cated. If analytical expressions are impossible to ob-
f (xi , ) = f k (xi , ) + j j k tain either the partial derivatives must be calculated
j
j by numerical approximation or an estimate must be
= f k (xi , ) + Jij j . made of the Jacobian.
j
In NLLSQ non-convergence (failure of the algo-
The Jacobian J is a function of constants, the indepen- rithm to nd a minimum) is a common phenomenon
dent variable and the parameters, so it changes from one whereas the LLSQ is globally concave so non-
iteration to the next. The residuals are given by convergence is not an issue.
124 CHAPTER 16. LEAST SQUARES
statistical degrees of freedom; see eective degrees of When the observational errors are uncorrelated and the
freedom for generalizations. weight matrix, W, is diagonal, these may be written as
Condence limits can be found if the probability distribu-
tion of the parameters is known, or an asymptotic approx- ( )
T
imation is made, or assumed. Likewise statistical tests on X WX ^ = XT Wy.
the residuals can be made if the probability distribution
If the errors are correlated, the resulting estimator is the
of the residuals is known or assumed. The probability
BLUE if the weight matrix is equal to the inverse of the
distribution of any linear combination of the dependent
variance-covariance matrix of the observations.
variables can be derived if the probability distribution of
experimental errors is known or assumed. Inference is When the errors are uncorrelated, it is convenient to
particularly straightforward if the errors are assumed to simplify the calculations to factor the weight matrix as
follow a normal distribution, which implies that the pa- wii = Wii . The normal equations can then be written
rameter estimates and residuals will also be normally dis- as
tributed conditional on the values of the independent vari-
ables. ( T )
X X ^ = XT y
where
16.6 Weighted least squares
See also: Weighted mean and Linear least squares X = wX, y = wy.
(mathematics) Weighted linear least squares
For non-linear least squares systems a similar argument
shows that the normal equations should be modied as
A special case of generalized least squares called follows.
weighted least squares occurs when all the o-diagonal
entries of (the correlation matrix of the residu-
als) are null; the variances of the observations (along (JT WJ) = JT Wy.
the covariance matrix diagonal) may still be unequal
(heteroskedasticity). Note that for empirical tests, the appropriate W is not
known for sure and must be estimated. For this feasible
The expressions given above are based on the implicit as-
generalized least squares (FGLS) techniques may be
sumption that the errors are uncorrelated with each other
used.
and with the independent variables and have equal vari-
ance. The GaussMarkov theorem shows that, when this
is a best linear unbiased estimator (BLUE). If,
is so,
however, the measurements are uncorrelated but have 16.7 Relationship to principal com-
dierent uncertainties, a modied approach might be ponents
adopted. Aitken showed that when a weighted sum of
squared residuals is minimized, is the BLUE if each
The rst principal component about the mean of a set of
weight is equal to the reciprocal of the variance of the
points can be represented by that line which most closely
measurement
approaches the data points (as measured by squared dis-
tance of closest approach, i.e. perpendicular to the line).
In contrast, linear least squares tries to minimize the dis-
n
1
S= 2
Wii ri , Wii = 2 tance in the y direction only. Thus, although the two use
i=1
i a similar error metric, linear least squares is a method
that treats one dimension of the data preferentially, while
The gradient equations for this sum of squares are PCA treats all dimensions equally.
126 CHAPTER 16. LEAST SQUARES
Newtons method
x : f (x) = 0 .
The NewtonRaphson method in one variable is imple- The function is shown in blue and the tangent line is in red. We
mented as follows: see that x +1 is a better approximation than x for the root x of
the function f.
Given a function dened over the reals x, and its
derivative ', we begin with a rst guess x0 for a root of
the function f. Provided the function satises all the as- root, then the function is approximated by its tangent
sumptions made in the derivation of the formula, a better line (which can be computed using the tools of calculus),
approximation x1 is and one computes the x-intercept of this tangent line
(which is easily done with elementary algebra). This x-
intercept will typically be a better approximation to the
f (x0 ) functions root than the original guess, and the method
x1 = x0 . can be iterated.
f (x0 )
Suppose : [a, b] R is a dierentiable function dened
Geometrically, (x1 , 0) is the intersection with the x-axis
on the interval [a, b] with values in the real numbers R.
of the tangent to the graph of f at (x0 , f (x0 )).
The formula for converging on the root can be easily de-
The process is repeated as rived. Suppose we have some current approximation xn.
Then we can derive the formula for a better approxima-
tion, xn by referring to the diagram on the right. The
f (xn ) equation of the tangent line to the curve y = (x) at the
xn+1 = xn
f (xn ) point x=xn is
until a suciently accurate value is reached.
This algorithm is rst in the class of Householders meth- y = f (xn ) (x xn ) + f (xn ),
ods, succeeded by Halleys method. The method can also
where, ' denotes the derivative of the function .
be extended to complex functions and to systems of equa-
tions. The x-intercept of this line (the value of x such that y=0)
is then used as the next approximation to the root, xn.
In other words, setting y to zero and x to xn gives
17.1 Description
0 = f (xn ) (xn+1 xn ) + f (xn ).
The idea of the method is as follows: one starts with
an initial guess which is reasonably close to the true Solving for xn gives
128
17.3. PRACTICAL CONSIDERATIONS 129
remainder is
where M is the supremum of the variable coecient of
n 2 on the interval I dened in the condition 1, that is:
1
R1 = f (n )( xn )2 ,
2!
1 f (x)
where n is in between xn and . M = sup .
xI 2 f (x)
Since is the root, (1) becomes:
The initial point x0 has to be chosen such that condi-
tions 1 through 3 are satised, where the third condition
requires that M |0 | < 1.
2.35287527 converges to 4;
f (n ) 2.35284172 converges to 3;
xn+1 = ( xn )2 .
| {z } 2f (xn ) | {z }
n+1 n 2.35283735 converges to 4;
2.352836327 converges to 3;
That is,
2.352836323 converges to 1.
4 f (xn )
1
xn 3
xn+1 = xn
= xn 1 = xn 3 xn = 2 xn .
3 1
1
f (xn ) 3 xn
3
The algorithm overshoots the solution and lands on the
2 other side of the y-axis, farther away than it initially was;
applying Newtons method actually doubles the distances
1 from the solution at each iteration.
In fact, the iterations diverge to innity for every f (x) =
0 |x| , where 0 < < 12 . In the limiting case of =
1
2 (square root), the iterations will alternate indenitely
-1 between points x0 and x0 , so they do not converge in
this case either.
-2
Discontinuous derivative
-3
-3 -2 -1 0 1 2 3 4 If the derivative is not continuous at the root, then conver-
gence may fail to occur in any neighborhood of the root.
The tangent lines of x3 - 2x + 2 at 0 and 1 intersect the x-axis at Consider the function
1 and 0 respectively, illustrating why Newtons method oscillates
between these values for some starting points. {
0 ifx = 0,
( )
For some functions, some starting points may enter an f (x) = x + x2 sin 2 ifx = 0.
x
innite cycle, preventing convergence. Let
Its derivative is:
f (x) = x3 2x + 2 {
1 ifx = 0,
f (x) = ( ) ( )
and take 0 as the starting point. The rst iteration pro- 1 + 2 x sin x2 2 cos x2 ifx = 0.
duces 1 and the second iteration returns to 0 so the se-
quence will alternate between the two without converging Within any neighborhood of the root, this derivative
to a root. In fact, this 2-cycle is stable: there are neighbor- keeps changing sign as x approaches 0 from the right (or
hoods around 0 and around 1 from which all points iterate from the left) while f(x) x x2 > 0 for 0 < x < 1.
asymptotically to the 2-cycle (and hence not to the root So f(x)/f'(x) is unbounded near the root, and Newtons
of the function). In general, the behavior of the sequence method will diverge almost everywhere in any neighbor-
can be very complex (see Newton fractal). hood of it, even though:
17.6. GENERALIZATIONS 133
the function is dierentiable (and thus continuous) except when x = 0where it is undened. Given xn ,
everywhere;
the derivative at the root is nonzero; 1 4
f (xn ) 3 xn
3
xn+1 = xn
=
f is innitely dierentiable except at the root; and 4 1
f (xn ) (1 + 3 xn 3 )
the derivative is bounded in a neighborhood of the which has approximately 4/3 times as many bits of pre-
root (unlike f(x)/f'(x)). cision as xn has. This is less than the 2 times as many
which would be required for quadratic convergence. So
the convergence of Newtons method (in this case) is not
17.5.3 Non-quadratic convergence quadratic, even though: the function is continuously dif-
ferentiable everywhere; the derivative is not zero at the
In some cases the iterates converge but do not converge root; and f is innitely dierentiable except at the desired
as quickly as promised. In these cases simpler methods root.
converge just as quickly as Newtons method.
f (x) = x2
f (x) = x2 (x 1000) + 1.
If there is no second derivative at the root, then conver- Main article: Newton fractal
gence may fail to be quadratic. Indeed, let
When dealing with complex functions, Newtons method
4 can be directly applied to nd their zeroes. Each zero has
f (x) = x + x . 3
a basin of attraction in the complex plane, the set of all
Then starting values that cause the method to converge to that
particular zero. These sets can be mapped as in the image
shown. For many complex functions, the boundaries of
4 1 the basins of attraction are fractals.
f (x) = 1 + x 3 .
3 In some cases there are regions in the complex plane
which are not in any of these basins of attraction, mean-
And
ing the iterates do not converge. For example,[4] if one
uses a real initial condition to seek a root of x2 + 1 , all
4 2 subsequent iterates will be real numbers and so the itera-
f (x) = x 3 tions cannot converge to either root, since both roots are
9
134 CHAPTER 17. NEWTONS METHOD
non-real. In this case almost all real initial conditions lead lemma, which uses the recursion from Newtons method
to chaotic behavior, while some initial conditions iterate on the p-adic numbers. Because of the more stable behav-
either to innity or to repeating cycles of any nite length. ior of addition and multiplication in the p-adic numbers
compared to the real numbers (specically, the unit ball
in the p-adics is a ring), convergence in Hensels lemma
17.6.2 Nonlinear systems of equations can be guaranteed under much simpler hypotheses than
in the classical Newtons method on the real line.
k variables, k functions
17.6.4 Nonlinear equations over p-adic Newtons method can be used to nd a minimum or max-
numbers imum of a function. The derivative is zero at a minimum
or maximum, so minima and maxima can be found by ap-
In p-adic analysis, the standard method to show a polyno- plying Newtons method to the derivative. The iteration
mial equation in one variable has a p-adic root is Hensels becomes:
17.8. EXAMPLES 135
17.7.3 Solving transcendental equations Consider the problem of nding the positive number x
with cos(x) = x3 . We can rephrase that as nding the
3 2
Many transcendental equations can be solved using New- zero of f(x) = cos(x) x . We have f'(x) = sin(x) 3x .
3
tons method. Given the equation Since cos(x) 1 for all x and x > 1 for x > 1, we know
that our solution lies between 0 and 1. We try a starting
value of x0 = 0.5. (Note that a starting value of 0 will lead
g(x) = h(x), to an undened result, showing the importance of using a
starting point that is close to the solution.)
with g(x) and/or h(x) a transcendental function, one writes
Secant method
The following is an example of using the Newtons
Method to help nd a root of a function f which has Steensens method
derivative fprime.
Subgradient method
The initial guess will be x0 = 1 and the function will be
f (x) = x2 2 so that f (x) = 2x .
Each new iterative of Newtons method will be denoted 17.11 References
by x1. We will check during the computation whether the
denominator (yprime) becomes too small (smaller than
[1] Accelerated and Modied Newton Methods.
epsilon), which would be the case if f (xn ) 0 , since
otherwise a large amount of error could be introduced. [2] Ryaben'kii, Victor S.; Tsynkov, Semyon V. (2006),
A Theoretical Introduction to Numerical Analysis, CRC
%These choices depend on the problem being solved x0
Press, p. 243, ISBN 9781584886075.
= 1 %The initial value f = @(x) x^2 - 2 %The function
whose root we are trying to nd fprime = @(x) 2*x [3] Dence, Thomas, Cubics, chaos and Newtons method,
%The derivative of f(x) tolerance = 10^(7) %7 digit Mathematical Gazette 81, November 1997, 403-408.
accuracy is desired epsilon = 10^(14) %Don't want to
divide by a number smaller than this maxIterations = [4] Strang, Gilbert, A chaotic search for i", ''The College
Mathematics Journal 22, January 1991, pp. 3-12 (esp. p.
20 %Don't allow the iterations to continue indenitely
6).
haveWeFoundSolution = false %Have not converged to
a solution yet for i = 1 : maxIterations y = f(x0) yprime
= fprime(x0) if(abs(yprime) < epsilon) %Don't want to Kendall E. Atkinson, An Introduction to Numerical
divide by too small of a number % denominator is too Analysis, (1989) John Wiley & Sons, Inc, ISBN 0-
small break; %Leave the loop end x1 = x0 - y/yprime 471-62489-6
%Do Newtons computation if(abs(x1 - x0)/abs(x1) <
Tjalling J. Ypma, Historical development of the
tolerance) %If the result is within the desired tolerance
Newton-Raphson method, SIAM Review 37 (4),
haveWeFoundSolution = true break; %Done, so leave the
531551, 1995. doi:10.1137/1037125.
loop end x0 = x1 %Update x0 to start the process again
end if (haveWeFoundSolution) ... % x1 is a solution Bonnans, J. Frdric; Gilbert, J. Charles;
within tolerance and maximum number of iterations else Lemarchal, Claude; Sagastizbal, Claudia A.
... % did not converge end (2006). Numerical optimization: Theoretical and
practical aspects. Universitext (Second revised
ed. of translation of 1997 French ed.). Berlin:
Springer-Verlag. pp. xiv+490. doi:10.1007/978-
17.10 See also 3-540-35447-5. ISBN 3-540-35445-X. MR
2265882.
Aitkens delta-squared process
P. Deuhard, Newton Methods for Nonlinear Prob-
Bisection method lems. Ane Invariance and Adaptive Algorithms.
Springer Series in Computational Mathematics, Vol.
Euler method 35. Springer, Berlin, 2004. ISBN 3-540-21099-7.
Fast inverse square root C. T. Kelley, Solving Nonlinear Equations with New-
tons Method, no 1 in Fundamentals of Algorithms,
Fisher scoring SIAM, 2003. ISBN 0-89871-546-6.
Gradient descent J. M. Ortega, W. C. Rheinboldt, Iterative Solution
Integer square root of Nonlinear Equations in Several Variables. Clas-
sics in Applied Mathematics, SIAM, 2000. ISBN
Laguerres method 0-89871-461-3.
Leonid Kantorovich, who initiated the convergence Press, WH; Teukolsky, SA; Vetterling, WT; Flan-
analysis of Newtons method in Banach spaces. nery, BP (2007). Chapter 9. Root Finding and
Nonlinear Sets of Equations Importance Sampling.
Methods of computing square roots Numerical Recipes: The Art of Scientic Computing
Newtons method in optimization (3rd ed.). New York: Cambridge University Press.
ISBN 978-0-521-88068-8.. See especially Sections
Richardson extrapolation 9.4, 9.6, and 9.7.
17.12. EXTERNAL LINKS 137
Supervised learning
See also: Unsupervised learning of the curse of dimensionality; but should contain
enough information to accurately predict the output.
Supervised learning is the machine learning task of 4. Determine the structure of the learned function and
inferring a function from labeled training data.[1] The corresponding learning algorithm. For example, the
training data consist of a set of training examples. In su- engineer may choose to use support vector machines
pervised learning, each example is a pair consisting of or decision trees.
an input object (typically a vector) and a desired output
value (also called the supervisory signal). A supervised 5. Complete the design. Run the learning algorithm on
learning algorithm analyzes the training data and pro- the gathered training set. Some supervised learn-
duces an inferred function, which can be used for map- ing algorithms require the user to determine cer-
ping new examples. An optimal scenario will allow for tain control parameters. These parameters may be
the algorithm to correctly determine the class labels for adjusted by optimizing performance on a subset
unseen instances. This requires the learning algorithm to (called a validation set) of the training set, or via
generalize from the training data to unseen situations in a cross-validation.
reasonable way (see inductive bias).
6. Evaluate the accuracy of the learned function. After
The parallel task in human and animal psychology is often parameter adjustment and learning, the performance
referred to as concept learning. of the resulting function should be measured on a
test set that is separate from the training set.
138
18.1. OVERVIEW 139
be exible so that it can t the data well. But if the no measurement errors (stochastic noise) if the function
learning algorithm is too exible, it will t each training you are trying to learn is too complex for your learning
data set dierently, and hence have high variance. A key model. In such a situation that part of the target function
aspect of many supervised learning methods is that they that cannot be modeled corrupts your training data - this
are able to adjust this tradeo between bias and variance phenomenon has been called deterministic noise. When
(either automatically or by providing a bias/variance pa- either type of noise is present, it is better to go with a
rameter that the user can adjust). higher bias, lower variance estimator.
In practice, there are several approaches to alleviate noise
18.1.2 Function complexity and amount of in the output values such as early stopping to prevent
overtting as well as detecting and removing the noisy
training data training examples prior to training the supervised learn-
ing algorithm. There are several algorithms that iden-
The second issue is the amount of training data available
tify noisy training examples and removing the suspected
relative to the complexity of the true function (classi-
noisy training examples prior to training has decreased
er or regression function). If the true function is simple,
generalization error with statistical signicance.[4][5]
then an inexible learning algorithm with high bias and
low variance will be able to learn it from a small amount
of data. But if the true function is highly complex (e.g., 18.1.5 Other factors to consider
because it involves complex interactions among many dif-
ferent input features and behaves dierently in dierent Other factors to consider when choosing and applying a
parts of the input space), then the function will only be learning algorithm include the following:
learnable from a very large amount of training data and
using a exible learning algorithm with low bias and 1. Heterogeneity of the data. If the feature vectors in-
high variance. Good learning algorithms therefore auto- clude features of many dierent kinds (discrete, dis-
matically adjust the bias/variance tradeo based on the crete ordered, counts, continuous values), some al-
amount of data available and the apparent complexity of gorithms are easier to apply than others. Many algo-
the function to be learned. rithms, including Support Vector Machines, linear
regression, logistic regression, neural networks, and
nearest neighbor methods, require that the input fea-
18.1.3 Dimensionality of the input space
tures be numerical and scaled to similar ranges (e.g.,
to the [1,1] interval). Methods that employ a dis-
A third issue is the dimensionality of the input space. If
tance function, such as nearest neighbor methods
the input feature vectors have very high dimension, the
and support vector machines with Gaussian kernels,
learning problem can be dicult even if the true func-
are particularly sensitive to this. An advantage of
tion only depends on a small number of those features.
decision trees is that they easily handle heteroge-
This is because the many extra dimensions can con-
neous data.
fuse the learning algorithm and cause it to have high vari-
ance. Hence, high input dimensionality typically requires 2. Redundancy in the data. If the input features contain
tuning the classier to have low variance and high bias. redundant information (e.g., highly correlated fea-
In practice, if the engineer can manually remove irrel- tures), some learning algorithms (e.g., linear regres-
evant features from the input data, this is likely to im- sion, logistic regression, and distance based meth-
prove the accuracy of the learned function. In addition, ods) will perform poorly because of numerical in-
there are many algorithms for feature selection that seek stabilities. These problems can often be solved by
to identify the relevant features and discard the irrelevant imposing some form of regularization.
ones. This is an instance of the more general strategy of
dimensionality reduction, which seeks to map the input 3. Presence of interactions and non-linearities. If each
data into a lower-dimensional space prior to running the of the features makes an independent contribution to
supervised learning algorithm. the output, then algorithms based on linear functions
(e.g., linear regression, logistic regression, Support
Vector Machines, naive Bayes) and distance func-
18.1.4 Noise in the output values tions (e.g., nearest neighbor methods, support vector
machines with Gaussian kernels) generally perform
A fourth issue is the degree of noise in the desired output well. However, if there are complex interactions
values (the supervisory target variables). If the desired among features, then algorithms such as decision
output values are often incorrect (because of human er- trees and neural networks work better, because they
ror or sensor errors), then the learning algorithm should are specically designed to discover these interac-
not attempt to nd a function that exactly matches the tions. Linear methods can also be applied, but
training examples. Attempting to t the data too carefully the engineer must manually specify the interactions
leads to overtting. You can overt even when there are when using them.
140 CHAPTER 18. SUPERVISED LEARNING
algorithm will have high bias and low variance. The value Articial neural network
of can be chosen empirically via cross validation.
The complexity penalty has a Bayesian interpretation as Backpropagation
the negative log prior probability of g , log P (g) , in
which case J(g) is the posterior probabability of g . Boosting (meta-algorithm)
Bayesian statistics
18.3 Generative training
Case-based reasoning
The training methods described above are discriminative Decision tree learning
training methods, because they seek to nd a function
g that discriminates well between the dierent output
Inductive logic programming
values (see discriminative model). For the special case
where f (x, y) = P (x, y) is a joint probability distribu-
tion Gaussian process regression
and the loss function is the negative log likelihood
i log P (xi , yi ), a risk minimization algorithm is said
to perform generative training, because f can be regarded Group method of data handling
as a generative model that explains how the data were gen-
erated. Generative training algorithms are often simpler Kernel estimators
and more computationally ecient than discriminative
training algorithms. In some cases, the solution can be Learning Automata
computed in closed form as in naive Bayes and linear dis-
criminant analysis. Minimum message length (decision trees, decision
graphs, etc.)
There are several ways in which the standard supervised Nearest Neighbor Algorithm
learning problem can be generalized:
Probably approximately correct learning (PAC)
1. Semi-supervised learning: In this setting, the de- learning
sired output values are provided only for a subset of
the training data. The remaining data is unlabeled. Ripple down rules, a knowledge acquisition method-
ology
2. Active learning: Instead of assuming that all of the
training examples are given at the start, active learn- Symbolic machine learning algorithms
ing algorithms interactively collect new examples,
typically by making queries to a human user. Of- Subsymbolic machine learning algorithms
ten, the queries are based on unlabeled data, which
is a scenario that combines semi-supervised learning
Support vector machines
with active learning.
18.8 References
[1] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar
(2012) Foundations of Machine Learning, The MIT Press
ISBN 9780262018258.
Linear regression
In statistics, linear regression is an approach for mod- lationship between y and the Xj, to assess which Xj
eling the relationship between a scalar dependent vari- may have no relationship with y at all, and to identify
able y and one or more explanatory variables (or inde- which subsets of the Xj contain redundant informa-
pendent variable) denoted X. The case of one explana- tion about y.
tory variable is called simple linear regression. For more
than one explanatory variable, the process is called multi- Linear regression models are often tted using the least
ple linear regression.[1] (This term should be distinguishedsquares approach, but they may also be tted in other
from multivariate linear regression, where multiple cor- ways, such as by minimizing the lack of t in some
related dependent variables are predicted, rather than a other norm (as with least absolute deviations regression),
single scalar variable.)[2] or by minimizing a penalized version of the least squares
In linear regression, data are modeled using linear pre- loss function as in ridge regression (L2-norm penalty) and
dictor functions, and unknown model parameters are lasso (L1-norm penalty). Conversely, the least squares
estimated from the data. Such models are called linear approach can be used to t models that are not linear
models.[3] Most commonly, linear regression refers to a models. Thus, although the terms least squares and lin-
model in which the conditional mean of y given the value ear model are closely linked, they are not synonymous.
of X is an ane function of X. Less commonly, linear
regression could refer to a model in which the median,
or some other quantile of the conditional distribution of 19.1 Introduction to linear regres-
y given X is expressed as a linear function of X. Like
all forms of regression analysis, linear regression focuses sion
on the conditional probability distribution of y given X,
rather than on the joint probability distribution of y and
X, which is the domain of multivariate analysis.
Linear regression was the rst type of regression analy-
sis to be studied rigorously, and to be used extensively
in practical applications.[4] This is because models which
depend linearly on their unknown parameters are easier
to t than models which are non-linearly related to their
parameters and because the statistical properties of the
resulting estimators are easier to determine.
Linear regression has many practical uses. Most applica-
tions fall into one of the following two broad categories:
If the goal is prediction, or forecasting, or reduction, Example of simple linear regression, which has one independent
linear regression can be used to t a predictive model variable
to an observed data set of y and X values. After
developing such a model, if an additional value of X Given a data set {yi , xi1 , . . . , xip }ni=1 of n statistical
is then given without its accompanying value of y, units, a linear regression model assumes that the relation-
the tted model can be used to make a prediction of ship between the dependent variable yi and the p-vector
the value of y. of regressors xi is linear. This relationship is modeled
through a disturbance term or error variable i an un-
Given a variable y and a number of variables X1 , ..., observed random variable that adds noise to the linear re-
Xp that may be related to y, linear regression analy- lationship between the dependent variable and regressors.
sis can be applied to quantify the strength of the re- Thus the model takes the form
143
144 CHAPTER 19. LINEAR REGRESSION
xi1 , xi2 , . . . , xip are called regressors, exogenous where 1 determines the initial velocity of the ball, 2
variables, explanatory variables, covariates, input is proportional to the standard gravity, and i is due to
variables, predictor variables, or independent vari- measurement errors. Linear regression can be used to es-
ables (see dependent and independent variables, but timate the values of 1 and 2 from the measured data.
19.1. INTRODUCTION TO LINEAR REGRESSION 145
This model is non-linear in the time variable, but it is lin- ridge regression and lasso regression. Bayesian lin-
ear in the parameters 1 and 2 ; if we take regressors xi = ear regression can also be used, which by its nature
(xi, xi) = (ti, ti2 ), the model takes on the standard form is more or less immune to the problem of overt-
ting. (In fact, ridge regression and lasso regression
can both be viewed as special cases of Bayesian lin-
hi = xTi + i . ear regression, with particular types of prior distri-
butions placed on the regression coecients.)
of the response variable using a linear regression The arrangement, or probability distribution of the
model, which implies that the response variable has predictor variables x has a major inuence on the
a log-normal distribution rather than a normal dis- precision of estimates of . Sampling and design of
tribution). experiments are highly developed subelds of statis-
tics that provide guidance for collecting data in such
Independence of errors. This assumes that the er- a way to achieve a precise estimate of .
rors of the response variables are uncorrelated with
each other. (Actual statistical independence is a
19.1.2 Interpretation
stronger condition than mere lack of correlation and
is often not needed, although it can be exploited if it
is known to hold.) Some methods (e.g. generalized
least squares) are capable of handling correlated
errors, although they typically require signicantly
more data unless some sort of regularization is used
to bias the model towards assuming uncorrelated er-
rors. Bayesian linear regression is a general way of
handling this issue.
The meaning of the expression held xed may depend ditional linearity of E(y|x) = Bx is still assumed, with a
on how the values of the predictor variables arise. If matrix B replacing the vector of the classical linear re-
the experimenter directly sets the values of the predic- gression model. Multivariate analogues of OLS and GLS
tor variables according to a study design, the compar- have been developed. The term general linear models
isons of interest may literally correspond to comparisons is equivalent to multivariate linear models. It should
among units whose predictor variables have been held be noted the dierence of multivariate linear models
xed by the experimenter. Alternatively, the expression and multivariable linear models, where the former is
held xed can refer to a selection that takes place in the the same as general linear models and the latter is the
context of data analysis. In this case, we hold a variable same as multiple linear models.
xed by restricting our attention to the subsets of the data
that happen to have a common value for the given predic-
tor variable. This is the only interpretation of held xed 19.2.3 Heteroscedastic models
that can be used in an observational study.
Various models have been created that allow for
The notion of a unique eect is appealing when study- heteroscedasticity, i.e. the errors for dierent response
ing a complex system where multiple interrelated com- variables may have dierent variances. For example,
ponents inuence the response variable. In some cases, weighted least squares is a method for estimating linear
it can literally be interpreted as the causal eect of an regression models when the response variables may have
intervention that is linked to the value of a predictor vari- dierent error variances, possibly with correlated errors.
able. However, it has been argued that in many cases mul- (See also Weighted linear least squares, and generalized
tiple regression analysis fails to clarify the relationships least squares.) Heteroscedasticity-consistent standard er-
between the predictor variables and the response variable rors is an improved method for use with uncorrelated but
when the predictors are correlated with each other and are potentially heteroscedastic errors.
not assigned following a study design.[9] A commonality
analysis may be helpful in disentangling the shared and
unique impacts of correlated independent variables.[10] 19.2.4 Generalized linear models
Generalized linear models (GLMs) are a framework for
19.2 Extensions modeling a response variable y that is bounded or dis-
crete. This is used, for example:
Numerous extensions of linear regression have been de-
veloped, which allow some or all of the assumptions un- when modeling positive quantities (e.g. prices or
derlying the basic model to be relaxed. populations) that vary over a large scalewhich are
better described using a skewed distribution such as
the log-normal distribution or Poisson distribution
19.2.1 Simple and multiple regression (although GLMs are not used for log-normal data,
instead the response variable is simply transformed
The very simplest case of a single scalar predictor vari- using the logarithm function);
able x and a single scalar response variable y is known as
simple linear regression. The extension to multiple and/or when modeling categorical data, such as the choice
vector-valued predictor variables (denoted with a capital of a given candidate in an election (which is bet-
X) is known as multiple linear regression, also known as ter described using a Bernoulli distribution/binomial
multivariable linear regression. Nearly all real-world re- distribution for binary choices, or a categorical
gression models involve multiple predictors, and basic de- distribution/multinomial distribution for multi-way
scriptions of linear regression are often phrased in terms choices), where there are a xed number of choices
of the multiple regression model. Note, however, that in that cannot be meaningfully ordered;
these cases the response variable y is still a scalar. An-
when modeling ordinal data, e.g. ratings on a scale
other term multivariate linear regression refers to cases
from 0 to 5, where the dierent outcomes can be
where y is a vector, i.e., the same as general linear regres-
ordered but where the quantity itself may not have
sion. The dierence between multivariate linear regres-
any absolute meaning (e.g. a rating of 4 may not be
sion and multivariable linear regression should be empha-
twice as good in any objective sense as a rating of
sized as it causes much confusion and misunderstanding
2, but simply indicates that it is better than 2 or 3
in the literature.
but not as good as 5).
19.2.2 General linear models Generalized linear models allow for an arbitrary link
function g that relates the mean of the response variable
The general linear model considers the situation when the to the predictors, i.e. E(y) = g(x). The link function is
response variable Y is not a scalar but a vector. Con- often related to the distribution of the response, and in
148 CHAPTER 19. LINEAR REGRESSION
particular it typically has the eect of transforming be- 19.3 Estimation methods
tween the (, ) range of the linear predictor and the
range of the response variable.
Some common examples of GLMs are:
Hierarchical linear models (or multilevel regression) or- A large number of procedures have been developed for
ganizes the data into a hierarchy of regressions, for ex- parameter estimation and inference in linear regression.
ample where A is regressed on B, and B is regressed on These methods dier in computational simplicity of al-
C. It is often used where the data have a natural hierar- gorithms, presence of a closed-form solution, robustness
chical structure such as in educational statistics, where with respect to heavy-tailed distributions, and theoretical
students are nested in classrooms, classrooms are nested assumptions needed to validate desirable statistical prop-
in schools, and schools are nested in some administrative erties such as consistency and asymptotic eciency.
grouping, such as a school district. The response variable Some of the more common estimation techniques for lin-
might be a measure of student achievement such as a test ear regression are summarized below.
score, and dierent covariates would be collected at the
classroom, school, and school district levels.
19.3.1 Least-squares estimation and re-
lated techniques
19.2.6 Errors-in-variables
Ordinary least squares (OLS) is the simplest
Errors-in-variables models (or measurement error mod- and thus most common estimator. It is concep-
els) extend the traditional linear regression model to al- tually simple and computationally straightforward.
low the predictor variables X to be observed with error. OLS estimates are commonly used to analyze both
This error causes standard estimators of to become bi- experimental and observational data.
ased. Generally, the form of bias is an attenuation, mean-
The OLS method minimizes the sum of squared
ing that the eects are biased toward zero.
residuals, and leads to a closed-form expression for
the estimated value of the unknown parameter :
19.2.7 Others
( ) ( )
In DempsterShafer theory, or a linear belief func- = (XT X)1 XT y = xi xT 1 xi yi .
i
tion in particular, a linear regression model may be
represented as a partially swept matrix, which can The estimator is unbiased and consistent if the errors
be combined with similar matrices representing ob- have nite variance and are uncorrelated with the
servations and other assumed normal distributions regressors[12]
and state equations. The combination of swept or
unswept matrices provides an alternative method for
estimating linear regression models. E[ xi i ] = 0.
19.3. ESTIMATION METHODS 149
It is also ecient under the assumption that the Iteratively reweighted least squares (IRLS) is
errors have nite variance and are homoscedastic, used when heteroscedasticity, or correlations, or
meaning that E[i2 |xi] does not depend on i. The both are present among the error terms of the model,
condition that the errors are uncorrelated with the but where little is known about the covariance struc-
regressors will generally be satised in an experi- ture of the errors independently of the data.[14] In
ment, but in the case of observational data, it is dif- the rst iteration, OLS, or GLS with a provisional
cult to exclude the possibility of an omitted covari- covariance structure is carried out, and the residuals
ate z that is related to both the observed covariates are obtained from the t. Based on the residuals, an
and the response variable. The existence of such a improved estimate of the covariance structure of the
covariate will generally lead to a correlation between errors can usually be obtained. A subsequent GLS
the regressors and the response variable, and hence iteration is then performed using this estimate of the
to an inconsistent estimator of . The condition of error structure to dene the weights. The process
homoscedasticity can fail with either experimental can be iterated to convergence, but in many cases,
or observational data. If the goal is either inference only one iteration is sucient to achieve an ecient
or predictive modeling, the performance of OLS es- estimate of .[15][16]
timates can be poor if multicollinearity is present,
unless the sample size is large. Instrumental variables regression (IV) can be per-
formed when the regressors are correlated with the
In simple linear regression, where there is only one errors. In this case, we need the existence of some
regressor (with a constant), the OLS coecient es- auxiliary instrumental variables zi such that E[zii]
timates have a simple form that is closely related to = 0. If Z is the matrix of instruments, then the esti-
the correlation coecient between the covariate and mator can be given in closed form as
the response.
= (XT Z(ZT Z)1 ZT X)1 XT Z(ZT Z)1 ZT y.
Generalized least squares (GLS) is an extension of
the OLS method, that allows ecient estimation of Optimal instruments regression is an extension of
when either heteroscedasticity, or correlations, or classical IV regression to the situation where E[i|zi]
both are present among the error terms of the model, = 0.
as long as the form of heteroscedasticity and correla-
Total least squares (TLS)[17] is an approach to least
tion is known independently of the data. To handle
squares estimation of the linear regression model
heteroscedasticity when the error terms are uncor-
that treats the covariates and response variable in a
related with each other, GLS minimizes a weighted
more geometrically symmetric manner than OLS. It
analogue to the sum of squared residuals from OLS
is one approach to handling the errors in variables
regression, where the weight for the ith case is in-
problem, and is also sometimes used even when the
versely proportional to var(i). This special case of
covariates are assumed to be error-free.
GLS is called weighted least squares. The GLS
solution to estimation problem is
= (XT 1 X)1 XT 1 y,
19.3.2 Maximum-likelihood estimation
and related techniques
where is the covariance matrix of the errors. GLS Maximum likelihood estimation can be per-
can be viewed as applying a linear transformation to formed when the distribution of the error terms is
the data so that the assumptions of OLS are met for known to belong to a certain parametric family
the transformed data. For GLS to be applied, the of probability distributions.[18] When f is a nor-
covariance structure of the errors must be known up mal distribution with zero mean and variance , the
to a multiplicative constant. resulting estimate is identical to the OLS estimate.
GLS estimates are maximum likelihood estimates
Percentage least squares focuses on reducing per- when follows a multivariate normal distribution
centage errors, which is useful in the eld of fore- with a known covariance matrix.
casting or time series analysis. It is also useful in
situations where the dependent variable has a wide Ridge regression,[19][20][21] and other forms of pe-
range without constant variance, as here the larger nalized estimation such as Lasso regression,[5] de-
residuals at the upper end of the range would domi- liberately introduce bias into the estimation of
nate if OLS were used. When the percentage or rel- in order to reduce the variability of the estimate.
ative error is normally distributed, least squares per- The resulting estimators generally have lower mean
centage regression provides maximum likelihood squared error than the OLS estimates, particularly
estimates. Percentage regression is linked to a mul- when multicollinearity is present. They are gener-
tiplicative error model, whereas OLS is linked to ally used when the goal is to predict the value of the
models containing an additive error term.[13] response variable y for values of the predictors x that
150 CHAPTER 19. LINEAR REGRESSION
have not yet been observed. These methods are not large, or when strong correlations exist among the
as commonly used when the goal is inference, since predictor variables. This two-stage procedure rst
it is dicult to account for the bias. reduces the predictor variables using principal com-
ponent analysis then uses the reduced variables in
Least absolute deviation (LAD) regression is a an OLS regression t. While it often works well in
robust estimation technique in that it is less sensi- practice, there is no general theoretical reason that
tive to the presence of outliers than OLS (but is less the most informative linear function of the predictor
ecient than OLS when no outliers are present). It variables should lie among the dominant principal
is equivalent to maximum likelihood estimation un- components of the multivariate distribution of the
der a Laplace distribution model for .[22] predictor variables. The partial least squares regres-
sion is the extension of the PCR method which does
Adaptive estimation. If we assume that error terms
not suer from the mentioned deciency.
are independent from the regressors i xi , the
optimal estimator is the 2-step MLE, where the rst Least-angle regression[6] is an estimation proce-
step is used to non-parametrically estimate the dis- dure for linear regression models that was developed
tribution of the error term.[23] to handle high-dimensional covariate vectors, po-
tentially with more covariates than observations.
19.3.3 Other estimation techniques The TheilSen estimator is a simple robust estima-
tion technique that chooses the slope of the t line
Bayesian linear regression applies the framework to be the median of the slopes of the lines through
of Bayesian statistics to linear regression. (See also pairs of sample points. It has similar statistical ef-
Bayesian multivariate linear regression.) In particu- ciency properties to simple linear regression but is
lar, the regression coecients are assumed to be much less sensitive to outliers.[25]
random variables with a specied prior distribution.
The prior distribution can bias the solutions for the Other robust estimation techniques, including the -
regression coecients, in a way similar to (but more trimmed mean approach, and L-, M-, S-, and R-
general than) ridge regression or lasso regression. In estimators have been introduced.
addition, the Bayesian estimation process produces
not a single point estimate for the best values of the 19.3.4 Further discussion
regression coecients but an entire posterior distri-
bution, completely describing the uncertainty sur- In statistics and numerical analysis, the problem of nu-
rounding the quantity. This can be used to estimate merical methods for linear least squares is an impor-
the best coecients using the mean, mode, me- tant one because linear regression models are one of the
dian, any quantile (see quantile regression), or any most important types of model, both as formal statistical
other function of the posterior distribution. models and for exploration of data sets. The majority of
Quantile regression focuses on the conditional statistical computer packages contain facilities for regres-
quantiles of y given X rather than the conditional sion analysis that make use of linear least squares compu-
mean of y given X. Linear quantile regression mod- tations. Hence it is appropriate that considerable eort
els a particular conditional quantile, for example the has been devoted to the task of ensuring that these com-
conditional median, as a linear function T x of the putations are undertaken eciently and with due regard
predictors. to numerical precision.
Individual statistical analyses are seldom undertaken in
Mixed models are widely used to analyze linear isolation, but rather are part of a sequence of investiga-
regression relationships involving dependent data tory steps. Some of the topics involved in considering
when the dependencies have a known structure. numerical methods for linear least squares relate to this
Common applications of mixed models include point. Thus important topics can be
analysis of data involving repeated measurements,
such as longitudinal data, or data obtained from clus-
Computations where a number of similar, and of-
ter sampling. They are generally t as parametric
ten nested, models are considered for the same data
models, using maximum likelihood or Bayesian es-
set. That is, where models with the same dependent
timation. In the case where the errors are modeled
variable but dierent sets of independent variables
as normal random variables, there is a close con-
are to be considered, for essentially the same set of
nection between mixed models and generalized least
data points.
squares.[24] Fixed eects estimation is an alternative
approach to analyzing this type of data. Computations for analyses that occur in a sequence,
as the number of data points increases.
Principal component regression (PCR)[7][8] is
used when the number of predictor variables is Special considerations for very extensive data sets.
19.4. APPLICATIONS OF LINEAR REGRESSION 151
Fitting of linear models by least squares often, but not al- 19.4 Applications of linear regres-
ways, arises in the context of statistical analysis. It can
therefore be important that considerations of computa-
sion
tional eciency for such problems extend to all of the
auxiliary quantities required for such analyses, and are not Linear regression is widely used in biological, behavioral
restricted to the formal solution of the linear least squares and social sciences to describe possible relationships be-
problem. tween variables. It ranks as one of the most important
tools used in these disciplines.
Matrix calculations, like any others, are aected by
rounding errors. An early summary of these eects, re-
garding the choice of computational methods for matrix 19.4.1 Trend line
inversion, was provided by Wilkinson.[26]
Main article: Trend estimation
19.3.5 Using Linear Algebra
A trend line represents a trend, the long-term movement
It follows that one can nd a best approximation of an- in time series data after other components have been ac-
other function by minimizing the area between two func- counted for. It tells whether a particular data set (say
tions, a continuous function f on [a, b] and a function GDP, oil prices or stock prices) have increased or de-
g W where W is a subspace of C[a, b] : creased over the period of time. A trend line could sim-
ply be drawn by eye through a set of data points, but
b more properly their position and slope is calculated using
Area = |f (x) g(x)| dx statistical techniques like linear regression. Trend lines
a typically are straight lines, although some variations use
higher degree polynomials depending on the degree of
within the subspace W . Due to the frequent diculty curvature desired in the line.
of evaluating integrands involving absolute value, one Trend lines are sometimes used in business analytics to
can instead dene show changes in data over time. This has the advantage
of being simple. Trend lines are often used to argue that
b a particular action or event (such as training, or an ad-
[f (x) g(x)] dx
2 vertising campaign) caused observed changes at a point
a in time. This is a simple technique, and does not require
a control group, experimental design, or a sophisticated
an adequate criterion for obtaining the least squares analysis technique. However, it suers from a lack of sci-
approximation, function g , of f with respect to the inner entic validity in cases where other potential changes can
product space W . aect the data.
As such, f g2 or, equivalently, f g , can thus be
written in vector form: 19.4.2 Epidemiology
b Early evidence relating tobacco smoking to mortality and
[f (x) g(x)]2 dx = f g, f g = f g2 morbidity came from observational studies employing re-
a
gression analysis. In order to reduce spurious correlations
when analyzing observational data, researchers usually in-
other words, the least squares approximation of f is the clude several variables in their regression models in addi-
function g subspace W closest to f in terms of the tion to the variable of primary interest. For example, sup-
inner product f, g . Furthermore, this can be applied pose we have a regression model in which cigarette smok-
with a theorem: ing is the independent variable of interest, and the depen-
dent variable is lifespan measured in years. Researchers
Let f be continuous on [a, b] , and let W be might include socio-economic status as an additional in-
a nite-dimensional subspace of C[a, b] . The dependent variable, to ensure that any observed eect of
least squares approximating function of f with smoking on lifespan is not due to some eect of educa-
respect to W is given by tion or income. However, it is never possible to include
all possible confounding variables in an empirical anal-
ysis. For example, a hypothetical gene might increase
g = f, w 1 w
1 + f, w 2 w 2 + + f, w
n w
n mortality and also cause people to smoke more. For this
reason, randomized controlled trials are often able to gen-
where B = {w 1, w n } is an orthonor-
2, . . . , w erate more compelling evidence of causal relationships
mal basis for W . than can be obtained using regression analyses of obser-
152 CHAPTER 19. LINEAR REGRESSION
vational data. When controlled experiments are not fea- Multivariate adaptive regression splines
sible, variants of regression analysis such as instrumental
variables regression may be used to attempt to estimate Nonlinear regression
causal relationships from observational data. Nonparametric regression
Normal equations
19.4.3 Finance
Projection pursuit regression
The capital asset pricing model uses linear regression as Segmented linear regression
well as the concept of beta for analyzing and quantifying
the systematic risk of an investment. This comes directly Stepwise regression
from the beta coecient of the linear regression model
that relates the return on the investment to the return on Support vector machine
all risky assets. Truncated regression model
19.4.4 Economics
19.6 Notes
Main article: Econometrics
[1] David A. Freedman (2009). Statistical Models: Theory
and Practice. Cambridge University Press. p. 26. A
Linear regression is the predominant empirical tool simple regression equation has on the right hand side an
in economics. For example, it is used to pre- intercept and an explanatory variable with a slope coe-
dict consumption spending,[27] xed investment spend- cient. A multiple regression equation has several explana-
ing, inventory investment, purchases of a countrys tory variables on the right hand side, each with its own
exports,[28] spending on imports,[28] the demand to hold slope coecient
liquid assets,[29] labor demand,[30] and labor supply.[30] [2] Rencher, Alvin C.; Christensen, William F. (2012),
Chapter 10, Multivariate regression Section 10.1, In-
troduction, Methods of Multivariate Analysis, Wiley Se-
19.4.5 Environmental science ries in Probability and Statistics 709 (3rd ed.), John Wiley
& Sons, p. 19, ISBN 9781118391679.
Linear regression nds application in a wide range of
environmental science applications. In Canada, the En- [3] Hilary L. Seal (1967). The historical development of
vironmental Eects Monitoring Program uses statistical the Gauss linear model. Biometrika 54 (1/2): 124.
analyses on sh and benthic surveys to measure the ef- doi:10.1093/biomet/54.1-2.1.
fects of pulp mill or metal mine euent on the aquatic [4] Yan, Xin (2009), Linear Regression Analysis: The-
ecosystem.[31] ory and Computing, World Scientic, pp. 12, ISBN
9789812834119, Regression analysis ... is probably one
of the oldest topics in mathematical statistics dating back
19.5 See also to about two hundred years ago. The earliest form of the
linear regression was the least squares method, which was
published by Legendre in 1805, and by Gauss in 1809 ...
Analysis of variance Legendre and Gauss both applied the method to the prob-
lem of determining, from astronomical observations, the
Censored regression model
orbits of bodies about the sun.
Cross-sectional regression [5] Tibshirani, Robert (1996). Regression Shrinkage and
Selection via the Lasso. Journal of the Royal Statistical
Curve tting
Society, Series B 58 (1): 267288. JSTOR 2346178.
Empirical Bayes methods
[6] Efron, Bradley; Hastie, Trevor; Johnstone,Iain;
Errors and residuals Tibshirani,Robert (2004). Least Angle Regres-
sion. The Annals of Statistics 32 (2): 407451.
Lack-of-t sum of squares doi:10.1214/009053604000000067. JSTOR 3448465.
[9] Berk, Richard A. Regression Analysis: A Constructive Cri- [23] Stone, C. J. (1975). Adaptive maximum likelihood esti-
tique. Sage. doi:10.1177/0734016807304871. mators of a location parameter. The Annals of Statistics
3 (2): 267284. doi:10.1214/aos/1176343056. JSTOR
[10] Warne, R. T. (2011). Beyond multiple regression: Us- 2958945.
ing commonality analysis to better understand R2 re-
sults. Gifted Child Quarterly, 55, 313-318. doi:10.1177/ [24] Goldstein, H. (1986). Multilevel Mixed Linear Model
0016986211422217 Analysis Using Iterative Generalized Least Squares.
Biometrika 73 (1): 4356. doi:10.1093/biomet/73.1.43.
[11] Brillinger, David R. (1977). The Identication of a Par-
JSTOR 2336270.
ticular Nonlinear Time Series System. Biometrika 64
(3): 509515. doi:10.1093/biomet/64.3.509. JSTOR
[25] Theil, H. (1950). A rank-invariant method of linear and
2345326.
polynomial regression analysis. I, II, III. Nederl. Akad.
[12] Lai, T.L.; Robbins, H.; Wei, C.Z. (1978). Wetensch., Proc. 53: 386392, 521525, 13971412.
Strong consistency of least squares esti- MR 0036489; Sen, Pranab Kumar (1968). Estimates of
mates in multiple regression. PNAS 75 (7): the regression coecient based on Kendalls tau. Journal
30343036. Bibcode:1978PNAS...75.3034L. of the American Statistical Association 63 (324): 1379
doi:10.1073/pnas.75.7.3034. JSTOR 68164. 1389. doi:10.2307/2285891. JSTOR 2285891. MR
0258201.
[13] Tofallis, C (2009). Least Squares Percentage Regres-
sion. Journal of Modern Applied Statistical Methods 7: [26] Wilkinson, J.H. (1963) Chapter 3: Matrix Computa-
526534. doi:10.2139/ssrn.1406472. tions, Rounding Errors in Algebraic Processes, London:
Her Majestys Stationery Oce (National Physical Labo-
[14] del Pino, Guido (1989). The Unifying Role of ratory, Notes in Applied Science, No.32)
Iterative Generalized Least Squares in Statistical
Algorithms. Statistical Science 4 (4): 394403. [27] Deaton, Angus (1992). Understanding Consumption. Ox-
doi:10.1214/ss/1177012408. JSTOR 2245853. ford University Press. ISBN 0-19-828824-7.
[15] Carroll, Raymond J. (1982). Adapting for Heteroscedas-
[28] Krugman, Paul R.; Obstfeld, M.; Melitz, Marc J. (2012).
ticity in Linear Models. The Annals of Statistics 10
International Economics: Theory and Policy (9th global
(4): 12241233. doi:10.1214/aos/1176345987. JSTOR
ed.). Harlow: Pearson. ISBN 9780273754091.
2240725.
[16] Cohen, Michael; Dalal, Siddhartha R.; Tukey,John W. [29] Laidler, David E. W. (1993). The Demand for Money:
(1993). Robust, Smoothly Heterogeneous Variance Re- Theories, Evidence, and Problems (4th ed.). New York:
gression. Journal of the Royal Statistical Society, Series C Harper Collins. ISBN 0065010981.
42 (2): 339353. JSTOR 2986237.
[30] Ehrenberg; Smith (2008). Modern Labor Economics
[17] Nievergelt, Yves (1994). Total Least Squares: State-of- (10th international ed.). London: Addison-Wesley. ISBN
the-Art Regression in Numerical Analysis. SIAM Re- 9780321538963.
view 36 (2): 258264. doi:10.1137/1036055. JSTOR
2132463. [31] EEMP webpage
Tikhonov regularization
Tikhonov regularization, named for Andrey Tikhonov, smaller norms; this is known as L2 regularization.[1] In
is the most commonly used method of regularization of other cases, lowpass operators (e.g., a dierence opera-
ill-posed problems. In statistics, the method is known tor or a weighted Fourier operator) may be used to en-
as ridge regression, and with multiple independent dis- force smoothness if the underlying vector is believed to
coveries, it is also variously known as the Tikhonov be mostly continuous. This regularization improves the
Miller method, the PhillipsTwomey method, the con- conditioning of the problem, thus enabling a direct nu-
strained linear inversion method, and the method of merical solution. An explicit solution, denoted by x
, is
linear regularization. It is related to the Levenberg given by:
Marquardt algorithm for non-linear least-squares prob-
lems.
When the following problem is not well posed (either be- = (AT A + T )1 AT b
x
cause of non-existence or non-uniqueness of x )
The eect of regularization may be varied via the scale of
matrix . For = 0 this reduces to the unregularized
Ax = b, least squares solution provided that (AT A)1 exists.
L2 regularization is used in many contexts aside from
then the standard approach (known as ordinary least
linear regression, such as classication with logistic
squares) leads to an overdetermined, or more often an
regression or support vector machines,[2] and matrix
underdetermined system of equations. Most real-world
factorization.[3]
phenomena operate as low-pass lters in the forward di-
rection where A maps x to b . Therefore in solving the
inverse-problem, the inverse-mapping operates as a high-
pass lter that has the undesirable tendency of amplifying 20.1 History
noise (eigenvalues / singular values are largest in the re-
verse mapping where they were smallest in the forward
Tikhonov regularization has been invented independently
mapping). In addition, ordinary least squares implicitly
in many dierent contexts. It became widely known from
nullies every element of the reconstructed version of x
its application to integral equations from the work of
that is in the null-space of A , rather than allowing for a
Andrey Tikhonov and David L. Phillips. Some authors
model to be used as a prior for x . Ordinary least squares
use the term TikhonovPhillips regularization. The
seeks to minimize the sum of squared residuals, which
nite-dimensional case was expounded by Arthur E. Ho-
can be compactly written as
erl, who took a statistical approach, and by Manus Foster,
who interpreted this method as a WienerKolmogorov l-
ter. Following Hoerl, it is known in the statistical litera-
Ax b2 ture as ridge regression.
where is the Euclidean norm. In order to give prefer-
ence to a particular solution with desirable properties, a
regularization term can be included in this minimization: 20.2 Generalized Tikhonov regu-
larization
Ax b2 + x2
For general multivariate normal distributions for x and
for some suitably chosen Tikhonov matrix, . In many the data error, one can apply a transformation of the vari-
cases, this matrix is chosen as a multiple of the identity ables to reduce to the case above. Equivalently, one can
matrix ( = I ), giving preference to solutions with seek an x to minimize
155
156 CHAPTER 20. TIKHONOV REGULARIZATION
= V DU T b
x 20.6 Relation to probabilistic for-
where D has diagonal values mulation
i The probabilistic formulation of an inverse problem in-
Dii = troduces (when all uncertainties are Gaussian) a covari-
i2 + 2
20.9. REFERENCES 157
ance matrix CM representing the a priori uncertainties bust stochastic approximation. IEEE Trans. Neural Net-
on the model parameters, and a covariance matrix CD works and Learning Systems 23 (7): 10871099.
representing the uncertainties on the observed parame-
ters (see, for instance, Tarantola, 2005 ). In the special [4] Vogel, Curtis R. (2002). Computational methods for in-
case when these two matrices are diagonal and isotropic, verse problems. Philadelphia: Society for Industrial and
2 2 Applied Mathematics. ISBN 0-89871-550-4.
CM = M I and CD = D I , and, in this case, the equa-
tions of inverse theory reduce to the equations above, with
= D /M . Amemiya, Takeshi (1985). Advanced Econometrics.
Harvard University Press. pp. 6061. ISBN 0-674-
00560-0.
20.7 Bayesian interpretation
Tikhonov, Andrey Nikolayevich (1943). "
" [On the stability of
Further information: Minimum mean square error inverse problems]. Doklady Akademii Nauk SSSR
Linear MMSE estimator for linear observation process 39 (5): 195198.
Although at rst the choice of the solution to this regular- Tikhonov, A. N. (1963). "
ized problem may look articial, and indeed the matrix
seems rather arbitrary, the process can be justied from " [Solution of incorrectly formu-
a Bayesian point of view. Note that for an ill-posed prob- lated problems and the regularization method].
lem one must necessarily introduce some additional as- Doklady Akademii Nauk SSSR 151: 501504..
sumptions in order to get a unique solution. Statistically, Translated in Soviet Mathematics 4: 10351038.
the prior probability distribution of x is sometimes taken Missing or empty |title= (help)
to be a multivariate normal distribution. For simplicity
here, the following assumptions are made: the means are Tikhonov, A. N.; V. Y. Arsenin (1977). Solution of
zero; their components are independent; the components Ill-posed Problems. Washington: Winston & Sons.
have the same standard deviation x . The data are also ISBN 0-470-99124-0.
subject to errors, and the errors in b are also assumed to
be independent with zero mean and standard deviation b Tikhonov A.N., Goncharsky A.V., Stepanov V.V.,
. Under these assumptions the Tikhonov-regularized so- Yagola A.G., 1995, Numerical Methods for the So-
lution is the most probable solution given the data and the lution of Ill-Posed Problems, Kluwer Academic Pub-
a priori distribution of x , according to Bayes theorem.[4] lishers.
If the assumption of normality is replaced by assumptions Tikhonov A.N., Leonov A.S., Yagola A.G., 1998,
of homoskedasticity and uncorrelatedness of errors, and Nonlinear Ill-Posed Problems, V. 1, V. 2, Chapman
if one still assumes zero mean, then the GaussMarkov and Hall.
theorem entails that the solution is the minimal unbiased
estimator. Hansen, P.C., 1998, Rank-decient and Discrete ill-
posed problems, SIAM
20.8 See also Hoerl AE, 1962, Application of ridge analysis to re-
gression problems, Chemical Engineering Progress,
LASSO estimator is another regularization method 1958, 5459.
in statistics.
Hoerl, A.E.; R.W. Kennard (1970). Ridge
regression: Biased estimation for nonorthogo-
nal problems. Technometrics 12 (1): 5567.
20.9 References doi:10.2307/1267351. JSTOR 1271436.
[1] Ng, Andrew Y. (2004). Proc. ICML (PDF) Foster, M. (1961). An Application of the
http://www.machinelearning.org/proceedings/icml2004/ Wiener-Kolmogorov Smoothing Theory to Ma-
papers/354.pdf. Missing or empty |title= (help) trix Inversion. Journal of the Society for In-
dustrial and Applied Mathematics 9 (3): 387.
[2] R.-E. Fan; K.-W. Chang; C.-J. Hsieh; X.-R. Wang; C.-
doi:10.1137/0109031.
J. Lin (2008). LIBLINEAR: A library for large linear
classication. Journal of Machine Learning Research 9:
18711874.
Phillips, D. L. (1962). A Technique for the
Numerical Solution of Certain Integral Equations
[3] Guan, Naiyang; Tao, Dacheng; Luo, Zhigang; Yuan, Bo of the First Kind. Journal of the ACM 9: 84.
(2012). Online nonnegative matrix factorization with ro- doi:10.1145/321105.321114.
158 CHAPTER 20. TIKHONOV REGULARIZATION
Regression analysis
In statistics, regression analysis is a statistical process ing used. Since the true form of the data-generating pro-
for estimating the relationships among variables. It in- cess is generally not known, regression analysis often de-
cludes many techniques for modeling and analyzing sev- pends to some extent on making assumptions about this
eral variables, when the focus is on the relationship be- process. These assumptions are sometimes testable if a
tween a dependent variable and one or more independent sucient quantity of data is available. Regression mod-
variables (or 'predictors). More specically, regression els for prediction are often useful even when the assump-
analysis helps one understand how the typical value of the tions are moderately violated, although they may not per-
dependent variable (or 'criterion variable') changes when form optimally. However, in many applications, espe-
any one of the independent variables is varied, while the cially with small eects or questions of causality based
other independent variables are held xed. Most com- on observational data, regression methods can give mis-
monly, regression analysis estimates the conditional ex- leading results.[2][3]
pectation of the dependent variable given the indepen-
dent variables that is, the average value of the dependent
variable when the independent variables are xed. Less
commonly, the focus is on a quantile, or other location pa- 21.1 History
rameter of the conditional distribution of the dependent
variable given the independent variables. In all cases, the
The earliest form of regression was the method of least
estimation target is a function of the independent vari-
squares, which was published by Legendre in 1805,[4] and
ables called the regression function. In regression analy-
by Gauss in 1809.[5] Legendre and Gauss both applied
sis, it is also of interest to characterize the variation of the
the method to the problem of determining, from astro-
dependent variable around the regression function which
nomical observations, the orbits of bodies about the Sun
can be described by a probability distribution.
(mostly comets, but also later the then newly discovered
Regression analysis is widely used for prediction and minor planets). Gauss published a further development
forecasting, where its use has substantial overlap with the of the theory of least squares in 1821,[6] including a ver-
eld of machine learning. Regression analysis is also used sion of the GaussMarkov theorem.
to understand which among the independent variables
The term regression was coined by Francis Galton
are related to the dependent variable, and to explore the
in the nineteenth century to describe a biological phe-
forms of these relationships. In restricted circumstances,
nomenon. The phenomenon was that the heights of de-
regression analysis can be used to infer causal relation-
scendants of tall ancestors tend to regress down towards a
ships between the independent and dependent variables.
normal average (a phenomenon also known as regression
However this can lead to illusions or false relationships,
toward the mean).[7][8] For Galton, regression had only
so caution is advisable;[1] for example, correlation does
this biological meaning,[9][10] but his work was later ex-
not imply causation.
tended by Udny Yule and Karl Pearson to a more general
Many techniques for carrying out regression analysis have statistical context.[11][12] In the work of Yule and Pear-
been developed. Familiar methods such as linear regres- son, the joint distribution of the response and explana-
sion and ordinary least squares regression are parametric, tory variables is assumed to be Gaussian. This assump-
in that the regression function is dened in terms of a - tion was weakened by R.A. Fisher in his works of 1922
nite number of unknown parameters that are estimated and 1925.[13][14][15] Fisher assumed that the conditional
from the data. Nonparametric regression refers to tech- distribution of the response variable is Gaussian, but the
niques that allow the regression function to lie in a speci- joint distribution need not be. In this respect, Fishers
ed set of functions, which may be innite-dimensional. assumption is closer to Gausss formulation of 1821.
The performance of regression analysis methods in prac- In the 1950s and 1960s, economists used electromechani-
tice depends on the form of the data generating pro- cal desk calculators to calculate regressions. Before 1970,
cess, and how it relates to the regression approach be- it sometimes took up to 24 hours to receive the result from
159
160 CHAPTER 21. REGRESSION ANALYSIS
21.3 Underlying assumptions In linear regression, the model specication is that the de-
pendent variable, yi is a linear combination of the param-
Classical assumptions for regression analysis include: eters (but need not be linear in the independent variables).
For example, in simple linear regression for modeling n
The sample is representative of the population for data points there is one independent variable: xi , and
the inference prediction. two parameters, 0 and 1 :
The errors are uncorrelated, that is, the variance This is still linear regression; although the expression on
covariance matrix of the errors is diagonal and each the right hand side is quadratic in the independent variable
non-zero element is the variance of the error. xi , it is linear in the parameters 0 , 1 and 2 .
The variance of the error is constant across obser- In both cases, i is an error term and the subscript i in-
vations (homoscedasticity). If not, weighted least dexes a particular observation.
squares or other methods might instead be used. Returning our attention to the straight line case: Given
a random sample from the population, we estimate the
These are sucient conditions for the least-squares esti- population parameters and obtain the sample linear re-
mator to possess desirable properties; in particular, these gression model:
assumptions imply that the parameter estimates will be
unbiased, consistent, and ecient in the class of linear
unbiased estimators. It is important to note that actual ybi = b0 + b1 xi .
data rarely satises the assumptions. That is, the method
is used even though the assumptions are not true. Vari- The residual, ei = yi ybi , is the dierence between the
ation from the assumptions can sometimes be used as a value of the dependent variable predicted by the model,
measure of how far the model is from being useful. Many ybi , and the true value of the dependent variable, yi .
of these assumptions may be relaxed in more advanced One method of estimation is ordinary least squares. This
treatments. Reports of statistical analyses usually include method obtains parameter estimates
[20][21]
that minimize the
analyses of tests on the sample data and methodology for sum of squared residuals, SSE, also sometimes de-
the t and usefulness of the model. noted RSS:
that violate statistical assumptions of regression. Geo- Minimization of this function results in a set of normal
graphic weighted regression is one technique to deal with equations, a set of simultaneous linear equations in the
such data.[18] Also, variables may include values aggre- parameters, which are solved to yield the parameter esti-
gated by areas. With aggregated data the modiable areal mators, b0 , b1 .
unit problem can cause extreme variation in regression
parameters.[19] When analyzing data aggregated by polit- In the case of simple regression, the formulas for the least
ical boundaries, postal codes or census areas results may squares estimates are
be very distinct with a dierent choice of units.
c1 = (xi x)(yi y) c1 x
and 0 = y
(xi x
)2
21.4 Linear regression
where x
is the mean (average) of the x values and y is the
Main article: Linear regression mean of the y values.
See simple linear regression for a derivation of these Under the assumption that the population error term has
formulas and a numerical example a constant variance, the estimate of that variance is given
by:
162 CHAPTER 21. REGRESSION ANALYSIS
i = yi 1 xi1 p xip .
n
p
n
Xij Xik k = Xij yi , j = 1, . . . , p.
i=1 k=1 i=1
include the probit and logit model. The multivariate pro- knowledge includes the fact that the dependent variable
bit model is a standard method of estimating a joint rela- cannot go outside a certain range of values, this can be
tionship between several binary dependent variables and made use of in selecting the model even if the observed
some independent variables. For categorical variables dataset has no values particularly near such bounds. The
with more than two values there is the multinomial logit. implications of this step of choosing an appropriate func-
For ordinal variables with more than two values, there are tional form for the regression can be great when extrap-
the ordered logit and ordered probit models. Censored olation is considered. At a minimum, it can ensure that
regression models may be used when the dependent vari- any extrapolation arising from a tted model is realistic
able is only sometimes observed, and Heckman correc- (or in accord with what is known).
tion type models may be used when the sample is not
randomly selected from the population of interest. An
alternative to such procedures is linear regression based 21.6 Nonlinear regression
on polychoric correlation (or polyserial correlations) be-
tween the categorical variables. Such procedures dier in
the assumptions made about the distribution of the vari- Main article: Nonlinear regression
ables in the population. If the variable is positive with low
values and represents the repetition of the occurrence of When the model function is not linear in the parameters,
an event, then count models like the Poisson regression the sum of squares must be minimized by an iterative pro-
or the negative binomial model may be used instead. cedure. This introduces many complications which are
summarized in Dierences between linear and non-linear
least squares
21.5 Interpolation and extrapola-
tion 21.7 Power and sample size calcu-
Regression models predict a value of the Y variable given lations
known values of the X variables. Prediction within the
range of values in the dataset used for model-tting is There are no generally agreed methods for relating the
known informally as interpolation. Prediction outside this number of observations versus the number of indepen-
range of the data is known as extrapolation. Performing dent variables in the model. One rule of thumb suggested
extrapolation relies strongly on the regression assump- by Good and Hardin is N = mn , where N is the sam-
tions. The further the extrapolation goes outside the data, ple size, n is the number of independent variables and m
the more room there is for the model to fail due to dif- is the number of observations needed to reach the de-
ferences between the assumptions and the sample data or sired precision if the model had only one independent
the true values. variable.[24] For example, a researcher is building a lin-
It is generally advised that when performing extrapola- ear regression model using a dataset that contains 1000
tion, one should accompany the estimated value of the de- patients ( N ). If the researcher decides that ve observa-
pendent variable with a prediction interval that represents tions are needed to precisely dene a straight line ( m ),
the uncertainty. Such intervals tend to expand rapidly as then the maximum number of independent variables the
the values of the independent variable(s) moved outside model can support is 4, because
log 1000
the range covered by the observed data. log 5 = 4.29 .
For such reasons and others, some tend to say that it might
be unwise to undertake extrapolation.[23]
However, this does not cover the full set of modelling er- 21.8 Other methods
rors that may be being made: in particular, the assump-
tion of a particular form for the relation between Y and X. Although the parameters of a regression model are usu-
A properly conducted regression analysis will include an ally estimated using the method of least squares, other
assessment of how well the assumed form is matched by methods which have been used include:
the observed data, but it can only do so within the range
of values of the independent variables actually available. Bayesian methods, e.g. Bayesian linear regression
This means that any extrapolation is particularly reliant
on the assumptions being made about the structural form Percentage regression, for situations where reducing
of the regression relationship. Best-practice advice here percentage errors is deemed more appropriate.[25]
is that a linear-in-variables and linear-in-parameters rela-
tionship should not be chosen simply for computational Least absolute deviations, which is more robust in
convenience, but that all available knowledge should be the presence of outliers, leading to quantile regres-
deployed in constructing a regression model. If this sion
164 CHAPTER 21. REGRESSION ANALYSIS
Kriging (a linear least squares estimation algorithm) [12] Pearson, Karl; Yule, G.U.; Blanchard, Norman;
Lee,Alice (1903). The Law of Ancestral Hered-
Local regression ity. Biometrika (Biometrika Trust) 2 (2): 211236.
doi:10.1093/biomet/2.2.211. JSTOR 2331683.
Modiable areal unit problem
[13] Fisher, R.A. (1922). The goodness of t of regression
Multivariate adaptive regression splines formulae, and the distribution of regression coecients.
Journal of the Royal Statistical Society (Blackwell Pub-
Multivariate normal distribution lishing) 85 (4): 597612. doi:10.2307/2341124. JSTOR
2341124.
Pearson product-moment correlation coecient
[14] Ronald A. Fisher (1954). Statistical Methods for Research
Prediction interval
Workers (Twelfth ed.). Edinburgh: Oliver and Boyd.
Robust regression ISBN 0-05-002170-2.
[17] N. Cressie (1996) Change of Support and the Modiable Draper, N.R.; Smith, H. (1998). Applied Regression
Areal Unit Problem. Geographical Systems 3:159180. Analysis (3rd ed.). John Wiley. ISBN 0-471-17082-
8.
[18] Fotheringham, A. Stewart; Brunsdon, Chris; Charlton,
Martin (2002). Geographically weighted regression: the Fox, J. (1997). Applied Regression Analysis, Linear
analysis of spatially varying relationships (Reprint ed.). Models and Related Methods. Sage
Chichester, England: John Wiley. ISBN 978-0-471-
49616-8. Hardle, W., Applied Nonparametric Regression
[19] Fotheringham, AS; Wong, DWS (1 January 1991). The (1990), ISBN 0-521-42950-1
modiable areal unit problem in multivariate statistical
Meade, N. and T. Islam (1995) Prediction Intervals
analysis. Environment and Planning A 23 (7): 1025
for Growth Curve Forecasts Journal of Forecasting,
1044. doi:10.1068/a231025.
14, pp. 413430.
[20] M. H. Kutner, C. J. Nachtsheim, and J. Neter (2004),
Applied Linear Regression Models, 4th ed., McGraw- A. Sen, M. Srivastava, Regression Analysis The-
Hill/Irwin, Boston (p. 25) ory, Methods, and Applications, Springer-Verlag,
Berlin, 2011 (4th printing).
[21] N. Ravishankar and D. K. Dey (2002), A First Course
in Linear Model Theory, Chapman and Hall/CRC, Boca T. Strutz: Data Fitting and Uncertainty (A practical
Raton (p. 101) introduction to weighted least squares and beyond).
Vieweg+Teubner, ISBN 978-3-8348-1022-9.
[22] Steel, R.G.D, and Torrie, J. H., Principles and Procedures
of Statistics with Special Reference to the Biological Sci- Malakooti, B. (2013). Operations and Production
ences., McGraw Hill, 1960, page 288. Systems with Multiple Objectives. John Wiley &
[23] Chiang, C.L, (2003) Statistical methods of analysis, World Sons.
Scientic. ISBN 981-238-310-7 - page 274 section 9.7.4
interpolation vs extrapolation
[24] Good, P. I.; Hardin, J. W. (2009). Common Errors in 21.13 External links
Statistics (And How to Avoid Them) (3rd ed.). Hoboken,
New Jersey: Wiley. p. 211. ISBN 978-0-470-45798-6. Hazewinkel, Michiel, ed. (2001), Regression anal-
ysis, Encyclopedia of Mathematics, Springer, ISBN
[25] Tofallis, C. (2009). Least Squares Percentage Regres-
978-1-55608-010-4
sion. Journal of Modern Applied Statistical Methods 7:
526534. doi:10.2139/ssrn.1406472. Earliest Uses: Regression basic history and refer-
[26] YangJing Long (2009). Human age estimation by met- ences
ric learning for regression problems (PDF). Proc. Inter-
Regression of Weakly Correlated Data how lin-
national Conference on Computer Analysis of Images and
Patterns: 7482.
ear regression mistakes can appear when Y-range is
much smaller than X-range
See also: Computational learning theory In facial recognition, for instance, a picture of a persons
This article is about statistical learning in machine learn- face would be the input, and the output label would be
ing. For its use in psychology, see Statistical learning in that persons name. The input would be represented by a
language acquisition. large multidimensional vector, in which each dimension
represents the value of one of the pixels.
Statistical learning theory is a framework for machine After learning a function based on the training set data,
learning drawing from the elds of statistics and that function is validated on a test set of data, data that
functional analysis.[1] Statistical learning theory deals did not appear in the training set.
with the problem of nding a predictive function based
on data. Statistical learning theory has led to success-
ful applications in elds such as computer vision, speech 22.2 Formal Description
recognition, bioinformatics and baseball.[2] It is the theo-
retical framework underlying support vector machines.
Take X to be the vector space of all possible inputs, and
Y to be the vector space of all possible outputs. Statistical
learning theory takes the perspective that there is some
22.1 Introduction unknown probability distribution over the product space
Z = X Y , i.e. there exists some unknown p(z) =
The goal of learning is prediction. Learning falls p(x, y) . The training set is made up of n samples from
into many categories, including supervised learn- this probability distribution, and is notated
ing, unsupervised learning, online learning, and
reinforcement learning. From the perspective of sta-
tistical learning theory, supervised learning is best
S = {(x1 , y1 ), . . . , (xn , yn )} = {z1 , . . . , zn }
understood.[3] Supervised learning involves learning
from a training set of data. Every point in the training Every x is an input vector from the training data, and y
i i
is an input-output pair, where the input maps to an is the output that corresponds to it.
output. The learning problem consists of inferring the
function that maps between the input and the output in a In this formalism, the inference problem consists of nd-
predictive fashion, such that the learned function can be ing a function f : X 7 Y such that f (x) y . Let H
used to predict output from future input. be a space of functions f : X 7 Y called the hypothesis
space. The hypothesis space is the space of functions the
Depending of the type of output, supervised learning algorithm will search through. Let V (f (x), y) be the loss
problems are either problems of regression or problems functional, a metric for the dierence between the pre-
of classication. If the output takes a continuous range of dicted value f (x) and the actual value y . The expected
values, it is a regression problem. Using Ohms Law as risk is dened to be
an example, a regression could be performed with voltage
as input and current as output. The regression would nd
the functional relationship between voltage and current to
be R1 , such that I[f ] = V (f (x), y)p(x, y)dxdy
XY
166
22.4. REGULARIZATION 167
1
n
V (f (xi , yi )) + f 2H
n i=1
22.6 References
[1] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar
(2012) Foundations of Machine Learning, The MIT Press
ISBN 9780262018258.
[4] Rosasco, L., Vito, E.D., Caponnetto, A., Fiana, M., and
Verri A. 2004. Neural computation Vol 16, pp 1063-1076
VapnikChervonenkis theory
VapnikChervonenkis theory (also known as VC the- In addition, VC theory and VC dimension are instrumen-
ory) was developed during 19601990 by Vladimir Vap- tal in the theory of empirical processes, in the case of
nik and Alexey Chervonenkis. The theory is a form of processes indexed by VC classes. Arguably these are the
computational learning theory, which attempts to explain most important applications of the VC theory, and are
the learning process from a statistical point of view. employed in proving generalization. Several techniques
will be introduced that are widely used in the empirical
VC theory is related to statistical learning theory and to
empirical processes. Richard M. Dudley and Vladimir process and VC theory. The discussion is mainly based on
the book Weak Convergence and Empirical Processes:
Vapnik himself, among others, apply VC-theory to
empirical processes. With Applications to Statistics.
169
170 CHAPTER 23. VAPNIKCHERVONENKIS THEORY
Gn = n(Pn P ) G, in (F) 1
n
f 7 (Pn P )f = (f (Xi ) P f )
n i=1
In the former case F is called Glivenko-Cantelli
class, and in the latter case (under the assumption Turns out that there is a connection between the empirical
x, supf F |f (x) P f | < ) the class F is called and the following symmetrized process:
Donsker or P-Donsker. Obviously, a Donsker class is
Glivenko-Cantelli in probability by an application of
Slutskys theorem . 1
n
f 7 P0n = i f (Xi )
These statements are true for a single f , by standard n i=1
LLN, CLT arguments under regularity conditions, and
the diculty in the Empirical Processes comes in because The symmetrized process is a Rademacher process, con-
joint statements are being made for all f F . Intuitively ditionally on the data Xi . Therefore it is a sub-Gaussian
then, the set F cannot be too large, and as it turns out that process by Hoedings inequality.
the geometry of F plays a very important role. Lemma (Symmetrization). For every nondecreasing,
One way of measuring how big the function set F is to use convex : R R and class of measurable functions F ,
the so-called covering numbers. The covering number
(
)
E(Pn P F ) E 2
P0n
F
N (, F, )
The proof of the Symmetrization lemma relies on intro-
is the minimal number of balls {g : g f < } needed ducing independent copies of the original variables Xi
to cover the set F (here it is obviously assumed that there(sometimes referred to as a ghost sample) and replacing
is an underlying norm on F ). The entropy is the loga- the inner expectation of the LHS by these copies. After
rithm of the covering number. an application of Jensens inequality dierent signs could
Two sucient conditions are provided below, under be introduced (hence the name symmetrization) without
which it can be proved that the set F is Glivenko-Cantelli changing the expectation. The proof can be found below
or Donsker. because of its instructive nature.
A class F is P-Glivenko-Cantelli if it is P-measurable [Proof]
with envelope F such that P F < and satises:
Introduce the ghost sample Y1 , . . . , Yn to be inde-
pendent copies of X1 , . . . , Xn . For xed values of
> 0 supQ N (F Q , F, L1 (Q)) < . X1 , . . . , Xn one has:
The next condition is a version of the celebrated Dudleys
theorem. If F is a class of functions such that n n
1
1
Pn P F = sup f (Xi ) Ef (Yi ) EY sup f (Xi ) f
f F n i=1 f F n i=1
supQ log N (F Q,2 , F, L2 (Q))d <
0 Therefore by Jensens inequality:
then F is P-Donsker for every probability measure P such
that P F 2 < . In the last integral, the notation means (
n
)
1
(Pn P F ) EY
f (Xi ) f (Yi )
n
( ) 12 i=1 F
f Q,2 = |f | dQ
2
Taking expectation with respect to X gives:
23.2. OVERVIEW OF VC THEORY IN EMPIRICAL PROCESSES 171
Note that adding a minus sign in front of a term f (Xi ) A similar bound can be shown (with a dierent constant,
f (Yi ) doesn't change the RHS, because its a symmetric same rate) for the so-called VC subgraph classes. For a
function of X and Y . Therefore the RHS remains the function f : X R the subgraph is a subset of X R
same under sign perturbation": such that: {(x, t) : t < f (x)} . A collection of F is
called a VC subgraph class if all subgraphs form a VC-
(
n
) class.
1
Consider a set of indicator functions IC = {1C : C C}
E
ei f (Xi ) f (Yi )
n
in L1 (Q) for discrete empirical type of measure Q (or
i=1 F
equivalently for any probability measure Q). It can then
for any (e1 , e2 , . . . , en ) {1, 1}n . Therefore:
be shown that quite remarkably, for r 1 :
(
n
)
1
E(Pn P F ) E E
i f (Xi ) f (Yi )
N (, IC , Lr (Q)) KV (C)(4e)V (C) r(V (C)1)
n
i=1 F
Finally using rst triangle inequality and then convexity Further consider the symmetric convex hull of a set F
of gives: :sconv F being
the collection of functions of the form
m m
f
i=1 i i with i=1 |i | 1 . Then if
(
n
) (
n
)
1
1
1
1
E(Pn P F ) E E 2
i f (Xi )
+ EN E 2
(F Q,2 f (Y )
V
2
n
2
n, F, L2 (Q))
C
i i
i=1 F
i=1 F
Where the last two expressions on the RHS are the same, the following is valid for the convex hull of F :
which concludes the proof.
A typical way of proving empirical CLTs, rst uses
log N (F Q,2 , sconv F, L2 (Q)) K V +2
2V
V (C)1 ( ) ( )V (C)1
n ne
max n (C, x1 , . . . , xn ) ai (f (xi )ti ) = (ai )(f (xi )ti ), f F
x1 ,...,xn
j=0
j V (C) 1
ai >0 ai <0
172 CHAPTER 23. VAPNIKCHERVONENKIS THEORY
can certainly be evaluated. Then one has the following ^ Pollard, David (1990). Empirical Processes: The-
Theorem: ory and Applications. NSF-CBMS Regional Con-
ference Series in Probability and Statistics Volume
2. ISBN 0-940600-16-1.
23.3.1 Theorem (VC Inequality) Bousquet, O.; Boucheron, S.; Lugosi, G. (2004).
Introduction to Statistical Learning Theory. Ad-
For binary classication and the 0/1 loss function we have vanced Lectures on Machine Learning Lecture Notes
the following generalization bounds: in Articial Intelligence 3176, 169-207. (Eds.)
23.4. REFERENCES 173
In computational learning theory, probably approxi- all patterns of bits in X = {0, 1}n that encode a picture
mately correct learning (PAC learning) is a framework of the letter P. An example concept from the second
for mathematical analysis of machine learning. It was example
is the set of all of the numbers between /2 and
proposed in 1984 by Leslie Valiant.[1] 10 . A concept class C is a set of concepts over X . This
could be the set of all subsets of the array of bits that are
In this framework, the learner receives samples and must
select a generalization function (called the hypothesis) skeletonized 4-connected (width of the font is 1).
from a certain class of possible functions. The goal is Let EX(c, D) be a procedure that draws an example, x
that, with high probability (the probably part), the se- , using a probability distribution D and gives the correct
lected function will have low generalization error (the label c(x) , that is 1 if x c and 0 otherwise.
approximately correct part). The learner must be able Say that there is an algorithm A that given access to
to learn the concept given any arbitrary approximation ra- EX(c, D) and inputs and that, with probability of
tio, probability of success, or distribution of the samples. at least 1 , A outputs a hypothesis h C that has
The model was later extended to treat noise (misclassied error less than or equal to with examples drawn from X
samples). with the distribution D . If there is such an algorithm for
An important innovation of the PAC framework is the in- every concept c C , for every distribution D over X
, and for all 0 < < 1/2 and 0 < < 1/2 then C is
troduction of computational complexity theory concepts
to machine learning. In particular, the learner is ex- PAC learnable (or distribution-free PAC learnable). We
can also say that A is a PAC learning algorithm for C .
pected to nd ecient functions (time and space require-
ments bounded to a polynomial of the example size), and An algorithm runs in time t if it draws at most t exam-
the learner itself must implement an ecient procedure ples and requires at most t time steps. A concept class
(requiring an example count bounded to a polynomial is eciently PAC learnable if it is PAC learnable by an
of the concept size, modied by the approximation and algorithm that runs in time polynomial in 1/ , 1/ and
likelihood bounds). instance length.
For the following denitions, two examples will be used. 1. The concept class C is PAC learnable.
The rst is the problem of character recognition given an
array of n bits encoding a binary-valued image. The other 2. The VC dimension of C is nite.
example is the problem of nding an interval that will
3. C is a uniform Glivenko-Cantelli class.
correctly classify points within the interval as positive and
the points outside of the range as negative.
Let X be a set called the instance space or the encoding
of all the samples, and each instance have length assigned.
24.3 References
In the character recognition problem, the instance space is
X = {0, 1}n . In the interval problem the instance space [1] L. Valiant. A theory of the learnable. Communications of
is X = R , where R denotes the set of all real numbers. the ACM, 27, 1984.
A concept is a subset c X . One concept is the set of [2] Kearns and Vazirani, pg. 1-12,
174
24.4. FURTHER READING 175
Algorithmic learning theory is a mathematical frame- ning one program to be capable of developing another
work for analyzing machine learning problems and algo- program by which any given sentence can be tested to
rithms. Synonyms include formal learning theory and determine whether it is grammatical or ungrammat-
algorithmic inductive inference. Algorithmic learning ical. The language being learned need not be English
theory is dierent from statistical learning theory in that or any other natural language - in fact the denition of
it does not make use of statistical assumptions and anal- grammatical can be absolutely anything known to the
ysis. Both algorithmic and statistical learning theory are tester.
concerned with machine learning and can thus be viewed In Golds learning model, the tester gives the learner an
as branches of computational learning theory.
example sentence at each step, and the learner responds
with a hypothesis, which is a suggested program to deter-
mine grammatical correctness. It is required of the tester
25.1 Distinguishing Characteris- that every possible sentence (grammatical or not) appears
tics in the list eventually, but no particular order is required. It
is required of the learner that at each step the hypothesis
must be correct for all the sentences so far.
Unlike statistical learning theory and most statistical the-
ory in general, algorithmic learning theory does not as- A particular learner is said to be able to learn a language
sume that data are random samples, that is, that data in the limit if there is a certain number of steps beyond
points are independent of each other. This makes the which its hypothesis no longer changes. At this point it
theory suitable for domains where observations are (rela- has indeed learned the language, because every possible
tively) noise-free but not random, such as language learn- sentence appears somewhere in the sequence of inputs
ing [1] and automated scientic discovery.[2][3] (past or future), and the hypothesis is correct for all inputs
(past or future), so the hypothesis is correct for every sen-
The fundamental concept of algorithmic learning theory
tence. The learner is not required to be able to tell when
is learning in the limit: as the number of data points in-
it has reached a correct hypothesis, all that is required is
creases, a learning algorithm should converge to a cor-
that it be true.
rect hypothesis on every possible data sequence consistent
with the problem space. This is a non-probabilistic ver- Gold showed that any language which is dened by a
sion of statistical consistency, which also requires conver- Turing machine program can be learned in the limit
gence to a correct model in the limit, but allows a learner by another Turing-complete machine using enumeration.
to fail on data sequences with probability measure 0. This is done by the learner testing all possible Turing ma-
chine programs in turn until one is found which is cor-
Algorithmic learning theory investigates the learning
rect so far - this forms the hypothesis for the current
power of Turing machines. Other frameworks consider
step. Eventually, the correct program will be reached,
a much more restricted class of learning algorithms than
after which the hypothesis will never change again (but
Turing machines, for example learners that compute hy-
note that the learner does not know that it won't need to
potheses more quickly, for instance in polynomial time.
change).
An example of such a framework is probably approxi-
mately correct learning. Gold also showed that if the learner is given only posi-
tive examples (that is, only grammatical sentences appear
in the input, not ungrammatical sentences), then the lan-
25.2 Learning in the limit guage can only be guaranteed to be learned in the limit if
there are only a nite number of possible sentences in the
language (this is possible if, for example, sentences are
The concept was introduced in E. Mark Gold's seminal known to be of limited length).
paper "Language identication in the limit".[4] The ob-
jective of language identication is for a machine run- Language identication in the limit is a highly abstract
176
25.6. EXTERNAL LINKS 177
model. It does not allow for limits of runtime or computer [7] Jain, S. and Sharma, A. (1999), On a generalized notion of
memory which can occur in practice, and the enumera- mistake bounds, Proceedings of the Conference on Learn-
tion method may fail if there are errors in the input. How- ing Theory (COLT), pp.249-256.
ever the framework is very powerful, because if these [8] Kevin T. Kelly (2007), Ockhams Razor, Empirical Com-
strict conditions are maintained, it allows the learning of plexity, and Truth-nding Eciency, Theoretical Com-
any program known to be computable. This is because puter Science, 383: 270-289.
a Turing machine program can be written to mimic any
program in any conventional programming language. See
Church-Turing thesis.
25.6 External links
Learning Theory in Computer Science.
25.3 Other Identication Criteria
The Stanford Encyclopaedia of Philosophy provides
Learning theorists have investigated other learning a highly accessible introduction to key concepts in
criteria,[5] such as the following. algorithmic learning theory, especially as they apply
to the philosophical problems of inductive inference.
Eciency: minimizing the number of data points re-
quired before convergence to a correct hypothesis.
25.5 References
[1] Jain, S. et al (1999): Systems That Learn, 2nd ed. Cam-
bridge, MA: MIT Press.
Critical region redirects here. For the computer this probability of making an incorrect decision is not the
science notion of a critical section, sometimes called a probability that the null hypothesis is true, nor whether
critical region, see critical section. any specic alternative hypothesis is true. This contrasts
with other possible techniques of decision theory in which
the null and alternative hypothesis are treated on a more
A statistical hypothesis is a scientic hypothesis that
is testable on the basis of observing a process that is equal basis.
modeled via a set of random variables.[1] A statistical One naive Bayesian approach to hypothesis testing is to
hypothesis test is a method of statistical inference used base decisions on the posterior probability,[3][4] but this
for testing a statistical hypothesis. fails when comparing point and continuous hypotheses.
A test result is called statistically signicant if it has been Other approaches to decision making, such as Bayesian
predicted as unlikely to have occurred by sampling er- decision theory, attempt to balance the consequences of
ror alone, according to a threshold probabilitythe sig- incorrect decisions across all possibilities, rather than
nicance level. Hypothesis tests are used in determining concentrating on a single null hypothesis. A number of
what outcomes of a study would lead to a rejection of the other approaches to reaching a decision based on data are
null hypothesis for a pre-specied level of signicance. available via decision theory and optimal decisions, some
In the Neyman-Pearson framework (see below), the pro- of which have desirable properties. Hypothesis testing,
cess of distinguishing between the null hypothesis and the though, is a dominant approach to data analysis in many
alternative hypothesis is aided by identifying two concep- elds of science. Extensions to the theory of hypothe-
tual types of errors (type 1 & type 2), and by specifying sis testing include the study of the power of tests, i.e.
parametric limits on e.g. how much type 1 error will be the probability of correctly rejecting the null hypothesis
permitted. given that it is false. Such considerations can be used for
the purpose of sample size determination prior to the col-
An alternative framework for statistical hypothesis test- lection of data.
ing is to specify a set of statistical models, one for each
candidate hypothesis, and then use model selection tech-
niques to choose the most appropriate model.[2] The most 26.2 The testing process
common selection techniques are based on either Akaike
information criterion or Bayes factor.
In the statistics literature, statistical hypothesis testing
Statistical hypothesis testing is sometimes called con- plays a fundamental role.[5] The usual line of reasoning
rmatory data analysis. It can be contrasted with is as follows:
exploratory data analysis, which may not have pre-
specied hypotheses. 1. There is an initial research hypothesis of which the
truth is unknown.
26.1 Variations and sub-classes 2. The rst step is to state the relevant null and alter-
native hypotheses. This is important as mis-stating
the hypotheses will muddy the rest of the process.
Statistical hypothesis testing is a key technique of both
Frequentist inference and Bayesian inference, although 3. The second step is to consider the statistical assump-
the two types of inference have notable dierences. Sta- tions being made about the sample in doing the test;
tistical hypothesis tests dene a procedure that controls for example, assumptions about the statistical inde-
(xes) the probability of incorrectly deciding that a de- pendence or about the form of the distributions of
fault position (null hypothesis) is incorrect. The proce- the observations. This is equally important as invalid
dure is based on how likely it would be for a set of obser- assumptions will mean that the results of the test are
vations to occur if the null hypothesis were true. Note that invalid.
178
26.2. THE TESTING PROCESS 179
4. Decide which test is appropriate, and state the rele- The Geiger-counter reading is high; 97% of safe
vant test statistic T. suitcases have lower readings. The limit is 95%.
Check the suitcase.
5. Derive the distribution of the test statistic under the
null hypothesis from the assumptions. In standard
cases this will be a well-known result. For example The former report is adequate, the latter gives a more de-
the test statistic might follow a Students t distribu- tailed explanation of the data and the reason why the suit-
tion or a normal distribution. case is being checked.
It is important to note the dierence between accepting
6. Select a signicance level (), a probability thresh-
the null hypothesis and simply failing to reject it. The
old below which the null hypothesis will be rejected.
fail to reject terminology highlights the fact that the null
Common values are 5% and 1%.
hypothesis is assumed to be true from the start of the test;
7. The distribution of the test statistic under the null if there is a lack of evidence against it, it simply contin-
hypothesis partitions the possible values of T into ues to be assumed true. The phrase accept the null hy-
those for which the null hypothesis is rejectedthe pothesis may suggest it has been proved simply because
so-called critical regionand those for which it is it has not been disproved, a logical fallacy known as the
not. The probability of the critical region is . argument from ignorance. Unless a test with particularly
high power is used, the idea of accepting the null hy-
8. Compute from the observations the observed value pothesis may be dangerous. Nonetheless the terminol-
t of the test statistic T. ogy is prevalent throughout statistics, where its meaning
is well understood.
9. Decide to either reject the null hypothesis in favor of
the alternative or not reject it. The decision rule is The processes described here are perfectly adequate for
to reject the null hypothesis H0 if the observed value computation. They seriously neglect the design of exper-
t is in the critical region, and to accept or fail to iments considerations.[7][8]
reject the hypothesis otherwise. It is particularly critical that appropriate sample sizes be
estimated before conducting the experiment.
An alternative process is commonly used:
The phrase test of signicance was coined by statisti-
cian Ronald Fisher.[9]
1. Compute from the observations the observed value
t of the test statistic T.
was advantageous in the past when only tables of test If the p-value is not less than the required signicance
statistics at common probability thresholds were avail- level (equivalently, if the observed test statistic is outside
able. It allowed a decision to be made without the cal- the critical region), then the test has no result. The evi-
culation of a probability. It was adequate for classwork dence is insucient to support a conclusion. (This is like
and for operational use, but it was decient for reporting a jury that fails to reach a verdict.) The researcher typ-
results. ically gives extra consideration to those cases where the
The latter process relied on extensive tables or on compu- p-value is close to the signicance level.
tational support not always available. The explicit calcu-
In the Lady tasting tea example (below), Fisher required
lation of a probability is useful for reporting. The calcu-
the Lady to properly categorize all of the cups of tea to
lations are now trivially performed with appropriate soft-
justify the conclusion that the result was unlikely to result
ware. from chance. He dened the critical region as that case
The dierence in the two processes applied to the Ra- alone. The region was dened by a probability (that the
dioactive suitcase example (below): null hypothesis was correct) of less than 5%.
Whether rejection of the null hypothesis truly justies ac-
The Geiger-counter reading is 10. The limit is 9. ceptance of the research hypothesis depends on the struc-
Check the suitcase. ture of the hypotheses. Rejecting the hypothesis that a
180 CHAPTER 26. STATISTICAL HYPOTHESIS TESTING
large paw print originated from a bear does not immedi- a substitute for the traditional comparison of predicted
ately prove the existence of Bigfoot. Hypothesis testing value and experimental result at the core of the scientic
emphasizes the rejection, which is based on a probability,method. When theory is only capable of predicting the
rather than the acceptance, which requires extra steps of sign of a relationship, a directional (one-sided) hypothesis
logic. test can be congured so that only a statistically signi-
The probability of rejecting the null hypothesis is a cant result supports theory. This form of theory appraisal
function of ve factors: whether the test is one- or two is the most heavily criticized application of hypothesis
tailed, the level of signicance, the standard deviation, testing.
the amount of deviation from the null hypothesis, and the
number of observations.[10] These factors are a source
of criticism; factors under the control of the experi- 26.2.3 Cautions
menter/analyst give the results an appearance of subjec-
tivity. If the government required statistical procedures to
carry warning labels like those on drugs, most inference
methods would have long labels indeed.[13] This caution
26.2.2 Use and importance applies to hypothesis tests and alternatives to them.
The successful hypothesis test is associated with a prob-
Statistics are helpful in analyzing most collections of data. ability and a type-I error rate. The conclusion might be
This is equally true of hypothesis testing which can jus- wrong.
tify conclusions even when no scientic theory exists. In
the Lady tasting tea example, it was obvious that no dif- The conclusion of the test is only as solid as the sample
ference existed between (milk poured into tea) and (tea upon which it is based. The design of the experiment is
poured into milk). The data contradicted the obvious. critical. A number of unexpected eects have been ob-
served including:
Real world applications of hypothesis testing include:[11]
Determining the range at which a bat can detect an The Placebo eect. Pills with no medically active
insect by echo ingredients were remarkably eective.
Deciding whether hospital carpeting results in more
infections A statistical analysis of misleading data produces mis-
leading conclusions. The issue of data quality can be
Selecting the best means to stop smoking more subtle. In forecasting for example, there is no agree-
ment on a measure of forecast accuracy. In the absence
Checking whether bumper stickers reect car owner
of a consensus measurement, no decision based on mea-
behavior
surements will be without controversy.
Testing the claims of handwriting analysts The book How to Lie with Statistics[14][15] is the most pop-
ular book on statistics ever published.[16] It does not much
Statistical hypothesis testing plays an important role in the consider hypothesis testing, but its cautions are applica-
whole of statistics and in statistical inference. For exam- ble, including: Many claims are made on the basis of sam-
ple, Lehmann (1992) in a review of the fundamental pa- ples too small to convince. If a report does not mention
per by Neyman and Pearson (1933) says: Nevertheless, sample size, be doubtful.
despite their shortcomings, the new paradigm formulated Hypothesis testing acts as a lter of statistical conclusions;
in the 1933 paper, and the many developments carried only those results meeting a probability threshold are pub-
out within its framework continue to play a central role lishable. Economics also acts as a publication lter; only
in both the theory and practice of statistics and can be those results favorable to the author and funding source
expected to do so in the foreseeable future. may be submitted for publication. The impact of lter-
Signicance testing has been the favored statistical tool ing on publication is termed publication bias. A related
in some experimental social sciences (over 90% of arti- problem is that of multiple testing (sometimes linked to
cles in the Journal of Applied Psychology during the early data mining), in which a variety of tests for a variety of
1990s).[12] Other elds have favored the estimation of pa- possible eects are applied to a single data set and only
rameters (e.g., eect size). Signicance testing is used as those yielding a signicant result are reported. These are
26.3. EXAMPLE 181
often dealt with by using multiplicity correction proce- decision processes: guilty vs not guilty or evidence vs a
dures that control the family wise error rate (FWER) or threshold (beyond a reasonable doubt). In one view, the
the false discovery rate (FDR). defendant is judged; in the other view the performance
Those making critical decisions based on the results of of the prosecution (which bears the burden of proof) is
a hypothesis test are prudent to look at the details rather judged. A hypothesis test can be regarded as either a
than the conclusion alone. In the physical sciences most judgment of a hypothesis or as a judgment of evidence.
results are fully accepted only when independently con-
rmed. The general advice concerning statistics is, Fig-
ures never lie, but liars gure (anonymous).
26.3.3 Example 1 Philosophers beans
26.3.1 Lady tasting tea Few beans of this handful are white.
Most beans in this bag are white.
Main article: Lady tasting tea Therefore: Probably, these beans were taken
from another bag.
In a famous example of hypothesis testing, known as the This is an hypothetical inference.
Lady tasting tea,[17] a female colleague of Fisher claimed
to be able to tell whether the tea or the milk was added The beans in the bag are the population. The handful are
rst to a cup. Fisher proposed to give her eight cups, four the sample. The null hypothesis is that the sample origi-
of each variety, in random order. One could then ask nated from the population. The criterion for rejecting the
what the probability was for her getting the number she null-hypothesis is the obvious dierence in appearance
got correct, but just by chance. The null hypothesis was (an informal dierence in the mean). The interesting re-
that the Lady had no such ability. The test statistic was a sult is that consideration of a real population and a real
simple count of the number of successes in selecting the sample produced an imaginary bag. The philosopher was
4 cups. The critical region was the single case of 4 suc- considering logic rather than probability. To be a real
cesses of 4 possible based on a conventional probability statistical hypothesis test, this example requires the for-
criterion (< 5%; 1 of 70 1.4%). Fisher asserted that no malities of a probability calculation and a comparison of
alternative hypothesis was (ever) required. The lady cor- that probability to a standard.
rectly identied every cup,[18] which would be considered
A simple generalization of the example considers a mixed
a statistically signicant result.
bag of beans and a handful that contain either very few
or very many white beans. The generalization considers
26.3.2 Analogy Courtroom trial both extremes. It requires more calculations and more
comparisons to arrive at a formal answer, but the core phi-
A statistical test procedure is comparable to a criminal losophy is unchanged; If the composition of the handful
trial; a defendant is considered not guilty as long as his is greatly dierent from that of the bag, then the sample
or her guilt is not proven. The prosecutor tries to prove probably originated from another bag. The original ex-
the guilt of the defendant. Only when there is enough ample is termed a one-sided or a one-tailed test while the
charging evidence the defendant is convicted. generalization is termed a two-sided or two-tailed test.
In the start of the procedure, there are two hypotheses H0 The statement also relies on the inference that the sam-
: the defendant is not guilty, and H1 : the defendant pling was random. If someone had been picking through
is guilty. The rst one is called null hypothesis, and is the bag to nd white beans, then it would explain why the
for the time being accepted. The second one is called handful had so many white beans, and also explain why
alternative (hypothesis). It is the hypothesis one hopes to the number of white beans in the bag was depleted (al-
support. though the bag is probably intended to be assumed much
larger than ones hand).
The hypothesis of innocence is only rejected when an er-
ror is very unlikely, because one doesn't want to convict
an innocent defendant. Such an error is called error of the 26.3.4 Example 2 Clairvoyant card
rst kind (i.e., the conviction of an innocent person), and game[20]
the occurrence of this error is controlled to be rare. As a
consequence of this asymmetric behaviour, the error of A person (the subject) is tested for clairvoyance. He is
the second kind (acquitting a person who committed the shown the reverse of a randomly chosen playing card 25
crime), is often rather large. times and asked which of the four suits it belongs to. The
A criminal trial can be regarded as either or both of two number of hits, or correct answers, is called X.
182 CHAPTER 26. STATISTICAL HYPOTHESIS TESTING
As we try to nd evidence of his clairvoyance, for the For example, if we select an error rate of 1%, c is calcu-
time being the null hypothesis is that the person is not lated thus:
clairvoyant. The alternative is, of course: the person is
(more or less) clairvoyant.
P ( rejectH0 |H0 valid is ) = P (X c|p = 14 ) 0.01.
If the null hypothesis is valid, the only thing the test per-
son can do is guess. For every card, the probability (rela- From all the numbers c, with this property, we choose the
tive frequency) of any single suit appearing is 1/4. If the smallest, in order to minimize the probability of a Type II
alternative is valid, the test subject will predict the suit error, a false negative. For the above example, we select:
correctly with probability greater than 1/4. We will call c = 13 .
the probability of guessing correctly p. The hypotheses,
then, are:
26.3.5 Example 3 Radioactive suitcase
null hypothesis : H0 : p = 1
4 (just guessing)
As an example, consider determining whether a suit-
case contains some radioactive material. Placed under
and
a Geiger counter, it produces 10 counts per minute. The
null hypothesis is that no radioactive material is in the
alternative hypothesis :H1 : p = 1
4 (true clairvoy- suitcase and that all measured counts are due to ambient
ant). radioactivity typical of the surrounding air and harmless
objects. We can then calculate how likely it is that we
When the test subject correctly predicts all 25 cards, we would observe 10 counts per minute if the null hypothesis
will consider him clairvoyant, and reject the null hypoth- were true. If the null hypothesis predicts (say) on average
esis. Thus also with 24 or 23 hits. With only 5 or 6 hits, 9 counts per minute, then according to the Poisson distri-
on the other hand, there is no cause to consider him so. bution typical for radioactive decay there is about 41%
But what about 12 hits, or 17 hits? What is the critical chance of recording 10 or more counts. Thus we can say
number, c, of hits, at which point we consider the sub- that the suitcase is compatible with the null hypothesis
ject to be clairvoyant? How do we determine the critical (this does not guarantee that there is no radioactive ma-
value c? It is obvious that with the choice c=25 (i.e. we terial, just that we don't have enough evidence to suggest
only accept clairvoyance when all cards are predicted cor- there is). On the other hand, if the null hypothesis pre-
rectly) we're more critical than with c=10. In the rst case dicts 3 counts per minute (for which the Poisson distribu-
almost no test subjects will be recognized to be clairvoy- tion predicts only 0.1% chance of recording 10 or more
ant, in the second case, a certain number will pass the test. counts) then the suitcase is not compatible with the null
In practice, one decides how critical one will be. That is, hypothesis, and there are likely other factors responsible
one decides how often one accepts an error of the rst to produce the measurements.
kind a false positive, or Type I error. With c = 25 the The test does not directly assert the presence of radioac-
probability of such an error is: tive material. A successful test asserts that the claim of
no radioactive material present is unlikely given the read-
( ) 25
ing (and therefore ...). The double negative (disproving
15
P ( rejectH0 |H0 valid is ) = P (X = 25|p = 14 ) = 14 the
10null ,
hypothesis) of the method is confusing, but using
a counter-example to disprove is standard mathematical
and hence, very small. The probability of a false positive practice. The attraction of the method is its practical-
is the probability of randomly guessing correctly all 25 ity. We know (from experience) the expected range of
times. counts with only ambient radioactivity present, so we can
Being less critical, with c=10, gives: say that a measurement is unusually large. Statistics just
formalizes the intuitive by using numbers instead of ad-
jectives. We probably do not know the characteristics of
25 the radioactive suitcases; We just assume that they pro-
P ( rejectH0 |H0 valid is ) = P (X 10|p = 41 ) = P (X
duce= larger 4 ) 0.07.
1
k|p =readings.
k=10
To slightly formalize intuition: Radioactivity is suspected
Thus, c = 10 yields a much greater probability of false if the Geiger-count with the suitcase is among or ex-
positive. ceeds the greatest (5% or 1%) of the Geiger-counts made
Before the test is actually performed, the maximum ac- with ambient radiation alone. This makes no assumptions
ceptable probability of a Type I error () is determined. about the distribution of counts. Many ambient radiation
Typically, values in the range of 1% to 5% are selected. observations are required to obtain good probability esti-
(If the maximum acceptable error rate is zero, an innite mates for rare events.
number of correct guesses is required.) Depending on The test described here is more fully the null-hypothesis
this Type 1 error rate, the critical value c is calculated. statistical signicance test. The null hypothesis repre-
26.4. DEFINITION OF TERMS 183
sents what we would believe by default, before seeing any Size For simple hypotheses, this is the tests probabil-
evidence. Statistical signicance is a possible nding of ity of incorrectly rejecting the null hypothesis. The
the test, declared when the observed sample is unlikely false positive rate. For composite hypotheses this
to have occurred by chance if the null hypothesis were is the supremum of the probability of rejecting the
true. The name of the test describes its formulation and null hypothesis over all cases covered by the null hy-
its possible outcome. One characteristic of the test is its pothesis. The complement of the false positive rate
crisp decision: to reject or not reject the null hypothesis. is termed specicity in biostatistics. (This is a spe-
A calculated value is compared to a threshold, which is cic test. Because the result is positive, we can con-
determined from the tolerable risk of error. dently say that the patient has the condition.) See
sensitivity and specicity and Type I and type II er-
rors for exhaustive denitions.
26.4 Denition of terms Signicance level of a test () It is the upper bound
imposed on the size of a test. Its value is chosen by
The following denitions are mainly based on the expo- the statistician prior to looking at the data or choos-
sition in the book by Lehmann and Romano: [5] ing any particular test to be used. It is the maximum
exposure to erroneously rejecting H0 he/she is ready
to accept. Testing H0 at signicance level means
Statistical hypothesis A statement about the parame- testing H0 with a test whose size does not exceed .
ters describing a population (not a sample). In most cases, one uses tests whose size is equal to
the signicance level.
Statistic A value calculated from a sample, often to
summarize the sample for comparison purposes. p-value The probability, assuming the null hypothesis is
true, of observing a result at least as extreme as the
Simple hypothesis Any hypothesis which species the test statistic.
population distribution completely.
Statistical signicance test A predecessor to the sta-
Composite hypothesis Any hypothesis which does not tistical hypothesis test (see the Origins section). An
specify the population distribution completely. experimental result was said to be statistically signif-
icant if a sample was suciently inconsistent with
Null hypothesis (H0 ) A simple hypothesis associated the (null) hypothesis. This was variously considered
with a contradiction to a theory one would like to common sense, a pragmatic heuristic for identify-
prove. ing meaningful experimental results, a convention
establishing a threshold of statistical evidence or a
Alternative hypothesis (H1 ) A hypothesis (often com- method for drawing conclusions from data. The sta-
posite) associated with a theory one would like to tistical hypothesis test added mathematical rigor and
prove. philosophical consistency to the concept by mak-
ing the alternative hypothesis explicit. The term is
Statistical test A procedure whose inputs are samples loosely used to describe the modern version which
and whose result is a hypothesis. is now part of statistical hypothesis testing.
Region of acceptance The set of values of the test Conservative test A test is conservative if, when con-
statistic for which we fail to reject the null hypothe- structed for a given nominal signicance level, the
sis. true probability of incorrectly rejecting the null hy-
pothesis is never greater than the nominal level.
Region of rejection / Critical region The set of values
of the test statistic for which the null hypothesis is Exact test A test in which the signicance level or criti-
rejected. cal value can be computed exactly, i.e., without any
approximation. In some contexts this term is re-
Critical value The threshold value delimiting the re- stricted to tests applied to categorical data and to
gions of acceptance and rejection for the test statis- permutation tests, in which computations are car-
tic. ried out by complete enumeration of all possible out-
comes and their probabilities.
Power of a test (1 ) The tests probability of cor-
rectly rejecting the null hypothesis. The comple- A statistical hypothesis test compares a test statistic (z or
ment of the false negative rate, . Power is termed t for examples) to a threshold. The test statistic (the for-
sensitivity in biostatistics. (This is a sensitive test. mula found in the table below) is based on optimality. For
Because the result is negative, we can condently say a xed level of Type I error rate, use of these statistics
that the patient does not have the condition.) See minimizes Type II error rates (equivalent to maximizing
sensitivity and specicity and Type I and type II er- power). The following terms describe tests in terms of
rors for exhaustive denitions. such optimality:
184 CHAPTER 26. STATISTICAL HYPOTHESIS TESTING
Most powerful test For a given size or signicance level, null hypothesis is that the curve t is adequate. It
the test with the greatest power (probability of re- is common to determine curve shapes to minimize
jection) for a given value of the parameter(s) being the mean square error, so it is appropriate that the
tested, contained in the alternative hypothesis. goodness-of-t calculation sums the squared errors.
Uniformly most powerful test (UMP) A test with the
greatest power for all values of the parameter(s) be- F-tests (analysis of variance, ANOVA) are commonly
ing tested, contained in the alternative hypothesis. used when deciding whether groupings of data by cate-
gory are meaningful. If the variance of test scores of the
left-handed in a class is much smaller than the variance
of the whole class, then it may be useful to study lefties
26.5 Common test statistics as a group. The null hypothesis is that two variances are
the same so the proposed grouping is not meaningful.
Main article: Test statistic
In the table below, the symbols used are dened at the
bottom of the table. Many other tests can be found
One-sample tests are appropriate when a sample is be- in other articles. Proofs exist that the test statistics are
ing compared to the population from a hypothesis. The appropriate.[21]
population characteristics are known from theory or are
calculated from the population.
Two-sample tests are appropriate for comparing two 26.6 Origins and early controversy
samples, typically experimental and control samples from
a scientically controlled experiment. Signicance testing is largely the product of Karl Pearson
Paired tests are appropriate for comparing two sam- (p-value, Pearsons chi-squared test), William Sealy Gos-
ples where it is impossible to control important variables. set (Students t-distribution), and Ronald Fisher ("null hy-
Rather than comparing two sets, members are paired be- pothesis", analysis of variance, "signicance test"), while
tween samples so the dierence between the members hypothesis testing was developed by Jerzy Neyman and
becomes the sample. Typically the mean of the dier- Egon Pearson (son of Karl). Ronald Fisher, mathemati-
ences is then compared to zero. The common example cian and biologist described by Richard Dawkins as the
scenario for when a paired dierence test is appropriate greatest biologist since Darwin, began his life in statis-
is when a single set of test subjects has something applied tics as a Bayesian (Zabell 1992), but Fisher soon grew
to them and the test is intended to check for an eect. disenchanted with the subjectivity involved (namely use
of the principle of indierence when determining prior
Z-tests are appropriate for comparing means under strin-
probabilities), and sought to provide a more objective
gent conditions regarding normality and a known stan-
approach to inductive inference.[27]
dard deviation.
Fisher was an agricultural statistician who emphasized
A t-test is appropriate for comparing means under relaxed
rigorous experimental design and methods to extract a re-
conditions (less is assumed).
sult from few samples assuming Gaussian distributions.
Tests of proportions are analogous to tests of means (the Neyman (who teamed with the younger Pearson) empha-
50% proportion). sized mathematical rigor and methods to obtain more re-
Chi-squared tests use the same calculations and the same sults from many samples and a wider range of distribu-
probability distribution for dierent applications: tions. Modern hypothesis testing is an inconsistent hy-
brid of the Fisher vs Neyman/Pearson formulation, meth-
ods and terminology developed in the early 20th cen-
Chi-squared tests for variance are used to determine
tury. While hypothesis testing was popularized early in
whether a normal population has a specied vari-
the 20th century, evidence of its use can be found much
ance. The null hypothesis is that it does.
earlier. In the 1770s Laplace considered the statistics of
Chi-squared tests of independence are used for de- almost half a million births. The statistics showed an ex-
[28]
ciding whether two variables are associated or are cess of boys compared to girls. He concluded by cal-
independent. The variables are categorical rather culation of a p-value that the excess was a real, but unex-
[29]
than numeric. It can be used to decide whether left- plained, eect.
handedness is correlated with libertarian politics (or Fisher popularized the signicance test. He required a
not). The null hypothesis is that the variables are null-hypothesis (corresponding to a population frequency
independent. The numbers used in the calculation distribution) and a sample. His (now familiar) calcula-
are the observed and expected frequencies of occur- tions determined whether to reject the null-hypothesis or
rence (from contingency tables). not. Signicance testing did not utilize an alternative hy-
Chi-squared goodness of t tests are used to de- pothesis so there was no concept of a Type II error.
termine the adequacy of curves t to data. The The p-value was devised as an informal, but objective,
26.6. ORIGINS AND EARLY CONTROVERSY 185
null hypothesis is never accepted, but there is a region of 26.7 Null hypothesis statistical sig-
acceptance).
nicance testing vs hypothesis
Sometime around 1940,[38] in an apparent eort to pro-
vide researchers with a non-controversial[40] way to testing
have their cake and eat it too, the authors of statistical text
books began anonymously combining these two strate-
gies by using the p-value in place of the test statistic (or An example of Neyman-Pearson hypothesis testing can
data) to test against the Neyman-Pearson signicance be made by a change to the radioactive suitcase example.
level.[38] Thus, researchers were encouraged to infer the If the suitcase is actually a shielded container for the
strength of their data against some null hypothesis using transportation of radioactive material, then a test might
p-values, while also thinking they are retaining the post- be used to select among three hypotheses: no radioactive
data collection objectivity provided by hypothesis test- source present, one present, two (all) present. The test
ing. It then became customary for the null hypothesis, could be required for safety, with actions required in each
which was originally some realistic research hypothesis, case. The Neyman-Pearson lemma of hypothesis testing
to be used almost solely as a strawman nil hypothesis says that a good criterion for the selection of hypotheses
(one where a treatment has no eect, regardless of the is the ratio of their probabilities (a likelihood ratio). A
context).[41] simple method of solution is to select the hypothesis with
the highest probability for the Geiger counts observed.
A comparison between Fisherian, frequentist
The typical result matches intuition: few counts imply no
(Neyman-Pearson)
source, many counts imply two sources and intermediate
counts imply one source.
Neyman-Pearson theory can accommodate both prior
probabilities and the costs of actions resulting from
26.6.1 Early choices of null hypothesis decisions.[46] The former allows each test to consider the
results of earlier tests (unlike Fishers signicance tests).
The latter allows the consideration of economic issues
Paul Meehl has argued that the epistemological impor- (for example) as well as probabilities. A likelihood ratio
tance of the choice of null hypothesis has gone largely un- remains a good criterion for selecting among hypotheses.
acknowledged. When the null hypothesis is predicted by
theory, a more precise experiment will be a more severe The two forms of hypothesis testing are based on dier-
test of the underlying theory. When the null hypothesis ent problem formulations. The original test is analogous
defaults to no dierence or no eect, a more precise to a true/false question; the Neyman-Pearson test is more
[47]
experiment is a less severe test of the theory that moti- like multiple choice. In the view of Tukey the former
vated performing the experiment.[42] An examination of produces a conclusion on the basis of only strong evidence
the origins of the latter practice may therefore be useful: while the latter produces a decision on the basis of avail-
able evidence. While the two tests seem quite dierent
1778: Pierre Laplace compares the birthrates of boys and both mathematically and philosophically, later develop-
girls in multiple European cities. He states: it is natu-
ments lead to the opposite claim. Consider many tiny
ral to conclude that these possibilities are very nearly in radioactive sources. The hypotheses become 0,1,2,3...
the same ratio. Thus Laplaces null hypothesis that the
grains of radioactive sand. There is little distinction be-
birthrates of boys and girls should be equal given con- tween none or some radiation (Fisher) and 0 grains of
ventional wisdom.[28]
radioactive sand versus all of the alternatives (Neyman-
1900: Karl Pearson develops the chi squared test to de- Pearson). The major Neyman-Pearson paper of 1933[31]
termine whether a given form of frequency curve will also considered composite hypotheses (ones whose dis-
eectively describe the samples drawn from a given pop- tribution includes an unknown parameter). An example
ulation. Thus the null hypothesis is that a population is proved the optimality of the (Students) t-test, there can
described by some distribution predicted by theory. He be no better test for the hypothesis under consideration
uses as an example the numbers of ve and sixes in the (p 321). Neyman-Pearson theory was proving the opti-
Weldon dice throw data.[43] mality of Fisherian methods from its inception.
1904: Karl Pearson develops the concept of Fishers signicance testing has proven a popular exi-
"contingency" in order to determine whether out- ble statistical tool in application with little mathematical
comes are independent of a given categorical factor. growth potential. Neyman-Pearson hypothesis testing is
Here the null hypothesis is by default that two things claimed as a pillar of mathematical statistics,[48] creating
are unrelated (e.g. scar formation and death rates a new paradigm for the eld. It also stimulated new ap-
from smallpox).[44] The null hypothesis in this case is no plications in Statistical process control, detection theory,
longer predicted by theory or conventional wisdom, but is decision theory and game theory. Both formulations have
instead the principle of indierence that lead Fisher and been successful, but the successes have been of a dier-
others to dismiss the use of inverse probabilities.[45] ent character.
26.8. CRITICISM 187
The dispute over formulations is unresolved. Science Most of the criticism is indirect. Rather than be-
primarily uses Fishers (slightly modied) formulation ing wrong, statistical hypothesis testing is misunder-
as taught in introductory statistics. Statisticians study stood, overused and misused.
Neyman-Pearson theory in graduate school. Mathemati-
cians are proud of uniting the formulations. Philoso- When used to detect whether a dierence exists be-
phers consider them separately. Learned opinions deem tween groups, a paradox arises. As improvements
the formulations variously competitive (Fisher vs Ney- are made to experimental design (e.g., increased
man), incompatible[27] or complementary.[33] The dis- precision of measurement and sample size), the test
pute has become more complex since Bayesian inference becomes more lenient. Unless one accepts the ab-
has achieved respectability. surd assumption that all sources of noise in the data
cancel out completely, the chance of nding sta-
The terminology is inconsistent. Hypothesis testing can tistical signicance in either direction approaches
mean any mixture of two formulations that both changed 100%.[59]
with time. Any discussion of signicance testing vs hy-
pothesis testing is doubly vulnerable to confusion. Layers of philosophical concerns. The probability
of statistical signicance is a function of decisions
Fisher thought that hypothesis testing was a useful strat-
made by experimenters/analysts.[10] If the decisions
egy for performing industrial quality control, however, he
are based on convention they are termed arbitrary or
strongly disagreed that hypothesis testing could be use-
mindless[40] while those not so based may be termed
ful for scientists.[30] Hypothesis testing provides a means
subjective. To minimize type II errors, large sam-
of nding test statistics used in signicance testing.[33]
ples are recommended. In psychology practically
The concept of power is useful in explaining the conse-
all null hypotheses are claimed to be false for su-
quences of adjusting the signicance level and is heavily
ciently large samples so "...it is usually nonsensical to
used in sample size determination. The two methods re-
perform an experiment with the sole aim of reject-
main philosophically distinct.[35] They usually (but not al-
ing the null hypothesis..[60] Statistically signicant
ways) produce the same mathematical answer. The pre-
ndings are often misleading in psychology.[61] Sta-
ferred answer is context dependent.[33] While the exist-
tistical signicance does not imply practical sig-
ing merger of Fisher and Neyman-Pearson theories has
nicance and correlation does not imply causation.
been heavily criticized, modifying the merger to achieve
Casting doubt on the null hypothesis is thus far from
Bayesian goals has been considered.[49]
directly supporting the research hypothesis.
"[I]t does not tell us what we want to know.[62] Lists
26.8 Criticism of dozens of complaints are available.[54][63]
See also: p-value Criticisms Critics and supporters are largely in factual agreement re-
garding the characteristics of null hypothesis signicance
testing (NHST): While it can provide critical information,
Criticism of statistical hypothesis testing lls it is inadequate as the sole tool for statistical analysis. Suc-
volumes[50][51][52][53][54][55] citing 300400 primary cessfully rejecting the null hypothesis may oer no support
references. Much of the criticism can be summarized by for the research hypothesis. The continuing controversy
the following issues: concerns the selection of the best statistical practices for
the near-term future given the (often poor) existing prac-
The interpretation of a p-value is dependent upon tices. Critics would prefer to ban NHST completely, forc-
stopping rule and denition of multiple compari- ing a complete departure from those practices, while sup-
son. The former often changes during the course porters suggest a less absolute change.
of a study and the latter is unavoidably ambiguous. Controversy over signicance testing, and its eects
(i.e. p values depend on both the (data) observed on publication bias in particular, has produced several
and on the other possible (data) that might have been results. The American Psychological Association has
observed but weren't).[56] strengthened its statistical reporting requirements after
review,[64] medical journal publishers have recognized
Confusion resulting (in part) from combining the
the obligation to publish some results that are not statisti-
methods of Fisher and Neyman-Pearson which are
cally signicant to combat publication bias[65] and a jour-
conceptually distinct.[47]
nal (Journal of Articles in Support of the Null Hypothesis)
Emphasis on statistical signicance to the exclu- has been created to publish such results exclusively.[66]
sion of estimation and conrmation by repeated Textbooks have added some cautions[67] and increased
experiments.[57] coverage of the tools necessary to estimate the size of the
sample required to produce signicant results. Major or-
Rigidly requiring statistical signicance as a crite- ganizations have not abandoned use of signicance tests
rion for publication, resulting in publication bias.[58] although some have discussed doing so.[64]
188 CHAPTER 26. STATISTICAL HYPOTHESIS TESTING
class places much emphasis on hypothesis testing per- [2] Burnham, K. P.; Anderson, D. R. (2002), Model Selec-
haps half of the course. Such elds as literature and di- tion and Multimodel Inference: A Practical Information-
vinity now include ndings based on statistical analysis Theoretic Approach (2nd ed.), Springer-Verlag, ISBN 0-
(see the Bible Analyzer). An introductory statistics class 387-95364-7.
teaches hypothesis testing as a cookbook process. Hy- [3] Schervish, M (1996) Theory of Statistics, p. 218. Springer
pothesis testing is also taught at the postgraduate level. ISBN 0-387-94546-6
Statisticians learn how to create good statistical test pro-
cedures (like z, Students t, F and chi-squared). Statisti- [4] Kaye, David H.; Freedman, David A. (2011). Reference
cal hypothesis testing is considered a mature area within Guide on Statistics. Reference Manual on Scientic Evi-
statistics,[69] but a limited amount of development con- dence (3rd ed.). Eagan, MN Washington, D.C: West Na-
tinues. tional Academies Press. p. 259. ISBN 978-0-309-21421-
6.
The cookbook method of teaching introductory statis-
tics leaves no time for history, philosophy or controversy. [5] Lehmann, E.L.; Romano, Joseph P. (2005). Testing Sta-
Hypothesis testing has been taught as received unied tistical Hypotheses (3E ed.). New York: Springer. ISBN
method. Surveys showed that graduates of the class were 0-387-98864-5.
lled with philosophical misconceptions (on all aspects of [6] Triola, Mario (2001). Elementary statistics (8 ed.).
statistical inference) that persisted among instructors.[80] Boston: Addison-Wesley. p. 388. ISBN 0-201-61477-
While the problem was addressed more than a decade 4.
ago,[81] and calls for educational reform continue,[82] stu-
dents still graduate from statistics classes holding funda- [7] Hinkelmann, Klaus and Kempthorne, Oscar (2008). De-
mental misconceptions about hypothesis testing.[83] Ideas sign and Analysis of Experiments. I and II (Second ed.).
Wiley. ISBN 978-0-470-38551-7.
for improving the teaching of hypothesis testing include
encouraging students to search for statistical errors in [8] Montgomery, Douglas (2009). Design and analysis of
published papers, teaching the history of statistics and experiments. Hoboken, NJ: Wiley. ISBN 978-0-470-
emphasizing the controversy in a generally dry subject.[84] 12866-4.
[18] Box, Joan Fisher (1978). R.A. Fisher, The Life of a Sci- [32] Goodman, S N (June 15, 1999). Toward evidence-based
entist. New York: Wiley. p. 134. ISBN 0-471-09300-9. medical statistics. 1: The P Value Fallacy. Ann Intern
Med 130 (12): 9951004. doi:10.7326/0003-4819-130-
[19] C. S. Peirce (August 1878). Illustrations of the Logic of 12-199906150-00008. PMID 10383371.
Science VI: Deduction, Induction, and Hypothesis. Pop-
ular Science Monthly 13. Retrieved 30 March 2012. [33] Lehmann, E. L. (December 1993). The Fisher,
Neyman-Pearson Theories of Testing Hypotheses:
[20] Jaynes, E.T. (2007). Probability theory : the logic of sci- One Theory or Two?". Journal of the Ameri-
ence (5. print. ed.). Cambridge [u.a.]: Cambridge Univ. can Statistical Association 88 (424): 12421249.
Press. ISBN 978-0-521-59271-0. doi:10.1080/01621459.1993.10476404.
[21] Loveland, Jennifer L. (2011). Mathematical Justication [34] Fisher, R N (1958). The Nature of Probability (PDF).
of Introductory Hypothesis Tests and Development of Ref- Centennial Review 2: 261274."We are quite in danger of
erence Materials (M.Sc. (Mathematics)). Utah State Uni- sending highly trained and highly intelligent young men
versity. Retrieved April 2013. Abstract: The focus was out into the world with tables of erroneous numbers under
on the Neyman-Pearson approach to hypothesis testing. their arms, and with a dense fog in the place where their
A brief historical development of the Neyman-Pearson brains ought to be. In this century, of course, they will
approach is followed by mathematical proofs of each of be working on guided missiles and advising the medical
the hypothesis tests covered in the reference material. profession on the control of disease, and there is no limit
The proofs do not reference the concepts introduced by to the extent to which they could impede every sort of
Neyman and Pearson, instead they show that traditional national eort.
test statistics have the probability distributions ascribed [35] Lenhard, Johannes (2006). Models and Statisti-
to them, so that signicance calculations assuming those cal Inference: The Controversy between Fisher and
distributions are correct. The thesis information is also NeymanPearson. Brit. J. Phil. Sci. 57: 6991.
posted at mathnstats.com as of April 2013. doi:10.1093/bjps/axi152.
[22] NIST handbook: Two-Sample t-test for Equal Means [36] Neyman, Jerzy (1967). RA Fisher (18901962):
An Appreciation.. Science. 156.3781: 14561460.
[23] Steel, R.G.D, and Torrie, J. H., Principles and Procedures doi:10.1126/science.156.3781.1456.
of Statistics with Special Reference to the Biological Sci-
ences., McGraw Hill, 1960, page 350. [37] Losavich, J. L.; Neyman, J.; Scott, E. L.; Wells, M. A.
(1971). Hypothetical explanations of the negative appar-
[24] Weiss, Neil A. (1999). Introductory Statistics (5th ed.). p. ent eects of cloud seeding in the Whitetop Experiment..
802. ISBN 0-201-59877-9. Proceedings of the U.S. National Academy of Sciences 68:
26432646. doi:10.1073/pnas.68.11.2643.
[25] NIST handbook: F-Test for Equality of Two Standard De-
viations (Testing standard deviations the same as testing [38] Halpin, P F; Stam, HJ (Winter 2006). Inductive Infer-
variances) ence or Inductive Behavior: Fisher and Neyman: Pear-
son Approaches to Statistical Testing in Psychological Re-
[26] Steel, R.G.D, and Torrie, J. H., Principles and Procedures search (19401960)". The American Journal of Psychol-
of Statistics with Special Reference to the Biological Sci- ogy 119 (4): 625653. doi:10.2307/20445367. JSTOR
ences., McGraw Hill, 1960, page 288.) 20445367. PMID 17286092.
[27] Raymond Hubbard, M.J. Bayarri, P Values are not Error [39] Gigerenzer, Gerd; Zeno Swijtink; Theodore Porter; Lor-
Probabilities. A working paper that explains the dier- raine Daston; John Beatty; Lorenz Kruger (1989). Part
ence between Fishers evidential p-value and the Neyman 3: The Inference Experts. The Empire of Chance: How
Pearson Type I error rate . Probability Changed Science and Everyday Life. Cam-
bridge University Press. pp. 70122. ISBN 978-0-521-
[28] Laplace, P (1778). Memoire Sur Les Probabilities 39838-1.
(PDF). Memoirs de lAcademie royale des Sciences de Paris
9: 227332. [40] Gigerenzer, G (November 2004). Mindless statis-
tics. The Journal of Socio-Economics 33 (5): 587606.
[29] Stigler, Stephen M. (1986). The History of Statistics: The doi:10.1016/j.socec.2004.09.033.
Measurement of Uncertainty before 1900. Cambridge,
[41] Loftus, G R (1991). On the Tyranny of Hypothesis Test-
Mass: Belknap Press of Harvard University Press. p. 134.
ing in the Social Sciences (PDF). Contemporary Psychol-
ISBN 0-674-40340-1.
ogy 36 (2): 102105. doi:10.1037/029395.
[30] Fisher, R (1955). Statistical Methods and Scientic In- [42] Meehl, P (1990). Appraising and Amending Theories:
duction (PDF). Journal of the Royal Statistical Society, The Strategy of Lakatosian Defense and Two Principles
Series B 17 (1): 6978. That Warrant It (PDF). Psychological Inquiry 1 (2): 108
141. doi:10.1207/s15327965pli0102_1.
[31] Neyman, J; Pearson, E. S. (January 1, 1933). On
the Problem of the most Ecient Tests of Sta- [43] Pearson, K (1900). On the criterion that a given sys-
tistical Hypotheses. Philosophical Transactions tem of deviations from the probable in the case of a cor-
of the Royal Society A 231 (694706): 289337. related system of variables is such that it can be rea-
doi:10.1098/rsta.1933.0009. sonably supposed to have arisen from random sampling
26.13. REFERENCES 191
(PDF). Philosophical Magazine Series 5 (50): 157175. [57] Yates, Frank (1951). The Inuence of Statis-
doi:10.1080/14786440009463897. tical Methods for Research Workers on the De-
velopment of the Science of Statistics. Journal
[44] Pearson, K (1904). On the Theory of Contingency of the American Statistical Association 46: 1934.
and Its Relation to Association and Normal Correlation doi:10.1080/01621459.1951.10500764. The emphasis
(PDF). Drapers Company Research Memoirs Biometric given to formal tests of signicance throughout [R.A.
Series 1: 135. Fishers] Statistical Methods ... has caused scientic re-
search workers to pay undue attention to the results of the
[45] Zabell, S (1989). R. A. Fisher on the History of In- tests of signicance they perform on their data, particu-
verse Probability. Statistical Science 4 (3): 247256. larly data derived from experiments, and too little to the
doi:10.1214/ss/1177012488. JSTOR 2245634. estimates of the magnitude of the eects they are investi-
gating. ... The emphasis on tests of signicance and the
[46] Ash, Robert (1970). Basic probability theory. New York: consideration of the results of each experiment in isola-
Wiley. ISBN 978-0471034506.Section 8.2 tion, have had the unfortunate consequence that scientic
workers have often regarded the execution of a test of sig-
[47] Tukey, John W. (1960). Conclusions vs de- nicance on an experiment as the ultimate objective.
cisions. Technometrics 26 (4): 423433.
doi:10.1080/00401706.1960.10489909. Until we [58] Begg, Colin B.; Berlin, Jesse A. (1988). Publication bias:
go through the accounts of testing hypotheses, separating a problem in interpreting medical data. Journal of the
[Neyman-Pearson] decision elements from [Fisher] Royal Statistical Society, Series A: 419463.
conclusion elements, the intimate mixture of disparate
elements will be a continual source of confusion. ... [59] Meehl, Paul E. (1967). Theory-Testing in Psychology
There is a place for both doing ones best and saying and Physics: A Methodological Paradox (PDF). Philos-
only what is certain, but it is important to know, in each ophy of Science 34 (2): 103115. doi:10.1086/288135.
instance, both which one is being done, and which one Thirty years later, Meehl acknowledged statistical signif-
ought to be done. icance theory to be mathematically sound while contin-
uing to question the default choice of null hypothesis,
[48] Stigler, Stephen M. (Aug 1996). The History of Statis- blaming instead the social scientists poor understanding
tics in 1933. Statistical Science 11 (3): 244252. of the logical relation between theory and fact in The
doi:10.1214/ss/1032280216. JSTOR 2246117. Problem Is Epistemology, Not Statistics: Replace Signif-
icance Tests by Condence Intervals and Quantify Accu-
[49] Berger, James O. (2003). Could Fisher, Jereys and racy of Risky Numerical Predictions (Chapter 14 in Har-
Neyman Have Agreed on Testing?". Statistical Science 18 low (1997)).
(1): 132. doi:10.1214/ss/1056397485.
[60] Nunnally, Jum (1960). The place of statistics in psychol-
[50] Morrison, Denton; Henkel, Ramon, ed. (2006) [1970].
ogy. Educational and Psychological Measurement 20 (4):
The Signicance Test Controversy. AldineTransaction.
641650. doi:10.1177/001316446002000401.
ISBN 0-202-30879-0.
[61] Lykken, David T. (1991). Whats wrong with psychol-
[51] Oakes, Michael (1986). Statistical Inference: A Commen-
ogy, anyway?". Thinking Clearly About Psychology 1: 3
tary for the Social and Behavioural Sciences. Chichester
39.
New York: Wiley. ISBN 0471104434.
[52] Chow, Siu L. (1997). Statistical Signicance: Rationale, [62] Jacob Cohen (December 1994). The Earth Is Round
Validity and Utility. ISBN 0-7619-5205-5. (p < .05)". American Psychologist 49 (12): 9971003.
doi:10.1037/0003-066X.49.12.997. This paper lead to
[53] Harlow, Lisa Lavoie; Stanley A. Mulaik; James H. Steiger, the review of statistical practices by the APA. Cohen was
ed. (1997). What If There Were No Signicance Tests?. a member of the Task Force that did the review.
Lawrence Erlbaum Associates. ISBN 978-0-8058-2634-
0. [63] Nickerson, Raymond S. (2000). Null Hypothesis Sig-
nicance Tests: A Review of an Old and Continuing
[54] Kline, Rex (2004). Beyond Signicance Testing: Reform- Controversy. Psychological Methods 5 (2): 241301.
ing Data Analysis Methods in Behavioral Research. Wash- doi:10.1037/1082-989X.5.2.241. PMID 10937333.
ington, DC: American Psychological Association. ISBN
9781591471189. [64] Wilkinson, Leland (1999). Statistical Methods in Psy-
chology Journals; Guidelines and Explanations. Amer-
[55] McCloskey, Deirdre N.; Stephen T. Ziliak (2008). The ican Psychologist 54 (8): 594604. doi:10.1037/0003-
Cult of Statistical Signicance: How the Standard Error 066X.54.8.594. Hypothesis tests. It is hard to imagine
Costs Us Jobs, Justice, and Lives. University of Michigan a situation in which a dichotomous accept-reject decision
Press. ISBN 0-472-05007-9. is better than reporting an actual p value or, better still,
a condence interval. (p 599). The committee used the
[56] Corneld, Jerome (1976). Recent Methodological Con- cautionary term forbearance in describing its decision
tributions to Clinical Trials (PDF). American Journal of against a ban of hypothesis testing in psychology report-
Epidemiology 104 (4): 408421. ing. (p 603)
192 CHAPTER 26. STATISTICAL HYPOTHESIS TESTING
[65] ICMJE: Obligation to Publish Negative Studies. Re- [77] College Board Tests > AP: Subjects > Statistics The Col-
trieved 3 September 2012. Editors should seriously con- lege Board (relates to USA students)
sider for publication any carefully done study of an impor-
tant question, relevant to their readers, whether the results [78] Hu, Darrell (1993). How to lie with statistics. New York:
for the primary or any additional outcome are statistically Norton. p. 8. ISBN 0-393-31072-8.'Statistical methods
signicant. Failure to submit or publish ndings because and statistical terms are necessary in reporting the mass
of lack of statistical signicance is an important cause of data of social and economic trends, business conditions,
publication bias. opinion polls, the census. But without writers who use
the words with honesty and readers who know what they
[66] Journal of Articles in Support of the Null Hypothesis web- mean, the result can only be semantic nonsense.'
site: JASNH homepage. Volume 1 number 1 was pub-
lished in 2002, and all articles are on psychology-related [79] Snedecor, George W.; Cochran, William G. (1967). Sta-
subjects. tistical Methods (6 ed.). Ames, Iowa: Iowa State Univer-
sity Press. p. 3. "...the basic ideas in statistics assist us
[67] Howell, David (2002). Statistical Methods for Psychology in thinking clearly about the problem, provide some guid-
(5 ed.). Duxbury. p. 94. ISBN 0-534-37770-X. ance about the conditions that must be satised if sound
inferences are to be made, and enable us to detect many
[68] Armstrong, J. Scott (2007). Signicance
inferences that have no good logical foundation.
tests harm progress in forecasting. Interna-
tional Journal of Forecasting 23 (2): 321327.
[80] Sotos, Ana Elisa Castro; Vanhoof, Stijn; Noortgate,
doi:10.1016/j.ijforecast.2007.03.004.
Wim Van den; Onghena, Patrick (2007). Students
[69] E. L. Lehmann (1997). Testing Statistical Hypotheses: Misconceptions of Statistical Inference: A Review of
The Story of a Book. Statistical Science 12 (1): 4852. the Empirical Evidence from Research on Statistics Ed-
doi:10.1214/ss/1029963261. ucation. Educational Research Review 2: 98113.
doi:10.1016/j.edurev.2007.04.001.
[70] Kruschke, J K (July 9, 2012). Bayesian Estimation Su-
persedes the T Test. Journal of Experimental Psychology: [81] Moore, David S. (1997). New Pedagogy and New Con-
General N/A (N/A): N/A. doi:10.1037/a0029146. tent: The Case of Statistics. International Statistical Re-
view 65: 123165. doi:10.2307/1403333.
[71] Kass, R E (1993). Bayes factors and model uncertainty
(PDF).Department of Statistics, University of Washing- [82] Hubbard, Raymond; Armstrong, J. Scott (2006).
ton Technical Paper Why We Don't Really Know What Statistical
Signicance Means: Implications for Educa-
[72] Rozeboom, William W (1960), The fallacy of the null- tors. Journal of Marketing Education 28 (2): 114.
hypothesis signicance test (PDF), Psychological Bul- doi:10.1177/0273475306288399. Preprint
letin 57 (5): 416428, doi:10.1037/h0042040 "...the
proper application of statistics to scientic inference is [83] Sotos, Ana Elisa Castro; Vanhoof, Stijn; Noortgate, Wim
irrevocably committed to extensive consideration of in- Van den; Onghena, Patrick (2009). How Condent
verse [AKA Bayesian] probabilities... It was acknowl- Are Students in Their Misconceptions about Hypothesis
edged, with regret, that a priori probability distributions Tests?". Journal of Statistics Education 17 (2).
were available only as a subjective feel, diering from
one person to the next in the more immediate future, at [84] Gigerenzer, G (2004). The Null Ritual What You Al-
least. ways Wanted to Know About Signicant Testing but Were
Afraid to Ask (PDF). The SAGE Handbook of Quan-
[73] Berger, James (2006), The Case for Objective titative Methodology for the Social Sciences: 391408.
Bayesian Analysis, Bayesian Analysis 1 (3): 385402, doi:10.4135/9781412986311.
doi:10.1214/06-ba115 In listing the competing deni-
tions of objective Bayesian analysis, A major goal of
statistics (indeed science) is to nd a completely coherent
objective Bayesian methodology for learning from data. 26.14 Further reading
The author expressed the view that this goal is not
attainable.
Lehmann E.L. (1992) Introduction to Neyman and
[74] Aldrich, J (2008). R. A. Fisher on Bayes and Bayes Pearson (1933) On the Problem of the Most E-
theorem (PDF). Bayesian Analysis 3 (1): 161170. cient Tests of Statistical Hypotheses. In: Break-
doi:10.1214/08-BA306. throughs in Statistics, Volume 1, (Eds Kotz, S., John-
son, N.L.), Springer-Verlag. ISBN 0-387-94037-5
[75] Mayo, D. G.; Spanos, A. (2006). Severe Testing as a Ba- (followed by reprinting of the paper)
sic Concept in a Neyman-Pearson Philosophy of Induc-
tion. The British Journal for the Philosophy of Science 57
Neyman, J.; Pearson, E.S. (1933). On the
(2): 323. doi:10.1093/bjps/axl003.
Problem of the Most Ecient Tests of Statisti-
[76] Mathematics > High School: Statistics & Probability > cal Hypotheses. Philosophical Transactions of
Introduction Common Core State Standards Initiative (re- the Royal Society A 231 (694706): 289337.
lates to USA students) doi:10.1098/rsta.1933.0009.
26.15. EXTERNAL LINKS 193
Bayesian inference
Bayesian inference is a method of statistical inference in for the observed data. Bayesian inference computes the
which Bayes theorem is used to update the probability for posterior probability according to Bayes theorem:
a hypothesis as evidence is acquired. Bayesian inference
is an important technique in statistics, and especially in
mathematical statistics. Bayesian updating is particularly P (E | H) P (H)
P (H | E) =
important in the dynamic analysis of a sequence of data. P (E)
Bayesian inference has found application in a wide range
where
of activities, including science, engineering, philosophy,
medicine, and law. In the philosophy of decision the-
| denotes a conditional probability; more speci-
ory, Bayesian inference is closely related to subjective
cally, it means given.
probability, often called "Bayesian probability". Bayesian
probability provides a rational method for updating be- H stands for any hypothesis whose probability may
liefs. be aected by data (called evidence below). Of-
ten there are competing hypotheses, from which one
chooses the most probable.
27.1 Introduction to Bayes rule the evidence E corresponds to new data that were
not used in computing the prior probability.
_ P (H) , the prior probability, is the probability of H
Relative size Case B Case B Total
before E is observed. This indicates ones previous
Condition A w x w+x estimate of the probability that a hypothesis is true,
Condition y z y+z before gaining the current evidence.
Total w+y x+z w+x+y+z P (H | E) , the posterior probability, is the proba-
bility of H given E , i.e., after E is observed. This
tells us what we want to know: the probability of a
w w+y w hypothesis given the observed evidence.
P (A|B) P (B) = ____ ________ = ________
w+y w+x+y+z w+x+y+z
P (E | H) is the probability of observing E given
H . As a function of H with E xed, this is the
likelihood. The likelihood function should not be
w w+x w
P (B|A) P (A) = ____ ________ = ________ confused with P (H | E) as a function of H rather
w+x w+x+y+z w+x+y+z
than of E . It indicates the compatibility of the evi-
) P()/P(B) etc. dence with the given hypothesis.
P (E) is sometimes termed the marginal likelihood
Main article: Bayes rule or model evidence. This factor is the same for all
See also: Bayesian probability possible hypotheses being considered. (This can be
seen by the fact that the hypothesis H does not ap-
pear anywhere in the symbol, unlike for all the other
factors.) This means that this factor does not enter
27.1.1 Formal into determining the relative probabilities of dier-
ent hypotheses.
Bayesian inference derives the posterior probability as a
consequence of two antecedents, a prior probability and Note that, for dierent values of H , only the factors
a "likelihood function" derived from a statistical model P (H) and P (E | H) aect the value of P (H | E)
194
27.2. FORMAL DESCRIPTION OF BAYESIAN INFERENCE 195
. As both of these factors appear in the numerator, the new evidence. This allows for Bayesian principles to be
posterior probability is proportional to both. In words: applied to various kinds of evidence, whether viewed all
at once or over time. This procedure is termed Bayesian
(more precisely) The posterior probability of a hy- updating.
pothesis is determined by a combination of the inher-
ent likeliness of a hypothesis (the prior) and the com-
patibility of the observed evidence with the hypothesis 27.1.3 Bayesian updating
(the likelihood).
Bayesian updating is widely used and computationally
(more concisely) Posterior is proportional to likeli- convenient. However, it is not the only updating rule that
hood times prior. might be considered rational.
Ian Hacking noted that traditional "Dutch book" argu-
Note that Bayes rule can also be written as follows: ments did not specify Bayesian updating: they left open
the possibility that non-Bayesian updating rules could
avoid Dutch books. Hacking wrote[1] And neither the
P (E | H) Dutch book argument, nor any other in the personalist
P (H | E) = P (H)
P (E) arsenal of proofs of the probability axioms, entails the
dynamic assumption. Not one entails Bayesianism. So
where the factor PP(E|H)
(E) represents the impact of E on the personalist requires the dynamic assumption to be
the probability of H . Bayesian. It is true that in consistency a personalist could
abandon the Bayesian model of learning from experience.
Salt could lose its savour.
27.1.2 Informal
Indeed, there are non-Bayesian updating rules that also
If the evidence does not match up with a hypothesis, one avoid Dutch books (as discussed in the literature on
should reject the hypothesis. But if a hypothesis is ex- "probability kinematics" following the publication of
tremely unlikely a priori, one should also reject it, even if Richard C. Jerey's rule, which applies Bayes rule to the
the evidence does appear to match up. case where the evidence itself is assigned a probability.[2]
The additional hypotheses needed to uniquely require
For example, imagine that I have various hypotheses
Bayesian updating have been deemed to be substantial,
about the nature of a newborn baby of a friend, including:
complicated, and unsatisfactory.[3]
H1 : the baby is a brown-haired boy.
H2 : the baby is a blond-haired girl. 27.2 Formal description of
H3 : the baby is a dog. Bayesian inference
Then consider two scenarios: 27.2.1 Denitions
1. I'm presented with evidence in the form of a pic- x , a data point in general. This may in fact be a
ture of a blond-haired baby girl. I nd this evidence vector of values.
supports H2 and opposes H1 and H3 . , the parameter of the data points distribution, i.e.,
2. I'm presented with evidence in the form of a picture x p(x | ) . This may in fact be a vector of
of a baby dog. Although this evidence, treated in parameters.
isolation, supports H3 , my prior belief in this hy- , the hyperparameter of the parameter, i.e.,
pothesis (that a human can give birth to a dog) is p( | ) . This may in fact be a vector of hyperpa-
extremely small, so the posterior probability is nev- rameters.
ertheless small.
X , a set of n observed data points, i.e., x1 , . . . , xn
The critical point about Bayesian inference, then, is that .
it provides a principled way of combining new evidence x
, a new data point whose distribution is to be pre-
with prior beliefs, through the application of Bayes rule. dicted.
(Contrast this with frequentist inference, which relies
only on the evidence as a whole, with no reference to
prior beliefs.) Furthermore, Bayes rule can be applied 27.2.2 Bayesian inference
iteratively: after observing some evidence, the resulting
posterior probability can then be treated as a prior prob- The prior distribution is the distribution of the pa-
ability, and a new posterior probability computed from rameter(s) before any data is observed, i.e. p( | )
196 CHAPTER 27. BAYESIAN INFERENCE
P( E1 | M3 )
...
p(E | , )
p( | E, ) = p( | )
P( E2 | M3 ) p(E | )
P( E3 | M3 ) p(E | , )
= p( | )
...
p(E|, )p( | ) d
Where
Where
If P (M ) = 0 then P (M | E) = 0 . If P (M ) = 1 ,
then P (M |E) = 1 . This can be interpreted to mean that
hard convictions are insensitive to counter-evidence.
P (E | M ) = P (ek | M ).
k
The former follows directly from Bayes theorem. The
latter can be derived by applying the rst rule to the event
This may be used to optimize practical calculations. not M " in place of " M ", yielding if 1 P (M ) =
0 , then 1 P (M | E) = 0 ", from which the result
immediately follows.
27.3.3 Parametric formulation
By parameterizing the space of models, the belief in all 27.4.3 Asymptotic behaviour of posterior
models may be updated in a single step. The distribution
of belief over the model space may then be thought of Consider the behaviour of a belief distribution as it is
as a distribution of belief over the parameter space. The updated a large number of times with independent and
distributions in this section are expressed as continuous, identically distributed trials. For suciently nice prior
represented by probability densities, as this is the usual probabilities, the Bernstein-von Mises theorem gives that
198 CHAPTER 27. BAYESIAN INFERENCE
in the limit of innite trials, the posterior converges to a There are examples where no maximum is attained, in
Gaussian distribution independent of the initial prior un- which case the set of MAP estimates is empty.
der some conditions rstly outlined and rigorously proven There are other methods of estimation that minimize the
by Joseph L. Doob in 1948, namely if the random vari- posterior risk (expected-posterior loss) with respect to a
able in consideration has a nite probability space. The loss function, and these are of interest to statistical deci-
more general results were obtained later by the statisti- sion theory using the sampling distribution (frequentist
cian David A. Freedman who published in two seminal statistics).
research papers in 1963 and 1965 when and under what
circumstances the asymptotic behaviour of posterior is The posterior predictive distribution of a new observa-
guaranteed. His 1963 paper treats, like Doob (1949), the tion x
(that is independent of previous observations) is
nite case and comes to a satisfactory conclusion. How- determined by
ever, if the random variable has an innite but countable
probability space (i.e., corresponding to a die with in-
nite many faces) the 1965 paper demonstrates that for a
dense subset of priors the Bernstein-von Mises theorem x|X, ) =
p( x, | X, ) d =
p( x | )p( | X, ) d.
p(
is not applicable. In this case there is almost surely no
asymptotic convergence. Later in the 1980s and 1990s
Freedman and Persi Diaconis continued to work on the
case of innite countable probability spaces.[5] To sum- 27.5 Examples
marise, there may be insucient trials to suppress the ef-
fects of the initial choice, and especially for large (but
nite) systems the convergence might be very slow. 27.5.1 Probability of a hypothesis
Taking a value with the greatest probability denes Before we observed the cookie, the probability we as-
maximum a posteriori (MAP) estimates: signed for Fred having chosen bowl #1 was the prior prob-
ability, P (H1 ) , which was 0.5. After observing the
cookie, we must revise the probability to P (H1 | E) ,
{MAP } arg max p( | X, ). which is 0.6.
27.6. IN FREQUENTIST STATISTICS AND DECISION THEORY 199
P (E = GD | C = c) = (0.01+0.16(c11))(0.50.09(c11))
In the rst chapters of this work, prior distributions
P (E = GD with nite support and the corresponding Bayes pro-
| C = c) = (0.01+0.16(c11))(0.5+0.09(c11))
cedures were used to establish some of the main the-
P (E = GD | C = c) = (0.990.16(c11))(0.50.09(c11))
orems relating to the comparison of experiments.
P (E = GD | C = c) = (0.990.16(c11))(0.5+0.09(c11))
Bayes procedures with respect to more general prior
distributions have played a very important role in
Assume a uniform prior of fC (c) = 0.2 , and that tri- the development of statistics, including its asymp-
als are independent and identically distributed. When a totic theory. There are many problems where a
new fragment of type e is discovered, Bayes theorem is glance at posterior distributions, for suitable priors,
applied to update the degree of belief for each c : yields immediately interesting information. Also,
fC (c | E = e) = P (E=e|C=c) this technique can hardly be avoided in sequential
P (E=e) fC (c) =
P (E=e|C=c)
analysis.[10]
16 f C (c)
11
P (E=e|C=c)f (c)dc
C
A computer simulation of the changing belief as 50 frag- A useful fact is that any Bayes decision rule ob-
ments are unearthed is shown on the graph. In the sim- tained by taking a proper prior over the whole pa-
ulation, the site was inhabited around 1420, or c = 15.2 rameter space must be admissible[11]
. By calculating the area under the relevant portion of
the graph for 50 trials, the archaeologist can say that An important area of investigation in the develop-
there is practically no chance the site was inhabited in the ment of admissibility ideas has been that of conven-
11th and 12th centuries, about 1% chance that it was in- tional sampling-theory procedures, and many inter-
habited during the 13th century, 63% chance during the esting results have been obtained.[12]
200 CHAPTER 27. BAYESIAN INFERENCE
27.6.1 Model selection posterior from one stage becoming the prior for the next.
The benet of a Bayesian approach is that it gives the ju-
See Bayesian model selection ror an unbiased, rational mechanism for combining evi-
dence. It may be appropriate to explain Bayes theorem to
jurors in odds form, as betting odds are more widely un-
derstood than probabilities. Alternatively, a logarithmic
27.7 Applications approach, replacing multiplication with addition, might
be easier for a jury to handle.
Bayesian search theory is used to search for lost ob- 27.10 See also
jects.
[9] Lehmann, Erich (1986). Testing Statistical Hypotheses [28] Bernardo, Jos M. (2006). A Bayesian mathematical
(Second ed.). (see p. 309 of Chapter 6.7 Admissibilty, statistics primer (PDF). ICOTS-7.
and pp. 1718 of Chapter 1.8 Complete Classes
[29] Bishop, C. M. (2007). Pattern Recognition and Machine
[10] Le Cam, Lucien (1986). Asymptotic Methods in Statistical Learning. New York: Springer. ISBN 0387310738.
Decision Theory. Springer-Verlag. ISBN 0-387-96307-
3. (From Chapter 12 Posterior Distributions and Bayes
Solutions, p. 324) 27.12 References
[11] Cox, D. R. and Hinkley, D.V (1974). Theoretical Statis-
tics. Chapman and Hall. ISBN 0-04-121537-0. page 432 Aster, Richard; Borchers, Brian, and Thurber,
Cliord (2012). Parameter Estimation and In-
[12] Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statis-
tics. Chapman and Hall. ISBN 0-04-121537-0. p. 433)
verse Problems, Second Edition, Elsevier. ISBN
0123850487, ISBN 978-0123850485
[13] Jim Albert (2009). Bayesian Computation with R, Second
edition. New York, Dordrecht, etc.: Springer. ISBN 978- Bickel, Peter J. and Doksum, Kjell A. (2001). Math-
0-387-92297-3. ematical Statistics, Volume 1: Basic and Selected
Topics (Second (updated printing 2007) ed.). Pear-
[14] Samuel Rathmanner and Marcus Hutter. A Philo- son PrenticeHall. ISBN 0-13-850363-X.
sophical Treatise of Universal Induction. Entropy,
13(6):10761136, 2011. Box, G. E. P. and Tiao, G. C. (1973) Bayesian In-
[15] The Problem of Old Evidence, in 5 of On Universal
ference in Statistical Analysis, Wiley, ISBN 0-471-
Prediction and Bayesian Conrmation, M. Hutter - The- 57428-7
oretical Computer Science, 2007 - Elsevier
Edwards, Ward (1968). Conservatism in Human
[16] Raymond J. Solomono, Peter Gacs, Paul M. B. Vi- Information Processing. In Kleinmuntz, B. Formal
tanyi, 2011 cs.bu.edu Representation of Human Judgment. Wiley.
27.14. EXTERNAL LINKS 203
Edwards, Ward (1982). Conservatism in Human Lee, Peter M. Bayesian Statistics: An Introduction.
Information Processing (excerpted)". In Daniel Fourth Edition (2012), John Wiley ISBN 978-1-
Kahneman, Paul Slovic and Amos Tversky. Judg- 1183-3257-3
ment under uncertainty: Heuristics and biases. Cam-
bridge University Press. Carlin, Bradley P. and Louis, Thomas A. (2008).
Bayesian Methods for Data Analysis, Third Edition.
Jaynes E. T. (2003) Probability Theory: The Logic Boca Raton, FL: Chapman and Hall/CRC. ISBN 1-
of Science, CUP. ISBN 978-0-521-59271-0 (Link 58488-697-8.
to Fragmentary Edition of March 1996).
Gelman, Andrew; Carlin, John B.; Stern, Hal S.;
Howson, C. and Urbach, P. (2005). Scientic Rea- Dunson, David B.; Vehtari, Aki; Rubin, Donald
soning: the Bayesian Approach (3rd ed.). Open B. (2013). Bayesian Data Analysis, Third Edition.
Court Publishing Company. ISBN 978-0-8126- Chapman and Hall/CRC. ISBN 978-1-4398-4095-
9578-6. 5.
Chi-squared distribution
This article is about the mathematics of the chi-squared the covariance matrix).[8] The idea of a family of chi-
distribution. For its uses in statistics, see chi-squared squared distributions, however, is not due to Pearson
test. For the music group, see Chi2 (band). but arose as a further development due to Fisher in the
1920s.[6]
In probability theory and statistics, the chi-squared dis-
tribution (also chi-square or -distribution) with k
degrees of freedom is the distribution of a sum of the 28.2 Denition
squares of k independent standard normal random vari-
ables. It is a special case of the gamma distribution and If Z 1 , ..., Zk are independent, standard normal random
is one of the most widely used probability distributions in variables, then the sum of their squares,
inferential statistics, e.g., in hypothesis testing or in con-
struction of condence intervals.[2][3][4][5] When it is be-
ing distinguished from the more general noncentral chi- k
squared distribution, this distribution is sometimes called Q = Zi2 ,
the central chi-squared distribution. i=1
The chi-squared distribution is used in the common chi- is distributed according to the chi-squared distribution
squared tests for goodness of t of an observed distribu- with k degrees of freedom. This is usually denoted as
tion to a theoretical one, the independence of two criteria
of classication of qualitative data, and in condence in-
terval estimation for a population standard deviation of
Q 2 (k) or Q 2k .
a normal distribution from a sample standard deviation.
Many other statistical tests also use this distribution, like The chi-squared distribution has one parameter: k a
Friedmans analysis of variance by ranks. positive integer that species the number of degrees of
freedom (i.e. the number of Zis)
205
206 CHAPTER 28. CHI-SQUARED DISTRIBUTION
28.3.6 Entropy
x
F (x; 2) = 1 e 2
The dierential entropy is given by
and the form is not much more complicated for other
small even k. [ ( )] ( ) [ ]
k k k k
Tables of the chi-squared cumulative distribution func- h = f (x; k) ln f (x; k) dx = +ln 2 + 1 ,
2 2 2 2
tion are widely available and the function is included in
many spreadsheets and all statistical packages. where (x) is the Digamma function.
Letting z x/k , Cherno bounds on the lower and The chi-squared distribution is the maximum entropy
upper tails of the CDF may be obtained.[9] For the cases probability distribution for a random variate X for which
when 0 < z < 1 (which include all of the cases when E(X) = k and E(ln(X)) = (k/2) + log(2) are
this CDF is less than half): xed. Since the chi-squared is in the family of gamma
28.4. RELATION TO OTHER DISTRIBUTIONS 207
If X ~ (k) then 2X is approximately normally dis- If X Maxwell(1) (Maxwell distribution) then
tributed with mean 2k1 and unit variance (result X 2 2 (3)
credited to R. A. Fisher).
If X 2 () then X 1
Inv-2 () (Inverse-chi-
squared distribution)
If X ~ (k) then X/k is approximately normally
3
Noncentral t-distribution can be obtained from nor- Main article: Noncentral chi-squared distribution
mal distribution and chi-squared distribution
A chi-squared variable with k degrees of freedom is de- The noncentral chi-squared distribution is obtained from
ned as the sum of the squares of k independent standard the sum of the squares of independent Gaussian random
normal random variables. variables having unit variance and nonzero means.
28.8 See also [10] Chi-squared distribution, from MathWorld, retrieved Feb.
11, 2009
Fishers method for combining independent tests of [12] Box, Hunter and Hunter (1978). Statistics for experi-
signicance menters. Wiley. p. 118. ISBN 0471093157.
210 CHAPTER 28. CHI-SQUARED DISTRIBUTION
Chi-squared test
Chi-square distribution, showing X2 on the x-axis and P-value 29.1.1 Pearsons chi-square test
on the y-axis.
Main article: Pearsons chi-square test
211
212 CHAPTER 29. CHI-SQUARED TEST
Likelihood-ratio tests in general statistical mod- The sum of these quantities over all of the cells is the test
elling, for testing whether there is evidence of the statistic. Under the null hypothesis, it has approximately a
need to move from a simple model to a more compli- chi-square distribution whose number of degrees of free-
cated one (where the simple model is nested within dom is
the complicated one).
29.6 References
[1] Yates, F (1934). Contingency table involving small num-
bers and the 2 test. Supplement to the Journal of the
Royal Statistical Society 1(2): 217235. JSTOR 2983604
Goodness of t
The goodness of t of a statistical model describes how measurement error is known, is to construct a weighted
well it ts a set of observations. Measures of goodness sum of squared errors:
of t typically summarize the discrepancy between ob-
served values and the values expected under the model
in question. Such measures can be used in statistical hy-
pothesis testing, e.g. to test for normality of residuals, (O E)2
to test whether two samples are drawn from identical 2 =
2
distributions (see KolmogorovSmirnov test), or whether
outcome frequencies follow a specied distribution (see
Pearsons chi-squared test). In the analysis of variance, where 2 is the known variance of the observation, O
one of the components into which the variance is parti- is the observed data and E is the theoretical data.[1] This
tioned may be a lack-of-t sum of squares. denition is only useful when one has estimates for the er-
ror on the measurements, but it leads to a situation where
a chi-squared distribution can be used to test goodness
30.1 Fit of distributions of t, provided that the errors can be assumed to have a
normal distribution.
In assessing whether a given distribution is suited to a The reduced chi-squared statistic is simply the chi-
data-set, the following tests and their underlying measures squared divided by the number of degrees of free-
of t can be used: dom:[1][2][3][4]
KolmogorovSmirnov test;
Cramrvon Mises criterion;
2 1 (O E)2
AndersonDarling test; 2red = =
2
ShapiroWilk test;
Chi Square test;
where is the number of degrees of freedom, usually
Akaike information criterion; given by N n 1 , where N is the number of observa-
HosmerLemeshow test; tions, and n is the number of tted parameters, assuming
that the mean value is an additional tted parameter. The
advantage of the reduced chi-squared is that it already
30.2 Regression analysis normalizes for the number of data points and model com-
plexity. This is also known as the mean square weighted
In regression analysis, the following topics relate to good- deviation.
ness of t: As a rule of thumb (again valid only when the variance
of the measurement error is known a priori rather than
Coecient of determination (The R estimated from the data), a 2red 1 indicates a poor
squared measure of goodness of t); model t. A 2red > 1 indicates that the t has not fully
Lack-of-t sum of squares. captured the data (or that the error variance has been un-
derestimated). In principle, a value of 2red = 1 indicates
that the extent of the match between observations and es-
30.2.1 Example timates is in accord with the error variance. A 2red < 1
indicates that the model is 'over-tting' the data: either
One way in which a measure of goodness of t statistic the model is improperly tting noise, or the error vari-
can be constructed, in the case where the variance of the ance has been overestimated.[5]
214
30.4. OTHER MEASURES OF FIT 215
Overtting
30.6 References
[1] Laub, Charlie; Kuhl, Tonya L. (n.d.), How Bad is Good? A
Critical Look at the Fitting of Reectivity Models using the
Reduced Chi-Square Statistic (PDF), University California,
Davis, retrieved 30 May 2015
Likelihood-ratio test
Not to be confused with the use of likelihood ratios in .[2] Symbols df1 and df2 represent the number of free
diagnostic testing. parameters of models 1 and 2, the null model and the al-
ternative model, respectively.
In statistics, a likelihood ratio test is a statistical test Here is an example of use. If the null model has 1
used to compare the goodness of t of two models, one of parameter and a log-likelihood of 8024 and the alter-
which (the null model) is a special case of the other (the native model has 3 parameters and a log-likelihood of
alternative model). The test is based on the likelihood 8012, then the probability of this dierence is that of
ratio, which expresses how many times more likely the chi-squared value of +2(8024 8012) = 24 with 3 1
data are under one model than the other. This likelihood = 2 degrees of freedom. Certain assumptions[3] must be
ratio, or equivalently its logarithm, can then be used to met for the statistic to follow a chi-squared distribution,
compute a p-value, or compared to a critical value to de- and often empirical p-values are computed.
cide whether to reject the null model in favour of the al- The likelihood-ratio test requires nested models, i.e.
ternative model. When the logarithm of the likelihood models in which the more complex one can be trans-
ratio is used, the statistic is known as a log-likelihood ra- formed into the simpler model by imposing a set of con-
tio statistic, and the probability distribution of this test straints on the parameters. If the models are not nested,
statistic, assuming that the null model is true, can be ap- then a generalization of the likelihood-ratio test can usu-
proximated using Wilkss theorem. ally be used instead: the relative likelihood.
In the case of distinguishing between two models, each of
which has no unknown parameters, use of the likelihood
ratio test can be justied by the NeymanPearson lemma, 31.2 Simple-vs-simple hypotheses
which demonstrates that such a test has the highest power
among all competitors.[1]
Main article: NeymanPearson lemma
( ) H0 : = 0 ,
model null for likelihood
D = 2 ln H1 : = 1 .
model alternative for likelihood
= 2 ln(model null for likelihood) + 2 ln(model alternative
Notefor
thatlikelihood)
under either hypothesis, the distribution of the
data is fully specied; there are no unknown parameters
The model with more parameters will always t at least as to estimate. The likelihood ratio test is based on the like-
well (have an equal or greater log-likelihood). Whether it lihood ratio, which is often denoted by (the capital
ts signicantly better and should thus be preferred is de- Greek letter lambda). The likelihood ratio is dened as
termined by deriving the probability or p-value of the dif- follows:[4][5]
ference D. Where the null hypothesis represents a special
case of the alternative hypothesis, the probability distri-
bution of the test statistic is approximately a chi-squared L(0 |x) f (i xi |0 )
distribution with degrees of freedom equal to df2 df1 (x) = L(1 |x) = f (i xi |1 )
217
218 CHAPTER 31. LIKELIHOOD-RATIO TEST
or 31.3.1 Interpretation
the probability that coins 1 and 2 come up heads or tails. 31.6 External links
In what follows, i = 1, 2 and j = H, T . The hypothesis
space H is constrained by the usual constraints on a prob- Practical application of likelihood ratio test de-
ability distribution, 0 pij 1 , and piH + piT = 1 . scribed
The space of the null hypothesis H0 is the subspace where
p1j = p2j . Writing nij for the best values for pij under R Package: Walds Sequential Probability Ratio Test
the hypothesis H , the maximum likelihood estimate is
Richard Lowrys Predictive Values and Likelihood
given by
Ratios Online Clinical Calculator
kij
nij = kiH +kiT .
Similarly, the maximum likelihood estimates of pij under
the null hypothesis H0 are given by
k1j +k2j
mij = k1H +k2H +k1T +k2T ,
which does not depend on the coin i .
The hypothesis and null hypothesis can be rewritten
slightly so that they satisfy the constraints for the log-
arithm of the likelihood ratio to have the desired nice
distribution. Since the constraint causes the two-
dimensional H to be reduced to the one-dimensional H0
, the asymptotic distribution for the test will be 2 (1) ,
the 2 distribution with one degree of freedom.
For the general contingency table, we can write the log-
likelihood ratio statistic as
nij
2 log = 2 kij log .
i,j
mij
31.5 References
[1] Neyman, Jerzy; Pearson, Egon S. (1933). On the Prob-
lem of the Most Ecient Tests of Statistical Hypothe-
ses. Philosophical Transactions of the Royal Society
A: Mathematical, Physical and Engineering Sciences 231
(694706): 289337. Bibcode:1933RSPTA.231..289N.
doi:10.1098/rsta.1933.0009. JSTOR 91247.
Statistical classication
For the unsupervised learning approach, see Cluster stances, the explanatory variables are termed features
analysis. (grouped into a feature vector), and the possible cate-
gories to be predicted are classes. There is also some ar-
gument over whether classication methods that do not
In machine learning and statistics, classication is the
problem of identifying to which of a set of categories involve a statistical model can be considered statisti-
cal. Other elds may use dierent terminology: e.g.
(sub-populations) a new observation belongs, on the ba-
sis of a training set of data containing observations (or in community ecology, the term classication normally
refers to cluster analysis, i.e. a type of unsupervised
instances) whose category membership is known. An ex-
ample would be assigning a given email into spam or learning, rather than the supervised learning described in
this article.
non-spam classes or assigning a diagnosis to a given pa-
tient as described by observed characteristics of the pa-
tient (gender, blood pressure, presence or absence of cer-
tain symptoms, etc.). 32.1 Relation to other problems
In the terminology of machine learning,[1] classication is
considered an instance of supervised learning, i.e. learn- Classication and clustering are examples of the more
ing where a training set of correctly identied observa- general problem of pattern recognition, which is the as-
tions is available. The corresponding unsupervised pro- signment of some sort of output value to a given in-
cedure is known as clustering, and involves grouping data put value. Other examples are regression, which assigns
into categories based on some measure of inherent simi- a real-valued output to each input; sequence labeling,
larity or distance. which assigns a class to each member of a sequence of
Often, the individual observations are analyzed into a set values (for example, part of speech tagging, which as-
of quantiable properties, known variously explanatory signs a part of speech to each word in an input sentence);
variables, features, etc. These properties may vari- parsing, which assigns a parse tree to an input sentence,
ously be categorical (e.g. A, B, AB or O, for describing the syntactic structure of the sentence; etc.
blood type), ordinal (e.g. large, medium or small), A common subclass of classication is probabilistic clas-
integer-valued (e.g. the number of occurrences of a part sication. Algorithms of this nature use statistical in-
word in an email) or real-valued (e.g. a measurement ference to nd the best class for a given instance. Un-
of blood pressure). Other classiers work by compar- like other algorithms, which simply output a best class,
ing observations to previous observations by means of a probabilistic algorithms output a probability of the in-
similarity or distance function. stance being a member of each of the possible classes.
An algorithm that implements classication, especially in The best class is normally then selected as the one with
a concrete implementation, is known as a classier. The the highest probability. However, such an algorithm has
term classier sometimes also refers to the mathemat- numerous advantages over non-probabilistic classiers:
ical function, implemented by a classication algorithm,
that maps input data to a category. It can output a condence value associated with its
Terminology across elds is quite varied. In statistics, choice (in general, a classier that can do this is
where classication is often done with logistic regres- known as a condence-weighted classier).
sion or a similar procedure, the properties of observa-
tions are termed explanatory variables (or independent Correspondingly, it can abstain when its condence
variables, regressors, etc.), and the categories to be pre- of choosing any particular output is too low.
dicted are known as outcomes, which are considered to
Because of the probabilities which are generated,
be possible values of the dependent variable. In ma-
probabilistic classiers can be more eectively in-
chine learning, the observations are often known as in-
corporated into larger machine-learning tasks, in a
220
32.5. FEATURE VECTORS 221
way that partially or completely avoids the problem 32.5 Feature vectors
of error propagation.
Most algorithms describe an individual instance whose
category is to be predicted using a feature vector of indi-
vidual, measurable properties of the instance. Each prop-
32.2 Frequentist procedures erty is termed a feature, also known in statistics as an
explanatory variable (or independent variable, although in
Early work on statistical classication was undertaken by general dierent features may or may not be statistically
Fisher,[2][3] in the context of two-group problems, leading independent). Features may variously be binary (male
to Fishers linear discriminant function as the rule for as- or female); categorical (e.g. A, B, AB or O, for
signing a group to a new observation.[4] This early work blood type); ordinal (e.g. large, medium or small);
assumed that data-values within each of the two groups integer-valued (e.g. the number of occurrences of a par-
had a multivariate normal distribution. The extension ticular word in an email); or real-valued (e.g. a measure-
of this same context to more than two-groups has also ment of blood pressure). If the instance is an image, the
been considered with a restriction imposed that the clas- feature values might correspond to the pixels of an image;
sication rule should be linear.[4][5] Later work for the if the instance is a piece of text, the feature values might
multivariate normal distribution allowed the classier to be occurrence frequencies of dierent words. Some al-
be nonlinear:[6] several classication rules can be derived gorithms work only in terms of discrete data and require
based on slight dierent adjustments of the Mahalanobis that real-valued or integer-valued data be discretized into
distance, with a new observation being assigned to the groups (e.g. less than 5, between 5 and 10, or greater than
group whose centre has the lowest adjusted distance from 10).
the observation. The vector space associated with these vectors is often
called the feature space. In order to reduce the dimen-
sionality of the feature space, a number of dimensionality
reduction techniques can be employed.
32.3 Bayesian procedures
32.11 References
[1] Alpaydin, Ethem (2010). Introduction to Machine Learn-
ing. MIT Press. p. 9. ISBN 978-0-262-01243-0.
Binary classication
224
33.2. CONVERTING CONTINUOUS VALUES TO BINARY 225
tios that one can compute from this table, which come 33.2 Converting continuous values
in four complementary pairs (each pair summing to 1).
These are obtained by dividing each of the four numbers
to binary
by the sum of its row or column, yielding eight numbers,
which can be referred to generically in the form true pos- Tests whose results are of continuous values, such as most
itive row ratio or false negative column ratio, though blood values, can articially be made binary by dening a
there are conventional terms. There are thus two pairs cuto value, with test results being designated as positive
of column ratios and two pairs of row ratios, and one can or negative depending on whether the resultant value is
summarize these with four numbers by choosing one ratio higher or lower than the cuto.
from each pair the other four numbers are the comple- However, such conversion causes a loss of information, as
ments. the resultant binary classication does not tell how much
The column ratios are True Positive Rate (TPR, aka above or below the cuto a value is. As a result, when
Sensitivity or recall), with complement the False Neg- converting a continuous value that is close to the cuto to
ative Rate (FNR); and True Negative Rate (TNR, aka a binary one, the resultant positive or negative predictive
Specicity, SPC), with complement False Positive Rate value is generally higher than the predictive value given
(FPR). These are the proportion of the population with directly from the continuous value. In such cases, the
the condition (resp., without the condition) for which the designation of the test of being either positive or negative
test is correct (or, complementarily, for which the test is gives the appearance of an inappropriately high certainty,
incorrect); these are independent of prevalence. while the value is in fact in an interval of uncertainty. For
example, with the urine concentration of hCG as a con-
The row ratios are Positive Predictive Value (PPV, aka tinuous value, a urine pregnancy test that measured 52
precision), with complement the False Discovery Rate mIU/ml of hCG may show as positive with 50 mIU/ml
(FDR); and Negative Predictive Value (NPV), with com- as cuto, but is in fact in an interval of uncertainty, which
plement the False Omission Rate (FOR). These are the may be apparent only by knowing the original continuous
proportion of the population with a given test result for value. On the other hand, a test result very far from the
which the test is correct (or, complementarily, for which cuto generally has a resultant positive or negative pre-
the test is incorrect); these depend on prevalence. dictive value that is lower than the predictive value given
In diagnostic testing, the main ratios used are the true col- from the continuous value. For example, a urine hCG
umn ratios True Positive Rate and True Negative Rate value of 200,000 mIU/ml confers a very high probability
where they are known as sensitivity and specicity. In in- of pregnancy, but conversion to binary values results in
formational retrieval, the main ratios are the true positive that it shows just as positive as the one of 52 mIU/ml.
ratios (row and column) Positive Predictive Value and
True Positive Rate where they are known as precision
and recall. 33.3 See also
One can take ratios of a complementary pair of ratios,
yielding four likelihood ratios (two column ratio of ratios, Examples of Bayesian inference
two row ratio of ratios). This is primarily done for the col-
umn (condition) ratios, yielding likelihood ratios in diag- Classication rule
nostic testing. Taking the ratio of one of these groups of Detection theory
ratios yields a nal ratio, the diagnostic odds ratio (DOR).
This can also be dened directly as (TPTN)/(FPFN) = Kernel methods
(TP/FN)/(FP/TN); this has a useful interpretation as an
odds ratio and is prevalence-independent. Matthews correlation coecient
There are a number of other metrics, most simply the Multiclass classication
accuracy or Fraction Correct (FC), which measures the
fraction of all instances that are correctly categorized; Multi-label classication
the complement is the Fraction Incorrect (FiC). The F- One-class classication
score combines precision and recall into one number via
a choice of weighing, most simply equal weighing, as Prosecutors fallacy
the balanced F-score (F1 score). Some metrics come
from regression coecients: the markedness and the Receiver operating characteristic
informedness, and their geometric mean, the Matthews Thresholding (image processing)
correlation coecient. Other metrics include Youdens J
statistic, the uncertainty coecient, the Phi coecient, Type I and type II errors
and Cohens kappa.
Uncertainty coecient, aka Prociency
Qualitative property
226 CHAPTER 33. BINARY CLASSIFICATION
33.4 References
33.5 Bibliography
Nello Cristianini and John Shawe-Taylor. An In-
troduction to Support Vector Machines and other
kernel-based learning methods. Cambridge Univer-
sity Press, 2000. ISBN 0-521-78019-5 ( SVM Book)
John Shawe-Taylor and Nello Cristianini. Kernel
Methods for Pattern Analysis. Cambridge University
Press, 2004. ISBN 0-521-81397-2 ( Kernel Methods
Book)
Bernhard Schlkopf and A. J. Smola: Learning with
Kernels. MIT Press, Cambridge, MA, 2002. (Partly
available on line: .) ISBN 0-262-19475-9
Chapter 34
Maximum likelihood
This article is about the statistical techniques. For belongs to a certain family of distributions { f(| ),
computer data storage, see Partial response maximum } (where is a vector of parameters for this family),
likelihood. called the parametric model, so that f 0 = f(| 0 ). The
value 0 is unknown and is referred to as the true value
In statistics, maximum-likelihood estimation (MLE) of the parameter vector. It is desirable to nd an esti-
is a method of estimating the parameters of a statistical mator which would be as close to the true value 0 as
model. When applied to a data set and given a possible. Either or both the observed variables xi and the
statistical model, maximum-likelihood estimation pro- parameter can be vectors.
vides estimates for the models parameters. To use the method of maximum likelihood, one rst spec-
The method of maximum likelihood corresponds to many ies the joint density function for all observations. For an
well-known estimation methods in statistics. For exam- independent and identically distributed sample, this joint
ple, one may be interested in the heights of adult fe- density function is
male penguins, but be unable to measure the height of
every single penguin in a population due to cost or time
constraints. Assuming that the heights are normally dis- f (x1 , x2 , . . . , xn | ) = f (x1 |)f (x2 |) f (xn |).
tributed with some unknown mean and variance, the
mean and variance can be estimated with MLE while only Now we look at this function from a dierent perspective
knowing the heights of some sample of the overall pop- by considering the observed values x1 , x2 , , xn to be
ulation. MLE would accomplish this by taking the mean xed parameters of this function, whereas will be the
and variance as parameters and nding particular para- functions variable and allowed to vary freely; this func-
metric values that make the observed results the most tion will be called the likelihood:
probable given the model.
In general, for a xed set of data and underlying sta-
n
34.1 Principles
1
= ln L.
n
Suppose there is a sample x1 , x2 , , xn of n independent
and identically distributed observations, coming from a The hat over indicates that it is akin to some estimator.
distribution with an unknown probability density func- Indeed, estimates the expected log-likelihood of a single
tion f 0 (). It is however surmised that the function f 0 observation in the model.
227
228 CHAPTER 34. MAXIMUM LIKELIHOOD
To establish consistency, the following conditions are 4. Dominance: there exists D(x) integrable with re-
sucient:[2] spect to the distribution f(x|0 ) such that
ln f (x | ) < D(x) all for .
1. Identication of the model:
By the uniform law of large numbers, the domi-
= 0 f ( | ) = f ( | 0 ).
nance condition together with continuity establish
the uniform convergence in probability of the log-
In other words, dierent parameter values corre-
likelihood:
spond to dierent distributions within the model.
If this condition did not hold, there would be some
value 1 such that 0 and 1 generate an identical p
distribution of the observable data. Then we sup
( | x) ()
0.
wouldn't be able to distinguish between these two
parameters even with an innite amount of data
The dominance condition can be employed in the case
these parameters would have been observationally
of i.i.d. observations. In the non-i.i.d. case the uniform
equivalent.
convergence in probability can be checked by showing
that the sequence (|x)
is stochastically equicontinuous.
The identication condition is absolutely necessary
If one wants to demonstrate that the ML estimator con-
for the ML estimator to be consistent. When this
verges to 0 almost surely, then a stronger condition of
condition holds, the limiting likelihood function
uniform convergence almost surely has to be imposed:
(|) has unique global maximum at 0 .
b = g( b ).
mle = mle b.
It maximizes the so-called prole likelihood: This estimator is unbiased up to the terms of order n1 ,
and is called the bias-corrected maximum likelihood
estimator.
0.25
( )n/2 ( n
0.2 1 (xi x x
)2 + n(
f (x1 , . . . , xn | , ) =
2
exp i=1
0.15 2 2 2 2
0.1
0.05 where x
is the sample mean.
0 This family of distributions has two parameters: =
0 0.2 0.4 0.6 0.8 1
(, ), so we maximize the likelihood, L(, ) =
f (x1 , . . . , xn | , ) , over both parameters simultane-
likelihood function for proportion value of a binomial process (n ously, or if possible, individually.
= 10)
Since the logarithm is a continuous strictly increasing
function over the range of the likelihood, the values which
One way to maximize this function is by dierentiating maximize the likelihood will also maximize its logarithm.
with respect to p and setting to zero: This log likelihood can be written as follows:
34.4. NON-INDEPENDENT VARIABLES 233
2n(
x ) [ 2] n 1 2
0= log(L(, )) = 0 . b =
E .
2 2 n
b is biased. However,
This means that the estimator b is
This is solved by
consistent.
Formally we say that the maximum likelihood estimator
n
xi for = (, 2 ) is:
=x
= .
i=1
n
( )
This is indeed the maximum of the function since it is b = b,
b2 .
the only turning point in and the second derivative is
strictly less than zero. Its expectation value is equal to the In this case the MLEs could be obtained individually. In
parameter of the given distribution, general this may not be the case, and the MLEs would
have to be obtained simultaneously.
The normal log likelihood at its maximum takes a partic-
E [b
] = , ularly simple form:
b we obtain
Inserting the estimate = f (x, y) = f (x)f (y)
communication systems;
( )
exp [x1 1 , . . . , xnpsychometrics;
1 1
n ] 1 [x1 1 , . . . , xn n ]
T
f (x1 , . . . , xn ) =
(2)n/2 det() 2
econometrics;
In the two variable case, the joint probability density
function is given by: time-delay of arrival (TDOA) in acoustic or electro-
magnetic detection;
[ ( )]
data modeling in nuclear and particle physics;
1 1 (x x )2 2(x x )(y y ) (y y )2
f (x, y) = exp +
2x y 1 2 2(1 2 ) x2 magneticxresonance
y imaging;[9][10]
y2
In this and other cases where a joint density function ex- computational phylogenetics;
ists, the likelihood function is dened as above, in the
section Principles, using this density. origin/destination and path-choice modeling in
transport networks;
The convergence of MLEs within ltering and smoothing 34.8 See also
EM algorithms are studied in[6] [7] .[8]
Other estimation methods
Quasi-maximum likelihood estimator, an [9] Sijbers, Jan; den Dekker, A.J. (2004). Maximum Like-
MLE estimator that is misspecied, but still lihood estimation of signal amplitude and noise variance
consistent. from MR data. Magnetic Resonance in Medicine 51 (3):
586594. doi:10.1002/mrm.10728. PMID 15004801.
Restricted maximum likelihood, a variation
using a likelihood function calculated from a [10] Sijbers, Jan; den Dekker, A.J.; Scheunders, P.; Van Dyck,
transformed set of data. D. (1998). Maximum Likelihood estimation of Rician
distribution parameters. IEEE Transactions on Medi-
Related concepts: cal Imaging 17 (3): 357361. doi:10.1109/42.712125.
PMID 9735899.
The BHHH algorithm is a non-linear opti-
mization algorithm that is popular for Maxi- [11] Pfanzagl (1994)
mum Likelihood estimations. [12] Edgeworth (September 1908) and Edgeworth (December
Extremum estimator, a more general class of 1908)
estimators to which MLE belongs. [13] Savage (1976), Pratt (1976), Stigler (1978, 1986, 1999),
Fisher information, information matrix, its re- Hald (1998, 1999), and Aldrich (1997)
lationship to covariance matrix of ML esti-
mates
Likelihood function, a description on what 34.10 Further reading
likelihood functions are.
Mean squared error, a measure of how 'good' Aldrich, John (1997). R. A. Fisher and the making
an estimator of a distributional parameter is of maximum likelihood 19121922. Statistical Sci-
(be it the maximum likelihood estimator or ence 12 (3): 162176. doi:10.1214/ss/1030037906.
some other estimator). MR 1617519.
The RaoBlackwell theorem, a result which Andersen, Erling B. (1970); Asymptotic Properties
yields a process for nding the best possi- of Conditional Maximum Likelihood Estimators,
ble unbiased estimator (in the sense of having Journal of the Royal Statistical Society B 32, 283
minimal mean squared error). The MLE is of- 301
ten a good starting place for the process.
Andersen, Erling B. (1980); Discrete Statistical Mod-
Sucient statistic, a function of the data els with Social Science Applications, North Holland,
through which the MLE (if it exists and is 1980
unique) will depend on the data.
Basu, Debabrata (1988); Statistical Information and
Likelihood : A Collection of Critical Essays by Dr.
D. Basu; in Ghosh, Jayanta K., editor; Lecture Notes
34.9 References in Statistics, Volume 45, Springer-Verlag, 1988
[1] Pfanzagl (1994, p. 206) Cox, David R.; Snell, E. Joyce (1968). A general
denition of residuals. Journal of the Royal Statis-
[2] Newey & McFadden (1994, Theorem 2.5.) tical Society, Series B: 248275. JSTOR 2984505.
[3] Lehmann & Casella (1998) Edgeworth, Francis Y. (Sep 1908). On the
[4] Newey & McFadden (1994, Theorem 3.3.) probable errors of frequency-constants. Journal
of the Royal Statistical Society 71 (3): 499512.
[5] Cox & Snell (1968, formula (20)) doi:10.2307/2339293. JSTOR 2339293.
[6] Einicke, G.A.; Malos, J.T.; Reid, D.C.; Hainsworth, Edgeworth, Francis Y. (Dec 1908). On the
D.W. (January 2009). Riccati Equation and EM probable errors of frequency-constants. Journal
Algorithm Convergence for Inertial Navigation Align- of the Royal Statistical Society 71 (4): 651678.
ment. IEEE Trans. Signal Processing 57 (1): 370375.
doi:10.2307/2339378. JSTOR 2339378.
doi:10.1109/TSP.2008.2007090.
Einicke, G.A. (2012). Smoothing, Filtering and Pre-
[7] Einicke, G.A.; Falco, G.; Malos, J.T. (May 2010).
diction: Estimating the Past, Present and Future. Ri-
EM Algorithm State Matrix Estimation for Naviga-
tion. IEEE Signal Processing Letters 17 (5): 437440. jeka, Croatia: Intech. ISBN 978-953-307-752-9.
doi:10.1109/LSP.2010.2043151. Ferguson, Thomas S. (1982). An inconsistent
[8] Einicke, G.A.; Falco, G.; Dunn, M.T.; Reid, D.C. maximum likelihood estimate. Journal of the
(May 2012). Iterative Smoother-Based Variance Esti- American Statistical Association 77 (380): 831834.
mation. IEEE Signal Processing Letters 19 (5): 275278. doi:10.1080/01621459.1982.10477894. JSTOR
doi:10.1109/LSP.2012.2190278. 2287314.
236 CHAPTER 34. MAXIMUM LIKELIHOOD
Ferguson, Thomas S. (1996). A course in large sam- Stigler, Stephen M. (1986). The history of statistics:
ple theory. Chapman & Hall. ISBN 0-412-04371-8. the measurement of uncertainty before 1900. Har-
vard University Press. ISBN 0-674-40340-1.
Hald, Anders (1998). A history of mathematical
statistics from 1750 to 1930. New York, NY: Wi- Stigler, Stephen M. (1999). Statistics on the table:
ley. ISBN 0-471-17912-4. the history of statistical concepts and methods. Har-
vard University Press. ISBN 0-674-83601-4.
Hald, Anders (1999). On the history of maximum
likelihood in relation to inverse probability and van der Vaart, Aad W. (1998). Asymptotic Statistics.
least squares. Statistical Science 14 (2): 214222. ISBN 0-521-78450-6.
doi:10.1214/ss/1009212248. JSTOR 2676741.
Linear classier
237
238 CHAPTER 35. LINEAR CLASSIFIER
The second set of methods includes discriminative mod- R(w) is a regularization term that prevents the pa-
els, which attempt to maximize the quality of the out- rameters from getting too large (causing overtting),
put on a training set. Additional terms in the training and
cost function can easily perform regularization of the -
nal model. Examples of discriminative training of linear C is some constant (set by the user of the learning
classiers include algorithm) that weighs the regularization against the
loss.
Logistic regressionmaximum likelihood estima-
assuming that the observed training set was Popular loss functions include the hinge loss (for linear
tion of w
generated by a binomial model that depends on the SVMs) and the log loss (for linear logistic regression). If
output of the classier. the regularization function R is convex, then the above
is a convex problem.[1] Many algorithms exist for solving
Perceptronan algorithm that attempts to x all er- such problems; popular ones for linear classication in-
rors encountered in the training set clude (stochastic) gradient descent, L-BFGS, coordinate
descent and Newton methods.
Support vector machinean algorithm that maxi-
mizes the margin between the decision hyperplane
and the examples in the training set. 35.3 See also
Note: Despite its name, LDA does not belong to the Linear regression
class of discriminative models in this taxonomy. How-
ever, its name makes sense when we compare LDA Winnow (algorithm)
to the other main linear dimensionality reduction algo-
Quadratic classier
rithm: principal components analysis (PCA). LDA is a
supervised learning algorithm that utilizes the labels of Support vector machines
the data, while PCA is an unsupervised learning algo-
rithm that ignores the labels. To summarize, the name
is a historical artifact.[4]:117 35.4 Notes
Discriminative training often yields higher accuracy than
modeling the conditional density functions. However, [1] Guo-Xun Yuan; Chia-Hua Ho; Chih-Jen Lin (2012).
handling missing data is often easier with conditional Recent Advances of Large-Scale Linear Classication.
density models. Proc. IEEE 100 (9).
All of the linear classier algorithms listed above can be [2] T. Mitchell, Generative and Discriminative Classiers:
converted into non-linear algorithms operating on a dif- Naive Bayes and Logistic Regression. Draft Version,
ferent input space (x) , using the kernel trick. 2005 download
Discriminative training of linear classiers usually pro- [4] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classica-
ceeds in a supervised way, by means of an optimization tion, Wiley, (2001). ISBN 0-471-05669-3
algorithm that is given a training set with desired outputs
and a loss function that measures the discrepancy between See also:
the classiers outputs and the desired outputs. Thus, the
learning algorithm solves an optimization problem of the 1. Y. Yang, X. Liu, A re-examination of text catego-
form[1] rization, Proc. ACM SIGIR Conference, pp. 42
49, (1999). paper @ citeseer
N 2. R. Herbrich, Learning Kernel Classiers: Theory
arg min R(w) + C L(yi , wT xi ) and Algorithms, MIT Press, (2001). ISBN 0-262-
w
i=1 08306-X
where
Logistic regression
In statistics, logistic regression, or logit regression, or tions are true, logistic regression assumptions must hold.
logit model[1] is a direct probability model that was de- The converse is not true, so the logistic model has fewer
veloped by statistician D. R. Cox in 1958[2] [3] although assumptions than discriminant analysis and makes no as-
much work was done in the single independent vari- sumption on the distribution of the independent variables.
able case almost two decades earlier. The binary logis-
tic model is used to predict a binary response based on
one or more predictor variables (features). That is, it
is used in estimating the parameters of a qualitative re-
sponse model. The probabilities describing the possi-
ble outcomes of a single trial are modeled, as a function
of the explanatory (predictor) variables, using a logistic
function. Frequently (and hereafter in this article) logis-
tic regression is used to refer specically to the prob-
36.1 Fields and example applica-
lem in which the dependent variable is binarythat is, tions
the number of available categories is twowhile prob-
lems with more than two categories are referred to as
multinomial logistic regression, or, if the multiple cate- Logistic regression is used widely in many elds, in-
gories are ordered, as ordinal logistic regression.[3] cluding the medical and social sciences. For example,
Logistic regression measures the relationship between the the Trauma and Injury Severity Score (TRISS), which
categorical dependent variable and one or more indepen- is widely used to predict mortality in injured patients,
dent variables, which are usually (but not necessarily) was originally developed by Boyd et al. using logistic
continuous, by estimating probabilities. Thus, it treats the regression.[5] Many other medical scales used to assess
same set of problems as does probit regression using sim- severity of a patient have been developed using logis-
ilar techniques; the rst assumes a logistic function and tic regression.[6][7][8][9] Logistic regression may be used
the second a standard normal distribution function. to predict whether a patient has a given disease (e.g.
diabetes; coronary heart disease), based on observed
Logistic regression can be seen as a special case of
characteristics of the patient (age, sex, body mass in-
generalized linear model and thus analogous to linear re-
dex, results of various blood tests, etc.; age, blood choles-
gression. The model of logistic regression, however, is
terol level, systolic blood pressure, relative weight, blood
based on quite dierent assumptions (about the relation-
hemoglobin level, smoking (at 3 levels), and abnormal
ship between dependent and independent variables) from
electrocardiogram.).[1][10] Another example might be to
those of linear regression. In particular the key dier-
predict whether an American voter will vote Democratic
ences of these two models can be seen in the following
or Republican, based on age, income, sex, race, state of
two features of logistic regression. First, the conditional
residence, votes in previous elections, etc.[11] The tech-
distribution p(y | x) is a Bernoulli distribution rather
nique can also be used in engineering, especially for pre-
than a Gaussian distribution, because the dependent vari-
dicting the probability of failure of a given process, sys-
able is binary. Second, the estimated probabilities are
tem or product.[12][13] It is also used in marketing applica-
restricted to [0,1] through the logistic distribution func-
tions such as prediction of a customers propensity to pur-
tion because logistic regression predicts the probability
chase a product or halt a subscription, etc. In economics it
of the instance being positive.
can be used to predict the likelihood of a persons choos-
Logistic regression is an alternative to Fishers 1936 clas- ing to be in the labor force, and a business application
sication method, linear discriminant analysis.[4] If the would be to predict the likelihood of a homeowner de-
assumptions of linear discriminant analysis hold, appli- faulting on a mortgage. Conditional random elds, an ex-
cation of Bayes rule to reverse the conditioning results tension of logistic regression to sequential data, are used
in the logistic model, so if linear discriminant assump- in natural language processing.
239
240 CHAPTER 36. LOGISTIC REGRESSION
36.2 Basics 1
0.5
36.3.2 Denition of the inverse of the logis- So we dene odds of the dependent variable equaling a
tic function case (given some linear combination x of the predictors)
as follows:
We can now dene the inverse of the logistic function, g
, the logit (log odds):
odds = e0 +1 x .
F (x)
g(F (x)) = ln = 0 + 1 x, 36.3.5 Denition of the odds ratio
1 F (x)
and equivalently: The odds ratio can be dened as:
F (x) F (x+1)
= e0 +1 x . OR = odds(x+1)/ odds(x) =
1F (x+1)
= e0 +1 (x+1) /e0 +1 x = e1
1 F (x) F (x)
1F (x)
36.3.3 Interpretation of these terms or for binary variable F(0) instead of F(x) and F(1) for
F(x+1). This exponential relationship provides an inter-
In the above equations, the terms are as follows: pretation for 1 : The odds multiply by e1 for every 1-
unit increase in x.[15]
g() refers to the logit function. The equation for
g(F (x)) illustrates that the logit (i.e., log-odds or 36.3.6 Multiple explanatory variables
natural logarithm of the odds) is equivalent to the
linear regression expression. If there are multiple explanatory variables, the above ex-
ln denotes the natural logarithm. pression 0 + 1 x can be revised to 0 + 1 x1 + 2 x2 +
+ m xm . Then when this is used in the equation re-
F (x) is the probability that the dependent variable lating the logged odds of a success to the values of the
equals a case, given some linear combination x of predictors, the linear regression will be a multiple regres-
the predictors. The formula for F (x) illustrates that sion with m explanators; the parameters j for all j = 0,
the probability of the dependent variable equaling a 1, 2, ..., m are all estimated.
case is equal to the value of the logistic function of
the linear regression expression. This is important
in that it shows that the value of the linear regres- 36.4 Model tting
sion expression can vary from negative to positive
innity and yet, after transformation, the resulting
expression for the probability F (x) ranges between 36.4.1 Estimation
0 and 1.
Because the model can be expressed as a generalized lin-
0 is the intercept from the linear regression equa- ear model (see below), for 0<p<1, ordinary least squares
tion (the value of the criterion when the predictor is can suce, with R-squared as the measure of goodness
equal to zero). of t in the tting space. When p=0 or 1, more complex
methods are required.
1 x is the regression coecient multiplied by some
value of the predictor.
Maximum likelihood estimation
base e denotes the exponential function.
The regression coecients are usually estimated using
maximum likelihood estimation.[16] Unlike linear regres-
36.3.4 Denition of the odds
sion with normally distributed residuals, it is not possible
The odds of the dependent variable equaling a case (given to nd a closed-form expression for the coecient values
some linear combination x of the predictors) is equiva- that maximize the likelihood function, so that an itera-
lent to the exponential function of the linear regression tive process must be used instead; for example Newtons
expression. This illustrates how the logit serves as a linkmethod. This process begins with a tentative solution, re-
function between the probability and the linear regres- vises it slightly to see if it can be improved, and repeats
sion expression. Given that the logit ranges between neg- this revision until improvement is minute, at which point
ative and positive innity, it provides an adequate crite- the process is said to have converged.[17]
rion upon which to conduct linear regression and the logit In some instances the model may not reach convergence.
is easily converted back into the odds.[14] Nonconvergence of a model indicates that the coecients
242 CHAPTER 36. LOGISTIC REGRESSION
are not meaningful because the iterative process was un- to estimate a linear model in which the dependent vari-
able to nd appropriate solutions. A failure to converge able is the logit of the proportion: that is, the log of the
may occur for a number of reasons: having a large ra- ratio of the fraction in one group to the fraction in the
tio of predictors to cases, multicollinearity, sparseness, other group.[19]:pp.6869
or complete separation.
null model provides a baseline upon which to compare does not necessarily increase as the odds ratio increases
predictor models. Given that deviance is a measure of and does not necessarily decrease as the odds ratio de-
the dierence between a given model and the saturated creases.
model, smaller values indicate better t. Thus, to assess The Cox and Snell R2 is an alternative index of good-
the contribution of a predictor or set of predictors, one ness of t related to the R2 value from linear regression.
can subtract the model deviance from the null deviance The Cox and Snell index is problematic as its maximum
and assess the dierence on a 2sp , chi-square distribu- value is .75, when the variance is at its maximum (.25).
tion with degrees of freedom[14] equal to the dierence The Nagelkerke R2 provides a correction to the Cox and
in the number of parameters estimated. Snell R2 so that the maximum value is equal to one. Nev-
Let ertheless, the Cox and Snell and likelihood ratio R2 s show
greater agreement with each other than either does with
the Nagelkerke R2 .[20] Of course, this might not be the
model null of likelihood case for values exceeding .75 as the Cox and Snell index
Dnull = 2 ln
model saturated the of likelihood is capped at this value. The likelihood ratio R2 is often
preferred to the alternatives as it is most analogous to R2
model tted of likelihood in linear regression, is independent of the base rate (both
Dtted = 2 ln .
model saturated the of likelihood Cox and Snell and Nagelkerke R2 s increase as the propor-
Then tion of cases increase from 0 to .5) and varies between 0
and 1.
( ) word
A ( of caution is in order when interpreting)pseudo-
model null of likelihood 2 model tted of likelihood
Dnull Dtted = 2 ln R 2 ln The reason these indices of t are referred
statistics.
model saturated the of likelihood to as pseudo Rmodel2 saturated
is that they dothenotofrepresent
likelihood
the propor-
( )
model null of likelihood model tted
tionate reduction of likelihood
in error 2
as the R in linear regression
= 2 ln ln
model saturated the of likelihood does.[20]
modelLinear regression
saturated the of assumes
likelihoodhomoscedasticity,
( model null of likelihood ) that the error variance is the same for all values of the cri-
= 2 ln ( model saturated the of likelihood )
model tted of likelihood terion. Logistic regression will always be heteroscedastic
model saturated the of likelihood the error variances dier for each value of the predicted
model null the of likelihood score. For each value of the predicted score there would
= 2 ln .
model tted of likelihood be a dierent value of the proportionate reduction in er-
2
If the model deviance is signicantly smaller than the null ror. Therefore, it is inappropriate to think of R as a pro-
deviance then one can conclude that the predictor or set of portionate reduction
[20]
in error in a universal sense in logis-
predictors signicantly improved model t. This is anal- tic regression.
ogous to the F-test used in linear regression analysis to
assess the signicance of prediction.[20]
HosmerLemeshow test
Pseudo-R2 s
The HosmerLemeshow test uses a test statistic that
In linear regression the squared multiple correlation, R2 asymptotically follows a 2 distribution to assess whether
is used to assess goodness of t as it represents the pro- or not the observed event rates match expected event rates
portion of variance in the criterion that is explained by in subgroups of the model population.
the predictors.[20] In logistic regression analysis, there is
no agreed upon analogous measure, but there are several
competing measures each with limitations.[20] Three of
the most commonly used indices are examined on this Evaluating binary classication performance
page beginning with the likelihood ratio R2 , R2 L:[20]
36.5 Coecients Wj =
Bj2
SEB2
j
or success). The goal of logistic regression is to ex- Formally, the outcomes Yi are described as being
plain the relationship between the explanatory variables Bernoulli-distributed data, where each outcome is deter-
and the outcome, so that an outcome can be predicted for mined by an unobserved probability pi that is specic
a new set of explanatory variables. to the outcome at hand, but related to the explanatory
Some examples: variables. This can be expressed in any of the following
equivalent forms:
an election, and the explanatory variables are the The meanings of these four lines are:
demographic characteristics of each person (e.g.
sex, race, age, income, etc.). In such a case, one of
1. The rst line expresses the probability distribution
the two outcomes is arbitrarily coded as 1, and the
of each Yi: Conditioned on the explanatory vari-
other as 0.
ables, it follows a Bernoulli distribution with param-
eters pi, the probability of the outcome of 1 for trial
As in linear regression, the outcome variables Yi are as- i. As noted above, each separate trial has its own
sumed to depend on the explanatory variables x,i ... xm,i. probability of success, just as each trial has its own
explanatory variables. The probability of success pi
Explanatory variables is not observed, only the outcome of an individual
Bernoulli trial using that probability.
As shown above in the above examples, the explana- 2. The second line expresses the fact that the expected
tory variables may be of any type: real-valued, binary, value of each Yi is equal to the probability of success
categorical, etc. The main distinction is between pi, which is a general property of the Bernoulli dis-
continuous variables (such as income, age and blood pres- tribution. In other words, if we run a large number
sure) and discrete variables (such as sex or race). Discrete of Bernoulli trials using the same probability of suc-
variables referring to more than two possible choices are cess pi, then take the average of all the 1 and 0 out-
typically coded using dummy variables (or indicator vari- comes, then the result would be close to pi. This is
ables), that is, separate explanatory variables taking the because doing an average this way simply computes
value 0 or 1 are created for each possible value of the the proportion of successes seen, which we expect to
discrete variable, with a 1 meaning variable does have converge to the underlying probability of success.
the given value and a 0 meaning variable does not have
3. The third line writes out the probability mass func-
that value. For example, a four-way discrete variable of
tion of the Bernoulli distribution, specifying the
blood type with the possible values A, B, AB, O can
probability of seeing each of the two possible out-
be converted to four separate two-way dummy variables,
comes.
is-A, is-B, is-AB, is-O, where only one of them has the
value 1 and all the rest have the value 0. This allows for 4. The fourth line is another way of writing the proba-
separate regression coecients to be matched for each bility mass function, which avoids having to write
possible value of the discrete variable. (In a case like this, separate cases and is more convenient for certain
only three of the four dummy variables are independent types of calculations. This relies on the fact that Yi
of each other, in the sense that once the values of three of can take only the value 0 or 1. In each case, one of
the variables are known, the fourth is automatically deter- the exponents will be 1, choosing the value under
mined. Thus, it is necessary to encode only three of the it, while the other is 0, canceling out the value un-
four possibilities as dummy variables. This also means der it. Hence, the outcome is either pi or 1 pi, as
that when all four possibilities are encoded, the overall in the previous line.
model is not identiable in the absence of additional con-
straints such as a regularization constraint. Theoretically, Linear predictor function
this could cause problems, but in reality almost all logis-
tic regression models are tted with regularization con-
The basic idea of logistic regression is to use the mecha-
straints.)
nism already developed for linear regression by modeling
the probability pi using a linear predictor function, i.e. a
Outcome variables linear combination of the explanatory variables and a set
246 CHAPTER 36. LOGISTIC REGRESSION
of regression coecients that are specic to the model The intuition for transforming using the logit function (the
at hand but the same for all trials. The linear predictor natural log of the odds) was explained above. It also has
function f (i) for a particular data point i is written as: the practical eect of converting the probability (which is
bounded to be between 0 and 1) to a variable that ranges
over (, +) thereby matching the potential range
f (i) = 0 + 1 x1,i + + m xm,i , of the linear prediction function on the right side of the
equation.
where 0 , . . . , m are regression coecients indicating Note that both the probabilities pi and the regression co-
the relative eect of a particular explanatory variable on ecients are unobserved, and the means of determining
the outcome. them is not part of the model itself. They are typically
The model is usually put into a more compact form as determined by some sort of optimization procedure, e.g.
follows: maximum likelihood estimation, that nds values that
best t the observed data (i.e. that give the most accurate
predictions for the data already observed), usually subject
The regression coecients 0 , 1 , ..., m are
to regularization conditions that seek to exclude unlikely
grouped into a single vector of size m + 1.
values, e.g. extremely large values for any of the regres-
For each data point i, an additional explanatory sion coecients. The use of a regularization condition
pseudo-variable x,i is added, with a xed value of is equivalent to doing maximum a posteriori (MAP) esti-
1, corresponding to the intercept coecient 0 . mation, an extension of maximum likelihood. (Regular-
ization is most commonly done using a squared regulariz-
The resulting explanatory variables x,i, x,i, ..., xm,i ing function, which is equivalent to placing a zero-mean
are then grouped into a single vector Xi of size m + Gaussian prior distribution on the coecients, but other
1. regularizers are also possible.) Whether or not regulariza-
tion is used, it is usually not possible to nd a closed-form
This makes it possible to write the linear predictor func- solution; instead, an iterative numerical method must be
tion as follows: used, such as iteratively reweighted least squares (IRLS)
or, more commonly these days, a quasi-Newton method
such as the L-BFGS method.
f (i) = Xi , The interpretation of the j parameter estimates is as the
additive eect on the log of the odds for a unit change in
using the notation for a dot product between two vectors. the jth explanatory variable. In the case of a dichotomous
explanatory variable, for instance gender, e is the esti-
mate of the odds of having the outcome for, say, males
36.6.2 As a generalized linear model compared with females.
An equivalent formula uses the inverse of the logit func-
The particular model used by logistic regression, which
tion, which is the logistic function, i.e.:
distinguishes it from standard linear regression and from
other types of regression analysis used for binary-valued
outcomes, is the way the probability of a particular out-
1
come is linked to the linear predictor function: E[Yi | Xi ] = pi = logit1 ( Xi ) =
1 + eXi
This formulation expresses logistic regression as a type of The above model has an equivalent formulation as a
generalized linear model, which predicts variables with latent-variable model. This formulation is common in the
various types of probability distributions by tting a lin- theory of discrete choice models, and makes it easier to
ear predictor function of the above form to some sort of extend to certain more complicated models with multiple,
arbitrary transformation of the expected value of the vari- correlated choices, as well as to compare logistic regres-
able. sion to the closely related probit model.
36.6. FORMAL MATHEMATICAL SPECIFICATION 247
(Note that this predicts that the irrelevancy of the scale Then
parameter may not carry over into more complex models
where more than two choices are available.) {
1 0
It turns out that this formulation is exactly equivalent to Y = 1 ifYi > Yi ,
i
the preceding one, phrased in terms of the generalized 0 otherwise.
linear model and without any latent variables. This can
be shown as follows, using the fact that the cumulative This model has a separate latent variable and a separate
distribution function (CDF) of the standard logistic dis- set of regression coecients for each possible outcome of
tribution is the logistic function, which is the inverse of the dependent variable. The reason for this separation is
the logit function, i.e. that it makes it easy to extend logistic regression to multi-
outcome categorical variables, as in the multinomial logit
model. In such a model, it is natural to model each pos-
1
Pr( < x) = logit (x) sible outcome using a dierent set of regression coe-
cients. It is also possible to motivate each of the separate
Then: latent variables as the theoretical utility associated with
248 CHAPTER 36. LOGISTIC REGRESSION
making the associated choice, and thus motivate logistic We would then use three latent variables, one for each
regression in terms of utility theory. (In terms of utility choice. Then, in accordance with utility theory, we can
theory, a rational actor always chooses the choice with then interpret the latent variables as expressing the utility
the greatest associated utility.) This is the approach taken that results from making each of the choices. We can
by economists when formulating discrete choice models, also interpret the regression coecients as indicating the
because it both provides a theoretically strong foundation strength that the associated factor (i.e. explanatory vari-
and facilitates intuitions about the model, which in turn able) has in contributing to the utility or more cor-
makes it easy to consider various sorts of extensions. (See rectly, the amount by which a unit change in an explana-
the example below.) tory variable changes the utility of a given choice. A
The choice of the type-1 extreme value distribution seems voter might expect that the right-of-center party would
lower taxes, especially on rich people. This would give
fairly arbitrary, but it makes the mathematics work out,
and it may be possible to justify its use through rational low-income people no benet, i.e. no change in utility
(since they usually don't pay taxes); would cause mod-
choice theory.
erate benet (i.e. somewhat more money, or moderate
It turns out that this model is equivalent to the previous utility increase) for middle-incoming people; and would
model, although this seems non-obvious, since there are cause signicant benets for high-income people. On the
now two sets of regression coecients and error variables, other hand, the left-of-center party might be expected to
and the error variables have a dierent distribution. In raise taxes and oset it with increased welfare and other
fact, this model reduces directly to the previous one with assistance for the lower and middle classes. This would
the following substitutions: cause signicant positive benet to low-income people,
perhaps weak benet to middle-income people, and sig-
nicant negative benet to high-income people. Finally,
= 1 0 the secessionist party would take no direct actions on the
economy, but simply secede. A low-income or middle-
= 1 0 income voter might expect basically no clear utility gain
An intuition for this comes from the fact that, since we or loss from this, but a high-income voter might expect
choose based on the maximum of two values, only their negative utility, since he/she is likely to own companies,
dierence matters, not the exact values and this ef- which will have a harder time doing business in such an
fectively removes one degree of freedom. Another crit- environment and probably lose money.
ical fact is that the dierence of two type-1 extreme- These intuitions can be expressed as follows:
value-distributed variables is a logistic distribution, i.e.
if = 1 0 Logistic(0, 1). This clearly shows that
as a linear predictor, we separate the linear predictor into the model is nonidentiable, in that multiple combina-
two, one for each of the two outcomes: tions of 0 and 1 will produce the same probabilities
for all possible explanatory variables. In fact, it can be
seen that adding any constant vector to both of them will
ln Pr(Yi = 0) = 0 Xi ln Z produce the same probabilities:
ln Pr(Yi = 1) = 1 Xi ln Z
Note that two separate sets of regression coecients have e(1 +C)Xi
been introduced, just as in the two-way latent variable Pr(Yi = 1) =
e(0 +C)Xi + e(1 +C)Xi
model, and the two equations appear a form that writes e1 Xi eCXi
the logarithm of the associated probability as a linear pre- = Xi CXi
e 0 e + e1 Xi eCXi
dictor, with an extra term lnZ at the end. This term, as CXi 1 Xi
e e
it turns out, serves as the normalizing factor ensuring that = CXi Xi
the result is a distribution. This can be seen by exponen- e (e 0 + e1 Xi )
tiating both sides: e1 Xi
= Xi
e 0 + e1 Xi
Pr(Yi = c) = softmax(c, 0 Xi , 1 Xi , . . . ). 1
pi = .
1 + e(0 +1 x1,i ++k xk,i )
In order to prove that this is equivalent to the previous
model, note that the above model is overspecied, in that This functional form is commonly called a single-layer
Pr(Yi = 0) and Pr(Yi = 1) cannot be independently perceptron or single-layer articial neural network. A
specied: rather Pr(Yi = 0) + Pr(Yi = 1) = 1 so know- single-layer neural network computes a continuous out-
ing one automatically determines the other. As a result, put instead of a step function. The derivative of pi with
250 CHAPTER 36. LOGISTIC REGRESSION
1
y=
1 + ef (X)
where f(X) is an analytic function in X. With this choice,
the single-layer neural network is identical to the logistic
regression model. This function has a continuous deriva-
tive, which allows it to be used in backpropagation. This
function is also preferred because its derivative is easily
calculated:
dy df
= y(1 y) .
dX dX
An example of this distribution is the fraction of seeds There are various possibilities:
(pi) that germinate after ni are planted.
In terms of expected values, this model is expressed as Don't do a proper Bayesian analysis, but simply
follows: compute a maximum a posteriori point estimate of
the parameters. This is common, for example, in
maximum entropy classiers in machine learning.
[ ]
Yi
pi = E Xi , Use a more general approximation method such as
ni the MetropolisHastings algorithm.
so that
Draw a Markov chain Monte Carlo sample from
the exact posterior by using the Independent
( [ ]) ( ) MetropolisHastings algorithm with heavy-tailed
Yi pi
logit E X = logit(p ) = ln = X , multivariate candidate distribution found by match-
ni
i i i
1 pi ing the mode and curvature at the mode of the
Or equivalently: normal approximation to the posterior and then
using the Students t shape with low degrees of
freedom.[22] This is shown to have excellent conver-
( ) ( )( )y ( )ni yi
ni yi ni 1 gencei properties. 1
Pr(Yi = yi | Xi ) = pi (1pi )ni yi = 1
yi yi 1 + eX i 1 + eX
Use a latent variable
i
model and approximate the lo-
This model can be t using the same sorts of methods as gistic distribution using a more tractable distribu-
the above more basic model. tion, e.g. a Students t-distribution or a mixture of
normal distributions.
is extremely common, and a ready-made Bayesian ables in this model are easy to sample from.
implementation may already be available. The Students t distribution that best approximates a stan-
dard logistic distribution can be determined by matching
Use the Laplace approximation of the posterior
the moments of the two distributions. The Students t dis-
distribution.[23] This approximates the posterior
tribution has three parameters, and since the skewness of
with a Gaussian distribution. This is not a terribly
both distributions is always 0, the rst four moments can
good approximation, but it suces if all that is de-
all be matched, using the following equations:
sired is an estimate of the posterior mean and vari-
ance. In such a case, an approximation scheme such
as variational Bayes can be used.[24] =0
2
s2 =
36.7.1 Gibbs sampling with an approxi- 2 3
6 6
mating distribution =
4 5
As shown above, logistic regression is equivalent to a This yields the following values:
latent variable model with an error variable distributed
according to a standard logistic distribution. The over-
all distribution of the latent variable Yi is also a logistic =0
distribution, with the mean equal to Xi (i.e. the xed
7 2
quantity added to the error variable). This model con- s=
9 3
siderably simplies the application of techniques such as
Gibbs sampling. However, sampling the regression coef- =9
cients is still dicult, because of the lack of conjugacy The following graphs compare the standard logistic dis-
between the normal and logistic distributions. Changing tribution with the Students t distribution that matches the
the prior distribution over the regression coecients is rst four moments using the above-determined values, as
of no help, because the logistic distribution is not in the well as the normal distribution that matches the rst two
exponential family and thus has no conjugate prior. moments. Note how much closer the Students t distri-
One possibility is to use a more general Markov chain bution agrees, especially in the tails. Beyond about two
Monte Carlo technique, such as the MetropolisHastings standard deviations from the mean, the logistic and nor-
algorithm, which can sample arbitrary distributions. An- mal distributions diverge rapidly, but the logistic and Stu-
other possibility, however, is to replace the logistic dis- dents t distributions don't start diverging signicantly un-
tribution with a similar-shaped distribution that is easier til more than 5 standard deviations away.
to work with using Gibbs sampling. In fact, the logistic (Another possibility, also amenable to Gibbs sampling, is
and normal distributions have a similar shape, and thus to approximate the logistic distribution using a mixture
one possibility is simply to have normally distributed er- density of normal distributions.)
rors. Because the normal distribution is conjugate to it-
self, sampling the regression coecients becomes easy.
In fact, this model is exactly the model used in probit re-
gression.
36.8 Extensions
However, the normal and logistic distributions dier in There are large numbers of extensions:
that the logistic has heavier tails. As a result, it is more
robust to inaccuracies in the underlying model (which are
Multinomial logistic regression (or multinomial
inevitable, in that the model is essentially always an ap-
logit) handles the case of a multi-way categorical de-
proximation) or to errors in the data. Probit regression
pendent variable (with unordered values, also called
loses some of this robustness.
classication). Note that the general case of hav-
Another alternative is to use errors distributed as a ing dependent variables with more than two values
Students t-distribution. The Students t-distribution has is termed polytomous regression.
heavy tails, and is easy to sample from because it is the
compound distribution of a normal distribution with vari- Ordered logistic regression (or ordered logit) han-
ance distributed as an inverse gamma distribution. In dles ordinal dependent variables (ordered values).
other words, if a normal distribution is used for the er- Mixed logit is an extension of multinomial logit that
ror variable, and another latent variable, following an in- allows for correlations among the choices of the de-
verse gamma distribution, is added corresponding to the pendent variable.
variance of this error variable, the marginal distribution
of the error variable will follow a Students t distribution. An extension of the logistic model to sets of inter-
Because of the various conjugacy relationships, all vari- dependent variables is the conditional random eld.
252 CHAPTER 36. LOGISTIC REGRESSION
Ordered logit
A way to measure a models suitability is to assess the
model against a set of data that was not used to create
HosmerLemeshow test
the model.[25] The class of techniques is called cross-
validation. This holdout model assessment method is par- Brier score
ticularly valuable when data are collected in dierent set-
tings (e.g., at dierent times or places) or when models MLPACK - contains a C++ implementation of lo-
are assumed to be generalizable. gistic regression
To measure the suitability of a binary regression model,
one can classify both the actual value and the predicted Local case-control sampling
value of each observation as either 0 or 1.[26] The pre-
dicted value of an observation can be set equal to 1 if
the estimated probability that the observation equals 1 is 36.11 References
above 12 , and set equal to 0 if the estimated probability
is below 12 . Here logistic regression is being used as a [1] David A. Freedman (2009). Statistical Models: Theory
binary classication model. There are four possible com- and Practice. Cambridge University Press. p. 128.
bined classications:
[2] Cox, DR (1958). The regression analysis of binary se-
1. prediction of 0 when the holdout sample has a 0 quences (with discussion)". J Roy Stat Soc B 20: 215242.
(True Negatives, the number of which is TN)
[3] Walker, SH; Duncan, DB (1967). Estimation of the
2. prediction of 0 when the holdout sample has a 1 probability of an event as a function of several indepen-
(False Negatives, the number of which is FN) dent variables. Biometrika 54: 167178.
3. prediction of 1 when the holdout sample has a 0 [4] Gareth James; Daniela Witten; Trevor Hastie; Robert Tib-
(False Positives, the number of which is FP) shirani (2013). An Introduction to Statistical Learning.
Springer. p. 6.
4. prediction of 1 when the holdout sample has a 1
(True Positives, the number of which is TP) [5] Boyd, C. R.; Tolson, M. A.; Copes, W. S. (1987). Eval-
uating trauma care: The TRISS method. Trauma Score
and the Injury Severity Score. The Journal of trauma
These classications are used to calculate accuracy, pre- 27 (4): 370378. doi:10.1097/00005373-198704000-
cision (also called positive predictive value), recall (also 00005. PMID 3106646.
called sensitivity), specicity and negative predictive
value: [6] Kologlu M., Elker D., Altun H., Sayek I. Valdation of MPI
and OIA II in two dierent groups of patients with sec-
ondary peritonitis // Hepato-Gastroenterology. 2001.
TP + TN Vol. 48, 37. P. 147-151.
Accuracy =
TP + FP + FN + TN
[7] Biondo S., Ramos E., Deiros M. et al. Prognostic factors
TP for mortality in left colonic peritonitis: a new scoring sys-
Precision = value predictive Positive = tem // J. Am. Coll. Surg. 2000. Vol. 191, 6. .
TP + FP
635-642.
TN
value predictive Negative =
TN + FN [8] Marshall J.C., Cook D.J., Christou N.V. et al. Multiple
TP Organ Dysfunction Score: A reliable descriptor of a com-
Recall = Sensitivity = plex clinical outcome // Crit. Care Med. 1995. Vol.
TP + FN 23. P. 1638-1652.
TN
Specicity = [9] Le Gall J.-R., Lemeshow S., Saulnier F. A new Simpli-
TN + FP
ed Acute Physiology Score (SAPS II) based on a Eu-
ropean/North American multicenter study // JAMA.
36.10 See also 1993. Vol. 270. P. 2957-2963.
[12] M. Strano; B.M. Colosimo (2006). Logistic re- 36.12 Further reading
gression analysis for experimental determination
of forming limit diagrams. International Jour- Agresti, Alan. (2002). Categorical Data Analy-
nal of Machine Tools and Manufacture 46 (6).
sis. New York: Wiley-Interscience. ISBN 0-471-
doi:10.1016/j.ijmachtools.2005.07.005.
36093-7.
[13] Palei, S. K.; Das, S. K. (2009). Logistic regression model Amemiya, T. (1985). Advanced Econometrics. Har-
for prediction of roof fall risks in bord and pillar work- vard University Press. ISBN 0-674-00560-0.
ings in coal mines: An approach. Safety Science 47: 88.
doi:10.1016/j.ssci.2008.01.002. Balakrishnan, N. (1991). Handbook of the Logis-
tic Distribution. Marcel Dekker, Inc. ISBN 978-0-
[14] Hosmer, David W.; Lemeshow, Stanley (2000). Applied 8247-8587-1.
Logistic Regression (2nd ed.). Wiley. ISBN 0-471-35632-
8. Greene, William H. (2003). Econometric Analysis,
fth edition. Prentice Hall. ISBN 0-13-066189-9.
[15] http://www.planta.cn/forum/files_planta/introduction_
to_categorical_data_analysis_805.pdf Hilbe, Joseph M. (2009). Logistic Regression Mod-
els. Chapman & Hall/CRC Press. ISBN 978-1-
[16] Menard, Scott W. (2002). Applied Logistic Regression 4200-7575-5.
(2nd ed.). SAGE. ISBN 978-0-7619-2208-7.
Howell, David C. (2010). Statistical Methods
[17] Menard ch 1.3 for Psychology, 7th ed. Belmont, CA; Thomson
Wadsworth. ISBN 978-0-495-59786-5.
[18] Peduzzi, P; Concato, J; Kemper, E; Holford, TR; Fein-
Peduzzi, P.; J. Concato; E. Kemper; T.R. Holford;
stein, AR (December 1996). A simulation study of the
A.R. Feinstein (1996). A simulation study of the
number of events per variable in logistic regression anal-
ysis.. Journal of Clinical Epidemiology 49 (12): 13739. number of events per variable in logistic regression
doi:10.1016/s0895-4356(96)00236-3. PMID 8970487. analysis. Journal of Clinical Epidemiology 49 (12):
13731379. doi:10.1016/s0895-4356(96)00236-3.
[19] Greene, William N. (2003). Econometric Analysis (Fifth PMID 8970487.
ed.). Prentice-Hall. ISBN 0-13-066189-9.
[20] Cohen, Jacob; Cohen, Patricia; West, Steven G.; Aiken, 36.13 External links
Leona S. (2002). Applied Multiple Regression/Correlation
Analysis for the Behavioral Sciences (3rd ed.). Routledge.
ISBN 978-0-8058-2223-6.
Econometrics Lecture (topic: Logit model) on
YouTube by Mark Thoma
[21] https://class.stanford.edu/c4x/HumanitiesScience/ Logistic Regression Interpretation
StatLearning/asset/classification.pdf slide 16
Logistic Regression tutorial
[22] Bolstad, William M. (2010). Understandeing Compu-
tational Bayesian Statistics. Wiley. ISBN 978-0-470- Using open source software for building Logistic
04609-8. Regression models
[23] Bishop, Christopher M. Chapter 4. Linear Models for Logistic regression. Biomedical statistics
Classication. Pattern Recognition and Machine Learn-
ing. Springer Science+Business Media, LLC. pp. 217
218. ISBN 978-0387-31073-2.
Not to be confused with latent Dirichlet allocation. 37.1 LDA for two classes
Linear discriminant analysis (LDA) is a generaliza- Consider a set of observations x (also called features, at-
tion of Fishers linear discriminant, a method used in tributes, variables or measurements) for each sample of
statistics, pattern recognition and machine learning to nd an object or event with known class y. This set of samples
a linear combination of features that characterizes or sep- is called the training set. The classication problem is
arates two or more classes of objects or events. The re- then to nd a good predictor for the class y of any sample
sulting combination may be used as a linear classier, of the same distribution (not necessarily from the training
or, more commonly, for dimensionality reduction before set) given only an observation x .[7]:338
later classication. LDA approaches the problem by assuming that the con-
LDA is closely related to analysis of variance (ANOVA) ditional probability density functions p(x|y = 0) and
and regression analysis, which also attempt to express p(x|y = 1) are both normally distributed with mean and
one dependent variable as a linear combination of covariance parameters ( 0 , 0 ) and (
1 , 1 ) , respec-
other features or measurements.[1][2] However, ANOVA tively. Under this assumption, the Bayes optimal solution
uses categorical independent variables and a continuous is to predict points as being from the second class if the
dependent variable, whereas discriminant analysis has log of the likelihood ratios is below some threshold T, so
continuous independent variables and a categorical de- that;
pendent variable (i.e. the class label).[3] Logistic regres-
sion and probit regression are more similar to LDA than
ANOVA is, as they also explain a categorical variable by 0 )T 1
(x x
0 ( 1 )T 1
0 )+ln |0 |(x x
1 ( 1 )ln |1 | < T
the values of continuous independent variables. These
other methods are preferable in applications where it is Without any further assumptions, the resulting classier
not reasonable to assume that the independent variables is referred to as QDA (quadratic discriminant analysis).
are normally distributed, which is a fundamental assump- LDA instead makes the additional simplifying
tion of the LDA method. homoscedasticity assumption (i.e. that the class co-
LDA is also closely related to principal component anal- variances are identical, so 0 = 1 = ) and that the
ysis (PCA) and factor analysis in that they both look for covariances have full rank. In this case, several terms
linear combinations of variables which best explain the cancel:
data.[4] LDA explicitly attempts to model the dierence
between the classes of data. PCA on the other hand does xT 1
0 x = xT 1
1 x
not take into account any dierence in class, and factor
xT i 1 i = i T i 1 x because i is
analysis builds the feature combinations based on dier-
Hermitian
ences rather than similarities. Discriminant analysis is
also dierent from factor analysis in that it is not an in-
terdependence technique: a distinction between indepen- and the above decision criterion becomes a threshold on
dent variables and dependent variables (also called crite- the dot product
rion variables) must be made.
LDA works when the measurements made on indepen-
x > c
w
dent variables for each observation are continuous quan-
tities. When dealing with categorical independent vari- for some threshold constant c, where
ables, the equivalent technique is discriminant correspon-
dence analysis.[5][6]
= 1 (
w 1
0)
254
37.4. MULTICLASS LDA 255
1 sional problem, the line that best divides the two groups
c= (T 0 T 1
0 0 + 1 T 1
1 1 )
2 is perpendicular to w
.
This means that the criterion of an input x being in a class
Generally, the data points to be discriminated are pro-
y is purely a function of this linear combination of the
jected onto w ; then the threshold that best separates the
known observations.
data is chosen from analysis of the one-dimensional dis-
It is often useful to see this conclusion in geometrical tribution. There is no general rule for the threshold. How-
terms: the criterion of an input x being in a class y is ever, if projections of points from both classes exhibit ap-
purely a function of projection of multidimensional-space proximately the same distributions, a good choice would
point x onto vector w (thus, we only consider its direc- be the hyperplane between projections of the two means,
tion). In other words, the observation belongs to y if cor- 0 and w
w 1 . In this case the parameter c in threshold
responding x is located on a certain side of a hyperplane condition w x > c can be found explicitly:
perpendicular to w . The location of the plane is dened
by the threshold c.
1 1 t 1 1 t 1
(
c=w 0 +
1) =
1
0
2 2 1 2 0
37.2 Canonical discriminant anal- Otsus Method is related to Fishers linear discriminant,
ysis for k classes and was created to binarize the histogram of pixels in
a grayscale image by optimally picking the black/white
Canonical discriminant analysis (CDA) nds axes (k - 1 threshold that minimizes intra-class variance and maxi-
canonical coordinates, k being the number of classes) that mizes inter-class variance within/between grayscales as-
best separate the categories. These linear functions are signed to black and white pixel classes.
uncorrelated and dene, in eect, an optimal k 1 space
through the n-dimensional cloud of data that best sepa-
rates (the projections in that space of) the k groups. See 37.4 Multiclass LDA
Multiclass LDA for details below.
In the case where there are more than two classes, the
analysis used in the derivation of the Fisher discriminant
37.3 Fishers linear discriminant can be extended to nd a subspace which appears to con-
tain all of the class variability. This generalization is due
[8]
The terms Fishers linear discriminant and LDA are often to C.R. Rao. Suppose that each of C classes has a mean
[1] i and the same covariance . Then the scatter between
used interchangeably, although Fishers original article
actually describes a slightly dierent discriminant, which class variability may be dened by the sample covariance
does not make some of the assumptions of LDA such as of the class means
normally distributed classes or equal class covariances.
Suppose two classes of observations have means 0,
1
C
and covariances 0 , 1 . Then the linear combination of = 1 (i )(i )T
b
features w
x will have means w
i and variances w T
i w C i=1
for i = 0, 1 . Fisher dened the separation between these
two distributions to be the ratio of the variance between where is the mean of the class means. The class sepa-
the classes to the variance within the classes: ration in a direction w
in this case will be given by
2
between
(w 1 w
0 )2 (
(w 1
0 ))2 T b w
w
S= 2 = T T
= T
S= T
within w
1 w +w
0 w w (0 + 1 )w w w
This measure is, in some sense, a measure of the signal- This means that when w is an eigenvector of 1 b the
to-noise ratio for the class labelling. It can be shown that separation will be equal to the corresponding eigenvalue.
the maximum separation occurs when If 1 b is diagonalizable, the variability between fea-
tures will be contained in the subspace spanned by the
eigenvectors corresponding to the C 1 largest eigenval-
(0 + 1 )1 (
w 1 0) ues (since b is of rank C 1 at most). These eigenvec-
tors are primarily used in feature reduction, as in PCA.
When the assumptions of LDA are satised, the above The eigenvectors corresponding to the smaller eigenval-
equation is equivalent to LDA. ues will tend to be very sensitive to the exact choice of
Be sure to note that the vector w is the normal to the dis- training data, and it is often necessary to use regularisa-
criminant hyperplane. As an example, in a two dimen- tion as described in the next section.
256 CHAPTER 37. LINEAR DISCRIMINANT ANALYSIS
If classication is required, instead of dimension reduc- LDA can be generalized to multiple discriminant analy-
tion, there are a number of alternative techniques avail- sis, where c becomes a categorical variable with N possi-
able. For instance, the classes may be partitioned, and ble states, instead of only two. Analogously, if the class-
a standard Fisher discriminant or LDA used to classify conditional densities p(x|c = i) are normal with shared
each partition. A common example of this is one against covariances, the sucient statistic for P (c|x) are the val-
the rest where the points from one class are put in one ues of N projections, which are the subspace spanned
group, and everything else in the other, and then LDA by the N means, ane projected by the inverse covari-
applied. This will result in C classiers, whose results are ance matrix. These projections can be found by solving
combined. Another common method is pairwise classi- a generalized eigenvalue problem, where the numerator
cation, where a new classier is created for each pair of is the covariance matrix formed by treating the means as
classes (giving C(C 1)/2 classiers in total), with the the samples, and the denominator is the shared covari-
individual classiers combined to produce a nal classi- ance matrix.
cation.
37.6 Applications
37.5 Practical use
In addition to the examples given below, LDA is applied
In practice, the class means and covariances are not in positioning and product management.
known. They can, however, be estimated from the train-
ing set. Either the maximum likelihood estimate or the
maximum a posteriori estimate may be used in place of 37.6.1 Bankruptcy prediction
the exact value in the above equations. Although the es-
timates of the covariance may be considered optimal in In bankruptcy prediction based on accounting ratios and
some sense, this does not mean that the resulting discrim- other nancial variables, linear discriminant analysis was
inant obtained by substituting these values is optimal in the rst statistical method applied to systematically ex-
any sense, even if the assumption of normally distributed plain which rms entered bankruptcy vs. survived. De-
classes is correct. spite limitations including known nonconformance of ac-
counting ratios to the normal distribution assumptions
Another complication in applying LDA and Fishers dis- of LDA, Edward Altman's 1968 model is still a leading
criminant to real data occurs when the number of mea- model in practical applications.
surements of each sample exceeds the number of sam-
ples in each class.[4] In this case, the covariance estimates
do not have full rank, and so cannot be inverted. There 37.6.2 Face recognition
are a number of ways to deal with this. One is to use a
pseudo inverse instead of the usual matrix inverse in the In computerised face recognition, each face is repre-
above formulae. However, better numeric stability may sented by a large number of pixel values. Linear discrim-
be achieved by rst projecting the problem onto the sub- inant analysis is primarily used here to reduce the number
space spanned by b .[9] Another strategy to deal with of features to a more manageable number before classi-
small sample size is to use a shrinkage estimator of the cation. Each of the new dimensions is a linear combi-
covariance matrix, which can be expressed mathemati- nation of pixel values, which form a template. The linear
cally as combinations obtained using Fishers linear discriminant
are called Fisher faces, while those obtained using the re-
lated principal component analysis are called eigenfaces.
= (1 ) + I
data from a sample of potential customers concern- criminant functions are built which help to objectively
ing their ratings of all the product attributes. The classify disease in a future patient into mild, moderate
data collection stage is usually done by marketing or severe form.
research professionals. Survey questions ask the re- In biology, similar principles are used in order to clas-
spondent to rate a product from one to ve (or 1 to sify and dene groups of dierent biological objects, for
7, or 1 to 10) on a range of attributes chosen by example, to dene phage types of Salmonella enteritidis
the researcher. Anywhere from ve to twenty at- based on Fourier transform infrared spectra,[12] to detect
tributes are chosen. They could include things like: animal source of Escherichia coli studying its virulence
ease of use, weight, accuracy, durability, colourful-
factors[13] etc.
ness, price, or size. The attributes chosen will vary
depending on the product being studied. The same
question is asked about all the products in the study. 37.6.5 Earth Science
The data for multiple products is codied and input
into a statistical program such as R, SPSS or SAS. This method can be used to separate the alteration zones.
(This step is the same as in Factor analysis). For example, when dierent data from various zones are
available, discriminate analysis can nd the pattern within
2. Estimate the Discriminant Function Coecients
the data and classify the them eectively [14]
and determine the statistical signicance and valid-
ity Choose the appropriate discriminant analy-
sis method. The direct method involves estimating
the discriminant function so that all the predictors 37.7 See also
are assessed simultaneously. The stepwise method
enters the predictors sequentially. The two-group Data mining
method should be used when the dependent variable
has two categories or states. The multiple discrim- Decision tree learning
inant method is used when the dependent variable
Factor analysis
has three or more categorical states. Use Wilkss
Lambda to test for signicance in SPSS or F stat in Kernel Fisher discriminant analysis
SAS. The most common method used to test valid-
ity is to split the sample into an estimation or analy- Logit (for logistic regression)
sis sample, and a validation or holdout sample. The
Multidimensional scaling
estimation sample is used in constructing the dis-
criminant function. The validation sample is used to Multilinear subspace learning
construct a classication matrix which contains the
number of correctly classied and incorrectly clas- Pattern recognition
sied cases. The percentage of correctly classied
cases is called the hit ratio. Perceptron
[7] Venables, W. N.; Ripley, B. D. (2002). Modern Applied LDA tutorial using MS Excel
Statistics with S (4th ed.). Springer Verlag. ISBN 0-387-
Biomedical statistics. Discriminant analysis
95457-0.
In machine learning, naive Bayes classiers are a fam- the class variable. For example, a fruit may be consid-
ily of simple probabilistic classiers based on apply- ered to be an apple if it is red, round, and about 3 cm
ing Bayes theorem with strong (naive) independence as- in diameter. A naive Bayes classier considers each of
sumptions between the features. these features to contribute independently to the proba-
Naive Bayes has been studied extensively since the 1950s. bility that this fruit is an apple, regardless of any possible
correlations between the color, roundness and diameter
It was introduced under a dierent name into the text re-
trieval community in the early 1960s, [1]:488
and remains features.
a popular (baseline) method for text categorization, the For some types of probability models, naive Bayes classi-
problem of judging documents as belonging to one cat- ers can be trained very eciently in a supervised learn-
egory or the other (such as spam or legitimate, sports ing setting. In many practical applications, parameter
or politics, etc.) with word frequencies as the features. estimation for naive Bayes models uses the method of
With appropriate preprocessing, it is competitive in this maximum likelihood; in other words, one can work with
domain with more advanced methods including support the naive Bayes model without accepting Bayesian prob-
vector machines.[2] It also nds application in automatic ability or using any Bayesian methods.
medical diagnosis.[3] Despite their naive design and apparently oversimplied
Naive Bayes classiers are highly scalable, requiring a assumptions, naive Bayes classiers have worked quite
number of parameters linear in the number of variables well in many complex real-world situations. In 2004, an
(features/predictors) in a learning problem. Maximum- analysis of the Bayesian classication problem showed
likelihood training can be done by evaluating a closed- that there are sound theoretical reasons for the apparently
form expression,[1]:718 which takes linear time, rather implausible ecacy of naive Bayes classiers.[5] Still, a
than by expensive iterative approximation as used for comprehensive comparison with other classication al-
many other types of classiers. gorithms in 2006 showed that Bayes classication is out-
In the statistics and computer science literature, Naive performed by other [6]
approaches, such as boosted trees or
Bayes models are known under a variety of names, in- random forests.
cluding simple Bayes and independence Bayes.[4] All An advantage of naive Bayes is that it only requires a small
these names reference the use of Bayes theorem in the amount of training data to estimate the parameters nec-
classiers decision rule, but naive Bayes is not (necessar- essary for classication.
ily) a Bayesian method;[4] Russell and Norvig note that
"[naive Bayes] is sometimes called a Bayesian classi-
er, a somewhat careless usage that has prompted true 38.2 Probabilistic model
Bayesians to call it the idiot Bayes model.[1]:482
Abstractly, naive Bayes is a conditional probability
model: given a problem instance to be classied, repre-
38.1 Introduction sented by a vector x = (x1 , . . . , xn ) representing some n
features (independent variables), it assigns to this instance
probabilities
Naive Bayes is a simple technique for constructing classi-
ers: models that assign class labels to problem instances,
represented as vectors of feature values, where the class
labels are drawn from some nite set. It is not a single p(Ck |x1 , . . . , xn )
algorithm for training such classiers, but a family of al-
gorithms based on a common principle: all naive Bayes for each of k possible outcomes or classes.[7]
classiers assume that the value of a particular feature The problem with the above formulation is that if the
is independent of the value of any other feature, given number of features n is large or if a feature can take on
259
260 CHAPTER 38. NAIVE BAYES CLASSIFIER
p(Ck , x1 , . . . , xn )
n
y = argmax p(Ck ) p(xi |Ck ).
which can be rewritten as follows, using the chain rule k{1,...,K} i=1
for repeated applications of the denition of conditional
probability:
38.3 Parameter estimation and
p(Ck , x1 , . . . , xn ) = p(Ck ) p(x1 , . . . , xn |Ck ) event models
= p(Ck ) p(x1 |Ck ) p(x2 , . . . , xn |Ck , x1 )
A class prior may be calculated by assuming equiproba-
= p(Ck ) p(x1 |Ck ) p(x2 |Ck , x1 ) p(x3 , . . . , xn |Ck , x1 , x2 )
ble classes (i.e., priors = 1 / (number of classes)), or by
= p(Ck ) p(x1 |Ck ) p(x2 |Ck , x1 ) . . . p(x n |Ck , x1 ,an
calculating x2estimate
, x3 , . . . ,for
xn1
the) class probability from the
training set (i.e., (prior for a given class) = (number of
Now the naive conditional independence assumptions
samples in the class) / (total number of samples)). To esti-
come into play: assume that each feature Fi is condition-
mate the parameters for a features distribution, one must
ally independent of every other feature Fj for j = i ,
assume a distribution or generate nonparametric models
given the category C . This means that
for the features from the training set.[8]
The assumptions on distributions of features are called
p(xi |Ck , xj ) = p(xi |Ck ) the event model of the Naive Bayes classier. For discrete
features like the ones encountered in document classica-
p(xi |Ck , xj , xk ) = p(xi |Ck ) tion (include spam ltering), multinomial and Bernoulli
distributions are popular. These assumptions lead to two
p(xi |Ck , xj , xk , xl ) = p(xi |Ck )
distinct models, which are often confused.[9][10]
and so on, for i = j, k, l . Thus, the joint model can be
expressed as
38.3.1 Gaussian naive Bayes
p(Ck |x1 , . . . , xn ) p(Ck , x1 , . . . , xn ) When dealing with continuous data, a typical assumption
is that the continuous values associated with each class
p(Ck ) p(x1 |Ck ) p(x2 |Ck ) p(x3 |Ck )
are distributed according to a Gaussian distribution. For
n
example, suppose the training data contain a continuous
p(Ck ) p(xi |Ck ) .
attribute, x . We rst segment the data by the class, and
i=1
then compute the mean and variance of x in each class.
This means that under the above independence assump- Let c be the mean of the values in x associated with
tions, the conditional distribution over the class variable class c, and let c2 be the variance of the values in x asso-
C is: ciated with class c. Then, the probability distribution of
38.3. PARAMETER ESTIMATION AND EVENT MODELS 261
some value given a class, p(x = v|c) , can be computed Rennie et al. discuss problems with the multinomial as-
by plugging v into the equation for a Normal distribution sumption in the context of document classication and
parameterized by c and c2 . That is, possible ways to alleviate those problems, including the
use of tfidf weights instead of raw term frequencies
and document length normalization, to produce a naive
1 (v )2
22c Bayes classier that is competitive with support vector
p(x = v|c) = e c
machines.[2]
2 2
c
( xi )! xi 38.3.4 Semi-supervised parameter estima-
p(x|Ck ) = i pki
i xi ! i
tion
The multinomial naive Bayes classier becomes a linear Given a way to train a naive Bayes classier from labeled
classier when expressed in log-space:[2] data, its possible to construct a semi-supervised training
algorithm that can learn from a combination of labeled
and unlabeled data by running the supervised learning al-
( )
n gorithm in a loop:[11]
log p(Ck |x) log p(Ck ) pki xi
i=1 Given a collection D = L U of labeled sam-
n ples L and unlabeled samples U, start by train-
= log p(Ck ) + xi log pki ing a naive Bayes classier on L.
i=1
Until convergence, do:
= b + w
kx
Predict class probabilities P (C|x)
where b = log p(Ck ) and wki = log pki . for all examples x in D .
If a given class and feature value never occur together Re-train the model based on the
in the training data, then the frequency-based probability probabilities (not the labels) pre-
estimate will be zero. This is problematic because it will dicted in the previous step.
wipe out all information in the other probabilities when
they are multiplied. Therefore, it is often desirable to in- Convergence is determined based on improvement to the
corporate a small-sample correction, called pseudocount, model likelihood P (D|) , where denotes the parame-
in all probability estimates such that no probability is ever ters of the naive Bayes model.
set to be exactly zero. This way of regularizing naive This training algorithm is an instance of the more gen-
Bayes is called Laplace smoothing when the pseudocount eral expectationmaximization algorithm (EM): the pre-
is one, and Lidstone smoothing in the general case. diction step inside the loop is the E-step of EM, while the
262 CHAPTER 38. NAIVE BAYES CLASSIFIER
re-training of naive Bayes is the M-step. The algorithm is logistic function to b + w x , or in the multiclass case,
formally justied by the assumption that the data are gen- the softmax function.
erated by a mixture model, and the components of this Discriminative classiers have lower asymptotic error
mixture model are exactly the classes of the classication than generative ones; however, research by Ng and Jordan
problem.[11] has shown that in some practical cases naive Bayes
can outperform logistic regression because it reaches its
asymptotic error faster.[13]
38.4 Discussion
Despite the fact that the far-reaching independence as- 38.5 Examples
sumptions are often inaccurate, the naive Bayes classier
has several properties that make it surprisingly useful in
practice. In particular, the decoupling of the class con- 38.5.1 Sex classication
ditional feature distributions means that each distribution
can be independently estimated as a one-dimensional dis- Problem: classify whether a given person is a male or a
tribution. This helps alleviate problems stemming from female based on the measured features. The features in-
the curse of dimensionality, such as the need for data clude height, weight, and foot size.
sets that scale exponentially with the number of features.
While naive Bayes often fails to produce a good estimate
Training
for the correct class probabilities,[12] this may not be a re-
quirement for many applications. For example, the naive
Example training set below.
Bayes classier will make the correct MAP decision rule
classication so long as the correct class is more probable The classier created from the training set using a Gaus-
than any other class. This is true regardless of whether the sian distribution assumption would be (given variances
probability estimate is slightly, or even grossly inaccurate. are unbiased sample variances):
In this manner, the overall classier can be robust enough Lets say we have equiprobable classes so P(male)=
to ignore serious deciencies in its underlying naive prob- P(female) = 0.5. This prior probability distribution might
ability model.[3] Other reasons for the observed success be based on our knowledge of frequencies in the larger
of the naive Bayes classier are discussed in the litera- population, or on frequency in the training set.
ture cited below.
Testing
38.4.1 Relation to logistic regression
Below is a sample to be classied as a male or female.
In the case of discrete inputs (indicator or frequency fea-
tures for discrete events), naive Bayes classiers form a We wish to determine which posterior is greater, male
generative-discriminative pair with (multinomial) logistic or female. For the classication as male the posterior is
regression classiers: each naive Bayes classier can be given by
considered a way of tting a probability model that op-
timizes the joint likelihood p(C, x) , while logistic re-
gression ts the same probability model to optimize the P (male) p(height|male) p(weight|male) p(f ootsi
posterior(male) =
conditional p(C|x) .[13] evidence
The link between the two can be seen by observing that For the classication as female the posterior is given by
the decision function for naive Bayes (in the binary case)
can be rewritten as predict class C1 if the odds of
p(C1 |x) exceed those of p(C2 |x) ". Expressing this in P (f emale) p(height|f emale) p(weight|f emale
log-space gives: posterior(f emale) =
evidence
The evidence (also termed normalizing constant) may be
p(C1 |x) calculated:
log = log p(C1 |x) log p(C2 |x) > 0
p(C2 |x)
The left-hand side of this equation is the log-odds, or evidence = P (male) p(height|male) p(weight|male) p(f ootsize|male
logit, the quantity predicted by the linear model that un-
derlies logistic regression. Since naive Bayes is also a lin- +P (f emale) p(height|f emale) p(weight|f emale) p(f ootsize|f emale
ear model for the two discrete event models, it can be
reparametrised as a linear function b + w x > 0 . Ob- However, given the sample the evidence is a constant and
taining the probabilities is then a matter of applying the thus scales both posteriors equally. It therefore does not
38.5. EXAMPLES 263
aect classication and can be ignored. We now deter- The question that we desire to answer is: what is the
mine the probability distribution for the sex of the sam- probability that a given document D belongs to a given
ple. class C?" In other words, what is p(C|D) ?
Now by denition
P (male) = 0.5
( ) p(D C)
1 (6 )2
p(height|male) = exp 1.5789 p(D|C) = p(C)
2 2 2 2
where = 5.855 and 2 = 3.5033 102 are the pa- and
rameters of normal distribution which have been previ-
ously determined from the training set. Note that a value
greater than 1 is OK here it is a probability density p(C|D) = p(D C)
rather than a probability, because height is a continuous p(D)
variable.
Bayes theorem manipulates these into a statement of
probability in terms of likelihood.
p(weight|male) = 5.9881 106
38.5.2 Document classication p(D|S) = p(wi |S)
i
Here is a worked example of naive Bayesian classica- Using the Bayesian result above, we can write:
tion to the document classication problem. Consider
the problem of classifying documents by their content,
for example into spam and non-spam e-mails. Imagine p(S)
that documents are drawn from a number of classes of p(S|D) = p(wi |S)
p(D) i
documents which can be modelled as sets of words where
the (independent) probability that the i-th word of a given p(S)
document occurs in a document from class C can be writ- p(S|D) = p(wi |S)
p(D) i
ten as
Dividing one by the other gives:
p(wi |C)
p(S|D) p(S) i p(wi |S)
(For this treatment, we simplify things further by assum- =
p(S|D) p(S) i p(wi |S)
ing that words are randomly distributed in the document
- that is, words are not dependent on the length of the Which can be re-factored as:
document, position within the document with relation to
other words, or other document-context.)
p(wi |S)
Then the probability that a given document D contains all p(S|D) = p(S)
of the words wi , given a class C, is p(S|D) p(S) i p(wi |S)
log (p(S | D) / p(S | D)) based on the observation that [7] Narasimha Murty, M.; Susheela Devi, V. (2011). Pat-
p(S | D) + p(S | D) = 1. tern Recognition: An Algorithmic Approach. ISBN
0857294946.
Taking the logarithm of all these ratios, we have:
[8] John, George H.; Langley, Pat (1995). Estimating Contin-
uous Distributions in Bayesian Classiers. Proc. Eleventh
p(S|D) p(S) p(wi |S) Conf. on Uncertainty in Articial Intelligence. Morgan
ln = ln + ln Kaufmann. pp. 338345.
p(S|D) p(S) i
p(w i |S)
[9] McCallum, Andrew; Nigam, Kamal (1998). A compar-
(This technique of "log-likelihood ratios" is a common ison of event models for Naive Bayes text classication
technique in statistics. In the case of two mutually exclu- (PDF). AAAI-98 workshop on learning for text catego-
sive alternatives (such as this example), the conversion of rization 752.
a log-likelihood ratio to a probability takes the form of a
[10] Metsis, Vangelis; Androutsopoulos, Ion; Paliouras, Geor-
sigmoid curve: see logit for details.)
gios (2006). Spam ltering with Naive Bayeswhich
Finally, the document can be classied as follows. It is Naive Bayes?. Third conference on email and anti-spam
p(S|D) (CEAS) 17.
spam if p(S|D) > p(S|D) (i.e., ln p(S|D) > 0 ),
otherwise it is not spam. [11] Nigam, Kamal; McCallum, Andrew; Thrun, Sebastian;
Mitchell, Tom (2000). Learning to classify text from
labeled and unlabeled documents using EM (PDF).
38.6 See also Machine Learning.
Linear classier
38.7.1 Further reading
Logistic regression
Domingos, Pedro; Pazzani, Michael (1997). On
Perceptron the optimality of the simple Bayesian classier under
zero-one loss. Machine Learning 29: 103137.
Take-the-best heuristic
Webb, G. I.; Boughton, J.; Wang, Z. (2005). Not
So Naive Bayes: Aggregating One-Dependence Es-
38.7 References timators. Machine Learning (Springer) 58 (1): 5
24. doi:10.1007/s10994-005-4258-6.
[1] Russell, Stuart; Norvig, Peter (2003) [1995]. Articial
Mozina, M.; Demsar, J.; Kattan, M.; Zupan, B.
Intelligence: A Modern Approach (2nd ed.). Prentice Hall.
ISBN 978-0137903955. (2004). Nomograms for Visualization of Naive
Bayesian Classier (PDF). Proc. PKDD-2004. pp.
[2] Rennie, J.; Shih, L.; Teevan, J.; Karger, D. (2003). 337348.
Tackling the poor assumptions of Naive Bayes classiers
(PDF). ICML. Maron, M. E. (1961). Automatic Indexing: An
Experimental Inquiry. JACM 8 (3): 404417.
[3] Rish, Irina (2001). An empirical study of the naive Bayes doi:10.1145/321075.321084.
classier (PDF). IJCAI Workshop on Empirical Methods
in AI. Minsky, M. (1961). Steps toward Articial Intelli-
gence. Proc. IRE 49 (1). pp. 830.
[4] Hand, D. J.; Yu, K. (2001). Idiots Bayes not so stupid
after all?". International Statistical Review 69 (3): 385
399. doi:10.2307/1403452. ISSN 0306-7734.
38.8 External links
[5] Zhang, Harry. The Optimality of Naive Bayes (PDF).
FLAIRS2004 conference.
Book Chapter: Naive Bayes text classication, In-
[6] Caruana, R.; Niculescu-Mizil, A. (2006). An empiri- troduction to Information Retrieval
cal comparison of supervised learning algorithms. Proc.
23rd International Conference on Machine Learning. Naive Bayes for Text Classication with Unbalanced
CiteSeerX: 10.1.1.122.5901. Classes
38.8. EXTERNAL LINKS 265
Software
Cross-validation (statistics)
Cross-validation, sometimes called rotation estima- sures of t (prediction error) to correct for the optimistic
tion,[1][2][3] is a model validation technique for assess- nature of training error and derive a more accurate esti-
ing how the results of a statistical analysis will generalize mate of model prediction performance.[6]
to an independent data set. It is mainly used in settings
where the goal is prediction, and one wants to estimate
how accurately a predictive model will perform in prac- 39.1 Purpose of cross-validation
tice. In a prediction problem, a model is usually given a
dataset of known data on which training is run (training
Suppose we have a model with one or more unknown
dataset), and a dataset of unknown data (or rst seen data)
parameters, and a data set to which the model can be
against which the model is tested (testing dataset).[4] The
t (the training data set). The tting process optimizes
goal of cross validation is to dene a dataset to test the
the model parameters to make the model t the training
model in the training phase (i.e., the validation dataset),
data as well as possible. If we then take an independent
in order to limit problems like overtting, give an in-
sample of validation data from the same population as
sight on how the model will generalize to an independent
the training data, it will generally turn out that the model
dataset (i.e., an unknown dataset, for instance from a real
does not t the validation data as well as it ts the training
problem), etc.
data. This is called overtting, and is particularly likely
One round of cross-validation involves partitioning a to happen when the size of the training data set is small,
sample of data into complementary subsets, performing or when the number of parameters in the model is large.
the analysis on one subset (called the training set), and Cross-validation is a way to predict the t of a model to a
validating the analysis on the other subset (called the val- hypothetical validation set when an explicit validation set
idation set or testing set). To reduce variability, multiple is not available.
rounds of cross-validation are performed using dierent
Linear regression provides a simple illustration of overt-
partitions, and the validation results are averaged over the
ting. In linear regression we have real response values y1 ,
rounds.
..., yn, and n p-dimensional vector covariates x1 , ..., xn.
Cross-validation is important in guarding against testing The components of the vectors x are denoted x, ..., x .
hypotheses suggested by the data (called "Type III er- If we use least squares to t a function in the form of a
rors"[5] ), especially where further samples are hazardous, hyperplane y = a + T x to the data (x, y) , we could
costly or impossible to collect. then assess the t using the mean squared error (MSE).
Furthermore, one of the main reasons for using cross- The MSE for a given value of the parameters a and on
validation instead of using the conventional validation the training set (x, y) is
(e.g. partitioning the data set into two sets of 70% for
training and 30% for test) is that the error (e.g. Root 1
n
1
n
Mean Square Error) on the training set in the conven- (yi a T xi )2 = (yi a1 xi1 p xip )2
n i=1 n i=1
tional validation is not a useful estimator of model per-
formance and thus the error on the test data set does not
It can be shown under mild assumptions that the expected
properly represent the assessment of model performance.
value of the MSE for the training set is (n p 1)/(n
This may be due to the fact that there is not enough data
+ p + 1) < 1 times the expected value of the MSE for
available or there is not a good distribution and spread of
the validation set (the expected value is taken over the
data to partition it into separate training and test sets in
distribution of training sets). Thus if we t the model
the conventional validation method. In these cases, a fair
and compute the MSE on the training set, we will get an
way to properly estimate model prediction performance is
optimistically biased assessment of how well the model
to use cross-validation as a powerful general technique.[6]
will t an independent data set. This biased estimate is
In summary, cross-validation combines (averages) mea- called the in-sample estimate of the t, whereas the cross-
validation estimate is an out-of-sample estimate.
266
39.2. COMMON TYPES OF CROSS-VALIDATION 267
to form a validation or testing set, and the remaining ob- feature selection or model tuning is required by the
servations are retained as the training data. Normally, modeling procedure, this must be repeated on every
less than a third of the initial sample is used for valida- training set. Otherwise, predictions will certainly be
tion data.[12] upwardly biased.[14] If cross-validation is used to de-
cide which features to use, an inner cross-validation
to carry out the feature selection on every training
set must be performed.[15]
39.8 Limitations and misuse
By allowing some of the training data to also be in-
Cross-validation only yields meaningful results if the val-
cluded in the test set this can happen due to twin-
idation set and training set are drawn from the same pop-
ning in the data set, whereby some exactly identical
ulation and only if human biases are controlled.
or nearly identical samples are present in the data
In many applications of predictive modeling, the struc- set. Note that to some extent twinning always takes
ture of the system being studied evolves over time. Both place even in perfectly independent training and val-
of these can introduce systematic dierences between the idation samples. This is because some of the train-
training and validation sets. For example, if a model for ing sample observations will have nearly identical
predicting stock values is trained on data for a certain values of predictors as validation sample observa-
ve-year period, it is unrealistic to treat the subsequent tions. And some of these will correlate with a target
ve-year period as a draw from the same population. As at better than chance levels in the same direction in
another example, suppose a model is developed to predict both training and validation when they are actually
an individuals risk for being diagnosed with a particular driven by confounded predictors with poor external
disease within the next year. If the model is trained us- validity. If such a cross-validated model is selected
ing data from a study involving only a specic population from a k-fold set, human conrmation bias will be
group (e.g. young people or males), but is then applied to at work and determine that such a model has been
the general population, the cross-validation results from validated. This is why traditional cross-validation
the training set could dier greatly from the actual pre- needs to be supplemented with controls for human
dictive performance. bias and confounded model specication like swap
In many applications, models also may be incorrectly sampling and prospective studies.
specied and vary as a function of modeler biases and/or
arbitrary choices. When this occurs, there may be an illu- It should be noted that some statisticians have questioned
sion that the system changes in external samples, whereas the usefulness of validation samples.[16]
the reason is that the model has missed a critical predic-
tor and/or included a confounded predictor. New evi-
dence is that cross-validation by itself is not very predic- 39.9 See also
tive of external validity, whereas a form of experimen-
tal validation known as swap sampling that does control
Boosting (machine learning)
for human bias can be much more predictive of exter-
nal validity.[13] As dened by this large MAQC-II study Bootstrap aggregating (bagging)
across 30,000 models, swap sampling incorporates cross-
validation in the sense that predictions are tested across Bootstrapping (statistics)
independent training and validation samples. Yet, models
are also developed across these independent samples and Resampling (statistics)
by modelers who are blinded to one another. When there
is a mismatch in these models developed across these
swapped training and validation samples as happens quite 39.10 Notes and references
frequently, MAQC-II shows that this will be much more
predictive of poor external predictive validity than tradi- [1] Geisser, Seymour (1993). Predictive Inference. New
tional cross-validation. York, NY: Chapman and Hall. ISBN 0-412-03471-9.
The reason for the success of the swapped sampling is a [2] Kohavi, Ron (1995). A study of cross-validation
built-in control for human biases in model building. In and bootstrap for accuracy estimation and model selec-
addition to placing too much faith in predictions that may tion. Proceedings of the Fourteenth International Joint
vary across modelers and lead to poor external validity Conference on Articial Intelligence (San Mateo, CA:
due to these confounding modeler eects, these are some Morgan Kaufmann) 2 (12): 11371143. CiteSeerX:
other ways that cross-validation can be misused: 10.1.1.48.529.
Unsupervised learning
271
272 CHAPTER 40. UNSUPERVISED LEARNING
40.3 Notes
[1] Jordan, Michael I.; Bishop, Christopher M. (2004). Neu-
ral Networks. In Allen B. Tucker. Computer Science
Handbook, Second Edition (Section VII: Intelligent Sys-
tems). Boca Raton, FL: Chapman & Hall/CRC Press
LLC. ISBN 1-58488-360-X.
Cluster analysis
For the supervised learning approach, see Statistical clas- ing and model parameters until the result achieves the de-
sication. sired properties.
Cluster analysis or clustering is the task of grouping Besides the term clustering, there are a number of terms
with similar meanings, including automatic classication,
numerical taxonomy, botryology (from Greek
grape) and typological analysis. The subtle dierences
are often in the usage of the results: while in data min-
ing, the resulting groups are the matter of interest, in au-
tomatic classication the resulting discriminative power
is of interest. This often leads to misunderstandings be-
tween researchers coming from the elds of data mining
and machine learning, since they use the same terms and
often the same algorithms, but have dierent goals.
Cluster analysis was originated in anthropology by Driver
and Kroeber in 1932 and introduced to psychology by Zu-
bin in 1938 and Robert Tryon in 1939[1][2] and famously
The result of a cluster analysis shown as the coloring of the used by Cattell beginning in 1943[3] for trait theory clas-
squares into three clusters. sication in personality psychology.
273
274 CHAPTER 41. CLUSTER ANALYSIS
Density models: for example DBSCAN and Clustering algorithms can be categorized based on their
OPTICS denes clusters as connected dense regions cluster model, as listed above. The following overview
in the data space. will only list the most prominent examples of clustering
algorithms, as there are possibly over 100 published clus-
Subspace models: in Biclustering (also known as tering algorithms. Not all provide models for their clus-
Co-clustering or two-mode-clustering), clusters are ters and can thus not easily be categorized. An overview
modeled with both cluster members and relevant at- of algorithms explained in Wikipedia can be found in the
tributes. list of statistics algorithms.
Group models: some algorithms do not provide a There is no objectively correct clustering algorithm,
rened model for their results and just provide the but as it was noted, clustering is in the eye of the
grouping information. beholder.[4] The most appropriate clustering algorithm
for a particular problem often needs to be chosen exper-
Graph-based models: a clique, i.e., a subset of nodes imentally, unless there is a mathematical reason to prefer
in a graph such that every two nodes in the subset are one cluster model over another. It should be noted that
connected by an edge can be considered as a proto- an algorithm that is designed for one kind of model has
typical form of cluster. Relaxations of the complete no chance on a data set that contains a radically dier-
connectivity requirement (a fraction of the edges can ent kind of model.[4] For example, k-means cannot nd
be missing) are known as quasi-cliques, as in HCS non-convex clusters.[4]
clustering algorithm .
A clustering is essentially a set of such clusters, usually 41.2.1 Connectivity based clustering (hier-
containing all objects in the data set. Additionally, it may archical clustering)
specify the relationship of the clusters to each other, for
example a hierarchy of clusters embedded in each other. Main article: Hierarchical clustering
Clusterings can be roughly distinguished as:
Connectivity based clustering, also known as hierarchical
hard clustering: each object belongs to a cluster or clustering, is based on the core idea of objects being
not more related to nearby objects than to objects farther
away. These algorithms connect objects to form clus-
soft clustering (also: fuzzy clustering): each object ters based on their distance. A cluster can be described
belongs to each cluster to a certain degree (e.g. a largely by the maximum distance needed to connect parts
likelihood of belonging to the cluster) of the cluster. At dierent distances, dierent clusters
will form, which can be represented using a dendrogram,
There are also ner distinctions possible, for example: which explains where the common name hierarchical
clustering comes from: these algorithms do not provide
strict partitioning clustering: here each object be- a single partitioning of the data set, but instead provide
longs to exactly one cluster an extensive hierarchy of clusters that merge with each
other at certain distances. In a dendrogram, the y-axis
strict partitioning clustering with outliers: objects marks the distance at which the clusters merge, while the
can also belong to no cluster, and are considered objects are placed along the x-axis such that the clusters
outliers. don't mix.
overlapping clustering (also: alternative clustering, Connectivity based clustering is a whole family of meth-
multi-view clustering): while usually a hard cluster- ods that dier by the way distances are computed. Apart
ing, objects may belong to more than one cluster. from the usual choice of distance functions, the user also
needs to decide on the linkage criterion (since a clus-
hierarchical clustering: objects that belong to a child ter consists of multiple objects, there are multiple candi-
cluster also belong to the parent cluster dates to compute the distance to) to use. Popular choices
are known as single-linkage clustering (the minimum of
subspace clustering: while an overlapping cluster- object distances), complete linkage clustering (the maxi-
ing, within a uniquely dened subspace, clusters are mum of object distances) or UPGMA (Unweighted Pair
not expected to overlap. Group Method with Arithmetic Mean, also known as av-
erage linkage clustering). Furthermore, hierarchical clus-
tering can be agglomerative (starting with single elements
41.2 Algorithms and aggregating them into clusters) or divisive (starting
with the complete data set and dividing it into partitions).
Main category: Data clustering algorithms These methods will not produce a unique partitioning of
the data set, but a hierarchy from which the user still
41.2. ALGORITHMS 275
needs to choose appropriate clusters. They are not very object to the nearest centroid. This often leads to incor-
robust towards outliers, which will either show up as ad- rectly cut borders in between of clusters (which is not sur-
ditional clusters or even cause other clusters to merge prising, as the algorithm optimized cluster centers, not
(known as chaining phenomenon, in particular with cluster borders).
single-linkage clustering). In the general case, the com- K-means has a number of interesting theoretical prop-
plexity is O(n3 ) , which makes them too slow for large erties. On the one hand, it partitions the data space
data sets. For some special cases, optimal ecient meth- into a structure known as a Voronoi diagram. On the
ods (of complexity O(n2 ) ) are known: SLINK[5] for other hand, it is conceptually close to nearest neighbor
single-linkage and CLINK[6] for complete-linkage clus-
classication, and as such is popular in machine learn-
tering. In the data mining community these methods are ing. Third, it can be seen as a variation of model based
recognized as a theoretical foundation of cluster analysis,
classication, and Lloyds algorithm as a variation of the
but often considered obsolete. They did however provide Expectation-maximization algorithm for this model dis-
inspiration for many later methods such as density based
cussed below.
clustering.
On Gaussian-distributed data, EM works well, since data set, but mean-shift can detect arbitrary-shaped clus-
it uses Gaussians for modelling clusters ters similar to DBSCAN. Due to the expensive iterative
procedure and density estimation, mean-shift is usually
Density-based clusters cannot be modeled using slower than DBSCAN or k-Means.
Gaussian distributions
Density-based clustering examples
Density-based clustering with DBSCAN.
41.2.4 Density-based clustering
DBSCAN assumes clusters of similar density, and
[8]
In density-based clustering, clusters are dened as areas may have problems separating nearby clusters
of higher density than the remainder of the data set. Ob- OPTICS is a DBSCAN variant that handles dierent
jects in these sparse areas - that are required to separate densities much better
clusters - are usually considered to be noise and border
points.
The most popular[9] density based clustering method is 41.2.5 Recent developments
DBSCAN.[10] In contrast to many newer methods, it
features a well-dened cluster model called density- In recent years considerable eort has been put into im-
proving the performance of existing algorithms.[14][15]
reachability. Similar to linkage based clustering, it is
based on connecting points within certain distance thresh- Among them are CLARANS (Ng and Han, 1994),[16] and
BIRCH (Zhang et al., 1996).[17] With the recent need to
olds. However, it only connects points that satisfy a den-
sity criterion, in the original variant dened as a minimum process larger and larger data sets (also known as big
data), the willingness to trade semantic meaning of the
number of other objects within this radius. A cluster con-
sists of all density-connected objects (which can form a generated clusters for performance has been increasing.
cluster of an arbitrary shape, in contrast to many other This led to the development of pre-clustering methods
methods) plus all objects that are within these objects such as canopy clustering, which can process huge data
range. Another interesting property of DBSCAN is that sets eciently, but the resulting clusters are merely a
its complexity is fairly low - it requires a linear number rough pre-partitioning of the data set to then analyze the
of range queries on the database - and that it will dis- partitions with existing slower methods such as k-means
cover essentially the same results (it is deterministic for clustering. Various other approaches to clustering have
core and noise points, but not for border points) in each been tried such as seed based clustering.[18]
run, therefore there is no need to run it multiple times. For high-dimensional data, many of the existing meth-
OPTICS[11] is a generalization of DBSCAN that removes ods fail due to the curse of dimensionality, which ren-
the need to choose an appropriate value for the range pa- ders particular distance functions problematic in high-
rameter , and produces a hierarchical result related to dimensional spaces. This led to new clustering algorithms
that of linkage clustering. DeLi-Clu,[12] Density-Link- for high-dimensional data that focus on subspace clus-
Clustering combines ideas from single-linkage clustering tering (where only some attributes are used, and cluster
and OPTICS, eliminating the parameter entirely and of- models include the relevant attributes for the cluster) and
fering performance improvements over OPTICS by using correlation clustering that also looks for arbitrary rotated
an R-tree index. (correlated) subspace clusters that can be modeled by
The key drawback of DBSCAN and OPTICS is that they giving a correlation of their attributes. Examples for such
clustering algorithms are CLIQUE[19] and SUBCLU.[20]
expect some kind of density drop to detect cluster bor-
ders. Moreover, they cannot detect intrinsic cluster struc- Ideas from density-based clustering methods (in partic-
tures which are prevalent in the majority of real life data. ular the DBSCAN/OPTICS family of algorithms) have
A variation of DBSCAN, EnDBSCAN,[13] eciently de- been adopted to subspace clustering (HiSC,[21] hierar-
tects such kinds of structures. On data sets with, for ex- chical subspace clustering and DiSH[22] ) and correlation
ample, overlapping Gaussian distributions - a common clustering (HiCO,[23] hierarchical correlation clustering,
use case in articial data - the cluster borders produced 4C[24] using correlation connectivity and ERiC[25] ex-
by these algorithms will often look arbitrary, because ploring hierarchical density-based correlation clusters).
the cluster density decreases continuously. On a data set Several dierent clustering systems based on mutual in-
consisting of mixtures of Gaussians, these algorithms are formation have been proposed. One is Marina Meil's
nearly always outperformed by methods such as EM clus- variation of information metric;[26] another provides hi-
tering that are able to precisely model this kind of data. erarchical clustering.[27] Using genetic algorithms, a wide
Mean-shift is a clustering approach where each object is range of dierent t-functions can be optimized, includ-
moved to the densest area in its vicinity, based on kernel ing mutual information.[28] Also message passing algo-
density estimation. Eventually, objects converge to local rithms, a recent development in Computer Science and
maxima of density. Similar to k-means clustering, these Statistical Physics, has led to the creation of new types of
density attractors can serve as representatives for the clustering algorithms.[29]
41.3. EVALUATION AND ASSESSMENT 277
how close the clustering is to the predetermined bench- when = 0 , and increasing allocates an in-
mark classes. However, it has recently been discussed creasing amount of weight to recall in the nal
whether this is adequate for real data, or only on syn- F-measure.
thetic data sets with a factual ground truth, since classes
can contain internal structure, the attributes present may
Jaccard index
not allow separation of clusters or the classes may contain
anomalies.[32] Additionally, from a knowledge discovery
point of view, the reproduction of known knowledge may The Jaccard index is used to quantify the simi-
not necessarily be the intended result.[32] larity between two datasets. The Jaccard index
A number of measures are adapted from variants used takes on a value between 0 and 1. An index of
to evaluate classication tasks. In place of counting the 1 means that the two dataset are identical, and
number of times a class was correctly assigned to a sin- an index of 0 indicates that the datasets have
gle data point (known as true positives), such pair count- no common elements. The Jaccard index is de-
ing metrics assess whether each pair of data points that ned by the following formula:
is truly in the same cluster is predicted to be in the same J(A, B) = |AB|
= TP
|AB| T P +F P +F N
cluster.
This is simply the number of unique elements
Some of the measures of quality of a cluster algorithm
common to both sets divided by the total num-
using external criterion include:
ber of unique elements in both sets.
41.4 Applications
41.5 See also
Clustering high-dimensional data [4] Estivill-Castro, Vladimir (20 June 2002). Why so
many clustering algorithms A Position Paper.
Conceptual clustering ACM SIGKDD Explorations Newsletter 4 (1): 6575.
doi:10.1145/568574.568575.
Consensus clustering
[5] Sibson, R. (1973). SLINK: an optimally ecient algo-
Constrained clustering rithm for the single-link cluster method (PDF). The Com-
puter Journal (British Computer Society) 16 (1): 3034.
Data stream clustering doi:10.1093/comjnl/16.1.30.
Neighbourhood components analysis [9] Microsoft academic search: most cited data mining ar-
ticles: DBSCAN is on rank 24, when accessed on:
Latent class analysis 4/18/2010
Curse of dimensionality [12] Achtert, E.; Bhm, C.; Krger, P. (2006). DeLi-
Clu: Boosting Robustness, Completeness, Usability, and
Determining the number of clusters in a data set Eciency of Hierarchical Clustering by a Closest Pair
Ranking. LNCS: Advances in Knowledge Discovery and
Parallel coordinates Data Mining. Lecture Notes in Computer Science 3918:
119128. doi:10.1007/11731139_16. ISBN 978-3-540-
Structured data analysis 33206-0.
[1] Bailey, Ken (1994). Numerical Taxonomy and Cluster [14] Sculley, D. (2010). Web-scale k-means clustering. Proc.
Analysis. Typologies and Taxonomies. p. 34. ISBN 19th WWW.
9780803952591.
[15] Huang, Z. (1998). Extensions to the k-means algo-
[2] Tryon, Robert C. (1939). Cluster Analysis: Correlation rithm for clustering large data sets with categorical val-
Prole and Orthometric (factor) Analysis for the Isolation ues. Data Mining and Knowledge Discovery 2: 283304.
of Unities in Mind and Personality. Edwards Brothers.
[16] R. Ng and J. Han. Ecient and eective clustering
[3] Cattell, R. B. (1943). The description of personality: Ba- method for spatial data mining. In: Proceedings of the
sic traits resolved into clusters. Journal of Abnormal and 20th VLDB Conference, pages 144-155, Santiago, Chile,
Social Psychology 38: 476506. doi:10.1037/h0054116. 1994.
41.7. EXTERNAL LINKS 281
[17] Tian Zhang, Raghu Ramakrishnan, Miron Livny. An [29] Frey, B. J.; Dueck, D. (2007). Clustering by Pass-
Ecient Data Clustering Method for Very Large ing Messages Between Data Points. Science 315
Databases. In: Proc. Int'l Conf. on Management of Data, (5814): 972976. doi:10.1126/science.1136800. PMID
ACM SIGMOD, pp. 103114. 17218491.
[18] Can, F.; Ozkarahan, E. A. (1990). Concepts and eec- [30] Manning, Christopher D.; Raghavan, Prabhakar; Schtze,
tiveness of the cover-coecient-based clustering method- Hinrich. Introduction to Information Retrieval. Cam-
ology for text databases. ACM Transactions on Database bridge University Press. ISBN 978-0-521-86571-5.
Systems 15 (4): 483. doi:10.1145/99935.99938.
[31] Dunn, J. (1974). Well separated clusters and optimal
[19] Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P. fuzzy partitions. Journal of Cybernetics 4: 95104.
(2005). Automatic Subspace Clustering of High Dimen- doi:10.1080/01969727408546059.
sional Data. Data Mining and Knowledge Discovery 11:
5. doi:10.1007/s10618-005-1396-1. [32] Frber, Ines; Gnnemann, Stephan; Kriegel, Hans-Peter;
Krger, Peer; Mller, Emmanuel; Schubert, Erich; Seidl,
[20] Karin Kailing, Hans-Peter Kriegel and Peer Krger. Thomas; Zimek, Arthur (2010). On Using Class-Labels
Density-Connected Subspace Clustering for High- in Evaluation of Clusterings (PDF). In Fern, Xiaoli
Dimensional Data. In: Proc. SIAM Int. Conf. on Data Z.; Davidson, Ian; Dy, Jennifer. MultiClust: Discover-
Mining (SDM'04), pp. 246-257, 2004. ing, Summarizing, and Using Multiple Clusterings. ACM
[21] Achtert, E.; Bhm, C.; Kriegel, H. P.; Krger, P.; Mller- SIGKDD.
Gorman, I.; Zimek, A. (2006). Finding Hierarchies
[33] Rand, W. M. (1971). Objective criteria for the evaluation
of Subspace Clusters. LNCS: Knowledge Discovery in
of clustering methods. Journal of the American Statistical
Databases: PKDD 2006. Lecture Notes in Computer Sci-
Association (American Statistical Association) 66 (336):
ence 4213: 446453. doi:10.1007/11871637_42. ISBN
846850. doi:10.2307/2284239. JSTOR 2284239.
978-3-540-45374-1.
[34] E. B. Fowlkes & C. L. Mallows (1983), A Method for
[22] Achtert, E.; Bhm, C.; Kriegel, H. P.; Krger, P.; Mller-
Comparing Two Hierarchical Clusterings, Journal of the
Gorman, I.; Zimek, A. (2007). Detection and Visu-
American Statistical Association 78, 553569.
alization of Subspace Cluster Hierarchies. LNCS: Ad-
vances in Databases: Concepts, Systems and Applications. [35] L. Hubert et P. Arabie. Comparing partitions. J. of Clas-
Lecture Notes in Computer Science 4443: 152163. sication, 2(1), 1985.
doi:10.1007/978-3-540-71703-4_15. ISBN 978-3-540-
71702-7. [36] D. L. Wallace. Comment. Journal of the American Sta-
tistical Association, 78 :569 579, 1983.
[23] Achtert, E.; Bhm, C.; Krger, P.; Zimek, A.
(2006). Mining Hierarchies of Correlation Clusters. [37] Bewley, A. et al. Real-time volume estimation of a
Proc. 18th International Conference on Scientic and dragline payload. IEEE International Conference on
Statistical Database Management (SSDBM): 119128. Robotics and Automation 2011: 15711576.
doi:10.1109/SSDBM.2006.35. ISBN 0-7695-2590-3.
[38] Basak, S.C.; Magnuson, V.R.; Niemi, C.J.; Regal, R.R.
[24] Bhm, C.; Kailing, K.; Krger, P.; Zimek, A. (2004). Determining Structural Similarity of Chemicals Using
Computing Clusters of Correlation Connected objects. Graph Theoretic Indices. Discr. Appl. Math., 19 1988:
Proceedings of the 2004 ACM SIGMOD international con- 1744.
ference on Management of data - SIGMOD '04. p. 455.
doi:10.1145/1007568.1007620. ISBN 1581138598. [39] Huth, R. et al. (2008). Classications of Atmospheric
Circulation Patterns: Recent Advances and Applications.
[25] Achtert, E.; Bohm, C.; Kriegel, H. P.; Krger, P.; Zimek, Ann. N.Y. Acad. Sci. 1146: 105152.
A. (2007). On Exploring Complex Relationships of
Correlation Clusters. 19th International Conference on
Scientic and Statistical Database Management (SSDBM
2007). p. 7. doi:10.1109/SSDBM.2007.21. ISBN 0- 41.7 External links
7695-2868-6.
[26] Meil, Marina (2003). Comparing Clusterings by the Data Mining at DMOZ
Variation of Information. Learning Theory and Kernel
Machines. Lecture Notes in Computer Science 2777:
173187. doi:10.1007/978-3-540-45167-9_14. ISBN
978-3-540-40720-1.
[27] Kraskov, Alexander; Stgbauer, Harald; Andrzejak,
Ralph G.; Grassberger, Peter (1 December 2003) [28
November 2003]. Hierarchical Clustering Based on Mu-
tual Information. arXiv:q-bio/0311039.
[28] Auarth, B. (July 1823, 2010). Clustering by a Genetic
Algorithm with Biased Mutation Operator. WCCI CEC
(IEEE). CiteSeerX: 10.1.1.170.869.
Chapter 42
Expectationmaximization algorithm
In statistics, an expectationmaximization (EM) algo- tailed treatment of the EM method for exponential fam-
rithm is an iterative method for nding maximum like- ilies was published by Rolf Sundberg in his thesis and
lihood or maximum a posteriori (MAP) estimates of several papers[2][3][4] following his collaboration with Per
parameters in statistical models, where the model de- Martin-Lf and Anders Martin-Lf.[5][6][7][8][9][10][11]
pends on unobserved latent variables. The EM iteration The Dempster-Laird-Rubin paper in 1977 generalized
alternates between performing an expectation (E) step, the method and sketched a convergence analysis for a
which creates a function for the expectation of the log- wider class of problems. Regardless of earlier inventions,
likelihood evaluated using the current estimate for the the innovative Dempster-Laird-Rubin paper in the Jour-
parameters, and a maximization (M) step, which com- nal of the Royal Statistical Society received an enthusiastic
putes parameters maximizing the expected log-likelihood discussion at the Royal Statistical Society meeting with
found on the E step. These parameter-estimates are then Sundberg calling the paper brilliant. The Dempster-
used to determine the distribution of the latent variables Laird-Rubin paper established the EM method as an im-
in the next E step. portant tool of statistical analysis.
The convergence analysis of the Dempster-Laird-Rubin
paper was awed and a correct convergence analysis
was published by C.F. Je Wu in 1983.[12] Wus proof
established the EM methods convergence outside of
the exponential family, as claimed by Dempster-Laird-
Rubin.[13]
42.2 Introduction
282
42.3. DESCRIPTION 283
variables and vice versa, but substituting one set of equa- Note that in typical models to which EM is applied:
tions into the other produces an unsolvable equation.
The EM algorithm proceeds from the observation that the 1. The observed data points X may be discrete (tak-
following is a way to solve these two sets of equations nu- ing values in a nite or countably innite set) or
merically. One can simply pick arbitrary values for one continuous (taking values in an uncountably innite
of the two sets of unknowns, use them to estimate the set). There may in fact be a vector of observations
second set, then use these new values to nd a better es- associated with each data point.
timate of the rst set, and then keep alternating between
2. The missing values (aka latent variables) Z are
the two until the resulting values both converge to xed
discrete, drawn from a xed number of values, and
points. Its not obvious that this will work at all, but in fact
there is one latent variable per observed data point.
it can be proven that in this particular context it does, and
that the derivative of the likelihood is (arbitrarily close 3. The parameters are continuous, and are of two
to) zero at that point, which in turn means that the point kinds: Parameters that are associated with all data
is either a maximum or a saddle point. In general there points, and parameters associated with a particular
may be multiple maxima, and there is no guarantee that value of a latent variable (i.e. associated with all
the global maximum will be found. Some likelihoods also data points whose corresponding latent variable has
have singularities in them, i.e. nonsensical maxima. For a particular value).
example, one of the solutions that may be found by EM
in a mixture model involves setting one of the compo-
However, it is possible to apply EM to other sorts of mod-
nents to have zero variance and the mean parameter for
els.
the same component to be equal to one of the data points.
The motivation is as follows. If we know the value of
the parameters , we can usually nd the value of the
42.3 Description latent variables Z by maximizing the log-likelihood over
all possible values of Z , either simply by iterating over Z
or through an algorithm such as the Viterbi algorithm for
Given a statistical model which generates a set X of ob-
hidden Markov models. Conversely, if we know the value
served data, a set of unobserved latent data or missing
of the latent variables Z , we can nd an estimate of the
values Z , and a vector of unknown parameters , along
parameters fairly easily, typically by simply grouping
with a likelihood function L(; X, Z) = p(X, Z|) , the
the observed data points according to the value of the as-
maximum likelihood estimate (MLE) of the unknown pa-
sociated latent variable and averaging the values, or some
rameters is determined by the marginal likelihood of the
function of the values, of the points in each group. This
observed data
suggests an iterative algorithm, in the case where both
and Z are unknown:
L(; X) = p(X|) = p(X, Z|)
Z 1. First, initialize the parameters to some random
values.
However, this quantity is often intractable (e.g. if Z is a
sequence of events, so that the number of values grows 2. Compute the best value for Z given these parameter
exponentially with the sequence length, making the exact values.
calculation of the sum extremely dicult).
3. Then, use the just-computed values of Z to compute
The EM algorithm seeks to nd the MLE of the marginal a better estimate for the parameters . Parame-
likelihood by iteratively applying the following two steps: ters associated with a particular value of Z will use
only those data points whose associated latent vari-
Expectation step (E step): Calculate the able has that value.
expected value of the log likelihood function,
with respect to the conditional distribution of 4. Iterate steps 2 and 3 until convergence.
Z given X under the current estimate of the pa-
rameters (t) : The algorithm as just described monotonically ap-
proaches a local minimum of the cost function, and is
commonly called hard EM. The k-means algorithm is an
Q(| (t) ) = EZ|X,(t) [log L(; X, Z)] example of this class of algorithms.
However, one can do somewhat better: Rather than mak-
Maximization step (M step): Find the pa- ing a hard choice for Z given the current parameter val-
rameter that maximizes this quantity: ues and averaging only over the set of data points asso-
ciated with a particular value of Z , one can instead de-
(t+1) = arg max Q(| (t) ) termine the probability of each possible value of Z for
284 CHAPTER 42. EXPECTATIONMAXIMIZATION ALGORITHM
each data point, and then use the probabilities associated write
with a particular value of Z to compute a weighted aver-
age over the entire set of data points. The resulting al-
gorithm is commonly called soft EM, and is the type of
algorithm normally associated with EM. The counts used log p(X|) = log p(X, Z|) log p(Z|X, ) .
to compute these weighted averages are called soft counts
(as opposed to the hard counts used in a hard-EM-type We take the expectation over values of Z by multiplying
algorithm such as k-means). The probabilities computed both sides by p(Z|X, (t) ) and summing (or integrating)
for Z are posterior probabilities and are what is computed over Z . The left-hand side is the expectation of a con-
in the E step. The soft counts used to compute new pa- stant, so we get:
rameter values are what is computed in the M step.
log p(X|) = p(Z|X, (t) ) log p(X, Z|) p(Z|X, (t) ) log p(Z|X
42.4 Properties Z Z
(t) (t)
= Q(| ) + H(| ),
Speaking of an expectation (E) step is a bit of a
(t)
misnomer. What is calculated in the rst step are the where H(| ) is dened by the negated sum it is re-
xed, data-dependent parameters of the function Q. Once placing. This last equation holds for any value of in-
(t)
the parameters of Q are known, it is fully determined and cluding = ,
is maximized in the second (M) step of an EM algorithm.
Although an EM iteration does increase the observed data
(i.e. marginal) likelihood function there is no guarantee log p(X| (t) ) = Q( (t) | (t) ) + H( (t) | (t) ) ,
that the sequence converges to a maximum likelihood es-
timator. For multimodal distributions, this means that an and subtracting this last equation from the previous equa-
EM algorithm may converge to a local maximum of the tion gives
observed data likelihood function, depending on starting
values. There are a variety of heuristic or metaheuristic
approaches for escaping a local maximum such as random
restart (starting with several dierent random initial esti- log p(X|)log p(X| ) = Q(| )Q( | )+H(| )H(
(t) (t) (t) (t) (t) (t
(
)
42.5 Proof of correctness F (q, ) = Eq [log L(; x, Z)]+H(q) = DKL q
pZ|X (|x; ) +log L(;
where q is an arbitrary probability distribution over the
Expectation-maximization works to improve Q(| (t) ) unobserved data z, pZ|X( |x;) is the conditional distri-
rather than directly improving log p(X|) . Here we show bution of the unobserved data given the observed data x,
that improvements to the former imply improvements to H is the entropy and DKL is the KullbackLeibler diver-
the latter.[14] gence.
For any Z with non-zero probability p(Z|X, ) , we can Then the steps in the EM algorithm may be viewed as:
42.9. VARIANTS 285
(t+1) = * arg max F (q (t) , ) where x k are scalar output estimates calculated by a lter
or a smoother from N scalar measurements zk . Simi-
larly, for a rst-order auto-regressive process, an updated
42.7 Applications process noise variance estimate can be calculated by
For more details on this topic, see Information geometry. where the incomplete-data likelihood function is
KullbackLeibler divergence can also be understood in and the complete-data likelihood function is
these terms.
n
2
L(; x, z) = P (x, z|) = I(zi = j) f (xi ; j , j )j
42.12 Examples i=1 j=1
or
42.12.1 Gaussian mixture
Let x = (x1 , x2 , . . . , xn ) be a sample of n independent n 2
[
observations from a mixture of two multivariate normal L(; x, z) = exp I(zi = j) log j 1
log |j | 12 (xi j )
distributions of dimension d , and let z = (z1 , z2 , . . . , zn ) 2
i=1 j=1
42.12. EXAMPLES 287
where I is an indicator function and f is the probability This has the same form as the MLE for the binomial dis-
density function of a multivariate normal. tribution, so
To see the last equality, note that for each i all indicators
I(zi = j) are equal to zero, except for one which is equal n
1 (t)
(t) n
to one. The inner sum thus reduces to a single term. (t+1) i=1 Tj,i
j = n = T
(t) (t) n i=1 j,i
i=1 (T1,i + T2,i )
E step
For the next estimates of (1 ,1 ):
Given our current estimate of the parameters (t) , the con-
ditional distribution of the Zi is determined by Bayes the-
orem to be the proportional height of the normal density (t+1)
(1
(t+1)
, 1 ) = arg max Q(|(t) )
weighted by : 1 ,1
n
(t) {
= arg max T1,i 21 log |1 | 12 (xi 1 ) 1
1 (x
(t) (t) (t) 1 ,1
(t) j f (xi ; j , j ) i=1
Tj,i := P(Zi = j|Xi = xi ; (t) ) = (t) (t) (t) (t) (t) (t)
1 f (xi ; 1 , 1 ) + 2 has
This f (x i;
the 2 ,
same 2 )
form
as a weighted MLE for a normal
These are called the membership probabilities which distribution, so
are normally considered the output of the E step (although
n
this is not the Q function of below). (t+1)
(t)
i=1 T1,i xi (t+1)
1 = n (t) and 1 =
i=1 T1,i
Note that this E step corresponds with the following func- n (t) (t+1) (t+1)
i=1 T1,i (xi 1 )(xi 1 )
tion for Q: n (t)
i=1 T1,i
The on-line textbook: Information Theory, In- [3] Rolf Sundberg. 1971. Maximum likelihood theory and
ference, and Learning Algorithms, by David J.C. applications for distributions generated when observing a
MacKay includes simple examples of the EM algo- function of an exponential family variable. Dissertation,
rithm such as clustering using the soft k-means al- Institute for Mathematical Statistics, Stockholm Univer-
gorithm, and emphasizes the variational view of the sity.
EM algorithm, as described in Chapter 33.7 of ver- [4] Sundberg, Rolf (1976). An iterative method for so-
sion 7.2 (fourth edition). lution of the likelihood equations for incomplete data
from exponential families. Communications in Statis-
Dellaert, Frank. The Expectation Maximization tics Simulation and Computation 5 (1): 5564.
Algorithm. CiteSeerX: 10.1.1.9.9735, gives an doi:10.1080/03610917608812007. MR 443190.
easier explanation of EM algorithm in terms of
[5] See the acknowledgement by Dempster, Laird and Rubin
lowerbound maximization.
on pages 3, 5 and 11.
Bishop, Christopher M. (2006). Pattern Recogni- [6] G. Kulldor. 1961. Contributions to the theory of es-
tion and Machine Learning. Springer. ISBN 0-387- timation from grouped and partially grouped samples.
31073-8. Almqvist & Wiksell.
42.17. EXTERNAL LINKS 289
[9] Per Martin-Lf. 1970. Statistika Modeller (Statistical [20] Jamshidian, Mortaza; Jennrich, Robert I. (1997). Ac-
Models): Anteckningar frn seminarier lsret 19691970 celeration of the EM Algorithm by using Quasi-Newton
(Notes from seminars in the academic year 1969-1970), Methods. Journal of the Royal Statistical Society, Series
with the assistance of Rolf Sundberg. Stockholm Univer- B 59 (2): 569587. doi:10.1111/1467-9868.00083. MR
sity. (Sundberg formula) 1452026.
[10] Martin-Lf, P. The notion of redundancy and its use as [21] Meng, Xiao-Li; Rubin, Donald B. (1993). Maxi-
a quantitative measure of the deviation between a statis- mum likelihood estimation via the ECM algorithm: A
tical hypothesis and a set of observational data. With a general framework. Biometrika 80 (2): 267278.
discussion by F. Abildgrd, A. P. Dempster, D. Basu, D. doi:10.1093/biomet/80.2.267. MR 1243503.
R. Cox, A. W. F. Edwards, D. A. Sprott, G. A. Barnard,
O. Barndor-Nielsen, J. D. Kalbeisch and G. Rasch and [22] Jiangtao Yin, Yanfeng Zhang, and Lixin Gao (2012).
a reply by the author. Proceedings of Conference on Foun- Accelerating Expectation-Maximization Algorithms
dational Questions in Statistical Inference (Aarhus, 1973), with Frequent Updates (PDF). Proceedings of the IEEE
pp. 142. Memoirs, No. 1, Dept. Theoret. Statist., Inst. International Conference on Cluster Computing.
Math., Univ. Aarhus, Aarhus, 1974.
[23] Hunter DR and Lange K (2004), A Tutorial on MM Al-
[11] Martin-Lf, Per The notion of redundancy and its use as gorithms, The American Statistician, 58: 30-37
a quantitative measure of the discrepancy between a sta-
[24] Matsuyama, Yasuo (2003). The -EM algorithm: Sur-
tistical hypothesis and a set of observational data. Scand.
rogate likelihood maximization using -logarithmic in-
J. Statist. 1 (1974), no. 1, 318.
formation measures. IEEE Transactions on Information
[12] Wu, C. F. Je (1983). On the Convergence Prop- Theory 49 (3): 692706. doi:10.1109/TIT.2002.808105.
erties of the EM Algorithm. The Annals of Statistics
[25] Matsuyama, Yasuo (2011). Hidden Markov model esti-
(Institute of Mathematical Statistics) 11 (1): 95103.
mation based on alpha-EM algorithm: Discrete and con-
doi:10.1214/aos/1176346060. Retrieved 11 December
tinuous alpha-HMMs. International Joint Conference on
2014.
Neural Networks: 808816.
[13] Wu, C. F. Je (Mar 1983). On the Convergence Proper-
[26] Wolynetz, M.S. (1979). Maximum likelihood estima-
ties of the EM Algorithm. Annals of Statistics 11 (1): 95
tion in a linear model from conned and censored normal
103. doi:10.1214/aos/1176346060. JSTOR 2240463.
data. Journal of the Royal Statistical Society, Series C 28
MR 684867.
(2): 195206. doi:10.2307/2346749.
[14] Little, Roderick J.A.; Rubin, Donald B. (1987). Statistical
[27] Lange, Kenneth. The MM Algorithm (PDF).
Analysis with Missing Data. Wiley Series in Probability
and Mathematical Statistics. New York: John Wiley &
Sons. pp. 134136. ISBN 0-471-80254-9.
k-means clustering
k-means clustering is a method of vector quantiza- Stuart Lloyd in 1957 as a technique for pulse-code mod-
tion, originally from signal processing, that is popular for ulation, though it wasn't published outside of Bell Labs
cluster analysis in data mining. k-means clustering aims until 1982.[3] In 1965, E.W.Forgy published essentially
to partition n observations into k clusters in which each the same method, which is why it is sometimes referred
observation belongs to the cluster with the nearest mean, to as Lloyd-Forgy.[4] A more ecient version was pro-
serving as a prototype of the cluster. This results in a posed and published in Fortran by Hartigan and Wong in
partitioning of the data space into Voronoi cells. 1975/1979.[5][6]
The problem is computationally dicult (NP-hard); how-
ever, there are ecient heuristic algorithms that are com-
monly employed and converge quickly to a local op- 43.3 Algorithms
timum. These are usually similar to the expectation-
maximization algorithm for mixtures of Gaussian distri- 43.3.1 Standard algorithm
butions via an iterative renement approach employed by
both algorithms. Additionally, they both use cluster cen- The most common algorithm uses an iterative renement
ters to model the data; however, k-means clustering tends technique. Due to its ubiquity it is often called the k-
to nd clusters of comparable spatial extent, while the means algorithm; it is also referred to as Lloyds algo-
expectation-maximization mechanism allows clusters to rithm, particularly in the computer science community.
have dierent shapes.
Given an initial set of k means m1 (1) ,,mk(1) (see be-
The algorithm has nothing to do with and should not low), the algorithm proceeds by alternating between two
be confused with k-nearest neighbor, another popular steps:[7]
machine learning technique.
Assignment step: Assign each observation to
the cluster whose mean yields the least within-
43.1 Description cluster sum of squares (WCSS). Since the sum
of squares is the squared Euclidean distance,
this is intuitively the nearest mean.[8] (Math-
Given a set of observations (x1 , x2 , , xn), where each ematically, this means partitioning the obser-
observation is a d-dimensional real vector, k-means clus- vations according to the Voronoi diagram gen-
tering aims to partition the n observations into k ( n) sets erated by the means).
S = {S 1 , S 2 , , Sk} so as to minimize the within-cluster
{ (t)
2
Si = xp :
xp mi
(t)
sum of squares (WCSS). In other words, its objective is
}
to nd:
xp m(t)
2 j, 1 j k ,
j
where i is the mean of points in Si. Update step: Calculate the new means to be
the centroids of the observations in the new
clusters.
43.2 History mi
(t+1)
= 1(t)
|Si | xj S
(t) xj
i
290
43.3. ALGORITHMS 291
The algorithm has converged when the assignments no is 2(n) , to converge.[10] These point sets do not seem to
longer change. Since both steps optimize the WCSS ob- arise in practice: this is corroborated by the fact that the
jective, and there only exists a nite number of such par- smoothed running time of k-means is polynomial.[11]
titionings, the algorithm must converge to a (local) opti- The assignment step is also referred to as expectation
mum. There is no guarantee that the global optimum is step, the update step as maximization step, making
found using this algorithm. this algorithm a variant of the generalized expectation-
The algorithm is often presented as assigning objects to maximization algorithm.
the nearest cluster by distance. The standard algorithm
aims at minimizing the WCSS objective, and thus assigns
by least sum of squares, which is exactly equivalent to 43.3.2 Complexity
assigning by the smallest Euclidean distance. Using a dif-
ferent distance function other than (squared) Euclidean Regarding computational complexity, nding the optimal
distance may stop the algorithm from converging. Vari- solution to the k-means clustering problem for observa-
ous modications of k-means such as spherical k-means tions in d dimensions is:
and k-medoids have been proposed to allow using other
distance measures. NP-hard in general Euclidean space d even for 2
clusters[12][13]
4.3
k-Means Clusters Dim. 2
4.4
4.3
Iris Species
4.1
4
4.2
4.1
3.8 3.8
3.7 3.7
3.6 3.6
3.5 3.5
3.4 3.4
3.3 3.3
3.1 3.1
3 3
2.9 2.9
PAM) uses the medoid instead of the mean, and this 2.8
2.7
2.6
Dim. 3
6
7
2.8
2.7
2.6
Dim. 3
6
7
Iris setosa
way minimizes the sum of distances for arbitrary
2.5 2.5
2.4
5 Cluster 1 2.4
5
4 4
2.3
2.2 3 Cluster 2
2.3
distance functions.
2.1 2 2.1 2
0.6
0.7
0.6
0.7
0.6
k-means++ chooses initial centers in a way that gives 0.2 0.2 0.2
The ltering algorithm uses kd-trees to speed up k-means clustering and EM clustering on an articial dataset
each k-means step.[17] (mouse). The tendency of k-means to produce equi-sized clus-
ters leads to bad results, while EM benets from the Gaussian
Some methods attempt to speed up each k-means distribution present in the data set
step using coresets[18] or the triangle inequality.[19]
Escape local optima by swapping points between The number of clusters k is an input parameter: an
clusters.[6] inappropriate choice of k may yield poor results.
That is why, when performing k-means, it is im-
portant to run diagnostic checks for determining the
The Spherical k-means clustering algorithm is suit- number of clusters in the data set.
able for directional data.[20]
Convergence to a local minimum may produce
counterintuitive (wrong) results (see example in
The Minkowski metric weighted k-means deals
with irrelevant features by assigning cluster specic Fig.).
weights to each feature[21]
A key limitation of k-means is its cluster model. The con-
cept is based on spherical clusters that are separable in a
way so that the mean value converges towards the cluster
43.4 Discussion center. The clusters are expected to be of similar size,
so that the assignment to the nearest cluster center is the
correct assignment. When for example applying k-means
with a value of k = 3 onto the well-known Iris ower data
set, the result often fails to separate the three Iris species
contained in the data set. With k = 2 , the two visible
A typical example of the k-means convergence to a local mini-
mum. In this example, the result of k-means clustering (the right clusters (one containing two species) will be discovered,
gure) contradicts the obvious cluster structure of the data set. whereas with k = 3 one of the two clusters will be split
The small circles are the data points, the four ray stars are the into two even parts. In fact, k = 2 is more appropriate
centroids (means). The initial conguration is on the left gure. for this data set, despite the data set containing 3 classes.
The algorithm converges after ve iterations presented on the g- As with any other clustering algorithm, the k-means re-
ures, from the left to the right. The illustration was prepared with sult relies on the data set to satisfy the assumptions made
the Mirkes Java applet.[22] by the clustering algorithms. It works well on some data
sets, while failing on others.
The two key features of k-means which make it ecient The result of k-means can also be seen as the Voronoi cells
are often regarded as its biggest drawbacks: of the cluster means. Since data is split halfway between
cluster means, this can lead to suboptimal splits as can be
Euclidean distance is used as a metric and variance seen in the mouse example. The Gaussian models used
is used as a measure of cluster scatter. by the Expectation-maximization algorithm (which can
43.5. APPLICATIONS 293
43.5 Applications
with simple, linear classiers for semi-supervised learn- given by the PCA (principal component analysis) prin-
ing in NLP (specically for named entity recognition)[28] cipal components, and the PCA subspace spanned by
and in computer vision. On an object recognition the principal directions is identical to the cluster cen-
task, it was found to exhibit comparable performance troid subspace. However, that PCA is a useful relax-
with more sophisticated feature learning approaches such ation of k-means clustering was not a new result (see, for
as autoencoders and restricted Boltzmann machines.[26] example,[34] ), and it is straightforward to uncover coun-
However, it generally requires more data than the sophis- terexamples to the statement that the cluster centroid sub-
ticated methods, for equivalent performance, because space is spanned by the principal directions.[35]
each data point only contributes to one feature rather
than multiple.[24]
43.6.3 Independent component analysis
(ICA)
43.6 Relation to other statistical It has been shown in [36] that under sparsity assumptions
machine learning algorithms and when input data is pre-processed with the whitening
transformation k-means produces the solution to the lin-
k-means clustering, and its associated expectation- ear Independent component analysis task. This aids in ex-
maximization algorithm, is a special case of a Gaussian plaining the successful application of k-means to feature
mixture model, specically, the limit of taking all co- learning.
variances as diagonal, equal, and small. It is often easy
to generalize a k-means problem into a Gaussian mix-
43.6.4 Bilateral ltering
ture model.[29] Another generalization of the k-means al-
gorithm is the K-SVD algorithm, which estimates data k-means implicitly assumes that the ordering of the input
points as a sparse linear combination of codebook vec- data set does not matter. The bilateral lter is similar to
tors. K-means corresponds to the special case of using K-means and mean shift in that it maintains a set of data
a single codebook vector, with a weight of 1.[30] points that are iteratively replaced by means. However,
the bilateral lter restricts the calculation of the (kernel
weighted) mean to include only points that are close in the
43.6.1 Mean shift clustering ordering of the input data.[31] This makes it applicable to
problems such as image denoising, where the spatial ar-
Basic mean shift clustering algorithms maintain a set of rangement of pixels in an image is of critical importance.
data points the same size as the input data set. Initially,
this set is copied from the input set. Then this set is itera-
tively replaced by the mean of those points in the set that
are within a given distance of that point. By contrast, k-
43.7 Similar problems
means restricts this updated set to k points usually much
less than the number of points in the input data set, and The set of squared error minimizing cluster functions also
replaces each point in this set by the mean of all points includes the k-medoids algorithm, an approach which
in the input set that are closer to that point than any other forces the center point of each cluster to be one of the
(e.g. within the Voronoi partition of each updating point). actual points, i.e., it uses medoids in place of centroids.
A mean shift algorithm that is similar then to k-means,
called likelihood mean shift, replaces the set of points un-
dergoing replacement by the mean of all points in the in- 43.8 Software Implementations
put set that are within a given distance of the changing
set.[31] One of the advantages of mean shift over k-means
43.8.1 Free
is that there is no need to choose the number of clusters,
because mean shift is likely to nd only a few clusters if CrimeStat implements two spatial k-means algo-
indeed only a small number exist. However, mean shift rithms, one of which allows the user to dene the
can be much slower than k-means, and still requires se- starting locations.
lection of a bandwidth parameter. Mean shift has soft
variants much as k-means does. ELKI contains k-means (with Lloyd and MacQueen
iteration, along with dierent initializations such
as k-means++ initialization) and various more ad-
43.6.2 Principal component analysis vanced clustering algorithms.
(PCA) Julia contains a k-means implementation in the
Clustering package.[37]
It was asserted in[32][33] that the relaxed solution of k-
means clustering, specied by the cluster indicators, is Mahout contains a MapReduce based k-means.
43.10. REFERENCES 295
MLPACK contains a C++ implementation of k- [3] Lloyd, S. P. (1957). Least square quantization in
means. PCM. Bell Telephone Laboratories Paper. Pub-
lished in journal much later: Lloyd., S. P. (1982).
Octave contains k-means. Least squares quantization in PCM (PDF). IEEE
Transactions on Information Theory 28 (2): 129137.
OpenCV contains a k-means implementation. doi:10.1109/TIT.1982.1056489. Retrieved 2009-04-15.
R contains three k-means variations.[1][3][6] [4] E.W. Forgy (1965). Cluster analysis of multivariate data:
eciency versus interpretability of classications. Bio-
SciPy and scikit-learn contain multiple k-means im- metrics 21: 768769.
plementations.
[5] J.A. Hartigan (1975). Clustering algorithms. John Wiley
Spark MLlib implements a distributed k-means al- & Sons, Inc.
gorithm.
[6] Hartigan, J. A.; Wong, M. A. (1979). Algorithm AS
Torch contains an unsup package that provides k- 136: A K-Means Clustering Algorithm. Journal of the
means clustering. Royal Statistical Society, Series C 28 (1): 100108. JSTOR
2346830.
Weka contains k-means and x-means.
[7] MacKay, David (2003). Chapter 20. An Example In-
ference Task: Clustering (PDF). Information Theory,
43.8.2 Commercial Inference and Learning Algorithms. Cambridge Univer-
sity Press. pp. 284292. ISBN 0-521-64298-1. MR
Grapheme 2012999.
MATLAB [8] Since the square root is a monotone function, this also is
the minimum Euclidean distance assignment.
Mathematica
[9] Hamerly, G. and Elkan, C. (2002). Alternatives to the k-
SAS means algorithm that nd better clusterings (PDF). Pro-
ceedings of the eleventh international conference on Infor-
Stata mation and knowledge management (CIKM).
[17] Kanungo, T.; Mount, D. M.; Netanyahu, N. S.; Pi- [30] Aharon, Michal; Elad, Michael; Bruckstein, Alfred
atko, C. D.; Silverman, R.; Wu, A. Y. (2002). (2006). K-SVD: An Algorithm for Designing Overcom-
An ecient k-means clustering algorithm: Analy- plete Dictionaries for Sparse Representation (PDF).
sis and implementation (PDF). IEEE Trans. Pat-
tern Analysis and Machine Intelligence 24: 881892. [31] Little, M.A.; Jones, N.S. (2011). Generalized Meth-
doi:10.1109/TPAMI.2002.1017616. Retrieved 2009-04- ods and Solvers for Piecewise Constant Signals: Part I
24. (PDF). Proceedings of the Royal Society A 467: 3088
3114. doi:10.1098/rspa.2010.0671.
[18] Frahling, G.; Sohler, C. (2006). A fast k-means implemen-
[32] H. Zha, C. Ding, M. Gu, X. He and H.D. Simon
tation using coresets (PDF). Proceedings of the twenty-
(Dec 2001). Spectral Relaxation for K-means Cluster-
second annual symposium on Computational geometry
ing (PDF). Neural Information Processing Systems vol.14
(SoCG).
(NIPS 2001) (Vancouver, Canada): 10571064.
[19] Elkan, C. (2003). Using the triangle inequality to accel- [33] Chris Ding and Xiaofeng He (July 2004). K-means Clus-
erate k-means (PDF). Proceedings of the Twentieth Inter- tering via Principal Component Analysis (PDF). Proc. of
national Conference on Machine Learning (ICML). Int'l Conf. Machine Learning (ICML 2004): 225232.
[20] Dhillon, I. S.; Modha, D. M. (2001). Con- [34] Drineas, P.; A. Frieze; R. Kannan; S. Vempala; V.
cept decompositions for large sparse text data us- Vinay (2004). Clustering large graphs via the singu-
ing clustering. Machine Learning 42 (1): 143175. lar value decomposition (PDF). Machine learning 56:
doi:10.1023/a:1007612920971. 933. doi:10.1023/b:mach.0000033113.59016.96. Re-
trieved 2012-08-02.
[21] Amorim, R. C.; Mirkin, B (2012). Minkowski metric,
feature weighting and anomalous cluster initializing in K- [35] Cohen, M.; S. Elder; C. Musco; C. Musco; M. Persu
Means clustering. Pattern Recognition 45 (3): 1061 (2014). Dimensionality reduction for k-means cluster-
1075. doi:10.1016/j.patcog.2011.08.012. ing and low rank approximation (Appendix B)". ArXiv.
Retrieved 2014-11-29.
[22] Mirkes, E.M. K-means and K-medoids applet.. http://
www.math.le.ac.uk''. Retrieved 1 May 2015. [36] Alon Vinnikov and Shai Shalev-Shwartz (2014). K-
means Recovers ICA Filters when Independent Compo-
[23] Honarkhah, M; Caers, J (2010). Stochastic Simu- nents are Sparse (PDF). Proc. of Int'l Conf. Machine
lation of Patterns Using Distance-Based Pattern Mod- Learning (ICML 2014).
eling. Mathematical Geosciences 42: 487517.
doi:10.1007/s11004-010-9276-7. [37] Clustering.jl www.github.com
Hierarchical clustering
In data mining and statistics, hierarchical clustering The choice of an appropriate metric will inuence the
(also called hierarchical cluster analysis or HCA) is shape of the clusters, as some elements may be close to
a method of cluster analysis which seeks to build a one another according to one distance and farther away
hierarchy of clusters. Strategies for hierarchical cluster-according to another. For example, in a 2-dimensional
ing generally fall into two types: [1] space, the distance between the point (1,0) and the ori-
gin (0,0) is always 1 according to the usual norms, but
Agglomerative: This is a bottom up approach: the distance between the point (1,1) and the origin (0,0)
each observation starts in its own cluster, and pairs can be 2 under Manhattan distance, 2 under Euclidean
of clusters are merged as one moves up the hierar- distance, or 1 under maximum distance.
chy. Some commonly used metrics for hierarchical clustering
are:[4]
Divisive: This is a top down approach: all obser-
vations start in one cluster, and splits are performed For text or other non-numeric data, metrics such as the
recursively as one moves down the hierarchy. Hamming distance or Levenshtein distance are often
used.
In general, the merges and splits are determined in a A review of cluster analysis in health psychology research
greedy manner. The results of hierarchical clustering are found that the most common distance measure in pub-
usually presented in a dendrogram. lished studies in that research area is the Euclidean dis-
tance or the squared Euclidean distance.
In the general case, the complexity of agglomerative clus-
tering is O(n3 ) , which makes them too slow for large
data sets. Divisive clustering with an exhaustive search is 44.1.2 Linkage criteria
O(2n ) , which is even worse. However, for some special
cases, optimal ecient agglomerative methods (of com- The linkage criterion determines the distance between
plexity O(n2 ) ) are known: SLINK[2] for single-linkage sets of observations as a function of the pairwise distances
and CLINK[3] for complete-linkage clustering. between observations.
Some commonly used linkage criteria between two sets
of observations A and B are:[5][6]
44.1 Cluster dissimilarity
where d is the chosen metric. Other linkage criteria in-
In order to decide which clusters should be combined (for clude:
agglomerative), or where a cluster should be split (for di-
visive), a measure of dissimilarity between sets of obser- The sum of all intra-cluster variance.
vations is required. In most methods of hierarchical clus-
tering, this is achieved by use of an appropriate metric (a The decrease in variance for the cluster being
measure of distance between pairs of observations), and a merged (Wards criterion).[7]
linkage criterion which species the dissimilarity of sets The probability that candidate clusters spawn from
as a function of the pairwise distances of observations in the same distribution function (V-linkage).
the sets.
The product of in-degree and out-degree on a k-
nearest-neighbor graph (graph degree linkage).[8]
44.1.1 Metric
The increment of some cluster descriptor (i.e., a
Further information: metric (mathematics) quantity dened for measuring the quality of a clus-
ter) after merging two clusters.[9][10][11]
297
298 CHAPTER 44. HIERARCHICAL CLUSTERING
44.2 Discussion a b c d e f
max{ d(x, y) : x A, y B }.
The hierarchical clustering dendrogram would be as such: The mean distance between elements of each cluster
This method builds the hierarchy from the individual ele- (also called average linkage clustering, used e.g. in
ments by progressively merging clusters. In our example, UPGMA):
we have six elements {a} {b} {c} {d} {e} and {f}. The
rst step is to determine which elements to merge in a
cluster. Usually, we want to take the two closest elements,
1
according to the chosen distance. d(x, y).
|A| |B|
xA yB
Optionally, one can also construct a distance matrix at
this stage, where the number in the i-th row j-th column
is the distance between the i-th and j-th elements. Then, The sum of all intra-cluster variance.
as clustering progresses, rows and columns are merged as The increase in variance for the cluster being merged
the clusters are merged and the distances updated. This is (Wards method<ref name="[7] )
a common way to implement this type of clustering, and
has the benet of caching distances between clusters. A The probability that candidate clusters spawn from
simple agglomerative clustering algorithm is described in the same distribution function (V-linkage).
44.5. SEE ALSO 299
Each agglomeration occurs at a greater distance between MultiDendrograms An open source Java application
clusters than the previous agglomeration, and one can de- for variable-group agglomerative hierarchical clus-
cide to stop clustering either when the clusters are too far tering, with graphical user interface.
apart to be merged (distance criterion) or when there is a
suciently small number of clusters (number criterion). Graph Agglomerative Clustering (GAC) toolbox
implemented several graph-based agglomerative
clustering algorithms.
44.4 Software
Hierarchical Clustering Explorer provides tools for
interactive exploration of multidimensional data.
44.4.1 Open Source Frameworks
R has several functions for hierarchical clustering:
see CRAN Task View: Cluster Analysis & Finite 44.4.3 Commercial
Mixture Models for more information.
MATLAB includes hierarchical cluster analysis.
Cluster 3.0 provides a nice Graphical User Interface
to access to dierent clustering routines and is avail- SAS includes hierarchical cluster analysis.
able for Windows, Mac OS X, Linux, Unix.
Mathematica includes a Hierarchical Clustering
ELKI includes multiple hierarchical clustering algo-
Package.
rithms, various linkage strategies and also includes
the ecient SLINK[2] algorithm, exible cluster ex-
NCSS (statistical software) includes hierarchical
traction from dendrograms and various other cluster
cluster analysis.
analysis algorithms.
Octave, the GNU analog to MATLAB implements SPSS includes hierarchical cluster analysis.
hierarchical clustering in linkage function
Qlucore Omics Explorer includes hierarchical clus-
Orange, a free data mining software suite, module ter analysis.
orngClustering for scripting in Python, or cluster
analysis through visual programming. Stata includes hierarchical cluster analysis.
scikit-learn implements a hierarchical clustering.
Weka includes hierarchical cluster analysis.
44.5 See also
fastCluster eciently implements the seven most
widely used clustering schemes. Statistical distance
SCaViS computing environment in Java that imple-
ments this algorithm. Brown clustering
Cluster analysis
44.4.2 Standalone implementations
CURE data clustering algorithm
CrimeStat implements two hierarchical clustering
routines, a nearest neighbor (Nnh) and a risk- Dendrogram
adjusted(Rnnh).
Determining the number of clusters in a data set
gue is a JavaScript package that implements some
agglomerative clustering functions (single-linkage, Hierarchical clustering of networks
complete-linkage, average-linkage) and functions to
visualize clustering output (e.g. dendrograms). Nearest-neighbor chain algorithm
hcluster is a Python implementation, based on
NumPy, which supports hierarchical clustering and Numerical taxonomy
plotting.
OPTICS algorithm
Hierarchical Agglomerative Clustering imple-
mented as C# visual studio project that includes real Nearest neighbor search
text les processing, building of document-term
matrix with stop words ltering and stemming. Locality-sensitive hashing
300 CHAPTER 44. HIERARCHICAL CLUSTERING
Instance-based learning
Analogical modeling
301
Chapter 46
302
46.4. METRIC LEARNING 303
algorithms such as Large Margin Nearest Neighbor or algorithms have been proposed over the years; these gen-
Neighbourhood components analysis. erally seek to reduce the number of distance evaluations
A drawback of the basic majority voting classication actually performed.
occurs when the class distribution is skewed. That is, k-NN has some strong consistency results. As the amount
examples of a more frequent class tend to dominate the of data approaches innity, the algorithm is guaranteed to
prediction of the new example, because they tend to be yield an error rate no worse than twice the Bayes error rate
common among the k nearest neighbors due to their large (the minimum achievable error rate given the distribution
number.[4] One way to overcome this problem is to weight of the data).[10] k-NN is guaranteed to approach the Bayes
the classication, taking into account the distance from error rate for some value of k (where k increases as a
the test point to each of its k nearest neighbors. The class function of the number of data points). Various improve-
(or value, in regression problems) of each of the k nearest ments to k-NN are possible by using proximity graphs.[11]
points is multiplied by a weight proportional to the inverse
of the distance from that point to the test point. Another
way to overcome skew is by abstraction in data repre- 46.4 Metric Learning
sentation. For example in a self-organizing map (SOM),
each node is a representative (a center) of a cluster of
The K-nearest neighbor classication performance can
similar points, regardless of their density in the original
often be signicantly improved through (supervised)
training data. K-NN can then be applied to the SOM.
metric learning. Popular algorithms are Neighbourhood
components analysis and Large margin nearest neighbor.
Supervised metric learning algorithms use the label infor-
46.2 Parameter selection mation to learn a new metric or pseudo-metric.
Feature extraction and dimension reduction can be com- too many training examples of other classes (unbal-
bined in one step using principal component anal- anced classes) that create a hostile background for
ysis (PCA), linear discriminant analysis (LDA), or the given small class
canonical correlation analysis (CCA) techniques as a
pre-processing step, followed by clustering by k-NN on
Class outliers with k-NN produce noise. They can be
feature vectors in reduced-dimension space. In machine
detected and separated for future analysis. Given two
learning this process is also called low-dimensional
[13] natural numbers, k>r>0, a training example is called a
embedding.
(k,r)NN class-outlier if its k nearest neighbors include
For very-high-dimensional datasets (e.g. when perform- more than r examples of other classes.
ing a similarity search on live video streams, DNA
data or high-dimensional time series) running a fast ap-
proximate k-NN search using locality sensitive hashing, 46.8.2 CNN for data reduction
random projections,[14] sketches [15] or other high-
dimensional similarity search techniques from VLDB Condensed nearest neighbor (CNN, the Hart algorithm)
toolbox might be the only feasible option. is an algorithm designed to reduce the data set for k-
NN classication.[17] It selects the set of prototypes U
from the training data, such that 1NN with U can clas-
46.7 Decision boundary sify the examples almost as accurately as 1NN does with
the whole data set.
Nearest neighbor rules in eect implicitly compute the
decision boundary. It is also possible to compute the de-
cision boundary explicitly, and to do so eciently, so that
the computational complexity is a function of the bound-
ary complexity.[16]
1. Select the class-outliers, that is, training data that are Calculation of the border ratio.
classied incorrectly by k-NN (for a given k)
2. Separate the rest of the data into two sets: (i) the
prototypes that are used for the classication deci-
sions and (ii) the absorbed points that can be cor-
rectly classied by k-NN using prototypes. The ab-
sorbed points can then be removed from the training
set.
1. Scan all elements of X, looking for an element x Fig. 4. The CNN reduced dataset.
whose nearest prototype from U has a dierent label
than x. Fig. 5. The 1NN classication map based on the
CNN extracted prototypes.
2. Remove x from X and add it to U
3. Repeat the scan until no more prototypes are added
to U.
46.9 k-NN regression
In k-NN regression, the k-NN algorithm is used for es-
Use U instead of X for classication. The examples that
timating continuous variables. One such algorithm uses
are not prototypes are called absorbed points.
a weighted average of the k nearest neighbors, weighted
It is ecient to scan the training examples in order of by the inverse of their distance. This algorithm works as
decreasing border ratio.[18] The border ratio of a training follows:
example x is dened as
1. Compute the Euclidean or Mahalanobis distance
a(x) = ||x'-y|| / ||x-y|| from the query example to the labeled examples.
2. Order the labeled examples by increasing distance.
where ||x-y|| is the distance to the closest example y having
a dierent color than x, and ||x'-y|| is the distance from y 3. Find a heuristically optimal number k of nearest
to its closest example x' with the same label as x. neighbors, based on RMSE. This is done using cross
validation.
The border ratio is in the interval [0,1] because ||x'-y||
never exceeds ||x-y||. This ordering gives preference to 4. Calculate an inverse distance weighted average with
the borders of the classes for inclusion in the set of proto- the k-nearest multivariate neighbors.
typesU. A point of a dierent label than x is called exter-
nal to x. The calculation of the border ratio is illustrated
by the gure on the right. The data points are labeled by 46.10 Validation of results
colors: the initial point is x and its label is red. External
points are blue and green. The closest to x external point
A confusion matrix or matching matrix is often used
is y. The closest to y red point is x' . The border ratio
as a tool to validate the accuracy of k-NN classication.
a(x)=||x'-y||/||x-y|| is the attribute of the initial point x.
More robust statistical methods such as likelihood-ratio
Below is an illustration of CNN in a series of gures. test can also be applied.
There are three classes (red, green and blue). Fig. 1:
initially there are 60 points in each class. Fig. 2 shows
the 1NN classication map: each pixel is classied by 46.11 See also
1NN using all the data. Fig. 3 shows the 5NN classi-
cation map. White areas correspond to the unclassied
Instance-based learning
regions, where 5NN voting is tied (for example, if there
are two green, two red and one blue points among 5 near- Nearest neighbor search
est neighbors). Fig. 4 shows the reduced data set. The
crosses are the class-outliers selected by the (3,2)NN rule Statistical classication
(all the three nearest neighbors of these instances belong Cluster analysis
to other classes); the squares are the prototypes, and the
empty circles are the absorbed points. The left bottom Data mining
corner shows the numbers of the class-outliers, proto-
Nearest centroid classier
types and absorbed points for all three classes. The num-
ber of prototypes varies from 15% to 20% for dierent Pattern recognition
classes in this example. Fig. 5 shows that the 1NN clas-
sication map with the prototypes is very similar to that Curse of dimensionality
with the initial data set. The gures were produced using Dimension reduction
the Mirkes applet.[18]
Principal Component Analysis
CNN model reduction for k-NN classiers Locality Sensitive Hashing
Fig. 1. The dataset. MinHash
Fig. 2. The 1NN classication map. Cluster hypothesis
Fig. 3. The 5NN classication map. Closest pair of points problem
306 CHAPTER 46. K-NEAREST NEIGHBORS ALGORITHM
307
308 CHAPTER 47. PRINCIPAL COMPONENT ANALYSIS
PCA is also related to canonical correlation analysis in such a way that the individual variables of t consid-
(CCA). CCA denes coordinate systems that optimally ered over the data set successively inherit the maximum
describe the cross-covariance between two datasets while possible variance from x, with each loading vector w con-
PCA denes a new orthogonal coordinate system that op- strained to be a unit vector.
timally describes variance in a single dataset.[6][7]
space of the original variables, {xi wk} wk, where 47.2.4 Dimensionality reduction
wk is the kth eigenvector of XT X.
The full principal components decomposition of X can The faithful transformation T = X W maps a data vector
therefore be given as xi from an original space of p variables to a new space of
p variables which are uncorrelated over the dataset. How-
ever, not all the principal components need to be kept.
Keeping only the rst L principal components, produced
T = XW by using only the rst L loading vectors, gives the trun-
cated transformation
where W is a p-by-p matrix whose columns are the eigen-
vectors of XT X
TL = XWL
47.2.3 Covariances where the matrix TL now has n rows but only L columns.
In other words, PCA learns a linear transformation t =
XT X itself can be recognised as proportional to the em- W T x, x Rp , t RL , where the columns of p L ma-
pirical sample covariance matrix of the dataset X. trix W form an orthogonal basis for the L features (the
[8]
The sample covariance Q between two of the dierent components of representation t) that are decorrelated.
principal components over the dataset is given by: By construction, of all the transformed data matrices
with only L columns, this score matrix maximises the
variance in the original data that has been preserved,
while minimising the total squared reconstruction error
Q(PC(j) , PC(k) ) (Xw(j) )T (Xw(k) ) TWT TL WTL 22 or X XL 22 .
T T
= w(j) X Xw(k)
= wT(j) (k) w(k)
= (k) wT(j) w(k)
may in fact be much more likely to substantially overlay Using the singular value decomposition the score matrix
each other, making them indistinguishable. T can be written
Similarly, in regression analysis, the larger the number of
explanatory variables allowed, the greater is the chance
of overtting the model, producing conclusions that fail T = XW
to generalise to other datasets. One approach, especially = U WT W
when there are strong correlations between dierent pos- =U
sible explanatory variables, is to reduce them to a few
principal components and then run the regression against so each column of T is given by one of the left singu-
them, a method called principal component regression. lar vectors of X multiplied by the corresponding singular
Dimensionality reduction may also be appropriate when value. This form is also the polar decomposition of T.
the variables in a dataset are noisy. If each column of Ecient algorithms exist to calculate the SVD of X with-
the dataset contains independent identically distributed out having to form the matrix XT X, so computing the
Gaussian noise, then the columns of T will also contain SVD is now the standard way to calculate a principal com-
similarly identically distributed Gaussian noise (such a ponents analysis from a data matrix, unless only a handful
distribution is invariant under the eects of the matrix of components are required.
W, which can be thought of as a high-dimensional rota-
tion of the co-ordinate axes). However, with more of the As with the eigen-decomposition, a truncated n L score
total variance concentrated in the rst few principal com- matrix TL can be obtained by considering only the rst L
ponents compared to the same noise variance, the pro- largest singular values and their singular vectors:
portionate eect of the noise is lessthe rst few com-
ponents achieve a higher signal-to-noise ratio. PCA thus
can have the eect of concentrating much of the signal TL = UL L = XWL
into the rst few principal components, which can use-
fully be captured by dimensionality reduction; while the The truncation of a matrix M or T using a truncated sin-
later principal components may be dominated by noise, gular value decomposition in this way produces a trun-
and so disposed of without great loss. cated matrix that is the nearest possible matrix of rank
L to the original matrix, in the sense of the dierence
between the two having the smallest possible Frobenius
47.2.5 Singular value decomposition norm, a result known as the EckartYoung theorem
[1936].
The principal components transformation can also be as-
sociated with another matrix factorisation, the singular
value decomposition (SVD) of X,
47.3 Further considerations
distinction of being the optimal orthogonal transforma- 47.4 Table of symbols and abbrevi-
tion for keeping the subspace that has largest variance
(as dened above). This advantage, however, comes at
ations
the price of greater computational requirements if com-
pared, for example and when applicable, to the discrete 47.5 Properties and limitations of
cosine transform, and in particular to the DCT-II which
is simply known as the DCT. Nonlinear dimensional- PCA
ity reduction techniques tend to be more computationally
demanding than PCA. 47.5.1 Properties[11]
PCA is sensitive to the scaling of the variables. If we
have just two variables and they have the same sample Property 1: For any integer q, 1 q p, con-
variance and are positively correlated, then the PCA will sider the orthogonal linear transformation
entail a rotation by 45 and the loadings for the two y = B x
variables with respect to the principal component will be
equal. But if we multiply all values of the rst variable where y is a q-element vector and B is a (q
by 100, then the rst principal component will be almost p) matrix, and let y = B B be the variance-
the same as that variable, with a small contribution from covariance matrix for y . Then the trace of y
the other variable, whereas the second component will , denoted tr( y ) , is maximized by taking B =
be almost aligned with the second original variable. This Aq , where Aq consists of the rst q columns
means that whenever the dierent variables have dier- of A (B is the transposition of B) .
ent units (like temperature and mass), PCA is a somewhat
arbitrary method of analysis. (Dierent results would be Property 2: Consider again the orthonormal
obtained if one used Fahrenheit rather than Celsius for transformation
example.) Note that Pearsons original paper was entitled y = B x
On Lines and Planes of Closest Fit to Systems of Points
in Space in space implies physical Euclidean space with x, B, A and y dened as before. Then
where such concerns do not arise. One way of making tr( y ) is minimized by taking B = Aq , where
the PCA less arbitrary is to use variables scaled so as to Aq consists of the last q columns of A .
have unit variance, by standardizing the data and hence
use the autocorrelation matrix instead of the autocovari- The statistical implication of this property is that the last
ance matrix as a basis for PCA. However, this compresses few PCs are not simply unstructured left-overs after re-
(or expands) the uctuations in all dimensions of the sig- moving the important PCs. Because these last PCs have
nal space to unit variance. variances as small as possible they are useful in their own
right. They can help to detect unsuspected near-constant
Mean subtraction (a.k.a. mean centering) is neces- linear relationships between the elements of x, and they
sary for performing PCA to ensure that the rst principal may also be useful in regression, in selecting a subset of
component describes the direction of maximum variance. variables from x, and in outlier detection.
If mean subtraction is not performed, the rst principal
component might instead correspond more or less to the Property 3: (Spectral Decomposition of )
mean of the data. A mean of zero is needed for nding
a basis that minimizes the mean square error of the ap- = 1 1 1 + + p p p
[9]
proximation of the data.
Before we look at its usage, we rst look at diagonal ele-
PCA is equivalent to empirical orthogonal functions
ments,
(EOF), a name which is used in meteorology.
An autoencoder neural network with a linear hidden layer
is similar to PCA. Upon convergence, the weight vectors P
2
of the K neurons in the hidden layer will form a basis for Var(x j ) = k kj
k=1
the space spanned by the rst K principal components.
Unlike PCA, this technique will not necessarily produce Then, perhaps the main statistical implication of the re-
orthogonal vectors. sult is that not only can we decompose the combined vari-
ances of all the elements of x into decreasing contribu-
PCA is a popular primary technique in pattern recog-
tions due to each PC, but we can also decompose the
nition. It is not, however, optimized for class
whole covariance matrix into contributions k k k from
separability.[10] An alternative is the linear discriminant
each PC. Although not strictly decreasing, the elements
analysis, which does take this into account.
of k k k will tend to become smaller as k increases,
as k k k decreases for increasing k , whereas the ele-
ments of k tend to stay 'about the same size'because of
the normalization constraints: k k = 1, k = 1, , p
312 CHAPTER 47. PRINCIPAL COMPONENT ANALYSIS
47.5.3 PCA and information theory 47.6.1 Organize the data set
The claim that the PCA used for dimensionality reduction Suppose you have data comprising a set of observations
preserves most of the information of the data is mislead- of p variables, and you want to reduce the data so that
ing. Indeed, without any assumption on the signal model, each observation can be described with only L variables,
PCA cannot help to reduce the amount of information L < p. Suppose further, that the data are arranged as a
lost during dimensionality reduction, where information set of n data vectors x1 . . . xn with each xi representing
was measured using Shannon entropy.[14] a single grouped observation of the p variables.
Under the assumption that
Write x1 . . . xn as row vectors, each of which has p
columns.
x=s+n Place the row vectors into a single matrix X of di-
mensions n p.
i.e., that the data vector x is the sum of the desired
information-bearing signal s and a noise signal n one can
show that PCA can be optimal for dimensionality reduc- 47.6.2 Calculate the empirical mean
tion also from an information-theoretic point-of-view.
Find the empirical mean along each dimension j =
In particular, Linsker showed that if s is Gaussian and n
1, ..., p.
is Gaussian noise with a covariance matrix proportional
to the identity matrix, the PCA maximizes the mutual Place the calculated mean values into an empirical
information I(y; s) between the desired information s and mean vector u of dimensions p 1.
the dimensionality-reduced output y = WTL x .[15]
If the noise is still Gaussian and has a covariance matrix
proportional to the identity matrix (i.e., the components
1
n
of the vector n are iid), but the information-bearing signal u[j] = X[i, j]
s is non-Gaussian (which is a common scenario), PCA at n i=1
least minimizes an upper bound on the information loss,
which is dened as[16][17]
47.6.3 Calculate the deviations from the
mean
I(x; s) I(y; s).
Mean subtraction is an integral part of the solution to-
The optimality of PCA is also preserved if the noise n is wards nding a principal component basis that minimizes
[20]
iid and at least more Gaussian (in terms of the Kullback the mean square error of approximating the data.
Leibler divergence) than the information-bearing signal Hence we proceed by centering the data as follows:
s .[18] In general, even if the above signal model holds,
PCA loses its information-theoretic optimality as soon as Subtract the empirical mean vector u from each row
the noise n becomes dependent. of the data matrix X.
Store mean-subtracted data in the n p matrix B.
47.6 Computing PCA using the co-
B = X huT
variance method where h is an n 1 column vector
of all 1s:
The following is a detailed description of PCA using the
covariance method (see also here) as opposed to the cor-
relation method.[19] But note that it is better to use the
singular value decomposition (using standard software). h[i] = 1 fori = 1, . . . , n
47.6. COMPUTING PCA USING THE COVARIANCE METHOD 313
47.6.4 Find the covariance matrix Matrix V, also of dimension p p, contains p col-
umn vectors, each of length p, which represent the
Find the p p empirical covariance matrix C from p eigenvectors of the covariance matrix C.
the outer product of matrix B with itself:
The eigenvalues and eigenvectors are ordered and
paired. The jth eigenvalue corresponds to the jth
eigenvector.
1
C= B B
n1 47.6.6 Rearrange the eigenvectors and
where is the conjugate transpose operator. eigenvalues
Note that if B consists entirely of real num-
bers, which is the case in many applications, Sort the columns of the eigenvector matrix V and
the conjugate transpose is the same as the eigenvalue matrix D in order of decreasing eigen-
regular transpose. value.
Make sure to maintain the correct pairings between
Please note that outer products apply to vectors. For the columns in each matrix.
tensor cases we should apply tensor products, but the
covariance matrix in PCA is a sum of outer products
between its sample vectors; indeed, it could be rep- 47.6.7 Compute the cumulative energy
resented as B*.B. See the covariance matrix sections content for each eigenvector
on the discussion page for more information.
The eigenvalues represent the distribution of the
The reasoning behind using N 1 instead of N to source datas energy among each of the eigenvec-
calculate the covariance is Bessels correction tors, where the eigenvectors form a basis for the
data. The cumulative energy content g for the jth
eigenvector is the sum of the energy content across
47.6.5 Find the eigenvectors and eigenval- all of the eigenvalues from 1 through j:
ues of the covariance matrix
g[j] =
j
Compute the matrix V of eigenvectors which k=1 D[k, k] for j =
diagonalizes the covariance matrix C: 1, . . . , p
Matrix D will take the form of an p p diagonal Use the vector g as a guide in choosing an appropri-
matrix, where ate value for L. The goal is to choose a value of L as
small as possible while achieving a reasonably high
value of g on a percentage basis. For example, you
may want to choose L so that the cumulative energy
D[k, l] = k fork = l g is above a certain threshold, like 90 percent. In
this case, choose the smallest value of L such that
is the jth eigenvalue of the covariance matrix
C, and
g[L]
D[k, l] = 0 fork = l. 0.9
g[p]
314 CHAPTER 47. PRINCIPAL COMPONENT ANALYSIS
47.6.9 Convert the source data to z-scores Hence () holds if and only if var(X) were diagonalis-
(optional) able by P .
This is very constructive, as var(X) is guaranteed to be a
Create an p 1 empirical standard deviation vector s non-negative denite matrix and thus is guaranteed to be
from the square root of each element along the main diagonalisable by some unitary matrix.
diagonal of the diagonalized covariance matrix C.
(Note, that scaling operations do not commute with
the KLT thus we must scale by the variances of the 47.7.1 Iterative computation
already-decorrelated vector, which is the diagonal of
C) : In practical implementations especially with high dimen-
sional data (large p), the covariance method is rarely used
because it is not ecient. One way to compute the rst
s = {s[j]} = { C[j, j]} forj = 1, . . . , p principal component eciently[24] is shown in the follow-
ing pseudo-code, for a data matrix X with zero mean,
Calculate the n p z-score matrix: without ever computing its covariance matrix.
r = a random vector of length p do c times: s = 0 (a
vector of length p) for each row x X s = s + (x r)x
s
B r = |s| return r
Z=
h sT This algorithm is simply an ecient way of calculating
T
Note: While this step is useful for various applica- X X r, normalizing, and placing the result back in r
2
tions as it normalizes the data set with respect to its (power iteration). It avoids the np operations of calcu-
variance, it is not integral part of PCA/KLT lating the covariance matrix. r will typically get close to
the rst principal component of X within a small number
of iterations, c. (The magnitude of s will be larger after
47.6.10 Project the z-scores of the data each iteration. Convergence can be detected when it in-
onto the new basis creases by an amount too small for the precision of the
machine.)
The projected vectors are the columns of the matrix Subsequent principal components can be computed by
subtracting component r from X (see GramSchmidt)
and then repeating this algorithm to nd the next principal
T = Z W = KLT{X}. component. However this simple approach is not numeri-
cally stable if more than a small number of principal com-
The rows of matrix T represent the Karhunen ponents are required, because imprecisions in the calcu-
Loeve transforms (KLT) of the data vectors in the lations will additively aect the estimates of subsequent
rows of matrix X. principal components. More advanced methods build on
this basic idea, as with the closely related Lanczos algo-
rithm.
47.7 Derivation of PCA using the One way to compute the eigenvalue that corresponds with
each principal component is to measure the dierence
covariance method in mean-squared-distance between the rows and the cen-
troid, before and after subtracting out the principal com-
Let X be a d-dimensional random vector expressed as col- ponent. The eigenvalue that corresponds with the com-
umn vector. Without loss of generality, assume X has ponent that was removed is equal to this dierence.
zero mean.
We want to nd () a d d orthonormal transformation
matrix P so that PX has a diagonal covariant matrix (i.e. 47.7.2 The NIPALS method
PX is a random vector with all its distinct components
pairwise uncorrelated). Main article: Non-linear iterative partial least squares
A quick computation assuming P were unitary yields:
For very-high-dimensional datasets, such as those
generated in the *omics sciences (e.g., genomics,
var(P X) = E[P X (P X) ] metabolomics) it is usually only necessary to compute
= E[P X X P ] the rst few PCs. The non-linear iterative partial least
squares (NIPALS) algorithm calculates t1 and w1 T from
= P E[XX ]P X. The outer product, t1 w1 T can then be subtracted from
= P var(X)P 1 X leaving the residual matrix E1 . This can be then used
47.10. RELATION BETWEEN PCA AND K-MEANS CLUSTERING 315
multiple factor analysis, co-inertia analysis, STATIS, and SIMCA Commercial software package available
DISTATIS. to perform PCA analysis.[49]
An Open Source Code and Tutorial in MATLAB Stata The pca command provides principal com-
and C++. ponents analysis.[57]
FactoMineR Probably the more complete library Cornell Spectrum Imager Open-source toolset
of functions for exploratory data analysis. built on ImageJ, enables PCA analysis for 3D
datacubes.[58]
XLSTAT - Principal Compent Analysis is a part of
XLSTAT core module[45] imDEV Free Excel addon to calculate principal
Mathematica Implements principal compo- components using R package[59][60]
nent analysis with the PrincipalComponents
ViSta: The Visual Statistics System Free software
command[46] using both covariance and correlation
that provides principal components analysis, simple
methods.
and multiple correspondence analysis.[61]
DataMelt - A Java free program that implements
several classes to build PCA analysis and to calcu- Spectramap Software to create a biplot using prin-
late eccentricity of random distributions. cipal components analysis, correspondence analysis
or spectral map analysis.[62]
NAG Library Principal components analysis is
implemented via the g03aa routine (available in both FinMath .NET numerical library containing an
the Fortran[47] and the C[48] versions of the Library). implementation of PCA.[63]
318 CHAPTER 47. PRINCIPAL COMPONENT ANALYSIS
OpenCV[65] Eigenface
[5] Shaw P.J.A. (2003) Multivariate statistics for the Environ- [24] Roweis, Sam. EM Algorithms for PCA and SPCA. Ad-
mental Sciences, Hodder-Arnold. ISBN 0-340-80763-6. vances in Neural Information Processing Systems. Ed.
Michael I. Jordan, Michael J. Kearns, and Sara A. Solla
[6] Barnett, T. P., and R. Preisendorfer. (1987). Origins and The MIT Press, 1998.
levels of monthly and seasonal forecast skill for United
States surface air temperatures determined by canonical [25] Geladi, Paul; Kowalski, Bruce (1986). Partial Least
correlation analysis.. Monthly Weather Review 115. Squares Regression:A Tutorial. Analytica Chimica Acta
185: 117. doi:10.1016/0003-2670(86)80028-9.
[7] Hsu, Daniel, Sham M. Kakade, and Tong Zhang (2008).
A spectral algorithm for learning hidden markov mod- [26] Kramer, R. (1998). Chemometric Techniques for Quanti-
els.. arXiv preprint arXiv:0811.4413. tative Analysis. New York: CRC Press.
[8] Bengio, Y. et al. (2013). Representation Learn- [27] Andrecut, M. (2009). Parallel GPU Implementation of
ing: A Review and New Perspectives (PDF). Iterative PCA Algorithms. Journal of Computational Bi-
Pattern Analysis and Machine Intelligence 35 (8). ology 16 (11): 15931599. doi:10.1089/cmb.2008.0221.
doi:10.1109/TPAMI.2013.50.
[28] Warmuth, M. K.; Kuzmin, D. (2008). Randomized on-
line PCA algorithms with regret bounds that are logarith-
[9] A. A. Miranda, Y. A. Le Borgne, and G. Bontempi. New
mic in the dimension. Journal of Machine Learning Re-
Routes from Minimal Approximation Error to Principal
search 9: 22872320.
Components, Volume 27, Number 3 / June, 2008, Neural
Processing Letters, Springer [29] Brenner, N., Bialek, W., & de Ruyter van Steveninck,
R.R. (2000).
[10] Fukunaga, Keinosuke (1990). Introduction to Statistical
Pattern Recognition. Elsevier. ISBN 0-12-269851-7. [30] H. Zha, C. Ding, M. Gu, X. He and H.D. Simon
(Dec 2001). Spectral Relaxation for K-means Cluster-
[11] Jollie, I. T. (2002). Principal Component Analysis, sec- ing (PDF). Neural Information Processing Systems vol.14
ond edition Springer-Verlag. ISBN 978-0-387-95442-4. (NIPS 2001) (Vancouver, Canada): 10571064.
[12] Leznik, M; Tofallis, C. 2005 [uhra.herts.ac.uk/bitstream/ [31] Chris Ding and Xiaofeng He (July 2004). K-means Clus-
handle/2299/715/S56.pdf Estimating Invariant Principal tering via Principal Component Analysis (PDF). Proc. of
Components Using Diagonal Regression.] Int'l Conf. Machine Learning (ICML 2004): 225232.
[13] Jonathon Shlens, A Tutorial on Principal Component [32] Drineas, P.; A. Frieze; R. Kannan; S. Vempala; V.
Analysis. Vinay (2004). Clustering large graphs via the singu-
lar value decomposition (PDF). Machine learning 56:
[14] Geiger, Bernhard; Kubin, Gernot (Sep 2012). Relative 933. doi:10.1023/b:mach.0000033113.59016.96. Re-
Information Loss in the PCA. Proc. IEEE Information trieved 2012-08-02.
Theory Workshop: 562566.
[33] Cohen, M.; S. Elder; C. Musco; C. Musco; M. Persu
[15] Linsker, Ralph (March 1988). Self-organization in a (2014). Dimensionality reduction for k-means cluster-
perceptual network. IEEE Computer 21 (3): 105117. ing and low rank approximation (Appendix B)". ArXiv.
doi:10.1109/2.36. Retrieved 2014-11-29.
[16] Deco & Obradovic (1996). An Information-Theoretic Ap- [34] http://www.linkedin.com/groups/
proach to Neural Computing. New York, NY: Springer. What-is-difference-between-factor-107833.S.
162765950
[17] Plumbley, Mark (1991). Information theory and unsu-
pervised neural networks.Tech Note [35] Timothy A. Brown. Conrmatory Factor Analysis for Ap-
plied Research Methodology in the social sciences. Guil-
[18] Geiger, Bernhard; Kubin, Gernot (January 2013). Signal ford Press, 2006
Enhancement as Minimization of Relevant Information
Loss. Proc. ITG Conf. on Systems, Communication and [36] Benzcri, J.-P. (1973). L'Analyse des Donnes. Volume
Coding. II. L'Analyse des Correspondances. Paris, France: Dunod.
[19] Engineering Statistics Handbook Section 6.5.5.2. Re- [37] Greenacre, Michael (1983). Theory and Applications
trieved 19 January 2015. of Correspondence Analysis. London: Academic Press.
ISBN 0-12-299050-1.
[20] A.A. Miranda, Y.-A. Le Borgne, and G. Bontempi. New
Routes from Minimal Approximation Error to Principal [38] Le Roux, Brigitte and Henry Rouanet (2004). Geomet-
Components, Volume 27, Number 3 / June, 2008, Neural ric Data Analysis, From Correspondence Analysis to Struc-
Processing Letters, Springer tured Data Analysis. Dordrecht: Kluwer.
[21] eig function Matlab documentation [39] A. N. Gorban, A. Y. Zinovyev, Principal Graphs and
Manifolds, In: Handbook of Research on Machine Learn-
[22] MATLAB PCA-based Face recognition software ing Applications and Trends: Algorithms, Methods and
Techniques, Olivas E.S. et al Eds. Information Science
[23] Eigenvalues function Mathematica documentation Reference, IGI Global: Hershey, PA, USA, 2009. 2859.
320 CHAPTER 47. PRINCIPAL COMPONENT ANALYSIS
[40] Wang, Y., Klijn, J.G., Zhang, Y., Sieuwerts, A.M., Look, [62] Spectramap www.coloritto.com
M.P., Yang, F., Talantov, D., Timmermans, M., Meijer-
van Gelder, M.E., Yu, J. et al.: Gene expression proles [63] FinMath rtmath.net
to predict distant metastasis of lymph-node-negative pri-
[64] http://www.camo.com
mary breast cancer Lancet 365, 671679 (2005); Data
online [65] Computer Vision Library sourceforge.net
[41] A. Zinovyev, ViDaExpert Multidimensional Data Vi- [66] PCOMP (IDL Reference) | Exelis VIS Docs Center IDL
sualization Tool (free for non-commercial use). Institut online documentation
Curie, Paris.
[67] javadoc weka.sourceforge.net
[42] A.N. Gorban, B. Kegl, D.C. Wunsch, A. Zinovyev (Eds.),
Principal Manifolds for Data Visualisation and Dimension [68] Software for analyzing multivariate data with instant re-
Reduction, LNCSE 58, Springer, Berlin Heidelberg sponse using PCA www.qlucore.com
New York, 2007. ISBN 978-3-540-73749-0
[69] EIGENSOFT genepath.med.harvard.edu
[43] Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N.
(2011). A Survey of Multilinear Subspace Learning for [70] Partek Genomics Suite www.partek.com
Tensor Data (PDF). Pattern Recognition 44 (7): 1540
[71] http://scikit-learn.org
1551. doi:10.1016/j.patcog.2011.01.004.
[72]
[44] Kriegel, H. P.; Krger, P.; Schubert, E.; Zimek, A.
(2008). A General Framework for Increasing the Ro- [73] MultivariateStats.jl www.github.com
bustness of PCA-Based Correlation Clustering Algo-
rithms. Scientic and Statistical Database Manage-
ment. Lecture Notes in Computer Science 5069: 418.
doi:10.1007/978-3-540-69497-7_27. ISBN 978-3-540- 47.17 References
69476-2.
Jackson, J.E. (1991). A Users Guide to Principal
[45] http://www.kovcomp.co.uk/support/XL-Tut/
Components (Wiley).
[46] PrincipalComponents Mathematica Documentation
Jollie, I. T. (1986). Principal Component Analy-
[47] The Numerical Algorithms Group. NAG Library Rou- sis. Springer-Verlag. p. 487. doi:10.1007/b98835.
tine Document: nagf_mv_prin_comp (g03aaf)" (PDF). ISBN 978-0-387-95442-4.
NAG Library Manual, Mark 23. Retrieved 2012-02-16.
Jollie, I.T. (2002). Principal Component Analysis,
[48] The Numerical Algorithms Group. NAG Library Rou- second edition (Springer).
tine Document: nag_mv_prin_comp (g03aac)" (PDF).
NAG Library Manual, Mark 9. Retrieved 2012-02-16. Husson Franois, L Sbastien & Pags Jrme
(2009). Exploratory Multivariate Analysis by Exam-
[49] PcaPress http://www.umetrics.com/products/simca
ple Using R. Chapman & Hall/CRC The R Series,
[50] PcaPress www.utdallas.edu London. 224p. |isbn=978-2-7535-0938-2
[51] Oracle documentation http://docs.oracle.com Pags Jrme (2014). Multiple Factor Analysis by
Example Using R. Chapman & Hall/CRC The R Se-
[52] princomp octave.sourceforge.net
ries London 272 p
[53] princomp
[54] prcomp
47.18 External links
[55] Multivariate cran.r-project.org
Dimensionality reduction
For dimensional reduction in physics, see Dimensional are computed. The eigenvectors that correspond to the
reduction. largest eigenvalues (the principal components) can now
be used to reconstruct a large fraction of the variance of
the original data. Moreover, the rst few eigenvectors can
In machine learning and statistics, dimensionality
reduction or dimension reduction is the process often be interpreted in terms of the large-scale physical
behavior of the system. The original space (with dimen-
of reducing the number of random variables under
consideration, and can be divided into feature selection sion of the number of points) has been reduced (with data
[1]
loss, but hopefully retaining the most important variance)
and feature extraction.[2]
to the space spanned by a few eigenvectors.
Principal component analysis can be employed in a non-
48.1 Feature selection linear way by means of the kernel trick. The resulting
technique is capable of constructing nonlinear mappings
that maximize the variance in the data. The resulting
Main article: Feature selection technique is entitled kernel PCA. Other prominent non-
linear techniques include manifold learning techniques
Feature selection approaches try to nd a subset of the such as Isomap, locally linear embedding (LLE), Hes-
original variables (also called features or attributes). Two sian LLE, Laplacian eigenmaps, and LTSA. These tech-
strategies are lter (e.g. information gain) and wrapper niques construct a low-dimensional data representation
(e.g. search guided by the accuracy) approaches. See also using a cost function that retains local properties of the
combinatorial optimization problems. data, and can be viewed as dening a graph-based ker-
nel for Kernel PCA. More recently, techniques have been
In some cases, data analysis such as regression or
proposed that, instead of dening a xed kernel, try to
classication can be done in the reduced space more ac-
learn the kernel using semidenite programming. The
curately than in the original space.
most prominent example of such a technique is maximum
variance unfolding (MVU). The central idea of MVU is
to exactly preserve all pairwise distances between near-
48.2 Feature extraction est neighbors (in the inner product space), while maxi-
mizing the distances between points that are not nearest
Main article: Feature extraction neighbors. A dimensionality reduction technique that is
sometimes used in neuroscience is maximally informative
dimensions, which nds a lower-dimensional representa-
Feature extraction transforms the data in the high-
tion of a dataset such that as much information as possible
dimensional space to a space of fewer dimensions. The
about the original data is preserved.
data transformation may be linear, as in principal com-
ponent analysis (PCA), but many nonlinear dimensional- An alternative approach to neighborhood preservation is
ity reduction techniques also exist.[3][4] For multidimen- through the minimization of a cost function that mea-
sional data, tensor representation can be used in dimen- sures dierences between distances in the input and out-
sionality reduction through multilinear subspace learn- put spaces. Important examples of such techniques in-
ing.[5] clude classical multidimensional scaling (which is iden-
tical to PCA), Isomap (which uses geodesic distances in
The main linear technique for dimensionality reduction,
the data space), diusion maps (which uses diusion dis-
principal component analysis, performs a linear mapping
tances in the data space), t-SNE (which minimizes the di-
of the data to a lower-dimensional space in such a way that
vergence between distributions over pairs of points), and
the variance of the data in the low-dimensional represen-
curvilinear component analysis.
tation is maximized. In practice, the correlation matrix of
the data is constructed and the eigenvectors on this matrix A dierent approach to nonlinear dimensionality reduc-
321
322 CHAPTER 48. DIMENSIONALITY REDUCTION
Multifactor dimensionality reduction [10] Shasha, D High (2004) Performance Discovery in Time
Series Berlin: Springer. ISBN 0-387-00857-8
Multilinear subspace learning
Multilinear PCA
48.6 References
Singular value decomposition
Latent semantic analysis Fodor,I. (2002) A survey of dimension reduction
techniques. Center for Applied Scientic Comput-
Semantic mapping ing, Lawrence Livermore National, Technical Re-
Topological data analysis port UCRL-ID-148494
ELastic MAPs
Greedy algorithm
49.1 Specics
3620=16 20
In general, greedy algorithms have ve components:
324
49.3. APPLICATIONS 325
49.1.1 Cases of failure best suited algorithms are greedy algorithms. It is impor-
tant, however, to note that the greedy algorithm can be
Examples on how a greedy algorithm may fail to achieve used as a selection algorithm to prioritize options within
the optimal solution. a search, or branch and bound algorithm. There are a few
variations to the greedy algorithm:
49.3 Applications
m
Greedy algorithms mostly (but not always) fail to nd the
globally optimal solution, because they usually do not op-
erate exhaustively on all the data. They can make com-
mitments to certain choices too early which prevent them
from nding the best overall solution later. For example,
all known greedy coloring algorithms for the graph color-
ing problem and all other NP-complete problems do not
A consistently nd optimum solutions. Nevertheless, they
are useful because they are quick to think up and often
Starting at A, a greedy algorithm will nd the local max- give good approximations to the optimum.
imum at m, oblivious of the global maximum at M.
If a greedy algorithm can be proven to yield the global
optimum for a given problem class, it typically becomes
the method of choice because it is faster than other opti-
mization methods like dynamic programming. Examples
of such greedy algorithms are Kruskals algorithm and
Prims algorithm for nding minimum spanning trees,
and the algorithm for nding optimum Human trees.
The theory of matroids, and the more general theory of
greedoids, provide whole classes of such algorithms.
Greedy algorithms appear in network routing as well. Us-
ing greedy routing, a message is forwarded to the neigh-
With a goal of reaching the largest-sum, at each step, boring node which is closest to the destination. The
the greedy algorithm will choose what appears to be the notion of a nodes location (and hence closeness) may
optimal immediate choice, so it will choose 12 instead of be determined by its physical location, as in geographic
3 at the second step, and will not reach the best solution, routing used by ad hoc networks. Location may also be
which contains 99. an entirely articial construct as in small world routing
and distributed hash table.
For many other problems, greedy algorithms fail to pro-
duce the optimal solution, and may even produce the
unique worst possible solution. One example is the 49.4 Examples
traveling salesman problem mentioned above: for each
number of cities, there is an assignment of distances be- The activity selection problem is characteristic to
tween the cities for which the nearest neighbor heuristic this class of problems, where the goal is to pick the
produces the unique worst possible tour.[3] maximum number of activities that do not clash with
each other.
In the Macintosh computer game Crystal Quest the
49.2 Types objective is to collect crystals, in a fashion similar
to the travelling salesman problem. The game has
Greedy algorithms can be characterized as being 'short a demo mode, where the game uses a greedy algo-
sighted', and also as 'non-recoverable'. They are ideal only rithm to go to every crystal. The articial intelli-
for problems which have 'optimal substructure'. Despite gence does not account for obstacles, so the demo
this, for many simple problems (e.g. giving change), the mode often ends quickly.
326 CHAPTER 49. GREEDY ALGORITHM
49.6 Notes
[1] Black, Paul E. (2 February 2005). greedy algorithm.
Dictionary of Algorithms and Data Structures. U.S. Na-
tional Institute of Standards and Technology (NIST). Re-
trieved 17 August 2012.
[2] Introduction to Algorithms (Cormen, Leiserson, Rivest,
and Stein) 2001, Chapter 16 Greedy Algorithms.
[3] (G. Gutin, A. Yeo and A. Zverovich, 2002)
49.7 References
Introduction to Algorithms (Cormen, Leiserson, and
Rivest) 1990, Chapter 17 Greedy Algorithms p.
329.
Introduction to Algorithms (Cormen, Leiserson,
Rivest, and Stein) 2001, Chapter 16 Greedy Algo-
rithms.
G. Gutin, A. Yeo and A. Zverovich, Traveling sales-
man should not be greedy: domination analysis of
greedy-type heuristics for the TSP. Discrete Ap-
plied Mathematics 117 (2002), 8186.
J. Bang-Jensen, G. Gutin and A. Yeo, When the
greedy algorithm fails. Discrete Optimization 1
(2004), 121127.
G. Bendall and F. Margot, Greedy Type Resistance
of Combinatorial Problems, Discrete Optimization
3 (2006), 288298.
Chapter 50
Reinforcement learning
327
328 CHAPTER 50. REINFORCEMENT LEARNING
The brute force approach entails the following two steps: Q (s, a) = E[R|s, a, ],
originated from (s, a) over time. Given enough time, this linear function approximation one starts with a mapping
procedure can thus construct a precise estimate Q of the that assigns a nite-dimensional vector to each state-
action-value function Q . This nishes the description action pair. Then, the action values of a state-action pair
of the policy evaluation step. In the policy improvement (s, a) are obtained by linearly combining the components
step, as it is done in the standard policy iteration algo- of (s, a) with some weights :
rithm, the next policy is obtained by computing a greedy
policy with respect to Q : Given a state s , this new pol-
icy returns an action that maximizes Q(s, ) . In practice
d
one often avoids computing and storing the new policy, Q(s, a) = i i (s, a)
i=1
but uses lazy evaluation to defer the computation of the
maximizing actions to when they are actually needed.The algorithms then adjust the weights, instead of adjust-
ing the values associated with the individual state-action
A few problems with this procedure are as follows:
pairs. However, linear function approximation is not the
only choice. More recently, methods based on ideas from
The procedure may waste too much time on evalu- nonparametric statistics (which can be seen to construct
ating a suboptimal policy; their own features) have been explored.
It uses samples ineciently in that a long trajectory So far, the discussion was restricted to how policy iter-
is used to improve the estimate only of the single ation can be used as a basis of the designing reinforce-
state-action pair that started the trajectory; ment learning algorithms. Equally importantly, value it-
eration can also be used as a starting point, giving rise to
When the returns along the trajectories have high
the Q-Learning algorithm (Watkins 1989) and its many
variance, convergence will be slow;
variants.
It works in episodic problems only; The problem with methods that use action-values is that
It works in small, nite MDPs only. they may need highly precise estimates of the competing
action values, which can be hard to obtain when the re-
turns are noisy. Though this problem is mitigated to some
Temporal dierence methods extent by temporal dierence methods and if one uses
the so-called compatible function approximation method,
The rst issue is easily corrected by allowing the proce- more work remains to be done to increase generality and
dure to change the policy (at all, or at some states) before eciency. Another problem specic to temporal dier-
the values settle. However good this sounds, this may be ence methods comes from their reliance on the recursive
dangerous as this might prevent convergence. Still, most Bellman equation. Most temporal dierence methods
current algorithms implement this idea, giving rise to the have a so-called parameter (0 1) that allows one
class of generalized policy iteration algorithm. We note in to continuously interpolate between Monte-Carlo meth-
passing that actor critic methods belong to this category. ods (which do not rely on the Bellman equations) and the
The second issue can be corrected within the algorithm basic temporal dierence methods (which rely entirely
by allowing trajectories to contribute to any state-action on the Bellman equations), which can thus be eective in
pair in them. This may also help to some extent with palliating this issue.
the third problem, although a better solution when returns
have high variance is to use Sutton's temporal dierence 50.3.4 Direct policy search
(TD) methods which are based on the recursive Bellman
equation. Note that the computation in TD methods can An alternative method to nd a good policy is to search
be incremental (when after each transition the memory directly in (some subset of) the policy space, in which
is changed and the transition is thrown away), or batch case the problem becomes an instance of stochastic op-
(when the transitions are collected and then the estimates timization. The two approaches available are gradient-
are computed once based on a large number of transi- based and gradient-free methods.
tions). Batch methods, a prime example of which is the
least-squares temporal dierence method due to Bradtke Gradient-based methods (giving rise to the so-called pol-
and Barto (1996), may use the information in the samples icy gradient methods) start with a mapping from a nite-
better, whereas incremental methods are the only choice dimensional (parameter) space to the space of policies:
when batch methods become infeasible due to their high given the parameter vector , let denote the policy
computational or memory complexity. In addition, there associated to . Dene the performance function by
exist methods that try to unify the advantages of the two
approaches. Methods based on temporal dierences also
overcome the second but last issue. () = .
In order to address the last issue mentioned in the previ- Under mild conditions this function will be dierentiable
ous section, function approximation methods are used. In as a function of the parameter vector . If the gradient
50.5. CURRENT RESEARCH 331
of was known, one could use gradient ascent. Since an 50.5 Current research
analytic expression for the gradient is not available, one
must rely on a noisy estimate. Such an estimate can be Current research topics include: adaptive methods which
constructed in many ways, giving rise to algorithms like work with fewer (or no) parameters under a large number
Williams REINFORCE method (which is also known as of conditions, addressing the exploration problem in large
the likelihood ratio method in the simulation-based op- MDPs, large-scale empirical evaluations, learning and
timization literature). Policy gradient methods have re- acting under partial information (e.g., using Predictive
ceived a lot of attention in the last couple of years (e.g., State Representation), modular and hierarchical rein-
Peters et al. (2003)), but they remain an active eld. forcement learning, improving existing value-function
An overview of policy search methods in the context of and policy search methods, algorithms that work well
robotics has been given by Deisenroth, Neumann and with large (or continuous) action spaces, transfer learn-
Peters.[2] The issue with many of these methods is that ing, lifelong learning, ecient sample-based planning
they may get stuck in local optima (as they are based on (e.g., based on Monte-Carlo tree search). Multiagent or
local search). Distributed Reinforcement Learning is also a topic of in-
A large class of methods avoids relying on gradient in- terest in current research. There is also a growing interest
formation. These include simulated annealing, cross- in real life applications of reinforcement learning. Suc-
entropy search or methods of evolutionary computation. cesses of reinforcement learning are collected on here and
Many gradient-free methods can achieve (in theory and here.
in the limit) a global optimum. In a number of cases they Reinforcement learning algorithms such as TD learning
have indeed demonstrated remarkable performance. are also being investigated as a model for Dopamine-
The issue with policy search methods is that they may based learning in the brain. In this model, the dopamin-
converge slowly if the information based on which they ergic projections from the substantia nigra to the basal
act is noisy. For example, this happens when in episodic ganglia function as the prediction error. Reinforcement
problems the trajectories are long and the variance of the learning has also been used as a part of the model for
returns is large. As argued beforehand, value-function human skill learning, especially in relation to the inter-
based methods that rely on temporal dierences might action between implicit and explicit learning in skill ac-
help in this case. In recent years, several actor-critic al- quisition (the rst publication on this application was
gorithms have been proposed following this idea and were in 1995-1996, and there have been many follow-up
demonstrated to perform well in various problems. studies). See http://webdocs.cs.ualberta.ca/~{}sutton/
RL-FAQ.html#behaviorism for further details of these
research areas above.
50.6 Literature
Bertsekas, Dimitri P. (August 2010). Chapter The Reinforcement Learning Toolbox from the
6 (online): Approximate Dynamic Programming. (Graz University of Technology)
Dynamic Programming and Optimal Control (PDF)
II (3 ed.). Hybrid reinforcement learning
51.1 General ing from a node labeled with a feature are labeled with
each of the possible values of the feature. Each leaf of
Decision tree learning is a method commonly used in data the tree is labeled with a class or a probability distribu-
mining.[1] The goal is to create a model that predicts the tion over the classes.
value of a target variable based on several input variables.
A tree can be learned by splitting the source set into
An example is shown on the right. Each interior node cor-
responds to one of the input variables; there are edges to subsets based on an attribute value test. This process is
children for each of the possible values of that input vari- repeated on each derived subset in a recursive manner
able. Each leaf represents a value of the target variable called recursive partitioning. The recursion is completed
given the values of the input variables represented by the when the subset at a node has all the same value of the
path from the root to the leaf. target variable, or when splitting no longer adds value to
the predictions. This process of top-down induction of
A decision tree is a simple representation for classifying decision trees (TDIDT) [2] is an example of a greedy al-
examples. Decision tree learning is one of the most suc- gorithm, and it is by far the most common strategy for
cessful techniques for supervised classication learning. learning decision trees from data.
For this section, assume that all of the features have -
nite discrete domains, and there is a single target feature In data mining, decision trees can be described also as the
called the classication. Each element of the domain of combination of mathematical and computational tech-
the classication is called a class. A decision tree or a niques to aid the description, categorisation and gener-
classication tree is a tree in which each internal (non- alisation of a given set of data.
leaf) node is labeled with an input feature. The arcs com- Data comes in records of the form:
334
51.3. METRICS 335
Boosted Trees can be used for regression-type and 51.3.1 Gini impurity
classication-type problems.[5][6]
Rotation forest - in which every decision tree Not to be confused with Gini coecient.
is trained by rst applying principal component
analysis (PCA) on a random subset of the input Used by the CART (classication and regression tree) al-
features.[7] gorithm, Gini impurity is a measure of how often a ran-
domly chosen element from the set would be incorrectly
Decision tree learning is the construction of a decision labeled if it were randomly labeled according to the dis-
tree from class-labeled training tuples. A decision tree is tribution of labels in the subset. Gini impurity can be
a ow-chart-like structure, where each internal (non-leaf) computed by summing the probability of each item being
node denotes a test on an attribute, each branch represents chosen times the probability of a mistake in categorizing
the outcome of a test, and each leaf (or terminal) node that item. It reaches its minimum (zero) when all cases
holds a class label. The topmost node in a tree is the root in the node fall into a single target category.
node. To compute Gini impurity for a set of items, suppose i
There are many specic decision-tree algorithms. No- {1, 2, ..., m} , and let fi be the fraction of items labeled
table ones include: with value i in the set.
m m
IG (f ) = i=1 fi (1 fi )=
i=1 (fi fi ) =
2
m m m
ID3 (Iterative Dichotomiser 3) i=1 fi i=1 fi = 1
2
i=1 fi
2
336 CHAPTER 51. DECISION TREE LEARNING
Used by the ID3, C4.5 and C5.0 tree-generation algo- Robust. Performs well even if its assumptions are
rithms. Information gain is based on the concept of somewhat violated by the true model from which the
entropy from information theory. data were generated.
m
IE (f ) = i=1 fi log2 fi Performs well with large datasets. Large amounts
of data can be analysed using standard computing
resources in reasonable time.
51.3.3 Variance reduction
Simple to understand and interpret. People are There are concepts that are hard to learn because
able to understand decision tree models after a brief decision trees do not express them easily, such
explanation. as XOR, parity or multiplexer problems. In such
cases, the decision tree becomes prohibitively large.
Requires little data preparation. Other tech- Approaches to solve the problem involve either
niques often require data normalisation, dummy changing the representation of the problem domain
variables need to be created and blank values to be (known as propositionalisation)[16] or using learn-
removed. ing algorithms based on more expressive repre-
sentations (such as statistical relational learning or
Able to handle both numerical and categorical inductive logic programming).
data. Other techniques are usually specialised in
analysing datasets that have only one type of vari- For data including categorical variables with dif-
able. (For example, relation rules can be used only ferent numbers of levels, information gain in deci-
with nominal variables while neural networks can be sion trees is biased in favor of those attributes with
used only with numerical variables.) more levels.[17] However, the issue of biased predic-
tor selection is avoided by the Conditional Inference
Uses a white box model. If a given situation is ob- approach.[9]
servable in a model the explanation for the condition
is easily explained by boolean logic. (An example
of a black box model is an articial neural network
since the explanation for the results is dicult to un- 51.6 Extensions
derstand.)
51.9. REFERENCES 337
51.6.1 Decision graphs data mining software suite, which includes the tree mod-
ule orngTree), KNIME, Microsoft SQL Server , and
In a decision tree, all paths from the root node to the leaf scikit-learn (a free and open-source machine learning li-
node proceed by way of conjunction, or AND. In a de- brary for the Python programming language).
cision graph, it is possible to use disjunctions (ORs) to
join two more paths together using Minimum message
length (MML).[18] Decision graphs have been further ex- 51.9 References
tended to allow for previously unstated new attributes to
be learnt dynamically and used at dierent places within
[1] Rokach, Lior; Maimon, O. (2008). Data mining with de-
the graph.[19] The more general coding scheme results in cision trees: theory and applications. World Scientic Pub
better predictive accuracy and log-loss probabilistic scor- Co Inc. ISBN 978-9812771711.
ing. In general, decision graphs infer models with fewer
leaves than decision trees. [2] Quinlan, J. R., (1986). Induction of Decision Trees. Ma-
chine Learning 1: 81-106, Kluwer Academic Publishers
51.6.2 Alternative search methods [3] Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J.
(1984). Classication and regression trees. Monterey, CA:
Wadsworth & Brooks/Cole Advanced Books & Software.
Evolutionary algorithms have been used to avoid local op-
ISBN 978-0-412-04841-8.
timal decisions and search the decision tree space with
little a priori bias.[20][21] [4] Breiman, L. (1996). Bagging Predictors. Machine
It is also possible for a tree to be sampled using Learning, 24": pp. 123-140.
MCMC.[22] [5] Friedman, J. H. (1999). Stochastic gradient boosting.
The tree can be searched for in a bottom-up fashion. [23] Stanford University.
C4.5 algorithm [9] Hothorn, T.; Hornik, K.; Zeileis, A. (2006). Unbiased
Recursive Partitioning: A Conditional Inference Frame-
Decision stump work. Journal of Computational and Graphical Statis-
tics 15 (3): 651674. doi:10.1198/106186006X133933.
Incremental decision tree JSTOR 27594202.
Alternating decision tree [10] Strobl, C.; Malley, J.; Tutz, G. (2009). An Introduc-
tion to Recursive Partitioning: Rationale, Application and
Structured data analysis (statistics) Characteristics of Classication and Regression Trees,
Bagging and Random Forests. Psychological Methods 14
(4): 323348. doi:10.1037/a0016973.
51.8 Implementations
[11] Rokach, L.; Maimon, O. (2005). Top-down induction of
decision trees classiers-a survey. IEEE Transactions on
Many data mining software packages provide implemen- Systems, Man, and Cybernetics, Part C 35 (4): 476487.
tations of one or more decision tree algorithms. Several doi:10.1109/TSMCC.2004.843247.
examples include Salford Systems CART (which licensed
the proprietary code of the original CART authors[3] ), [12] Hyal, Laurent; Rivest, RL (1976). Constructing Op-
IBM SPSS Modeler, RapidMiner, SAS Enterprise Miner, timal Binary Decision Trees is NP-complete. Informa-
tion Processing Letters 5 (1): 1517. doi:10.1016/0020-
Matlab, R (an open source software environment for sta-
0190(76)90095-8.
tistical computing which includes several CART imple-
mentations such as rpart, party and randomForest pack- [13] Murthy S. (1998). Automatic construction of decision
ages), Weka (a free and open-source data mining suite, trees from data: A multidisciplinary survey. Data Min-
contains many decision tree algorithms), Orange (a free ing and Knowledge Discovery
338 CHAPTER 51. DECISION TREE LEARNING
339
Chapter 53
Ensemble learning
For an alternative meaning, see variational Bayesian tions. The trained ensemble, therefore, represents a sin-
methods. gle hypothesis. This hypothesis, however, is not necessar-
ily contained within the hypothesis space of the models
In statistics and machine learning, ensemble meth- from which it is built. Thus, ensembles can be shown to
have more exibility in the functions they can represent.
ods use multiple learning algorithms to obtain better
predictive performance than could be obtained from any This exibility can, in theory, enable them to over-t the
training data more than a single model would, but in prac-
of the constituent learning algorithms.[1][2][3] Unlike a
statistical ensemble in statistical mechanics, which is usu- tice, some ensemble techniques (especially bagging) tend
to reduce problems related to over-tting of the training
ally innite, a machine learning ensemble refers only to
a concrete nite set of alternative models, but typically data.
allows for much more exible structure to exist among Empirically, ensembles tend to yield better results when
those alternatives. there is a signicant diversity among the models.[4][5]
Many ensemble methods, therefore, seek to promote di-
versity among the models they combine.[6][7] Although
53.1 Overview perhaps non-intuitive, more random algorithms (like ran-
dom decision trees) can be used to produce a stronger
ensemble than very deliberate algorithms (like entropy-
Supervised learning algorithms are commonly described reducing decision trees).[8] Using a variety of strong
as performing the task of searching through a hypoth- learning algorithms, however, has been shown to be more
esis space to nd a suitable hypothesis that will make eective than using techniques that attempt to dumb-
good predictions with a particular problem. Even if the down the models in order to promote diversity.[9]
hypothesis space contains hypotheses that are very well-
suited for a particular problem, it may be very dicult to
nd a good one. Ensembles combine multiple hypotheses
to form a (hopefully) better hypothesis. The term ensem-
53.3 Common types of ensembles
ble is usually reserved for methods that generate multi-
ple hypotheses using the same base learner. The broader 53.3.1 Bayes optimal classier
term of multiple classier systems also covers hybridiza-
tion of hypotheses that are not induced by the same base The Bayes Optimal Classier is a classication technique.
learner. It is an ensemble of all the hypotheses in the hypothe-
sis space. On average, no other ensemble can outper-
Evaluating the prediction of an ensemble typically re-
form it.[10] Each hypothesis is given a vote proportional to
quires more computation than evaluating the prediction
the likelihood that the training dataset would be sampled
of a single model, so ensembles may be thought of as a
from a system if that hypothesis were true. To facilitate
way to compensate for poor learning algorithms by per-
training data of nite size, the vote of each hypothesis is
forming a lot of extra computation. Fast algorithms such
also multiplied by the prior probability of that hypothesis.
as decision trees are commonly used with ensembles (for
The Bayes Optimal Classier can be expressed with the
example Random Forest), although slower algorithms can
following equation:
benet from ensemble techniques as well.
y = argmaxcj C P (cj |hi )P (T |hi )P (hi )
53.2 Ensemble theory hi H
340
53.3. COMMON TYPES OF ENSEMBLES 341
Optimal Classier represents a hypothesis that is not nec- and combining them using Bayes law.[14] Unlike the
essarily in H . The hypothesis represented by the Bayes Bayes optimal classier, Bayesian model averaging can be
Optimal Classier, however, is the optimal hypothesis in practically implemented. Hypotheses are typically sam-
ensemble space (the space of all possible ensembles con- pled using a Monte Carlo sampling technique such as
sisting only of hypotheses in H ). MCMC. For example, Gibbs sampling may be used to
Unfortunately, Bayes Optimal Classier cannot be prac- draw hypotheses that are representative of the distribu-
tically implemented for any but the most simple of prob- tion P (T |H) . It has been shown that under certain cir-
lems. There are several reasons why the Bayes Optimal cumstances, when hypotheses are drawn in this manner
and averaged according to Bayes law, this technique has
Classier cannot be practically implemented:
an expected error that is bounded to be at most twice the
expected error of the Bayes optimal classier.[15] Despite
1. Most interesting hypothesis spaces are too large to
the theoretical correctness of this technique, it has been
iterate over, as required by the argmax .
found to promote over-tting and to perform worse, em-
2. Many hypotheses yield only a predicted class, rather pirically, compared to simpler ensemble techniques such
than a probability for each class as required by the as bagging;[16] however, these conclusions appear to be
term P (cj |hi ) . based on a misunderstanding of the purpose of Bayesian
model averaging vs. model combination.[17]
3. Computing an unbiased estimate of the probability
of the training set given a hypothesis ( P (T |hi ) ) is
non-trivial.
53.3.5 Bayesian model combination
4. Estimating the prior probability for each hypothesis
( P (hi ) ) is rarely feasible. Bayesian model combination (BMC) is an algorithmic
correction to BMA. Instead of sampling each model in
the ensemble individually, it samples from the space of
53.3.2 Bootstrap aggregating (bagging) possible ensembles (with model weightings drawn ran-
domly from a Dirichlet distribution having uniform pa-
Main article: Bootstrap aggregating rameters). This modication overcomes the tendency of
BMA to converge toward giving all of the weight to a
Bootstrap aggregating, often abbreviated as bagging, in- single model. Although BMC is somewhat more compu-
volves having each model in the ensemble vote with tationally expensive than BMA, it tends to yield dramat-
equal weight. In order to promote model variance, bag- ically better results. The results from BMC have been
ging trains each model in the ensemble using a ran- shown to be better on average[18] (with statistical signi-
domly drawn subset of the training set. As an exam- cance) than BMA, and bagging.
ple, the random forest algorithm combines random de- The use of Bayes law to compute model weights neces-
cision trees with bagging to achieve very high classica- sitates computing the probability of the data given each
tion accuracy.[11] An interesting application of bagging in model. Typically, none of the models in the ensemble are
unsupervised learning is provided here.[12][13] exactly the distribution from which the training data were
generated, so all of them correctly receive a value close
to zero for this term. This would work well if the ensem-
53.3.3 Boosting ble were big enough to sample the entire model-space,
but such is rarely possible. Consequently, each pattern in
Main article: Boosting (meta-algorithm) the training data will cause the ensemble weight to shift
toward the model in the ensemble that is closest to the
Boosting involves incrementally building an ensemble by distribution of the training data. It essentially reduces to
training each new model instance to emphasize the train- an unnecessarily complex method for doing model selec-
ing instances that previous models mis-classied. In some tion.
cases, boosting has been shown to yield better accuracy The possible weightings for an ensemble can be visualized
than bagging, but it also tends to be more likely to over-t as lying on a simplex. At each vertex of the simplex, all
the training data. By far, the most common implementa- of the weight is given to a single model in the ensemble.
tion of Boosting is Adaboost, although some newer algo- BMA converges toward the vertex that is closest to the
rithms are reported to achieve better results . distribution of the training data. By contrast, BMC con-
verges toward the point where this distribution projects
onto the simplex. In other words, instead of selecting the
53.3.4 Bayesian model averaging
one model that is closest to the generating distribution,
Bayesian model averaging (BMA) is an ensemble tech- it seeks the combination of models that is closest to the
nique that seeks to approximate the Bayes Optimal Clas- generating distribution.
sier by sampling hypotheses from the hypothesis space, The results from BMA can often be approximated by us-
342 CHAPTER 53. ENSEMBLE LEARNING
ing cross-validation to select the best model from a bucket Stacking typically yields performance better than any sin-
of models. Likewise, the results from BMC may be ap- gle one of the trained models.[21] It has been successfully
proximated by using cross-validation to select the best en- used on both supervised learning tasks (regression,[22]
semble combination from a random sampling of possible classication and distance learning [23] ) and unsupervised
weightings. learning (density estimation).[24] It has also been used to
estimate baggings error rate.[3][25] It has been reported to
out-perform Bayesian model-averaging.[26] The two top-
53.3.6 Bucket of models performers in the Netix competition utilized blending,
which may be considered to be a form of stacking.[27]
A bucket of models is an ensemble in which a model
selection algorithm is used to choose the best model for
each problem. When tested with only one problem, a 53.4 References
bucket of models can produce no better results than the
best model in the set, but when evaluated across many [1] Opitz, D.; Maclin, R. (1999). Popular ensemble meth-
problems, it will typically produce much better results, ods: An empirical study. Journal of Articial Intelligence
on average, than any model in the set. Research 11: 169198. doi:10.1613/jair.614.
The most common approach used for model-selection is [2] Polikar, R. (2006). Ensemble based systems in decision
cross-validation selection (sometimes called a bake-o making. IEEE Circuits and Systems Magazine 6 (3): 21
contest). It is described with the following pseudo-code: 45. doi:10.1109/MCAS.2006.1688199.
For each model m in the bucket: Do c times: (where 'c' is [3] Rokach, L. (2010). Ensemble-based classiers.
some constant) Randomly divide the training dataset into Articial Intelligence Review 33 (1-2): 139.
two datasets: A, and B. Train m with A Test m with B doi:10.1007/s10462-009-9124-7.
Select the model that obtains the highest average score
[4] Kuncheva, L. and Whitaker, C., Measures of diversity in
Cross-Validation Selection can be summed up as: try classier ensembles, Machine Learning, 51, pp. 181-207,
them all with the training set, and pick the one that works 2003
best.[19]
[5] Sollich, P. and Krogh, A., Learning with ensembles: How
Gating is a generalization of Cross-Validation Selection. overtting can be useful, Advances in Neural Information
It involves training another learning model to decide Processing Systems, volume 8, pp. 190-196, 1996.
which of the models in the bucket is best-suited to solve
the problem. Often, a perceptron is used for the gating [6] Brown, G. and Wyatt, J. and Harris, R. and Yao, X., Di-
model. It can be used to pick the best model, or it can versity creation methods: a survey and categorisation., In-
formation Fusion, 6(1), pp.5-20, 2005.
be used to give a linear weight to the predictions from
each model in the bucket. [7] Accuracy and Diversity in Ensembles of Text Categorisers.
When a bucket of models is used with a large set of prob- J. J. Garca Adeva, Ulises Cervio, and R. Calvo, CLEI
Journal, Vol. 8, No. 2, pp. 1 - 12, December 2005.
lems, it may be desirable to avoid training some of the
models that take a long time to train. Landmark learn- [8] Ho, T., Random Decision Forests, Proceedings of the
ing is a meta-learning approach that seeks to solve this Third International Conference on Document Analysis and
problem. It involves training only the fast (but imprecise) Recognition, pp. 278-282, 1995.
algorithms in the bucket, and then using the performance
of these algorithms to help determine which slow (but ac- [9] Gashler, M. and Giraud-Carrier, C. and Martinez, T.,
Decision Tree Ensemble: Small Heterogeneous Is Bet-
curate) algorithm is most likely to do best.[20]
ter Than Large Homogeneous, The Seventh International
Conference on Machine Learning and Applications, 2008,
pp. 900-905., DOI 10.1109/ICMLA.2008.154
53.3.7 Stacking
[10] Tom M. Mitchell, Machine Learning, 1997, pp. 175
Stacking (sometimes called stacked generalization) in- [11] Breiman, L., Bagging Predictors, Machine Learning,
volves training a learning algorithm to combine the pre- 24(2), pp.123-140, 1996.
dictions of several other learning algorithms. First, all of
the other algorithms are trained using the available data, [12] Sahu, A., Runger, G., Apley, D., Image denoising with a
then a combiner algorithm is trained to make a nal pre- multi-phase kernel principal component approach and an
diction using all the predictions of the other algorithms as ensemble version, IEEE Applied Imagery Pattern Recog-
additional inputs. If an arbitrary combiner algorithm is nition Workshop, pp.1-7, 2011.
used, then stacking can theoretically represent any of the [13] Shinde, Amit, Anshuman Sahu, Daniel Apley, and George
ensemble techniques described in this article, although in Runger. Preimages for Variation Patterns from Kernel
practice, a single-layer logistic regression model is often PCA and Bagging. IIE Transactions, Vol. 46, Iss. 5,
used as the combiner. 2014.
53.6. EXTERNAL LINKS 343
[14] Hoeting, J. A.; Madigan, D.; Raftery, A. E.; Volinsky, C. Robert Schapire; Yoav Freund (2012). Boosting:
T. (1999). Bayesian Model Averaging: A Tutorial. Sta- Foundations and Algorithms. MIT. ISBN 978-0-
tistical Science 14 (4): 382401. doi:10.2307/2676803. 262-01718-3.
JSTOR 2676803.
[27] Sill, J. and Takacs, G. and Mackey L. and Lin D., Feature-
Weighted Linear Stacking, 2009, arXiv:0911.0460
Random forest
This article is about the machine learning technique. optimization and bagging. In addition, this paper com-
For other kinds of random tree, see Random tree bines several ingredients, some previously known and
(disambiguation). some novel, which form the basis of the modern practice
of random forests, in particular:
Random forests are an ensemble learning method for
classication, regression and other tasks, that operate by 1. Using out-of-bag error as an estimate of the
constructing a multitude of decision trees at training time generalization error.
and outputting the class that is the mode of the classes
(classication) or mean prediction (regression) of the in-
dividual trees. Random forests correct for decision trees 2. Measuring variable importance through permuta-
habit of overtting to their training set. tion.
The early development of random forests was inuenced Decision trees are a popular method for various machine
by the work of Amit and Geman[5] who introduced the learning tasks. Tree learning come[s] closest to meeting
idea of searching over a random subset of the available the requirements for serving as an o-the-shelf procedure
decisions when splitting a node, in the context of growing for data mining, say Hastie et al., because it is invariant
a single tree. The idea of random subspace selection from under scaling and various other transformations of feature
Ho[4] was also inuential in the design of random forests. values, is robust to inclusion of irrelevant features, and
In this method a forest of trees is grown, and variation produces inspectable models. However, they are seldom
among the trees is introduced by projecting the training accurate.[8]:352
data into a randomly chosen subspace before tting each In particular, trees that are grown very deep tend to learn
tree. Finally, the idea of randomized node optimization, highly irregular patterns: they overt their training sets,
where the decision at each node is selected by a random- because they have low bias, but very high variance. Ran-
ized procedure, rather than a deterministic optimization dom forests are a way of averaging multiple deep deci-
was rst introduced by Dietterich.[7] sion trees, trained on dierent parts of the same training
The introduction of random forests proper was rst made set, with the goal of reducing the variance.[8]:587588 This
in a paper by Leo Breiman.[1] This paper describes a comes at the expense of a small increase in the bias and
method of building a forest of uncorrelated trees using a some loss of interpretability, but generally greatly boosts
CART like procedure, combined with randomized node the performance of the nal model.
344
54.3. PROPERTIES 345
54.2.2 Tree bagging learning process, a random subset of the features. This
process is sometimes called feature bagging. The rea-
Main article: Bootstrap aggregating son for doing this is the correlation of the trees in an or-
dinary bootstrap sample: if one or a few features are very
The training algorithm for random forests applies the gen- strong predictors for the response variable (target output),
eral technique of bootstrap aggregating, or bagging, to these features will be selected in many of the B trees,
tree learners. Given a training set X = x1 , , x with causing them to become correlated.
responses Y = y1 , , y , bagging repeatedly selects a Typically, for a dataset with p features, p features are
random sample with replacement of the training set and used in each split.
ts trees to these samples:
or by taking the majority vote in the case of decision trees. 54.3 Properties
This bootstrapping procedure leads to better model per-
formance because it decreases the variance of the model, 54.3.1 Variable importance
without increasing the bias. This means that while the
predictions of a single tree are highly sensitive to noise Random forests can be used to rank the importance of
in its training set, the average of many trees is not, as variables in a regression or classication problem in a
long as the trees are not correlated. Simply training many natural way. The following technique was described in
trees on a single training set would give strongly corre- Breimans original paper[1] and is implemented in the R
lated trees (or even the same tree many times, if the train- package randomForest.[2]
ing algorithm is deterministic); bootstrap sampling is a The rst step in measuring the variable importance in a
way of de-correlating the trees by showing them dier- data set Dn = {(Xi , Yi )}ni=1 is to t a random forest to
ent training sets. the data. During the tting process the out-of-bag error
The number of samples/trees, B, is a free parameter. for each data point is recorded and averaged over the for-
Typically, a few hundred to several thousand trees are est (errors on an independent test set can be substituted
used, depending on the size and nature of the training set. if bagging is not used during training).
An optimal number of trees B can be found using cross- To measure the importance of the j -th feature after train-
validation, or by observing the out-of-bag error: the mean ing, the values of the j -th feature are permuted among
prediction error on each training sample x, using only the the training data and the out-of-bag error is again com-
trees that did not have x in their bootstrap sample.[9] The puted on this perturbed data set. The importance score
training and test error tend to level o after some number for the j -th feature is computed by averaging the dier-
of trees have been t. ence in out-of-bag error before and after the permutation
over all trees. The score is normalized by the standard
deviation of these dierences.
54.2.3 From bagging to random forests
Features which produce large values for this score are
Main article: Random subspace method ranked as more important than features which produce
small values.
The above procedure describes the original bagging algo- This method of determining variable importance has
rithm for trees. Random forests dier in only one way some drawbacks. For data including categorical variables
from this general scheme: they use a modied tree learn- with dierent number of levels, random forests are biased
ing algorithm that selects, at each candidate split in the in favor of those attributes with more levels. Methods
346 CHAPTER 54. RANDOM FOREST
such as partial permutations[11][12] and growing unbiased can also dene an RF dissimilarity measure between un-
trees[13] can be used to solve the problem. If the data labeled data: the idea is to construct an RF predictor that
contain groups of correlated features of similar relevance distinguishes the observed data from suitably generated
for the output, then smaller groups are favored over larger synthetic data.[1][16] The observed data are the original
groups.[14] unlabeled data and the synthetic data are drawn from a
reference distribution. An RF dissimilarity can be at-
tractive because it handles mixed variable types well, is
54.3.2 Relationship to nearest neighbors invariant to monotonic transformations of the input vari-
ables, and is robust to outlying observations. The RF
A relationship between random forests and the k-nearest dissimilarity easily deals with a large number of semi-
neighbor algorithm (k-NN) was pointed out by Lin and continuous variables due to its intrinsic variable selection;
Jeon in 2002.[15] It turns out that both can be viewed for example, the Addcl 1 RF dissimilarity weighs the
as so-called weighted neighborhoods schemes. These are contribution of each variable according to how dependent
models built from a training set {(xi , yi )}ni=1 that make it is on other variables. The RF dissimilarity has been
predictions y for new points x' by looking at the neigh- used in a variety of applications, e.g. to nd clusters of
borhood of the point, formalized by a weight function patients based on tissue marker data.[17]
W:
n 54.5 Variants
y = W (xi , x ) yi .
i=1
Instead of decision trees, linear models have been pro-
Here, W (xi , x ) is the non-negative weight of the i'th posed and evaluated as base estimators in random forests,
training point relative to the new point x'. For any par- in particular multinomial logistic regression and naive
[18][19]
ticular x', the weights must sum to one. Weight functions Bayes classiers.
are given as follows:
[4] Ho, Tin Kam (1998). The Random Subspace Method for [18] Prinzie, A., Van den Poel, D. (2008). Random
Constructing Decision Forests (PDF). IEEE Transactions Forests for multiclass classication: Random MultiNo-
on Pattern Analysis and Machine Intelligence 20 (8): 832 mial Logit. Expert Systems with Applications 34 (3):
844. doi:10.1109/34.709601. 17211732. doi:10.1016/j.eswa.2007.01.029.
[5] Amit, Yali; Geman, Donald (1997). Shape quan- [19] Prinzie, A., Van den Poel, D. (2007). Random Multi-
tization and recognition with randomized trees class Classication: Generalizing Random Forests to Ran-
(PDF). Neural Computation 9 (7): 15451588. dom MNL and Random NB, Dexa 2007, Lecture Notes
doi:10.1162/neco.1997.9.7.1545. in Computer Science, 4653, 349358.
[15] Lin, Yi; Jeon, Yongho (2002). Random forests and adap-
tive nearest neighbors (Technical report). Technical Re-
port No. 1055. University of Wisconsin.
Boosting is a machine learning ensemble meta-algorithm e.g., boost by majority and BrownBoost). Thus, future
for reducing bias primarily and also variance[1] in weak learners focus more on the examples that previous
supervised learning, and a family of machine learning al- weak learners misclassied.
gorithms which convert weak learners to strong ones.[2] There are many boosting algorithms. The original ones,
Boosting is based on the question posed by Kearns and
proposed by Robert Schapire (a recursive majority gate
Valiant (1988, 1989):[3][4] Can a set of weak learners formulation[5] ) and Yoav Freund (boost by majority[9] ),
create a single strong learner? A weak learner is de-
were not adaptive and could not take full advantage of
ned to be a classier which is only slightly correlated the weak learners. However, Schapire and Freund then
with the true classication (it can label examples better
developed AdaBoost, an adaptive boosting algorithm that
than random guessing). In contrast, a strong learner is a won the prestigious Gdel Prize. Only algorithms that
classier that is arbitrarily well-correlated with the true
are provable boosting algorithms in the probably approx-
classication. imately correct learning formulation are called boosting
Robert Schapires armative answer in a 1990 paper[5] algorithms. Other algorithms that are similar in spirit to
to the question of Kearns and Valiant has had signicant boosting algorithms are sometimes called leveraging al-
ramications in machine learning and statistics, most no- gorithms, although they are also sometimes incorrectly
tably leading to the development of boosting.[6] called boosting algorithms.[9]
When rst introduced, the hypothesis boosting problem
simply referred to the process of turning a weak learner
into a strong learner. Informally, [the hypothesis boost- 55.2 Examples of boosting algo-
ing] problem asks whether an ecient learning algorithm rithms
[] that outputs a hypothesis whose performance is only
slightly better than random guessing [i.e. a weak learner]
The main variation between many boosting algorithms is
implies the existence of an ecient algorithm that out-
their method of weighting training data points and hy-
puts a hypothesis of arbitrary accuracy [i.e. a strong
potheses. AdaBoost is very popular and perhaps the
learner].[3] Algorithms that achieve hypothesis boost-
most signicant historically as it was the rst algo-
ing quickly became simply known as boosting. Fre-
rithm that could adapt to the weak learners. However,
und and Schapires arcing (Adapt[at]ive Resampling and
there are many more recent algorithms such as LPBoost,
Combining),[7] as a general technique, is more or less syn-
TotalBoost, BrownBoost, MadaBoost, LogitBoost, and
onymous with boosting.[8]
others. Many boosting algorithms t into the AnyBoost
framework,[9] which shows that boosting performs
gradient descent in function space using a convex cost
function.
55.1 Boosting algorithms
Boosting algorithms are used in Computer Vision, where
individual classiers detecting contrast changes can be
While boosting is not algorithmically constrained, most
combined to identify Facial Features.[10]
boosting algorithms consist of iteratively learning weak
classiers with respect to a distribution and adding them
to a nal strong classier. When they are added, they
are typically weighted in some way that is usually related 55.3 Criticism
to the weak learners accuracy. After a weak learner is
added, the data is reweighted: examples that are misclas- In 2008 Phillip Long (at Google) and Rocco A. Servedio
sied gain weight and examples that are classied cor- (Columbia University) published a paper[11] at the 25th
rectly lose weight (some boosting algorithms actually de- International Conference for Machine Learning suggest-
crease the weight of repeatedly misclassied examples, ing that many of these algorithms are probably awed.
348
55.7. EXTERNAL LINKS 349
They conclude that convex potential boosters cannot [6] Leo Breiman (1998). Arcing classier (with discussion
withstand random classication noise, thus making the and a rejoinder by the author)". Ann. Statist. 26 (3): 801
applicability of such algorithms for real world, noisy data 849. Retrieved 18 January 2015. Schapire (1990) proved
sets questionable. The paper shows that if any non-zero that boosting is possible. (Page 823)
fraction of the training data is mis-labeled, the boost- [7] Yoav Freund and Robert E. Schapire (1997); A Decision-
ing algorithm tries extremely hard to correctly classify Theoretic Generalization of On-Line Learning and an Ap-
these training examples, and fails to produce a model plication to Boosting, Journal of Computer and System
with accuracy better than 1/2. This result does not ap- Sciences, 55(1):119-139
ply to branching program based boosters but does apply
to AdaBoost, LogitBoost, and others.[12][11] [8] Leo Breiman (1998); Arcing Classier (with Discussion
and a Rejoinder by the Author), Annals of Statistics, vol.
26, no. 3, pp. 801-849: The concept of weak learn-
ing was introduced by Kearns and Valiant (1988, 1989),
55.4 See also who left open the question of whether weak and strong
learnability are equivalent. The question was termed the
boosting problem since [a solution must] boost the low ac-
55.5 Implementations curacy of a weak learner to the high accuracy of a strong
learner. Schapire (1990) proved that boosting is possible.
Orange, a free data mining software suite, module A boosting algorithm is a method that takes a weak learner
and converts it into a strong learner. Freund and Schapire
Orange.ensemble
(1997) proved that an algorithm similar to arc-fs is boost-
Weka is a machine learning set of tools that oers ing.
variate implementations of boosting algorithms like [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus
AdaBoost and LogitBoost Frean (2000); Boosting Algorithms as Gradient Descent, in
S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Ad-
R package GBM (Generalized Boosted Regres-
vances in Neural Information Processing Systems 12, pp.
sion Models) implements extensions to Freund and 512-518, MIT Press
Schapires AdaBoost algorithm and Friedmans gra-
dient boosting machine. [10] OpenCV CascadeClassier http://docs.opencv.org/
modules/objdetect/doc/cascade_classification.html
jboost; AdaBoost, LogitBoost, RobustBoost, Boos-
texter and alternating decision trees [11] Random Classication Noise Defeats All Convex Poten-
tial Boosters
Bootstrap aggregating
Bootstrap aggregating, also called bagging, is a single smoother from the complete data set, 100 bootstrap
machine learning ensemble meta-algorithm designed to samples of the data were drawn. Each sample is dier-
improve the stability and accuracy of machine learning ent from the original data set, yet resembles it in distribu-
algorithms used in statistical classication and regression. tion and variability. For each bootstrap sample, a LOESS
It also reduces variance and helps to avoid overtting. Al- smoother was t. Predictions from these 100 smoothers
though it is usually applied to decision tree methods, it were then made across the range of the data. The rst
can be used with any type of method. Bagging is a spe- 10 predicted smooth ts appear as grey lines in the gure
cial case of the model averaging approach. below. The lines are clearly very wiggly and they overt
the data - a result of the span being too low.
By taking the average of 100 smoothers, each tted to a
56.1 Description of the technique subset of the original data set, we arrive at one bagged
predictor (red line). Clearly, the mean is more stable and
Given a standard training set D of size n, bagging gener- there is less overt.
ates m new training sets Di , each of size n, by sampling
from D uniformly and with replacement. By sampling
with replacement, some observations may be repeated in
each Di . If n=n, then for large n the set Di is expected to
have the fraction (1 - 1/e) (63.2%) of the unique exam-
ples of D, the rest being duplicates.[1] This kind of sample
is known as a bootstrap sample. The m models are tted
using the above m bootstrap samples and combined by
averaging the output (for regression) or voting (for clas-
sication).
Bagging leads to improvements for unstable procedures
(Breiman, 1996), which include, for example, articial
neural networks, classication and regression trees, and
subset selection in linear regression (Breiman, 1994). An
interesting application of bagging showing improvement
in preimage learning is provided here.[2][3] On the other
hand, it can mildly degrade the performance of stable
methods such as K-nearest neighbors (Breiman, 1996).
350
56.6. REFERENCES 351
56.3 Bagging for nearest neighbour out of a set of n (dierent and equally likely), the expected
number of unique draws is n(1 en /n ) .
classiers
[2] Sahu, A., Runger, G., Apley, D., Image denoising with a
The risk of a 1 nearest neighbour (1NN) classier is at multi-phase kernel principal component approach and an
most twice the risk of the Bayes classier,[4] but there are ensemble version, IEEE Applied Imagery Pattern Recog-
nition Workshop, pp.1-7, 2011.
no guarantees that this classier will be consistent. By
careful choice of the size of the resamples, bagging can [3] Shinde, Amit, Anshuman Sahu, Daniel Apley, and George
lead to substantial improvements of the performance of Runger. Preimages for Variation Patterns from Kernel
the 1NN classier. By taking a large number of resam- PCA and Bagging. IIE Transactions, Vol.46, Iss.5, 2014
ples of the data of size n , the bagged nearest neighbour
[4] Castelli, Vittorio. Nearest Neighbor Classiers, p.5
classier will be consistent provided n diverges
(PDF). columbia.edu. Columbia University. Retrieved 25
but n /n 0 as the sample size n . April 2015.
Under innite simulation, the bagged nearest neighbour
[5] Samworth R. J. (2012). Optimal weighted nearest neigh-
classier can be viewed as a weighted nearest neighbour
bour classiers. Annals of Statistics 40 (5): 27332763.
classier. Suppose that the feature space is d dimensional doi:10.1214/12-AOS1049.
bnn
and denote by Cn,n the bagged nearest neighbour classi-
( )
n 1 Alfaro, E., Gmez, M. and Garca, N. (2012).
RR (Cn,n
bnn
)RR (C
Bayes
) = B1 + B2 4/d {1+o(1)}, adabag: An R package for classication with Ad-
n (n )
aBoost.M1, AdaBoost-SAMME and Bagging.
for some constants B1 and B2 . The optimal choice of n
, that balances the two terms in the asymptotic expansion,
is given by n = Bnd/(d+4) for some constant B .
56.4 History
Bagging (Bootstrap aggregating) was proposed by Leo
Breiman in 1994 to improve the classication by com-
bining classications of randomly generated training sets.
See Breiman, 1994. Technical Report No. 421.
56.6 References
[1] Aslam, Javed A.; Popa, Raluca A.; and Rivest, Ronald L.
(2007); On Estimating the Size and Condence of a Statis-
tical Audit, Proceedings of the Electronic Voting Technol-
ogy Workshop (EVT '07), Boston, MA, August 6, 2007.
More generally, when drawing with replacement n values
Chapter 57
Gradient boosting
Gradient boosting is a machine learning technique for Fm (x)+h(x) . The question is now, how to nd h ? The
regression and classication problems, which produces a gradient boosting solution starts with the observation that
prediction model in the form of an ensemble of weak a perfect h would imply
prediction models, typically decision trees. It builds the
model in a stage-wise fashion like other boosting meth-
ods do, and it generalizes them by allowing optimization Fm+1 = Fm (x) + h(x) = y
of an arbitrary dierentiable loss function. or, equivalently,
The idea of gradient boosting originated in the observa-
tion by Leo Breiman [1] that boosting can be interpreted as
an optimization algorithm on a suitable cost function. Ex- h(x) = y Fm (x)
plicit regression gradient boosting algorithms were subse- Therefore, gradient boosting will t h to the residual
quently developed by Jerome H. Friedman[2][3] simulta- y Fm (x) . Like in other boosting variants, each Fm+1
neously with the more general functional gradient boost- learns to correct its predecessor Fm . A generalization of
ing perspective of Llew Mason, Jonathan Baxter, Peter this idea to other loss functions than squared error (and
Bartlett and Marcus Frean .[4][5] The latter two papers in- to classication and ranking problems) follows from the
troduced the abstract view of boosting algorithms as itera- observation that residuals y F (x) are the negative gra-
tive functional gradient descent algorithms. That is, algo- dients of the squared error loss function 12 (y F (x))2 .
rithms that optimize a cost functional over function space So, gradient boosting is a gradient descent algorithm; and
by iteratively choosing a function (weak hypothesis) that generalizing it entails plugging in a dierent loss and its
points in the negative gradient direction. This functional gradient.
gradient view of boosting has led to the development of
boosting algorithms in many areas of machine learning
and statistics beyond regression and classication.
57.2 Algorithm
In many supervised learning problems one has an output
57.1 Informal introduction variable y and a vector of input variables x connected to-
gether via a joint probability distribution P(x, y). Using
(This section follows the exposition of gradient boosting a training set {(x1 , y1 ), . . . , (xn , yn )} of known values
[6]
by Li. ) of x and corresponding values of y, the goal is to nd an
approximation F (x) to a function F*(x) that minimizes
Like other boosting methods, gradient boosting combines the expected value of some specied loss function L(y,
weak learners into a single strong learner, in an itera- F(x)):
tive fashion. It is easiest to explain in the least-squares
regression setting, where the goal is to learn a model F
that predicts values y = F (x) , minimizing the mean F = arg min E [L(y, F (x))]
x,y
squared error (y y)2 to the true values y (averaged over F
some training set). Gradient boosting method assumes a real-valued y and
At each stage 1 m M of gradient boosting, it may seeks an approximation F (x) in the form of a weighted
be assumed that there is some imperfect model Fm (at sum of functions hi(x) from some class , called base
the outset, a very weak model that just predicts the mean (or weak) learners:
y in the training set could be used). The gradient boost-
ing algorithm does not change Fm in any way; instead,
M
it improves on it by constructing a new model that adds F (x) = i hi (x) + const
an estimator h to provide a better model Fm+1 (x) = i=1
352
57.3. GRADIENT TREE BOOSTING 353
In accordance with the empirical risk minimization prin- (b) Fit a base learner hm (x) to pseudo-
ciple, the method tries to nd an approximation F (x) that residuals, i.e. train it using the training
minimizes the average value of the loss function on the set {(xi , rim )}ni=1 .
training set. It does so by starting with a model, con-
(c) Compute multiplier m by solving the follow-
sisting of a constant function F0 (x) , and incrementally
ing one-dimensional optimization problem:
expanding it in a greedy fashion:
n
m = arg min L (yi , Fm1 (xi ) + hm (xi )) .
n
i=1
F0 (x) = arg min L(yi , )
i=1
(d) Update the model:
n
Fm (x) = Fm1 (x)+arg min L(yi , Fm1 (xi )+f (xi )) Fm (x) = Fm1 (x) + m hm (x).
f H i=1
1. Initialize model with a constant value: Friedman proposes to modify this algorithm so that it
chooses a separate optimal valuejm for each of the trees
n
regions, instead of a singlem for the whole tree. He calls
F0 (x) = arg min L(yi , ).
the modied algorithm TreeBoost. The coecients
i=1
bjm from the tree-tting procedure can be then simply
2. For m = 1 to M: discarded and the model update rule becomes:
57.6 Names [12] Cossock, David and Zhang, Tong (2008). Statistical Anal-
ysis of Bayes Optimal Subset Ranking, page 14.
The method goes by a variety of names. Friedman in- [13] Yandex corporate blog entry about new ranking model
troduced his regression technique as a Gradient Boost- Snezhinsk (in Russian)
ing Machine (GBM).[2] Mason, Baxter et. el. described
the generalized abstract class of algorithms as functional
gradient boosting.[4][5]
A popular open-source implementation[10] for R calls it
Generalized Boosting Model. Commercial implemen-
tations from Salford Systems use the names Multiple
Additive Regression Trees (MART) and TreeNet, both
trademarked.
Random forest
57.8 References
[1] Brieman, L. "Arcing The Edge" (June 1997)
[5] Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (May
1999). Boosting Algorithms as Gradient Descent in Func-
tion Space (PDF).
[8] Note: in case of usual CART trees, the trees are tted
using least-squares loss, and so the coecient bjm for the
region Rjm is equal to just the value of output variable,
averaged over all training instances in Rjm .
Semi-supervised learning
356
58.3. METHODS FOR SEMI-SUPERVISED LEARNING 357
58.1 Assumptions used in semi- The transductive learning framework was formally intro-
duced by Vladimir Vapnik in the 1970s.[4] Interest in in-
supervised learning ductive learning using generative models also began in the
1970s. A probably approximately correct learning bound
In order to make any use of unlabeled data, we must for semi-supervised learning of a Gaussian mixture was
assume some structure to the underlying distribution of demonstrated by Ratsaby and Venkatesh in 1995 [5]
data. Semi-supervised learning algorithms make use of
at least one of the following assumptions. [1] Semi-supervised learning has recently become more pop-
ular and practically relevant due to the variety of prob-
lems for which vast quantities of unlabeled data are
58.1.1 Smoothness assumption availablee.g. text on websites, protein sequences, or
images. For a review of recent work see a survey article
[6]
Points which are close to each other are more likely to by Zhu (2008).
share a label. This is also generally assumed in supervised
learning and yields a preference for geometrically sim-
ple decision boundaries. In the case of semi-supervised 58.3 Methods for semi-supervised
learning, the smoothness assumption additionally yields
a preference for decision boundaries in low-density re- learning
gions, so that there are fewer points close to each other
but in dierent classes. 58.3.1 Generative models
58.3.2 Low-density separation and intrinsic spaces respectively. The graph is used to ap-
proximate the intrinsic regularization term. Dening the
Another major class of methods attempts to place bound- graph Laplacian L = D W where Dii = l+u j=1 Wij
aries in regions where there are few data points (labeled or and f the vector [f (x1 ) . . . f (xl+u )] , we have
unlabeled). One of the most commonly used algorithms
is the transductive support vector machine, or TSVM
(which, despite its name, may be used for inductive learn-
l+u
ing as well). Whereas support vector machines for su- fT Lf = Wij (fi fj )2 ||M f (x)||2 dp(x)
pervised learning seek a decision boundary with maximal i,j=1 M
margin over the labeled data, the goal of TSVM is a label-
ing of the unlabeled data such that the decision boundary The Laplacian can also be used to extend the supervised
has maximal margin over all of the data. In addition to learning algorithms: regularized least squares and sup-
the standard hinge loss (1 yf (x))+ for labeled data, a port vector machines (SVM) to semi-supervised versions
loss function (1 |f (x)|)+ is introduced over the unla- Laplacian regularized least squares and Laplacian SVM.
beled data by letting y = sign f (x) . TSVM then selects
f (x) = h (x) + b from a reproducing kernel Hilbert
space H by minimizing the regularized empirical risk: 58.3.4 Heuristic approaches
examples available, but the sampling process from which 58.7 External links
labeled examples arise.[13][14]
A freely available MATLAB implementation of the
graph-based semi-supervised algorithms Laplacian
58.5 See also support vector machines and Laplacian regularized
least squares.
PU learning
58.6 References
[1] Chapelle, Olivier; Schlkopf, Bernhard; Zien, Alexan-
der (2006). Semi-supervised learning. Cambridge, Mass.:
MIT Press. ISBN 978-0-262-03358-9.
[2] Stevens, K.N.(2000), Acoustic Phonetics, MIT Press,
ISBN 0-262-69250-3, 978-0-262-69250-2
[3] Scudder, H.J. Probability of Error of Some Adaptive
Pattern-Recognition Machines. IEEE Transaction on In-
formation Theory, 11:363371 (1965). Cited in Chapelle
et al. 2006, page 3.
[4] Vapnik, V. and Chervonenkis, A. Theory of Pattern
Recognition [in Russian]. Nauka, Moscow (1974). Cited
in Chapelle et al. 2006, page 3.
[5] Ratsaby, J. and Venkatesh, S. Learning from a mixture
of labeled and unlabeled examples with parametric side
information. In Proceedings of the Eighth Annual Confer-
ence on Computational Learning Theory, pages 412-417
(1995). Cited in Chapelle et al. 2006, page 4.
[6] Zhu, Xiaojin. Semi-supervised learning literature survey.
Computer Sciences, University of Wisconsin-Madison
(2008).
[7] Cozman, F. and Cohen, I. Risks of semi-supervised learn-
ing: how unlabeled data can degrade performance of gen-
erative classiers. In: Chapelle et al. (2006).
[8] Zhu, Xiaojin. Semi-Supervised Learning University of
Wisconsin-Madison.
[9] M. Belkin, P. Niyogi (2004). Semi-supervised Learning
on Riemannian Manifolds. Machine Learning 56 (Spe-
cial Issue on Clustering): 209239.
[10] M. Belkin, P. Niyogi, V. Sindhwani. On Manifold Regu-
larization. AISTATS 2005.
[11] Zhu, Xiaojin; Goldberg, Andrew B. (2009). Introduction
to semi-supervised learning. Morgan & Claypool. ISBN
9781598295481.
[12] Younger, B. A. and Fearing, D. D. (1999), Parsing Items
into Separate Categories: Developmental Change in In-
fant Categorization. Child Development, 70: 291303.
[13] Xu, F. and Tenenbaum, J. B. (2007). Sensitivity
to sampling in Bayesian word learning. Developmen-
tal Science 10. pp. 288297. doi:10.1111/j.1467-
7687.2007.00590.x.
[14] Gweon, H., Tenenbaum J.B., and Schulz L.E (2010).
Infants consider both the sample and the sampling pro-
cess in inductive generalization. Proc Natl Acad Sci U S
A. 107 (20): 906671.
Chapter 59
Perceptron
Perceptrons redirects here. For the book of that title, was quickly proved that perceptrons could not be trained
see Perceptrons (book). to recognise many classes of patterns. This led to the eld
of neural network research stagnating for many years,
before it was recognised that a feedforward neural net-
In machine learning, the perceptron is an algorithm for
supervised learning of binary classiers: functions that work with two or more layers (also called a multilayer
perceptron) had far greater processing power than per-
can decide whether an input (represented by a vector of
numbers) belong to one class or another. It is a type of ceptrons with one layer (also called a single layer percep-
[1]
tron). Single layer perceptrons are only capable of learn-
linear classier, i.e. a classication algorithm that makes
its predictions based on a linear predictor function com- ing linearly separable patterns; in 1969 a famous book en-
bining a set of weights with the feature vector. The algo- titled Perceptrons by Marvin Minsky and Seymour Papert
rithm allows for online learning, in that it processes ele- showed that it was impossible for these classes of network
ments in the training set one at a time. to learn an XOR function. It is often believed that they
also conjectured (incorrectly) that a similar result would
The perceptron algorithm dates back to the late 1950s; its hold for a multi-layer perceptron network. However, this
rst implementation, in custom hardware, was one of the is not true, as both Minsky and Papert already knew
rst articial neural networks to be produced. that multi-layer perceptrons were capable of producing
an XOR function. (See the page on Perceptrons (book)
for more information.) Three years later Stephen Gross-
59.1 History berg published a series of papers introducing networks
capable of modelling dierential, contrast-enhancing and
See also: History of articial intelligence, AI XOR functions. (The papers were published in 1972
winter and 1973, see e.g.:Grossberg (1973). Contour enhance-
ment, short-term memory, and constancies in reverber-
ating neural networks (PDF). Studies in Applied Math-
The perceptron algorithm was invented in 1957 at the
ematics 52: 213257.). Nevertheless the often-miscited
Cornell Aeronautical Laboratory by Frank Rosenblatt,[2]
Minsky/Papert text caused a signicant decline in inter-
funded by the United States Oce of Naval Research.[3]
est and funding of neural network research. It took ten
The perceptron was intended to be a machine, rather than
more years until neural network research experienced a
a program, and while its rst implementation was in soft-
resurgence in the 1980s. This text was reprinted in 1987
ware for the IBM 704, it was subsequently implemented
as Perceptrons - Expanded Edition where some errors
in custom-built hardware as the Mark 1 perceptron.
in the original text are shown and corrected.
This machine was designed for image recognition: it had
an array of 400 photocells, randomly connected to the The kernel perceptron algorithm was already introduced
neurons. Weights were encoded in potentiometers, and in 1964 by Aizerman et al.[5] Margin bounds guaran-
weight updates during learning were performed by elec- tees were given for the Perceptron algorithm in the gen-
tric motors.[4]:193 eral non-separable case rst by Freund and Schapire
(1998),[1] and more recently by Mohri and Rostamizadeh
In a 1958 press conference organized by the US Navy,
(2013) who extend previous results and give new L1
Rosenblatt made statements about the perceptron that
bounds.[6]
caused a heated controversy among the edgling AI com-
munity; based on Rosenblatts statements, The New York
Times reported the perceptron to be the embryo of an
electronic computer that [the Navy] expects will be able 59.2 Denition
to walk, talk, see, write, reproduce itself and be conscious
of its existence.[3] In the modern sense, the perceptron is an algorithm for
Although the perceptron initially seemed promising, it learning a binary classier: a function that maps its input
360
59.3. LEARNING ALGORITHM 361
size
size
1 ifw x + b > 0
f (x) =
0 otherwise
domestication domestication
size
size
ther a positive or a negative instance, in the case of a
binary classication problem. If b is negative, then the
weighted combination of inputs must produce a positive
value greater than |b| in order to push the classier neu- domestication domestication
ron over the 0 threshold. Spatially, the bias alters the posi-
tion (though not the orientation) of the decision boundary. A diagram showing a perceptron updating its linear boundary as
The perceptron learning algorithm does not terminate if more training examples are added.
the learning set is not linearly separable. If the vectors are
not linearly separable learning will never reach a point
where all vectors are classied properly. The most fa- y = f (z) denotes the output from the perceptron
mous example of the perceptrons inability to solve prob- for an input vector z .
lems with linearly nonseparable vectors is the Boolean
b is the bias term, which in the example below we
exclusive-or problem. The solution spaces of decision
take to be 0.
boundaries for all binary functions and learning behav-
[7]
iors are studied in the reference. D = {(x1 , d1 ), . . . , (xs , ds )} is the training set of
In the context of neural networks, a perceptron is an s samples, where:
articial neuron using the Heaviside step function as the xj is the n -dimensional input vector.
activation function. The perceptron algorithm is also
termed the single-layer perceptron, to distinguish it dj is the desired output value of the percep-
from a multilayer perceptron, which is a misnomer for a tron for that input.
more complicated neural network. As a linear classier,
the single-layer perceptron is the simplest feedforward We show the values of the features as follows:
neural network.
xj,i is the value of the i th feature of the j th training
input vector.
59.3.1 Denitions Too high a learning rate makes the perceptron periodi-
cally oscillate around the solution unless additional steps
We rst dene some variables: are taken.
362 CHAPTER 59. PERCEPTRON
59.4 Variants example, a random matrix was used to project the data
linearly to a 1000-dimensional space; then each resulting
The pocket algorithm with ratchet (Gallant, 1990) solves data point was transformed through the hyperbolic tan-
the stability problem of perceptron learning by keep- gent function. A linear classier can then separate the
ing the best solution seen so far in its pocket. The data, as shown in the third gure. However the data may
pocket algorithm then returns the solution in the pocket, still not be completely separable in this space, in which
rather than the last solution. It can be used also for non- the perceptron algorithm would not converge. In the ex-
separable data sets, where the aim is to nd a percep- ample shown, stochastic steepest gradient descent was
tron with a small number of misclassications. However, used to adapt the parameters.
these solutions appear purely stochastically and hence the Another way to solve nonlinear problems without using
pocket algorithm neither approaches them gradually in multiple layers is to use higher order networks (sigma-pi
the course of learning, nor are they guaranteed to show unit). In this type of network, each element in the in-
up within a given number of learning steps. put vector is extended with each pairwise combination of
The Maxover algorithm (Wendemuth, 1995) [9]
is multiplied inputs (second order). This can be extended
robust in the sense that it will converge regardless of to an n-order network.
(prior) knowledge of linear separability of the data set. In It should be kept in mind, however, that the best classier
the linear separable case, it will solve the training problem is not necessarily that which classies all the training data
- if desired, even with optimal stability (maximum mar- perfectly. Indeed, if we had the prior constraint that the
gin between the classes). For non-separable data sets, it data come from equi-variant Gaussian distributions, the
will return a solution with a small number of misclassi- linear separation in the input space is optimal, and the
cations. In all cases, the algorithm gradually approaches nonlinear solution is overtted.
the solution in the course of learning, without memoriz-
Other linear classication algorithms include Winnow,
ing previous states and without stochastic jumps. Con-
support vector machine and logistic regression.
vergence is to global optimality for separable data sets
and to local optimality for non-separable data sets.
In separable problems, perceptron training can also aim at
nding the largest separating margin between the classes.
The so-called perceptron of optimal stability can be de-
59.5 Example
termined by means of iterative training and optimization
schemes, such as the Min-Over algorithm (Krauth and A perceptron learns to perform a binary NAND function
Mezard, 1987)[10] or the AdaTron (Anlauf and Biehl, on inputs x1 and x2 .
1989)) .[11] AdaTron uses the fact that the corresponding
Inputs: x0 , x1 , x2 , with input x0 held constant at 1.
quadratic optimization problem is convex. The percep-
tron of optimal stability, together with the kernel trick, Threshold ( t ): 0.5
are the conceptual foundations of the support vector ma- Bias ( b ): 1
chine.
Learning rate ( r ): 0.1
The -perceptron further used a pre-processing layer of
xed random weights, with thresholded output units. This Training set, consisting of four samples:
enabled the perceptron to classify analogue patterns, by {((1, 0, 0), 1), ((1, 0, 1), 1), ((1, 1, 0), 1), ((1, 1, 1), 0)}
projecting them into a binary space. In fact, for a pro- In the following, the nal weights of one iteration become
jection space of suciently high dimension, patterns can the initial weights of the next. Each cycle over all the
become linearly separable. samples in the training set is demarcated with heavy lines.
For example, consider the case of having to classify data This example can be implemented in the following
into two classes. Here is a small such data set, consisting Python code.
of points coming from two Gaussian distributions.
threshold = 0.5 learning_rate = 0.1 weights = [0, 0,
0] training_set = [((1, 0, 0), 1), ((1, 0, 1), 1), ((1, 1,
Two-class Gaussian data
0), 1), ((1, 1, 1), 0)] def dot_product(values, weights):
A linear classier operating on the original space return sum(value * weight for value, weight in zip(values,
weights)) while True: print('-' * 60) error_count =
A linear classier operating on a high-dimensional 0 for input_vector, desired_output in training_set:
projection print(weights) result = dot_product(input_vector,
weights) > threshold error = desired_output - result if
A linear classier can only separate points with a error != 0: error_count += 1 for index, value in enu-
hyperplane, so no linear classier can classify all the merate(input_vector): weights[index] += learning_rate *
points here perfectly. On the other hand, the data can error * value if error_count == 0: break
be projected into a large number of dimensions. In our
364 CHAPTER 59. PERCEPTRON
59.6 Multiclass perceptron [7] Liou, D.-R.; Liou, J.-W.; Liou, C.-Y. (2013). Learning
Behaviors of Perceptron. ISBN 978-1-477554-73-9.
iConcept Press.
Like most other techniques for training linear classiers,
the perceptron generalizes naturally to multiclass classi- [8] Bishop, Christopher M. Chapter 4. Linear Models for
cation. Here, the input x and the output y are drawn Classication. Pattern Recognition and Machine Learn-
from arbitrary sets. A feature representation function ing. Springer Science+Business Media, LLC. p. 194.
f (x, y) maps each possible input/output pair to a nite- ISBN 978-0387-31073-2.
dimensional real-valued feature vector. As before, the [9] A. Wendemuth. Learning the Unlearnable. J. of Physics
feature vector is multiplied by a weight vector w , but A: Math. Gen. 28: 5423-5436 (1995)
now the resulting score is used to choose among many
possible outputs: [10] W. Krauth and M. Mezard. Learning algorithms with op-
timal stabilty in neural networks. J. of Physics A: Math.
Gen. 20: L745-L752 (1987)
y = argmaxy f (x, y) w. [11] J.K. Anlauf and M. Biehl. The AdaTron: an Adaptive
Perceptron algorithm. Europhysics Letters 10: 687-692
Learning again iterates over the examples, predicting an (1989)
output for each, leaving the weights unchanged when the
predicted output matches the target, and changing them Aizerman, M. A. and Braverman, E. M. and Lev
when it does not. The update becomes: I. Rozonoer. Theoretical foundations of the poten-
tial function method in pattern recognition learn-
ing. Automation and Remote Control, 25:821837,
wt+1 = wt + f (x, y) f (x, y). 1964.
This multiclass formulation reduces to the original per- Rosenblatt, Frank (1958), The Perceptron: A Prob-
ceptron when x is a real-valued vector, y is chosen from abilistic Model for Information Storage and Orga-
{0, 1} , and f (x, y) = yx . nization in the Brain, Cornell Aeronautical Labora-
For certain problems, input/output representations and tory, Psychological Review, v65, No. 6, pp. 386
features can be chosen so that argmaxy f (x, y) w can 408. doi:10.1037/h0042519.
be found eciently even though y is chosen from a very Rosenblatt, Frank (1962), Principles of Neurody-
large or even innite set. namics. Washington, DC:Spartan Books.
In recent years, perceptron training has become popular Minsky M. L. and Papert S. A. 1969. Perceptrons.
in the eld of natural language processing for such tasks Cambridge, MA: MIT Press.
as part-of-speech tagging and syntactic parsing (Collins,
2002). Gallant, S. I. (1990). Perceptron-based learning al-
gorithms. IEEE Transactions on Neural Networks,
vol. 1, no. 2, pp. 179191.
59.7 References Mohri, Mehryar and Rostamizadeh, Afshin (2013).
Perceptron Mistake Bounds arXiv:1305.0208,
[1] Freund, Y.; Schapire, R. E. (1999). Large 2013.
margin classication using the perceptron algo-
rithm (PDF). Machine Learning 37 (3): 277296. Noviko, A. B. (1962). On convergence proofs on
doi:10.1023/A:1007662407062. perceptrons. Symposium on the Mathematical The-
ory of Automata, 12, 615-622. Polytechnic Institute
[2] Rosenblatt, Frank (1957), The Perceptron--a perceiving
and recognizing automaton. Report 85-460-1, Cornell of Brooklyn.
Aeronautical Laboratory. Widrow, B., Lehr, M.A., 30 years of Adaptive
[3] Mikel Olazaran (1996). A Sociological Study of the Neural Networks: Perceptron, Madaline, and Back-
Ocial History of the Perceptrons Controversy. Social propagation, Proc. IEEE, vol 78, no 9, pp. 1415
Studies of Science 26 (3). JSTOR 285702. 1442, (1990).
[4] Bishop, Christopher M. (2006). Pattern Recognition and Collins, M. 2002. Discriminative training methods
Machine Learning. Springer. for hidden Markov models: Theory and experiments
[5] Aizerman, M. A.; Braverman, E. M.; Rozonoer, L. I.
with the perceptron algorithm in Proceedings of the
(1964). Theoretical foundations of the potential func- Conference on Empirical Methods in Natural Lan-
tion method in pattern recognition learning. Automation guage Processing (EMNLP '02).
and Remote Control 25: 821837.
Yin, Hongfeng (1996), Perceptron-Based Algo-
[6] Mohri, Mehryar and Rostamizadeh, Afshin (2013). rithms and Analysis, Spectrum Library, Concordia
Perceptron Mistake Bounds arXiv:1305.0208, 2013. University, Canada
59.8. EXTERNAL LINKS 365
Mathematics of perceptrons
Chapter 60
Not to be confused with Secure Virtual Machine. function k(x, y) selected to suit the problem.[2] The hy-
perplanes in the higher-dimensional space are dened as
In machine learning, support vector machines (SVMs, the set of points whose dot product with a vector in that
space is constant. The vectors dening the hyperplanes
also support vector networks[1] ) are supervised learn-
ing models with associated learning algorithms that an- can be chosen to be linear combinations with parame-
ters i of images of feature vectors xi that occur in the
alyze data and recognize patterns, used for classication
and regression analysis. Given a set of training examples, data base. With this choice of a hyperplane, the points x
in the feature space that are mapped
into the hyperplane
each marked for belonging to one of two categories, an
SVM training algorithm builds a model that assigns new are dened by the relation: i i k(xi , x) = constant.
examples into one category or the other, making it a non- Note that if k(x, y) becomes small as y grows further
probabilistic binary linear classier. An SVM model is a away from x , each term in the sum measures the degree
representation of the examples as points in space, mapped of closeness of the test point x to the corresponding data
so that the examples of the separate categories are divided base point xi . In this way, the sum of kernels above can
by a clear gap that is as wide as possible. New examples be used to measure the relative nearness of each test point
are then mapped into that same space and predicted to to the data points originating in one or the other of the sets
belong to a category based on which side of the gap they to be discriminated. Note the fact that the set of points
fall on. x mapped into any hyperplane can be quite convoluted
as a result, allowing much more complex discrimination
In addition to performing linear classication, SVMs can between sets which are not convex at all in the original
eciently perform a non-linear classication using what space.
is called the kernel trick, implicitly mapping their inputs
into high-dimensional feature spaces.
60.2 History
60.1 Denition The original SVM algorithm was invented by Vladimir N.
Vapnik and Alexey Ya. Chervonenkis in 1963. In 1992,
More formally, a support vector machine constructs a Bernhard E. Boser, Isabelle M. Guyon and Vladimir
hyperplane or set of hyperplanes in a high- or innite- N. Vapnik suggested a way to create nonlinear classi-
dimensional space, which can be used for classication, ers by applying the kernel trick to maximum-margin
regression, or other tasks. Intuitively, a good separation hyperplanes.[3] The current standard incarnation (soft
is achieved by the hyperplane that has the largest distance margin) was proposed by Corinna Cortes and Vapnik in
to the nearest training-data point of any class (so-called 1993 and published in 1995.[1]
functional margin), since in general the larger the margin
the lower the generalization error of the classier.
Whereas the original problem may be stated in a nite di- 60.3 Motivation
mensional space, it often happens that the sets to discrim-
inate are not linearly separable in that space. For this rea- Classifying data is a common task in machine learning.
son, it was proposed that the original nite-dimensional Suppose some given data points each belong to one of two
space be mapped into a much higher-dimensional space, classes, and the goal is to decide which class a new data
presumably making the separation easier in that space. point will be in. In the case of support vector machines, a
To keep the computational load reasonable, the mappings data point is viewed as a p -dimensional vector (a list of p
used by SVM schemes are designed to ensure that dot numbers), and we want to know whether we can separate
products may be computed easily in terms of the variables such points with a (p 1) -dimensional hyperplane. This
in the original space, by dening them in terms of a kernel is called a linear classier. There are many hyperplanes
366
60.4. LINEAR SVM 367
X2 H1 H2 H3
X1
H1 does not separate the classes. H2 does, but only with a small
margin. H3 separates them with the maximum margin.
that might classify the data. One reasonable choice as the Maximum-margin hyperplane and margins for an SVM trained
best hyperplane is the one that represents the largest sepa- with samples from two classes. Samples on the margin are called
ration, or margin, between the two classes. So we choose the support vectors.
the hyperplane so that the distance from it to the nearest
data point on each side is maximized. If such a hyper-
plane exists, it is known as the maximum-margin hyper-
plane and the linear classier it denes is known as a max- wxb=1
imum margin classier; or equivalently, the perceptron of
and
optimal stability.
w x b = 1.
60.4 Linear SVM
By using geometry, we nd the distance between these
2
two hyperplanes is w , so we want to minimize w .
Given some training data D , a set of n points of the form
As we also have to prevent data points from falling into
the margin, we add the following constraint: for each i
n
either
D = {(xi , yi ) | xi Rp , yi {1, 1}}i=1
w x b = 0,
yi (w xi b) 1, all for 1 i n. (1)
where denotes the dot product and w the (not neces-
sarily normalized) normal vector to the hyperplane. The We can put this together to get the optimization problem:
b
parameter w determines the oset of the hyperplane Minimize (in w, b )
from the origin along the normal vector w .
If the training data are linearly separable, we can select
two hyperplanes in a way that they separate the data and w
there are no points between them, and then try to maxi- subject to (for any i = 1, . . . , n )
mize their distance. The region bounded by them is called
the margin. These hyperplanes can be described by the
equations yi (w xi b) 1.
368 CHAPTER 60. SUPPORT VECTOR MACHINE
1
arg min w2
(w,b) 2
n
1 n
1
L() = i i j yi yj xTi xj = i i j yi yj k(xi , xj )
subject to (for any i = 1, . . . , n ) i=1
2 i,j i=1
2 i,j
yi (w xi b) 1 i 1 i n. (2) 0 i C,
{ }
1 n n n
arg min max w2 + C i i [yi (w xi b) 1 + i ] i i
w,,b , 2 i=1 i=1 i=1
with i , i 0 .
gaussian kernel
+1 Kernel machine
2 -1
The original optimal hyperplane algorithm proposed by
1 Vapnik in 1963 was a linear classier. However, in 1992,
Bernhard E. Boser, Isabelle M. Guyon and Vladimir N.
0 Vapnik suggested a way to create nonlinear classiers by
x2
regularized, and previously it was widely believed that the 60.7.2 Issues
innite dimensions do not spoil the results. However, it
has been shown that higher dimensions do increase the Potential drawbacks of the SVM are the following three
generalization error, although the amount is bounded.[7] aspects:
Some common kernels include:
Uncalibrated class membership probabilities
Polynomial (homogeneous): k(xi , xj ) = (xi xj )d The SVM is only directly applicable for two-class
tasks. Therefore, algorithms that reduce the multi-
Polynomial (inhomogeneous): k(xi , xj ) = (xi xj + class task to several binary problems have to be ap-
1)d plied; see the multi-class SVM section.
Gaussian radial basis function: k(xi , xj ) = Parameters of a solved model are dicult to inter-
exp(xi xj 2 ) , for > 0 . Sometimes pret.
parametrized using = 1/2 2
Hyperbolic tangent: k(xi , xj ) = tanh(xi xj + c) ,
for some (not every) > 0 and c < 0 60.8 Extensions
The kernel is related to the transform (xi ) by the equa- 60.8.1 Multiclass SVM
tion k(xi , xj ) = (xi ) (xj ) . The
value w is also in
the transformed space, with w = i i yi (xi ) . Dot Multiclass SVM aims to assign labels to instances by using
products with w for classication can again
be computed support vector machines, where the labels are drawn from
by the kernel trick, i.e. w (x) = i i yi k(xi , x) . a nite set of several elements.
However, there does not in general exist a value w' such
The dominant approach for doing so is to reduce the
that w (x) = k(w , x) .
single multiclass problem into multiple binary classica-
tion problems.[10] Common methods for such reduction
include:[10] [11]
60.7 Properties
Building binary classiers which distinguish be-
SVMs belong to a family of generalized linear classiers tween (i) one of the labels and the rest (one-versus-
and can be interpreted as an extension of the perceptron. all) or (ii) between every pair of classes (one-versus-
They can also be considered a special case of Tikhonov one). Classication of new instances for the one-
regularization. A special property is that they simulta- versus-all case is done by a winner-takes-all strat-
neously minimize the empirical classication error and egy, in which the classier with the highest output
maximize the geometric margin; hence they are also function assigns the class (it is important that the
known as maximum margin classiers. output functions be calibrated to produce compara-
A comparison of the SVM to other classiers has been ble scores). For the one-versus-one approach, clas-
made by Meyer, Leisch and Hornik.[8] sication is done by a max-wins voting strategy, in
which every classier assigns the instance to one of
the two classes, then the vote for the assigned class is
60.7.1 Parameter selection increased by one vote, and nally the class with the
most votes determines the instance classication.
The eectiveness of SVM depends on the selection of Directed acyclic graph SVM (DAGSVM)[12]
kernel, the kernels parameters, and soft margin pa-
rameter C. A common choice is a Gaussian kernel, Error-correcting output codes[13]
which has a single parameter . The best combina-
tion of C and is often selected by a grid search Crammer and Singer proposed a multiclass SVM method
with exponentially growing sequences of C and , which casts the multiclass classication problem into a
for example, C {25 , 23 , . . . , 213 , 215 } ; single optimization problem, rather than decomposing it
{215 , 213 , . . . , 21 , 23 } . Typically, each combination into multiple binary classication problems.[14] See also
of parameter choices is checked using cross validation, Lee, Lin and Wahba.[15][16]
and the parameters with best cross-validation accuracy
are picked. Alternatively, recent work in Bayesian opti-
mization can be used to select C and , often requiring 60.8.2 Transductive support vector ma-
the evaluation of far fewer parameter combinations than chines
grid search. The nal model, which is used for testing
and for classifying new data, is then trained on the whole Transductive support vector machines extend SVMs
training set using the selected parameters.[9] in that they could also treat partially labeled data in
60.9. INTERPRETING SVM MODELS 371
{
semi-supervised learning by following the principles of yi w, xi b
transduction. Here, in addition to the training set D , the w, xi + b yi
learner is also given a set
where xi is a training sample with target value yi . The
inner product plus intercept w, xi + b is the prediction
D = {xi |xi R }i=1
p k for that sample, and is a free parameter that serves as
a threshold: all predictions have to be within an range
of test examples to be classied. Formally, a transductive of the true predictions. Slack variables are usually added
support vector machine is dened by the following primal into the above to allow for errors and to allow approxima-
optimization problem:[17] tion in the case the above problem is infeasible.
Minimize (in w, b, y )
The general kernel SVMs can also be solved more e- 60.13 References
ciently using sub-gradient descent (e.g. P-packSVM[28] ),
especially when parallelization is allowed. [1] Cortes, C.; Vapnik, V. (1995). Support-vector
networks. Machine Learning 20 (3): 273.
Kernel SVMs are available in many machine learn-
doi:10.1007/BF00994018.
ing toolkits, including LIBSVM, MATLAB, SVM-
light, kernlab, scikit-learn, Shogun, Weka, Shark, [2] Press, William H.; Teukolsky, Saul A.; Vetterling,
JKernelMachines and others. William T.; Flannery, B. P. (2007). Section 16.5.
Support Vector Machines. Numerical Recipes:
The Art of Scientic Computing (3rd ed.). New
York: Cambridge University Press. ISBN 978-0-
60.11 Applications 521-88068-8.
[3] Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992).
SVMs can be used to solve various real world problems: A training algorithm for optimal margin classiers.
Proceedings of the fth annual workshop on Com-
putational learning theory - COLT '92. p. 144.
SVMs are helpful in text and hypertext categoriza- doi:10.1145/130385.130401. ISBN 089791497X.
tion as their application can signicantly reduce the
need for labeled training instances in both the stan- [4] ACM Website, Press release of March 17th 2009.
http://www.acm.org/press-room/news-releases/
dard inductive and transductive settings.
awards-08-groupa
Classication of images can also be performed us- [5] Aizerman, Mark A.; Braverman, Emmanuel M.; and Ro-
ing SVMs. Experimental results show that SVMs zonoer, Lev I. (1964). Theoretical foundations of the po-
achieve signicantly higher search accuracy than tential function method in pattern recognition learning.
traditional query renement schemes after just three Automation and Remote Control 25: 821837.
to four rounds of relevance feedback. [6] Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992).
A training algorithm for optimal margin classiers.
SVMs are also useful in medical science to classify Proceedings of the fth annual workshop on Com-
proteins with up to 90% of the compounds classied putational learning theory - COLT '92. p. 144.
correctly. doi:10.1145/130385.130401. ISBN 089791497X.
[7] Jin, Chi; Wang, Liwei (2012). Dimensionality dependent
Hand-written characters can be recognized using PAC-Bayes margin bound. Advances in Neural Informa-
SVM. tion Processing Systems.
[8] Meyer, D.; Leisch, F.; Hornik, K. (2003). The support
vector machine under test. Neurocomputing 55: 169.
60.12 See also doi:10.1016/S0925-2312(03)00431-4.
[9] Hsu, Chih-Wei; Chang, Chih-Chung; and Lin, Chih-Jen
In situ adaptive tabulation (2003). A Practical Guide to Support Vector Classication
(PDF) (Technical report). Department of Computer Sci-
ence and Information Engineering, National Taiwan Uni-
Kernel machines
versity.
Fisher kernel [10] Duan, K. B.; Keerthi, S. S. (2005). Which Is the Best
Multiclass SVM Method? An Empirical Study. Multiple
Platt scaling Classier Systems (PDF). LNCS 3541. pp. 278285.
doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.
Polynomial kernel [11] Hsu, Chih-Wei; and Lin, Chih-Jen (2002). A Compari-
son of Methods for Multiclass Support Vector Machines.
Predictive analytics IEEE Transactions on Neural Networks.
[12] Platt, John; Cristianini, N.; and Shawe-Taylor, J. (2000).
Regularization perspectives on support vector ma- Large margin DAGs for multiclass classication. In
chines Solla, Sara A.; Leen, Todd K.; and Mller, Klaus-Robert;
eds. Advances in Neural Information Processing Systems
Relevance vector machine, a probabilistic sparse (PDF). MIT Press. pp. 547553.
kernel model identical in functional form to SVM [13] Dietterich, Thomas G.; and Bakiri, Ghulum; Bakiri
(1995). Solving Multiclass Learning Problems via
Sequential minimal optimization Error-Correcting Output Codes (PDF). Journal of Ar-
ticial Intelligence Research, Vol. 2 2: 263286.
Winnow (algorithm) arXiv:cs/9501101. Bibcode:1995cs........1101D.
60.14. EXTERNAL LINKS 373
[14] Crammer, Koby; and Singer, Yoram (2001). On the 60.14 External links
Algorithmic Implementation of Multiclass Kernel-based
Vector Machines (PDF). J. of Machine Learning Re-
www.support-vector.net The key book about the
search 2: 265292.
method, An Introduction to Support Vector Ma-
[15] Lee, Y.; Lin, Y.; and Wahba, G. (2001). Multicategory chines with online software
Support Vector Machines (PDF). Computing Science and
Statistics 33. Burges, Christopher J. C.; A Tutorial on Sup-
port Vector Machines for Pattern Recognition, Data
[16] Lee, Y.; Lin, Y.; Wahba, G. (2004). Mul- Mining and Knowledge Discovery 2:121167, 1998
ticategory Support Vector Machines. Journal of
the American Statistical Association 99 (465): 67. www.kernel-machines.org (general information and
doi:10.1198/016214504000000098. collection of research papers)
[17] Joachims, Thorsten; Transductive Inference for Text www.support-vector-machines.org (Literature, Re-
Classication using Support Vector Machines, Proceed- view, Software, Links related to Support Vector Ma-
ings of the 1999 International Conference on Machine chines Academic Site)
Learning (ICML 1999), pp. 200-209.
videolectures.net (SVM-related video lectures)
[18] Drucker, Harris; Burges, Christopher J. C.; Kaufman,
Linda; Smola, Alexander J.; and Vapnik, Vladimir N. Karatzoglou, Alexandros et al.; Support Vector Ma-
(1997); Support Vector Regression Machines, in Ad- chines in R, Journal of Statistical Software April
vances in Neural Information Processing Systems 9, NIPS 2006, Volume 15, Issue 9.
1996, 155161, MIT Press.
libsvm LIBSVM is a popular library of SVM learn-
[19] Suykens, Johan A. K.; Vandewalle, Joos P. L.; Least ers
squares support vector machine classiers, Neural Process-
ing Letters, vol. 9, no. 3, Jun. 1999, pp. 293300. liblinear liblinear is a library for large linear classi-
cation including some SVMs
[20] Smola, Alex J.; Schlkopf, Bernhard (2004). A tutorial
on support vector regression (PDF). Statistics and Com- Shark Shark is a C++ machine learning library im-
puting 14 (3): 199222. plementing various types of SVMs
[21] Bilwaj Gaonkar, Christos Davatzikos Analytic estimation dlib dlib is a C++ library for working with kernel
of statistical signicance maps for support vector machine
methods and SVMs
based multi-variate image analysis and classication
SVM light is a collection of software tools for learn-
[22] R. Cuingnet, C. Rosso, M. Chupin, S. Lehricy, D. Dor-
ing and classication using SVM.
mont, H. Benali, Y. Samson and O. Colliot, Spatial regu-
larization of SVM for the detection of diusion alterations SVMJS live demo is a GUI demo for Javascript im-
associated with stroke outcome, Medical Image Analysis,
plementation of SVMs
2011, 15 (5): 729-737
Gesture Recognition Toolkit contains an easy to use
[23] Statnikov, A., Hardin, D., & Aliferis, C. (2006). Using
SVM weight-based methods to identify causally relevant wrapper for libsvm
and non-causally relevant variables. sign, 1, 4.
[24] John C. Platt (1999). Using Analytic QP and Sparseness to 60.15 Bibliography
Speed Training of Support Vector Machines (PDF). NIPS.
[25] Ferris, M. C.; Munson, T. S. (2002). Interior- Theodoridis, Sergios; and Koutroumbas, Konstanti-
Point Methods for Massive Support Vector Ma- nos; Pattern Recognition, 4th Edition, Academic
chines. SIAM Journal on Optimization 13 (3): 783. Press, 2009, ISBN 978-1-59749-272-0
doi:10.1137/S1052623400374379.
Cristianini, Nello; and Shawe-Taylor, John; An In-
[26] Shai Shalev-Shwartz; Yoram Singer; Nathan Srebro troduction to Support Vector Machines and other
(2007). Pegasos: Primal Estimated sub-GrAdient SOlver kernel-based learning methods, Cambridge Univer-
for SVM (PDF). ICML. sity Press, 2000. ISBN 0-521-78019-5 ( SVM Book)
[27] R.-E. Fan; K.-W. Chang; C.-J. Hsieh; X.-R. Wang; C.- Huang, Te-Ming; Kecman, Vojislav; and Kopriva,
J. Lin (2008). LIBLINEAR: A library for large linear
Ivica (2006); Kernel Based Algorithms for Mining
classication. Journal of Machine Learning Research 9:
18711874.
Huge Data Sets, in Supervised, Semi-supervised, and
Unsupervised Learning, Springer-Verlag, Berlin,
[28] Zeyuan Allen Zhu et al. (2009). P-packSVM: Parallel Pri- Heidelberg, 260 pp. 96 illus., Hardcover, ISBN 3-
mal grAdient desCent Kernel SVM (PDF). ICDM. 540-31681-7
374 CHAPTER 60. SUPPORT VECTOR MACHINE
Kecman, Vojislav; Learning and Soft Computing Catanzaro, Bryan; Sundaram, Narayanan; and
Support Vector Machines, Neural Networks, Fuzzy Keutzer, Kurt; Fast Support Vector Machine Train-
Logic Systems, The MIT Press, Cambridge, MA, ing and Classication on Graphics Processors, in In-
2001. ternational Conference on Machine Learning, 2008
Neural network redirects here. For networks of living For example, a neural network for handwriting recogni-
neurons, see Biological neural network. For the journal, tion is dened by a set of input neurons which may be
see Neural Networks (journal). For the evolutionary activated by the pixels of an input image. After being
concept, see Neutral network (evolution). weighted and transformed by a function (determined by
the networks designer), the activations of these neurons
are then passed on to other neurons. This process is re-
peated until nally, an output neuron is activated. This
determines which character was read.
Like other machine learning methods - systems that learn
from data - neural networks have been used to solve a
wide variety of tasks that are hard to solve using ordinary
rule-based programming, including computer vision and
speech recognition.
61.1 Background
Examinations of humans central nervous systems in-
spired the concept of articial neural networks. In an ar-
ticial neural network, simple articial nodes, known as
"neurons", neurodes, processing elements or units,
are connected together to form a network which mimics
a biological neural network.
There is no single formal denition of what an articial
neural network is. However, a class of statistical mod-
els may commonly be called Neural if it possesses the
following characteristics:
An articial neural network is an interconnected group of nodes,
akin to the vast network of neurons in a brain. Here, each circu- 1. contains sets of adaptive weights, i.e. numerical pa-
lar node represents an articial neuron and an arrow represents rameters that are tuned by a learning algorithm, and
a connection from the output of one neuron to the input of an-
other. 2. capability of approximating non-linear functions of
their inputs.
In machine learning and cognitive science, articial neu-
ral networks (ANNs) are a family of statistical learning The adaptive weights can be thought of as connection
models inspired by biological neural networks (the central strengths between neurons, which are activated during
nervous systems of animals, in particular the brain) and training and prediction.
are used to estimate or approximate functions that can Neural networks are similar to biological neural networks
depend on a large number of inputs and are generally in the performing of functions collectively and in parallel
unknown. Articial neural networks are generally pre- by the units, rather than there being a clear delineation
sented as systems of interconnected "neurons" which send of subtasks to which individual units are assigned. The
messages to each other. The connections have numeric term neural network usually refers to models employed
weights that can be tuned based on experience, making in statistics, cognitive psychology and articial intelli-
neural nets adaptive to inputs and capable of learning. gence. Neural network models which emulate the central
375
376 CHAPTER 61. ARTIFICIAL NEURAL NETWORK
nervous system are part of theoretical neuroscience and of machine learning research by Marvin Minsky and
computational neuroscience. Seymour Papert[7] (1969), who discovered two key is-
In modern software implementations of articial neu- sues with the computational machines that processed neu-
ral networks, the approach inspired by biology has been ral networks. The rst was that single-layer neural net-
largely abandoned for a more practical approach based works were incapable of processing the exclusive-or cir-
on statistics and signal processing. In some of these sys- cuit. The second signicant issue was that comput-
tems, neural networks or parts of neural networks (like ers didn't have enough processing power to eectively
articial neurons) form components in larger systems that handle the long run time required by large neural net-
works. Neural network research slowed until comput-
combine both adaptive and non-adaptive elements. While
the more general approach of such systems is more suit- ers achieved greater processing power. Another key ad-
vance that came later was the backpropagation algorithm
able for real-world problem solving, it has little to do with
the traditional, articial intelligence connectionist mod- which eectively solved the exclusive-or problem (Wer-
bos 1975).[6]
els. What they do have in common, however, is the prin-
ciple of non-linear, distributed, parallel and local process- The parallel distributed processing of the mid-1980s be-
ing and adaptation. Historically, the use of neural net- came popular under the name connectionism. The text-
work models marked a directional shift in the late eight- book by David E. Rumelhart and James McClelland[8]
ies from high-level (symbolic) AI, characterized by expert (1986) provided a full exposition of the use of connec-
systems with knowledge embodied in if-then rules, to low- tionism in computers to simulate neural processes.
level (sub-symbolic) machine learning, characterized by Neural networks, as used in articial intelligence, have
knowledge embodied in the parameters of a dynamical traditionally been viewed as simplied models of neural
system. processing in the brain, even though the relation between
this model and the biological architecture of the brain is
debated; its not clear to what degree articial neural net-
61.2 History works mirror brain function.[9]
Support vector machines and other, much simpler meth-
Warren McCulloch and Walter Pitts[1] (1943) created ods such as linear classiers gradually overtook neural
a computational model for neural networks based on networks in machine learning popularity. But the advent
mathematics and algorithms called threshold logic. This of deep learning in the late 2000s sparked renewed inter-
model paved the way for neural network research to split est in neural nets.
into two distinct approaches. One approach focused on
biological processes in the brain and the other focused
on the application of neural networks to articial intelli-
gence. 61.2.1 Improvements since 2006
In the late 1940s psychologist Donald Hebb[2] created a Computational devices have been created in CMOS, for
hypothesis of learning based on the mechanism of neural both biophysical simulation and neuromorphic comput-
plasticity that is now known as Hebbian learning. Heb- ing. More recent eorts show promise for creating
bian learning is considered to be a 'typical' unsupervisednanodevices[10] for very large scale principal components
learning rule and its later variants were early models analyses and convolution. If successful, these eorts
for long term potentiation. Researchers started apply- could usher in a new era of neural computing[11] that is
ing these ideas to computational models in 1948 with a step beyond digital computing, because it depends on
Turings B-type machines. learning rather than programming and because it is fun-
Farley and Wesley A. Clark[3] (1954) rst used compu- damentally analog rather than digital even though the rst
tational machines, then called calculators, to simulate instantiations may in fact be with CMOS digital devices.
a Hebbian network at MIT. Other neural network com- Between 2009 and 2012, the recurrent neural networks
putational machines were created by Rochester, Holland, and deep feedforward neural networks developed in the
Habit, and Duda[4] (1956). research group of Jrgen Schmidhuber at the Swiss AI
Frank Rosenblatt[5] (1958) created the perceptron, an al- Lab IDSIA have won eight international competitions
gorithm for pattern recognition based on a two-layer com- in pattern recognition and machine learning.[12][13] For
puter learning network using simple addition and sub- example, the bi-directional and multi-dimensional long
traction. With mathematical notation, Rosenblatt also short term memory (LSTM)[14][15][16][17] of Alex Graves
described circuitry not in the basic perceptron, such et al. won three competitions in connected handwriting
as the exclusive-or circuit, a circuit whose mathemati- recognition at the 2009 International Conference on Doc-
cal computation could not be processed until after the ument Analysis and Recognition (ICDAR), without any
backpropagation algorithm was created by Paul Wer- prior knowledge about the three dierent languages to be
bos[6] (1975). learned.
Neural network research stagnated after the publication Fast GPU-based implementations of this approach by
61.3. MODELS 377
61.3 Models
Neural network models in articial intelligence are usu-
ally referred to as articial neural networks (ANNs);
these are essentially simple mathematical models den-
ing a function f : X Y or a distribution over X or
both X and Y , but sometimes models are also intimately
associated with a particular learning algorithm or learn-
ing rule. A common use of the phrase ANN model is ANN dependency graph
really the denition of a class of such functions (where
members of the class are obtained by varying parame- This gure depicts such a decomposition of f , with de-
ters, connection weights, or specics of the architecture pendencies between variables indicated by arrows. These
such as the number of neurons or their connectivity). can be interpreted in two ways.
The rst view is the functional view: the input x is trans-
formed into a 3-dimensional vector h , which is then
61.3.1 Network function
transformed into a 2-dimensional vector g , which is -
nally transformed into f . This view is most commonly
See also: Graphical models
encountered in the context of optimization.
The second view is the probabilistic view: the random
The word network in the term 'articial neural network'
variable F = f (G) depends upon the random variable
refers to the interconnections between the neurons in the
G = g(H) , which depends upon H = h(X) , which
dierent layers of each system. An example system has
depends upon the random variable X . This view is most
three layers. The rst layer has input neurons which send
commonly encountered in the context of graphical mod-
data via synapses to the second layer of neurons, and then
els.
via more synapses to the third layer of output neurons.
More complex systems will have more layers of neurons, The two views are largely equivalent. In either case, for
some having increased layers of input neurons and output this particular network architecture, the components of
neurons. The synapses store parameters called weights individual layers are independent of each other (e.g., the
that manipulate the data in the calculations. components of g are independent of each other given their
input h ). This naturally enables a degree of parallelism
An ANN is typically dened by three types of parameters:
in the implementation.
1. The interconnection pattern between the dierent Networks such as the previous one are commonly called
layers of neurons feedforward, because their graph is a directed acyclic
graph. Networks with cycles are commonly called
2. The learning process for updating the weights of the recurrent. Such networks are commonly depicted in the
378 CHAPTER 61. ARTIFICIAL NEURAL NETWORK
pairs (x, y) drawn from some distribution D . In prac- Tasks that fall within the paradigm of supervised learn-
tical situations we would only have N samples from D ing are pattern recognition (also known as classication)
and thus, for the above example, we would only minimize and regression (also known as function approximation).
N
C = N1 i=1 (f (xi )yi )2 . Thus, the cost is minimized The supervised learning paradigm is also applicable to
over a sample of the data rather than the entire distribu- sequential data (e.g., for speech and gesture recognition).
tion generating the data. This can be thought of as learning with a teacher, in the
61.4. EMPLOYING ARTIFICIAL NEURAL NETWORKS 379
form of a function that provides continuous feedback on involved in vehicle routing,[33] natural resources man-
the quality of solutions obtained thus far. agement[34][35] or medicine[36] because of the ability of
ANNs to mitigate losses of accuracy even when reduc-
ing the discretization grid density for numerically approx-
Unsupervised learning imating the solution of the original control problems.
In unsupervised learning, some data x is given and the Tasks that fall within the paradigm of reinforcement
cost function to be minimized, that can be any function learning are control problems, games and other sequential
of the data x and the networks output, f . decision making tasks.
The cost function is dependent on the task (what we are See also: dynamic programming and stochastic control
trying to model) and our a priori assumptions (the implicit
properties of our model, its parameters and the observed
variables).
61.3.4 Learning algorithms
As a trivial example, consider the model f (x) = a where
a is a constant and the cost C = E[(x f (x))2 ] . Mini- Training a neural network model essentially means se-
mizing this cost will give us a value of a that is equal to the lecting one model from the set of allowed models (or,
mean of the data. The cost function can be much more in a Bayesian framework, determining a distribution over
complicated. Its form depends on the application: for ex- the set of allowed models) that minimizes the cost crite-
ample, in compression it could be related to the mutual rion. There are numerous algorithms available for train-
information between x and f (x) , whereas in statistical ing neural network models; most of them can be viewed
modeling, it could be related to the posterior probability as a straightforward application of optimization theory
of the model given the data (note that in both of those ex- and statistical estimation.
amples those quantities would be maximized rather than
minimized). Most of the algorithms used in training articial neural
networks employ some form of gradient descent, using
Tasks that fall within the paradigm of unsupervised learn- backpropagation to compute the actual gradients. This is
ing are in general estimation problems; the applications done by simply taking the derivative of the cost function
include clustering, the estimation of statistical distribu- with respect to the network parameters and then chang-
tions, compression and ltering. ing those parameters in a gradient-related direction. The
backpropagation training algorithms are usually classi-
ed into three categories: steepest descent (with vari-
Reinforcement learning
able learning rate, with variable learning rate and momen-
tum, resilient backpropagation), quasi-Newton (Broyden-
In reinforcement learning, data x are usually not given,
Fletcher-Goldfarb-Shanno, one step secant, Levenberg-
but generated by an agents interactions with the environ-
Marquardt) and conjugate gradient (Fletcher-Reeves up-
ment. At each point in time t , the agent performs an
date, Polak-Ribire update, Powell-Beale restart, scaled
action yt and the environment generates an observation
conjugate gradient). [37]
xt and an instantaneous cost ct , according to some (usu-
ally unknown) dynamics. The aim is to discover a policy Evolutionary methods,[38] gene expression pro-
for selecting actions that minimizes some measure of a gramming,[39] simulated annealing,[40] expectation-
long-term cost, e.g., the expected cumulative cost. The maximization, non-parametric methods and particle
environments dynamics and the long-term cost for each swarm optimization[41] are some commonly used
policy are usually unknown, but can be estimated. methods for training neural networks.
More formally the environment is modeled as a Markov See also: machine learning
decision process (MDP) with states s1 , ..., sn S and
actions a1 , ..., am A with the following probability dis-
tributions: the instantaneous cost distribution P (ct |st )
, the observation distribution P (xt |st ) and the transi- 61.4 Employing articial neural
tion P (st+1 |st , at ) , while a policy is dened as the
conditional distribution over actions given the observa- networks
tions. Taken together, the two then dene a Markov chain
(MC). The aim is to discover the policy (i.e., the MC) that Perhaps the greatest advantage of ANNs is their ability
minimizes the cost. to be used as an arbitrary function approximation mech-
ANNs are frequently used in reinforcement learning as anism that 'learns from observed data. However, using
part of the overall algorithm.[30][31] Dynamic program- them is not so straightforward, and a relatively good un-
ming has been coupled with ANNs (Neuro dynamic pro- derstanding of the underlying theory is essential.
gramming) by Bertsekas and Tsitsiklis[32] and applied
to multi-dimensional nonlinear problems such as those Choice of model: This will depend on the data rep-
380 CHAPTER 61. ARTIFICIAL NEURAL NETWORK
resentation and the application. Overly complex medical diagnosis, nancial applications (e.g. automated
models tend to lead to problems with learning. trading systems), data mining (or knowledge discovery in
databases, KDD), visualization and e-mail spam lter-
Learning algorithm: There are numerous trade-os ing.
between learning algorithms. Almost any algorithm
will work well with the correct hyperparameters for Articial neural networks have also been used to diag-
training on a particular xed data set. However, se- nose several cancers. An ANN based hybrid lung cancer
lecting and tuning an algorithm for training on un- detection system named HLND improves the accuracy [43]
seen data requires a signicant amount of experi- of diagnosis and the speed of lung cancer radiology.
mentation. These networks have also been used to diagnose prostate
cancer. The diagnoses can be used to make specic mod-
Robustness: If the model, cost function and learn- els taken from a large group of patients compared to in-
ing algorithm are selected appropriately the result- formation of one given patient. The models do not de-
ing ANN can be extremely robust. pend on assumptions about correlations of dierent vari-
ables. Colorectal cancer has also been predicted using
With the correct implementation, ANNs can be used nat- the neural networks. Neural networks could predict the
urally in online learning and large data set applications. outcome for a patient with colorectal cancer with more
Their simple implementation and the existence of mostly accuracy than the current clinical methods. After train-
local dependencies exhibited in the structure allows for ing, the networks could predict[44]
multiple patient outcomes
fast, parallel implementations in hardware. from unrelated institutions.
early research in distributed representations [46] and self- of their functions. Most systems use weights to change
organizing maps. E.g. in sparse distributed memory the parameters of the throughput and the varying con-
the patterns encoded by neural networks are used as nections to the neurons. Articial neural networks can be
memory addresses for content-addressable memory, with autonomous and learn by input from outside teachers or
neurons essentially serving as address encoders and even self-teaching from written-in rules.
decoders.
More recently deep learning was shown to be useful in
semantic hashing[47] where a deep graphical model the 61.8 Theoretical properties
word-count vectors[48] obtained from a large set of doc-
uments. Documents are mapped to memory addresses in 61.8.1 Computational power
such a way that semantically similar documents are lo-
cated at nearby addresses. Documents similar to a query The multi-layer perceptron (MLP) is a universal function
document can then be found by simply accessing all the approximator, as proven by the universal approximation
addresses that dier by only a few bits from the address theorem. However, the proof is not constructive regard-
of the query document. ing the number of neurons required or the settings of the
Neural Turing Machines[49] developed by Google Deep- weights.
Mind extend the capabilities of deep neural networks by Work by Hava Siegelmann and Eduardo D. Sontag has
coupling them to external memory resources, which they provided a proof that a specic recurrent architecture
can interact with by attentional processes. The combined with rational valued weights (as opposed to full preci-
system is analogous to a Turing Machine but is dier- sion real number-valued weights) has the full power of
entiable end-to-end, allowing it to be eciently trained a Universal Turing Machine[51] using a nite number of
with gradient descent. Preliminary results demonstrate neurons and standard linear connections. Further, it has
that Neural Turing Machines can infer simple algorithms been shown that the use of irrational values for weights
such as copying, sorting, and associative recall from input results in a machine with super-Turing power.[52]
and output examples.
Memory Networks[50] is another extension to neural net-
works incorporating long-term memory which was devel- 61.8.2 Capacity
oped by Facebook research. The long-term memory can
be read and written to, with the goal of using it for pre- Articial neural network models have a property called
diction. These models have been applied in the context 'capacity', which roughly corresponds to their ability to
of question answering (QA) where the long-term mem- model any given function. It is related to the amount of
ory eectively acts as a (dynamic) knowledge base, and information that can be stored in the network and to the
the output is a textual response. notion of complexity.
61.8.3 Convergence
61.6 Neural network software
Nothing can be said in general about convergence since it
Main article: Neural network software depends on a number of factors. Firstly, there may exist
many local minima. This depends on the cost function
Neural network software is used to simulate, research, and the model. Secondly, the optimization method used
develop and apply articial neural networks, biological might not be guaranteed to converge when far away from
neural networks and, in some cases, a wider array of a local minimum. Thirdly, for a very large amount of
adaptive systems. data or parameters, some methods become impractical.
In general, it has been found that theoretical guarantees
regarding convergence are an unreliable guide to practical
application.
61.7 Types of articial neural net-
works
61.8.4 Generalization and statistics
Main article: Types of articial neural networks In applications where the goal is to create a system that
generalizes well in unseen examples, the problem of over-
Articial neural network types vary from those with only training has emerged. This arises in convoluted or over-
one or two layers of single direction logic, to compli- specied systems when the capacity of the network sig-
cated multiinput many directional feedback loops and nicantly exceeds the needed free parameters. There
layers. On the whole, these systems use algorithms in are two schools of thought for avoiding this problem:
their programming to determine control and organization The rst is to use cross-validation and similar techniques
382 CHAPTER 61. ARTIFICIAL NEURAL NETWORK
to check for the presence of overtraining and optimally for real-world operation . This is not surprising, since
select hyperparameters such as to minimize the gener- any learning machine needs sucient representative ex-
alization error. The second is to use some form of amples in order to capture the underlying structure that
regularization. This is a concept that emerges naturally in allows it to generalize to new cases. Dean Pomerleau,
a probabilistic (Bayesian) framework, where the regular- in his research presented in the paper Knowledge-based
ization can be performed by selecting a larger prior prob- Training of Articial Neural Networks for Autonomous
ability over simpler models; but also in statistical learning Robot Driving, uses a neural network to train a robotic
theory, where the goal is to minimize over two quantities: vehicle to drive on multiple types of roads (single lane,
the 'empirical risk' and the 'structural risk', which roughly multi-lane, dirt, etc.). A large amount of his research
corresponds to the error over the training set and the pre- is devoted to (1) extrapolating multiple training scenar-
dicted error in unseen data due to overtting. ios from a single training experience, and (2) preserving
past training diversity so that the system does not become
overtrained (if, for example, it is presented with a series
of right turns it should not learn to always turn right).
These issues are common in neural networks that must de-
cide from amongst a wide variety of responses, but can be
dealt with in several ways, for example by randomly shuf-
ing the training examples, by using a numerical opti-
mization algorithm that does not take too large steps when
changing the network connections following an example,
or by grouping examples in so-called mini-batches.
A. K. Dewdney, a former Scientic American columnist,
wrote in 1997, Although neural nets do solve a few toy
problems, their powers of computation are so limited that
I am surprised anyone takes them seriously as a general
problem-solving tool. (Dewdney, p. 82)
Condence analysis of a neural network
Deep learning
61.9.4 Hybrid approaches Digital morphogenesis
Neural coding
384 CHAPTER 61. ARTIFICIAL NEURAL NETWORK
[13] http://www.kurzweilai.net/
Parallel distributed processing how-bio-inspired-deep-learning-keeps-winning-competitions
2012 Kurzweil AI Interview with Jrgen Schmidhuber on
Radial basis function network
the eight competitions won by his Deep Learning team
Recurrent neural networks 20092012
[31] Hoskins, J.C.; Himmelblau, D.M. (1992). Process con- [41] Wu, J., Chen, E. (May 2009). Wang, H., Shen, Y., Huang,
trol via articial neural networks and reinforcement learn- T., Zeng, Z., ed. A Novel Nonparametric Regression En-
ing. Computers & Chemical Engineering 16 (4): 241 semble for Rainfall Forecasting Using Particle Swarm Op-
251. doi:10.1016/0098-1354(92)80045-B. timization Technique Coupled with Articial Neural Net-
work. 6th International Symposium on Neural Networks,
[32] Bertsekas, D.P., Tsitsiklis, J.N. (1996). Neuro-dynamic ISNN 2009. Springer. doi:10.1007/978-3-642-01513-
programming. Athena Scientic. p. 512. ISBN 1- 7_6. ISBN 978-3-642-01215-0.
886529-10-8.
[42] Roman M. Balabin, Ekaterina I. Lomakina (2009). Neu-
[33] Secomandi, Nicola (2000). Comparing neuro-dynamic ral network approach to quantum-chemistry data: Accu-
programming algorithms for the vehicle routing prob- rate prediction of density functional theory energies. J.
lem with stochastic demands. Computers & Operations Chem. Phys. 131 (7): 074104. doi:10.1063/1.3206326.
Research 27 (1112): 12011225. doi:10.1016/S0305- PMID 19708729.
0548(99)00146-X.
[43] Ganesan, N. Application of Neural Networks in Diag-
[34] de Rigo, D., Rizzoli, A. E., Soncini-Sessa, R., Weber, nosing Cancer Disease Using Demographic Data (PDF).
E., Zenesi, P. (2001). Neuro-dynamic programming for International Journal of Computer Applications.
the ecient management of reservoir networks (PDF).
Proceedings of MODSIM 2001, International Congress on [44] Bottaci, Leonardo. Articial Neural Networks Applied
Modelling and Simulation. MODSIM 2001, International to Outcome Prediction for Colorectal Cancer Patients in
Congress on Modelling and Simulation. Canberra, Aus- Separate Institutions (PDF). The Lancet.
tralia: Modelling and Simulation Society of Australia [45] Forrest MD (April 2015). Simulation of alcohol action
and New Zealand. doi:10.5281/zenodo.7481. ISBN 0- upon a detailed Purkinje neuron model and a simpler sur-
867405252. Retrieved 29 July 2012. rogate model that runs >400 times faster. BMC Neuro-
[35] Damas, M., Salmeron, M., Diaz, A., Ortega, J., Pri- science 16 (27). doi:10.1186/s12868-015-0162-6.
eto, A., Olivares, G. (2000). Genetic algorithms and [46] Hinton, Georey E. Distributed representations. (1984)
neuro-dynamic programming: application to water sup-
ply networks. Proceedings of 2000 Congress on Evo- [47] Salakhutdinov, Ruslan, and Georey Hinton. Semantic
lutionary Computation. 2000 Congress on Evolution- hashing. International Journal of Approximate Reason-
ary Computation. La Jolla, California, USA: IEEE. ing 50.7 (2009): 969-978.
386 CHAPTER 61. ARTIFICIAL NEURAL NETWORK
[48] Le, Quoc V., and Tomas Mikolov. Distributed repre- Gurney, K. (1997) An Introduction to Neural Net-
sentations of sentences and documents. arXiv preprint works London: Routledge. ISBN 1-85728-673-1
arXiv:1405.4053 (2014). (hardback) or ISBN 1-85728-503-4 (paperback)
[49] Graves, Alex, Greg Wayne, and Ivo Danihelka. Neu- Haykin, S. (1999) Neural Networks: A Comprehen-
ral Turing Machines. arXiv preprint arXiv:1410.5401
sive Foundation, Prentice Hall, ISBN 0-13-273350-
(2014).
1
[50] Weston, Jason, Sumit Chopra, and Antoine Bordes.
Memory networks. arXiv preprint arXiv:1410.3916 Fahlman, S, Lebiere, C (1991). The Cascade-
(2014). Correlation Learning Architecture, created for
National Science Foundation, Contract Number
[51] Siegelmann, H.T.; Sontag, E.D. (1991). Turing com- EET-8716324, and Defense Advanced Research
putability with neural nets (PDF). Appl. Math. Lett. 4 Projects Agency (DOD), ARPA Order No. 4976
(6): 7780. doi:10.1016/0893-9659(91)90080-F.
under Contract F33615-87-C-1499. electronic
[52] Balczar, Jos (Jul 1997). Computational Power of Neu- version
ral Networks: A Kolmogorov Complexity Characteri-
zation. Information Theory, IEEE Transactions on 43 Hertz, J., Palmer, R.G., Krogh. A.S. (1990) Intro-
(4): 11751183. doi:10.1109/18.605580. Retrieved 3 duction to the theory of neural computation, Perseus
November 2014. Books. ISBN 0-201-51560-1
[53] NASA - Dryden Flight Research Center - News Lawrence, Jeanette (1994) Introduction to Neu-
Room: News Releases: NASA NEURAL NETWORK ral Networks, California Scientic Software Press.
PROJECT PASSES MILESTONE. Nasa.gov. Retrieved ISBN 1-883157-00-5
on 2013-11-20.
Masters, Timothy (1994) Signal and Image Process-
[54] Roger Bridgmans defence of neural networks
ing with Neural Networks, John Wiley & Sons, Inc.
[55] http://www.iro.umontreal.ca/~{}lisa/publications2/ ISBN 0-471-04963-8
index.php/publications/show/4
Ripley, Brian D. (1996) Pattern Recognition and
[56] Sun and Bookman (1990) Neural Networks, Cambridge
[57] Tahmasebi; Hezarkhani (2012). A hybrid neural
Siegelmann, H.T. and Sontag, E.D. (1994). Ana-
networks-fuzzy logic-genetic algorithm for grade es-
timation. Computers & Geosciences 42: 1827.
log computation via neural networks, Theoretical
doi:10.1016/j.cageo.2012.02.004. Computer Science, v. 131, no. 2, pp. 331360.
electronic version
Bishop, C.M. (1995) Neural Networks for Pat- Wasserman, Philip (1993) Advanced Methods in
tern Recognition, Oxford: Oxford University Press. Neural Computing, Van Nostrand Reinhold, ISBN
ISBN 0-19-853849-9 (hardback) or ISBN 0-19- 0-442-00461-3
853864-2 (paperback)
Computational Intelligence: A Methodologi-
Cybenko, G.V. (1989). Approximation by Super- cal Introduction by Kruse, Borgelt, Klawonn,
positions of a Sigmoidal function, Mathematics of Moewes, Steinbrecher, Held, 2013, Springer, ISBN
Control, Signals, and Systems, Vol. 2 pp. 303314. 9781447150121
electronic version
Neuro-Fuzzy-Systeme (3rd edition) by Borgelt,
Duda, R.O., Hart, P.E., Stork, D.G. (2001) Pat- Klawonn, Kruse, Nauck, 2003, Vieweg, ISBN
tern classication (2nd edition), Wiley, ISBN 0-471- 9783528252656
05669-3
Egmont-Petersen, M., de Ridder, D., Handels, H.
(2002). Image processing with neural networks 61.14 External links
a review. Pattern Recognition 35 (10): 22792301.
doi:10.1016/S0031-3203(01)00178-9. Neural Networks at DMOZ
61.14. EXTERNAL LINKS 387
Deep learning
388
62.2. HISTORY 389
transformation is a processing unit that has trainable pa- Many deep learning algorithms are framed as unsuper-
rameters, such as weights and thresholds.[4](p6) A chain vised learning problems. Because of this, these algo-
of transformations from input to output is a credit assign- rithms can make use of the unlabeled data that supervised
ment path (CAP). CAPs describe potentially causal con- algorithms cannot. Unlabeled data is usually more abun-
nections between input and output and may vary in length. dant than labeled data, making this an important benet
For a feedforward neural network, the depth of the CAPs, of these algorithms. The deep belief network is an exam-
and thus the depth of the network, is the number of hid- ple of a deep structure that can be trained in an unsuper-
den layers plus one (the output layer is also parameter- vised manner.[3]
ized). For recurrent neural networks, in which a signal
may propagate through a layer more than once, the CAP
is potentially unlimited in length. There is no universally 62.2 History
agreed upon threshold of depth dividing shallow learning
from deep learning, but most researchers in the eld agree
that deep learning has multiple nonlinear layers (CAP > Deep learning architectures, specically those built from
2) and Schmidhuber considers CAP > 10 to be very deep articial neural networks (ANN), date back at least to
learning.[4](p7) the Neocognitron introduced by Kunihiko Fukushima in
1980.[10] The ANNs themselves date back even further.
In 1989, Yann LeCun et al. were able to apply the stan-
dard backpropagation algorithm, which had been around
62.1.2 Fundamental concepts since 1974,[11] to a deep neural network with the purpose
of recognizing handwritten ZIP codes on mail. Despite
Deep learning algorithms are based on distributed rep- the success of applying the algorithm, the time to train
resentations. The underlying assumption behind dis- the network on this dataset was approximately 3 days,
tributed representations is that observed data is generated making it impractical for general use.[12] Many factors
by the interactions of many dierent factors on dierent contribute to the slow speed, one being due to the so-
levels. Deep learning adds the assumption that these fac- called vanishing gradient problem analyzed in 1991 by
tors are organized into multiple levels, corresponding to Sepp Hochreiter.[13][14]
dierent levels of abstraction or composition. Varying While such neural networks by 1991 were used for rec-
numbers of layers and layer sizes can be used to provide ognizing isolated 2-D hand-written digits, 3-D object
dierent amounts of abstraction.[3] recognition by 1991 used a 3-D model-based approach
Deep learning algorithms in particular exploit this idea matching 2-D images with a handcrafted 3-D object
of hierarchical explanatory factors. Dierent concepts model. Juyang Weng et al.. proposed that a human brain
are learned from other concepts, with the more abstract, does not use a monolithic 3-D object model and in 1992
higher level concepts being learned from the lower level they published Cresceptron,[15][16][17] a method for per-
ones. These architectures are often constructed with a forming 3-D object recognition directly from cluttered
greedy layer-by-layer method that models this idea. Deep scenes. Cresceptron is a cascade of many layers similar to
learning helps to disentangle these abstractions and pick Neocognitron. But unlike Neocognitron which required
out which features are useful for learning.[3] the human programmer to hand-merge features, Crescep-
For supervised learning tasks where label information is tron fully automatically learned an open number of un-
readily available in training, deep learning promotes a supervised features in each layer of the cascade where
principle which is very dierent than traditional meth- each feature is represented by a convolution kernel. In
ods of machine learning. That is, rather than focusing addition, Cresceptron also segmented each learned ob-
on feature engineering which is often labor-intensive and ject from a cluttered scene through back-analysis through
varies from one task to another, deep learning methods the network. Max-pooling, now often adopted by deep
are focused on end-to-end learning based on raw features. neural networks (e.g., ImageNet tests), was rst used in
In other words, deep learning moves away from feature Cresceptron to reduce the position resolution by a factor
engineering to a maximal extent possible. To accomplish of (2x2) to 1 through the cascade for better generaliza-
end-to-end optimization starting with raw features and tion. Because of a great lack of understanding how the
ending in labels, layered structures are often necessary. brain autonomously wire its biological networks and the
From this perspective, we can regard the use of layered computational cost by ANNs then, simpler models that
structures to derive intermediate representations in deep use task-specic handcrafted features such as Gabor l-
learning as a natural consequence of raw-feature-based ter and support vector machines (SVMs) were of popular
end-to-end learning.[1] Understanding the connection be- choice of the eld in the 1990s and 2000s.
tween the above two aspects of deep learning is important In the long history of speech recognition, both shallow
to appreciate its use in several application areas, all in- form and deep form (e.g., recurrent nets) of articial neu-
volving supervised learning tasks (e.g., supervised speech ral networks had been explored for many years.[18][19][20]
and image recognition), as to be discussed in a later part But these methods never won over the non-uniform
of this article. internal-handcrafting Gaussian mixture model/Hidden
390 CHAPTER 62. DEEP LEARNING
Markov model (GMM-HMM) technology based on gen- designed correspondingly with large, context-dependent
erative models of speech trained discriminatively.[21] output layers, dramatic error reduction occurred over the
A number of key diculties had been methodologi- then-state-of-the-art GMM-HMM and more advanced
cally analyzed, including gradient diminishing and weak generative model-based speech recognition systems with-
temporal correlation structure in the neural predictive out the need for generative DBN pre-training, the nd-
models.[22][23] All these diculties were in addition to ing veried subsequently by several other major speech
the lack of big training data and big computing power recognition research groups [24][35] Further, the nature of
in these early days. Most speech recognition researchers recognition errors produced by the two types of systems
who understood such barriers hence subsequently moved was found to be characteristically dierent,[25][36] oer-
away from neural nets to pursue generative modeling ap- ing technical insights into how to artfully integrate deep
proaches until the recent resurgence of deep learning that learning into the existing highly ecient, run-time speech
has overcome all these diculties. Hinton et al. and decoding system deployed by all major players in speech
Deng et al. reviewed part of this recent history about recognition industry. The history of this signicant de-
how their collaboration with each other and then with velopment in deep learning has been described and ana-
cross-group colleagues ignited the renaissance of neural lyzed in recent books.[1][37]
networks and initiated deep learning research and appli- Advances in hardware have also been an important en-
cations in speech recognition.[24][25][26][27] abling factor for the renewed interest of deep learning.
The term deep learning gained traction in the mid- In particular, powerful graphics processing units (GPUs)
2000s after a publication by Georey Hinton and Ruslan are highly suited for the kind of number crunching, ma-
Salakhutdinov showed how a many-layered feedforward trix/vector math involved in machine learning. GPUs
neural network could be eectively pre-trained one layer have been shown to speed up training algorithms by or-
at a time, treating each layer in turn as an unsupervised ders of magnitude, bringing running times of weeks back
restricted Boltzmann machine, then using supervised to days.[38][39]
backpropagation for ne-tuning.[28] In 1992, Schmidhu-
ber had already implemented a very similar idea for the
more general case of unsupervised deep hierarchies of
recurrent neural networks, and also experimentally shown 62.3 Deep learning in articial
its benets for speeding up supervised learning [29][30]
neural networks
Since the resurgence of deep learning, it has become part
of many state-of-the-art systems in dierent disciplines,
Some of the most successful deep learning methods in-
particularly that of computer vision and automatic speech
volve articial neural networks. Articial neural net-
recognition (ASR). Results on commonly used evalua-
works are inspired by the 1959 biological model proposed
tion sets such as TIMIT (ASR) and MNIST (image clas-
by Nobel laureates David H. Hubel & Torsten Wiesel,
sication) as well as a range of large vocabulary speech
who found two types of cells in the primary visual cortex:
recognition tasks are constantly being improved with new
simple cells and complex cells. Many articial neural net-
applications of deep learning.[24][31][32] Currently, it has
works can be viewed as cascading models [15][16][17][40] of
been shown that deep learning architectures in the form
cell types inspired by these biological observations.
of convolutional neural networks have been nearly best
performing;[33][34] however, these are more widely used Fukushimas Neocognitron introduced convolutional
in computer vision than in ASR. neural networks partially trained by unsupervised learn-
ing while humans directed features in the neural
The real impact of deep learning in industry started in
plane. Yann LeCun et al. (1989) applied super-
large-scale speech recognition around 2010. In late 2009,
vised backpropagation to such architectures.[41] Weng
Geo Hinton was invited by Li Deng to work with him
et al. (1992) published convolutional neural networks
and colleagues at Microsoft Research in Redmond to
Cresceptron[15][16][17] for 3-D object recognition from
apply deep learning to speech recognition. They co-
images of cluttered scenes and segmentation of such ob-
organized the 2009 NIPS Workshop on Deep Learning
jects from images.
for Speech Recognition. The workshop was motivated
by the limitations of deep generative models of speech, An obvious need for recognizing general 3-D ob-
and the possibility that the big-compute, big-data era jects is least shift invariance and tolerance to defor-
warranted a serious try of the deep neural net (DNN) mation. Max-pooling appeared to be rst proposed
approach. It was then (incorrectly) believed that pre- by Cresceptron[15][16] to enable the network to tolerate
training of DNNs using generative models of deep belief small-to-large deformation in a hierarchical way while
net (DBN) would be the cure for the main diculties of using convolution. Max-pooling helps, but still does not
neural nets encountered during 1990s.[26] However, soon fully guarantee, shift-invariance at the pixel level.[17]
after the research along this direction started at Microsoft With the advent of the back-propagation algorithm in the
Research, it was discovered that when large amounts of 1970s, many researchers tried to train supervised deep
training data are used and especially when DNNs are articial neural networks from scratch, initially with little
62.4. DEEP LEARNING ARCHITECTURES 391
success. Sepp Hochreiter's diploma thesis of 1991[42][43] all other machine learning techniques on the old, famous
formally identied the reason for this failure in the van- MNIST handwritten digits problem of Yann LeCun and
ishing gradient problem, which not only aects many- colleagues at NYU.
layered feedforward networks, but also recurrent neural At about the same time, in late 2009, deep learning made
networks. The latter are trained by unfolding them into inroad into speech recognition, as marked by the NIPS
very deep feedforward networks, where a new layer is cre- Workshop on Deep Learning for Speech Recognition. In-
ated for each time step of an input sequence processed by tensive collaborative work between Microsoft Research
the network. As errors propagate from layer to layer, they and University of Toronto researchers had demonstrated
shrink exponentially with the number of layers.
by mid 2010 in Redmond that deep neural networks
To overcome this problem, several methods were pro- interfaced with a hidden Markov model with context-
posed. One is Jrgen Schmidhuber's multi-level hi- dependent states that dene the neural network output
erarchy of networks (1992) pre-trained one level at a layer can drastically reduce errors in large vocabulary
time through unsupervised learning, ne-tuned through speech recognition tasks such as voice search. The same
backpropagation.[29] Here each level learns a compressed deep neural net model was shown to scale up to Switch-
representation of the observations that is fed to the next board tasks about one year later at Microsoft Research
level. Asia.
Another method is the long short term memory (LSTM) As of 2011, the state of the art in deep learn-
network of 1997 by Hochreiter & Schmidhuber.[44] In ing feedforward networks alternates convolutional lay-
2009, deep multidimensional LSTM networks won three ers and max-pooling layers,[53][54] topped by several
ICDAR 2009 competitions in connected handwriting pure classication layers. Training is usually done
recognition, without any prior knowledge about the three without any unsupervised pre-training. Since 2011,
dierent languages to be learned.[45][46] GPU-based implementations[53] of this approach won
Sven Behnke relied only on the sign of the gra- many pattern recognition contests, including[55] the IJCNN
2011 Trac Sign Recognition Competition, the ISBI
dient (Rprop) when training his Neural Abstraction
Pyramid[47] to solve problems like image reconstruction 2012 Segmentation of neuronal structures in EM stacks
challenge,[56] and others.
and face localization.
Other methods also use unsupervised pre-training to Such supervised deep learning methods also were the
structure a neural network, making it rst learn generally rst articial pattern recognizers to achieve
[57]
human-
useful feature detectors. Then the network is trained fur- competitive performance on certain tasks.
ther by supervised back-propagation to classify labeled To break the barriers of weak AI represented by deep
data. The deep model of Hinton et al. (2006) involves learning, it is necessary to go beyond the deep learn-
learning the distribution of a high level representation us-
ing architectures because biological brains use both shal-
ing successive layers of binary or real-valued latent vari-low and deep circuits as reported by brain anatomy[58]
ables. It uses a restricted Boltzmann machine (Smolen- in order to deal with the wide variety of invariance that
sky, 1986[48] ) to model each new layer of higher level the brain displays. Weng[59] argued that the brain self-
features. Each new layer guarantees an increase on the wires largely according to signal statistics and, therefore,
lower-bound of the log likelihood of the data, thus im- a serial cascade cannot catch all major statistical depen-
proving the model, if trained properly. Once suciently dencies. Fully guaranteed shift invariance for ANNs to
many layers have been learned the deep architecture may deal with small and large natural objects in large clut-
be used as a generative model by reproducing the data tered scenes became true when the invariance went be-
when sampling down the model (an ancestral pass) yond shift, to extend to all ANN-learned concepts, such
from the top level feature activations.[49] Hinton reports as location, type (object class label), scale, lighting, in the
that his models are eective feature extractors over high- Developmental Networks (DNs)[60] whose embodiments
dimensional, structured data.[50] are Where-What Networks, WWN-1 (2008)[61] through
[62]
The Google Brain team led by Andrew Ng and Je Dean WWN-7 (2013).
created a neural network that learned to recognize higher-
level concepts, such as cats, only from watching unlabeled
images taken from YouTube videos.[51] [52] 62.4 Deep learning architectures
Other methods rely on the sheer processing power of
modern computers, in particular, GPUs. In 2010 it There are huge number of dierent variants of deep ar-
was shown by Dan Ciresan and colleagues[38] in Jrgen chitectures; however, most of them are branched from
Schmidhuber's group at the Swiss AI Lab IDSIA that some original parent architectures. It is not always pos-
despite the above-mentioned vanishing gradient prob- sible to compare the performance of multiple architec-
lem, the superior processing power of GPUs makes plain tures all together, since they are not all implemented on
back-propagation feasible for deep feedforward neural the same data set. Deep learning is a fast-growing eld
networks with many layers. The method outperformed so new architectures, variants, or algorithms may appear
392 CHAPTER 62. DEEP LEARNING
every few weeks. DNNs are prone to overtting because of the added lay-
ers of abstraction, which allow them to model rare de-
pendencies in the training data. Regularization meth-
62.4.1 Deep neural networks ods such as weight decay ( 2 -regularization) or sparsity
( 1 -regularization) can be applied during training to
A deep neural network (DNN) is an articial neu- help combat overtting.[67] A more recent regularization
ral network with multiple hidden layers of units be- method applied to DNNs is dropout regularization. In
tween the input and output layers.[2][4] Similar to shal- dropout, some number of units are randomly omitted
low ANNs, DNNs can model complex non-linear re- from the hidden layers during training. This helps to
lationships. DNN architectures, e.g., for object detec- break the rare dependencies that can occur in the training
tion and parsing generate compositional models where data [68]
the object is expressed as layered composition of image Backpropagation and gradient descent have been the pre-
primitives.[63] The extra layers enable composition of fea- ferred method for training these structures due to the
tures from lower layers, giving the potential of modeling ease of implementation and their tendency to converge
complex data with fewer units than a similarly performing to better local optima in comparison with other training
shallow network.[2] methods. However, these methods can be computation-
DNNs are typically designed as feedforward networks, ally expensive, especially when being used to train DNNs.
but recent research has successfully applied the deep There are many training parameters to be considered with
learning architecture to recurrent neural networks for ap- a DNN, such as the size (number of layers and number
plications such as language modeling.[64] Convolutional of units per layer), the learning rate and initial weights.
deep neural networks (CNNs) are used in computer vi- Sweeping through the parameter space for optimal pa-
sion where their success is well-documented.[65] More rameters may not be feasible due to the cost in time and
recently, CNNs have been applied to acoustic modeling computational resources. Various 'tricks such as using
for automatic speech recognition (ASR), where they have mini-batching (computing the gradient on several train-
shown success over previous models.[34] For simplicity, a ing examples at once rather than individual examples) [69]
look at training DNNs is given here. have been shown to speed up computation. The large
processing throughput of GPUs has produced signicant
A DNN can be discriminatively trained with the standard speedups in training, due to the matrix and vector com-
backpropagation algorithm. The weight updates can be putations required being well suited for GPUs.[4] Radical
done via stochastic gradient descent using the following alternatives to backprop such as Extreme Learning Ma-
equation: chines,[70] No-prop networks [71] and Weightless neural
networks [72] are gaining attention.
C
wij (t + 1) = wij (t) +
wij
62.4.3 Deep belief networks
Here, is the learning rate, and C is the cost func-
tion. The choice of the cost function depends on fac- Main article: Deep belief network
tors such as the learning type (supervised, unsuper- A deep belief network (DBN) is a probabilistic,
vised, reinforcement, etc.) and the activation function. generative model made up of multiple layers of hidden
For example, when performing supervised learning on units. It can be looked at as a composition of simple learn-
a multiclass classication problem, common choices for ing modules that make up each layer.[73]
the activation function and cost function are the softmax A DBN can be used for generatively pre-training a DNN
function and cross entropy function, respectively. The by using the learned weights as the initial weights. Back-
exp(xj )
softmax function is dened as pj = exp(x where pj propagation or other discriminative algorithms can then
k k)
represents the class probability and xj and xk represent be applied for ne-tuning of these weights. This is par-
the total input to units j and ticularly helpful in situations where limited training data
k respectively. Cross en- is available, as poorly initialized weights can have signi-
tropy is dened as C = j dj log(pj ) where dj rep-
resents the target probability for output unit j and pj is cant impact on the performance of the nal model. These
the probability output for j after applying the activation pre-trained weights are in a region of the weight space that
function.[66] is closer to the optimal weights (as compared to just ran-
dom initialization). This allows for both improved mod-
eling capability and faster convergence of the ne-tuning
62.4.2 Issues with deep neural networks phase.[74]
A DBN can be eciently trained in an unsupervised,
As with ANNs, many issues can arise with DNNs if they layer-by-layer manner where the layers are typically made
are naively trained. Two common issues are overtting of restricted Boltzmann machines (RBM). A description
and computation time. of training a DBN via RBMs is provided below. An RBM
62.4. DEEP LEARNING ARCHITECTURES 393
network. They provide a generic structure which can be formance and functionality of this kind of architecture.
used in many image and signal processing tasks and can The approximate inference, which is based on mean-
be trained in a way similar to that for Deep Belief Net- eld method, is about 25 to 50 times slower than a sin-
works. Recently, many benchmark results on standard gle bottom-up pass in DBNs. This time consuming task
image datasets like CIFAR [78] have been obtained using make the joint optimization, quite impractical for large
CDBNs.[79] data sets, and seriously restricts the use of DBMs in tasks
such as feature representations (the mean-eld inference
have to be performed for each new test input).[83]
62.4.6 Deep Boltzmann Machines
62.4.7 Stacked (Denoising) Auto-Encoders
A Deep Boltzmann Machine (DBM) is a type of bi-
nary pairwise Markov random eld (undirected proba- The auto encoder idea is motivated by the concept of good
bilistic graphical models) with multiple layers of hidden representation. For instance for the case of classier it is
random variables. It is a network of symmetrically cou- possible to dene that a good representation is one that will
pled stochastic binary units. It comprises a set of visible yield a better performing classier.
units {0, 1}D , and a series of layers of hidden units
h(1) {0, 1}F1 , h(2) {0, 1}F2 , . . . , h(L) {0, 1}FL An encoder is referred to a deterministic mapping f that
. There is no connection between the units of the same transforms an input vector x into hidden representation
layer (like RBM). For the DBM, we can write the proba- y, where = {W , b} , W is the weight matrix and b is
bility which is assigned to vector as: an oset vector (bias). On the contrary a decoder maps
(1) (1) (2) (1) (2)
back the hidden representation y to the reconstructed in-
(3) (2) (3)
p() = Z1 h e ij Wij i hj + jl Wjl hj hl + lm Wlm hput l hzm via
, g . The whole process of auto encoding is to
compare this reconstructed input to the original and try
where h = {h(1) , h(2) , h(3) } are the set of hidden units, to minimize this error to make the reconstructed value as
and = {W (1) , W (2) , W (3) } are the model parame- close as possible to the original.
ters, representing visible-hidden and hidden-hidden sym-
metric interaction, since they are undirected links. As it In stacked denoising auto encoders, the partially corrupted
is clear by setting W (2) = 0 and W (3) = 0 the net- output is cleaned (denoised). This fact has been intro-
[84]
work becomes the well-known Restricted Boltzmann ma- duced in with a specic approach to good represen-
chine. [80] tation, a good representation is one that can be obtained
robustly from a corrupted input and that will be useful for
There are several reasons which motivate us to take ad- recovering the corresponding clean input. Implicit in this
vantage of deep Boltzmann machine architectures. Like denition are the ideas of
DBNs, they benet from the ability of learning com-
plex and abstract internal representations of the input in
The higher level representations are relatively stable
tasks such as object or speech recognition, with the use
and robust to the corruption of the input;
of limited number of labeled data to ne-tune the rep-
resentations built based on a large supply of unlabeled It is required to extract features that are useful for
sensory input data. However, unlike DBNs and deep representation of the input distribution.
convolutional neural networks, they adopt the inference
and training procedure in both directions, bottom-up and The algorithm consists of multiple steps; starts by a
top-down pass, which enable the DBMs to better unveil stochastic mapping of x to x through qD (x|x) , this is
the representations of the ambiguous and complex input the corrupting step. Then the corrupted input x passes
[81] [82]
structures, . through a basic auto encoder process and is mapped to a
Since the exact maximum likelihood learning is intractable hidden representation y = f ( x) = s(W x + b) . From
for the DBMs, we may perform the approximate max- this hidden representation we can reconstruct z = g (y)
imum likelihood learning. There is another possibility, . In the last stage a minimization algorithm is done in or-
to use mean-eld inference to estimate data-dependent der to have a z as close as possible to uncorrupted input x
expectations, incorporation with a Markov chain Monte . The reconstruction error LH (x, z) might be either the
Carlo (MCMC) based stochastic approximation technique cross-entropy loss with an ane-sigmoid decoder, or the
to approximate the expected sucient statistics of the squared error loss with an ane decoder.[84]
model.[80] In order to make a deep architecture, auto encoders stack
We can see the dierence between DBNs and DBM. In one on top of another. Once the encoding function f
DBNs, the top two layers form a restricted Boltzmann of the rst denoising auto encoder is learned and used to
machine which is an undirected graphical model, but the uncorrupt the input (corrupted input), we can train the
lower layers form a directed generative model. second level.[84]
Apart from all the advantages of DBMs discussed so far, Once the stacked auto encoder is trained, its output might
they have a crucial disadvantage which limits the per- be used as the input to a supervised learning algorithm
62.4. DEEP LEARNING ARCHITECTURES 395
such as support vector machine classier or a multiclass 62.4.9 Tensor Deep Stacking Networks (T-
logistic regression.[84] DSN)
deep networks. The features, learned by deep architec- tion is used to empirically adjust the priors needed for
tures such as DBNs,[96] DBMs,[81] deep auto encoders,[97] the bottom-up inference procedure by means of a deep
convolutional variants,[98][99] ssRBMs,[95] deep coding locally connected generative model. This is based on ex-
network,[100] DBNs with sparse feature learning,[101] re- tracting sparse features out of time-varying observations
cursive neural networks,[102] conditional DBNs,[103] de- using a linear dynamical model. Then, a pooling strategy
noising auto encoders,[104] are able to provide better rep- is employed in order to learn invariant feature represen-
resentation for more rapid and accurate classication tations. Similar to other deep architectures, these blocks
tasks with high-dimensional training data sets. However, are the building elements of a deeper architecture where
they are not quite powerful in learning novel classes with greedy layer-wise unsupervised learning are used. Note
few examples, themselves. In these architectures, all units that the layers constitute a kind of Markov chain such that
through the network are involved in the representation of the states at any layer are only dependent on the succeed-
the input (distributed representations), and they have to be ing and preceding layers.
adjusted together (high degree of freedom). However, if
Deep predictive coding network (DPCN)[111] predicts the
we limit the degree of freedom, we make it easier for the representation of the layer, by means of a top-down ap-
model to learn new classes out of few training samples
proach using the information in upper layer and also tem-
(less parameters to learn). Hierarchical Bayesian (HB) poral dependencies from the previous states, it is called
models, provide learning from few examples, for example
[105][106][107][108][109]
for computer vision, statistics, and It is also possible to extend the DPCN to form a
cognitive science. convolutional network.[111]
Compound HD architectures try to integrate both charac-
teristics of HB and deep networks. The compound HDP- 62.4.13 Multilayer Kernel Machine
DBM architecture, a hierarchical Dirichlet process (HDP)
as a hierarchical model, incorporated with DBM archi- The Multilayer Kernel Machine (MKM) as introduced in
tecture. It is a full generative model, generalized from [112]
is a way of learning highly nonlinear functions with
abstract concepts owing through the layers of the model, the iterative applications of weakly nonlinear kernels.
which is able to synthesize new examples in novel classes They use the kernel principal component analysis (KPCA),
that look reasonably natural. Note that all the levels in,[113] as method for unsupervised greedy layer-wise pre-
are learned jointly by maximizing a joint log-probability training step of the deep learning architecture.
score.[110]
Layer l + 1 -th learns the representation of the previous
Consider a DBM with three hidden layers, the probability layer l , extracting the nl principal component (PC) of the
of a visible input is: projection layer l output in the feature domain induced
W (1) i h1 + W (2) h1 h2 + W (3) h2by 3 the kernel. For the sake of dimensionality reduction
1
p(, ) = Z h e ij ij j jl jl j l lm lm l hm ,
of the updated representation in each layer, a supervised
where h = {h(1) , h(2) , h(3) } are the set of hidden units, strategy is proposed to select the best informative features
and = {W (1) , W (2) , W (3) } are the model parame- among the ones extracted by KPCA. The process is:
ters, representing visible-hidden and hidden-hidden sym-
metric interaction terms. ranking the nl features according to their mutual in-
formation with the class labels;
After a DBM model has been learned, we have an
undirected model that denes the joint distribution for dierent values of K and ml {1, . . . , nl } ,
P (, h1 , h2 , h3 ) . One way to express what has been compute the classication error rate of a K-nearest
learned is the conditional model P (, h1 , h2 |h3 ) and a neighbor (K-NN) classier using only the ml most
prior term P (h3 ) . informative features on a validation set;
The part P (, h1 , h2 |h3 ) , represents a conditional DBM the value of ml with which the classier has reached
model, which can be viewed as a two-layer DBM but with the lowest error rate determines the number of fea-
bias terms given by the states of h3 : tures to retain.
(1) (2) (3)
Wij i h1j + Wjl h1j h2l + 2 3
P (, h1 , h2 |h3 ) = 1
Z(,h3 ) e
ij jl lm Wlm hl hm .
There are some drawbacks in using the KPCA method as
the building cells of an MKM.
62.4.12 Deep Coding Networks Another, more straightforward method of integrating ker-
nel machine into the deep learning architecture was de-
There are several advantages to having a model which can veloped by Microsoft researchers for spoken language un-
actively update itself to the context in data. One of these derstanding applications.[114] The main idea is to use a
methods arises from the idea to have a model which is kernel machine to approximate a shallow neural net with
able to adjust its prior knowledge dynamically according an innite number of hidden units, and then to use the
to the context of the data. Deep coding network (DPCN) stacking technique to splice the output of the kernel ma-
is a predictive coding scheme where top-down informa- chine and the raw input in building the next, higher level
62.5. APPLICATIONS 397
of the kernel machine. The number of the levels in this system is analogous to a Turing Machine but is dier-
kernel version of the deep convex network is a hyper- entiable end-to-end, allowing it to be eciently trained
parameter of the overall system determined by cross val- with gradient descent. Preliminary results demonstrate
idation. that Neural Turing Machines can infer simple algorithms
such as copying, sorting, and associative recall from input
and output examples.
62.4.14 Deep Q-Networks Memory Networks[126] is another extension to neural net-
works incorporating long-term memory which was devel-
This is the latest class of deep learning models targeted
oped by Facebook research. The long-term memory can
for reinforcement learning, published in February 2015
be read and written to, with the goal of using it for pre-
in Nature[115] The application discussed in this paper is
diction. These models have been applied in the context
limited to ATARI gaming, but the implications for other
of question answering (QA) where the long-term mem-
potential applications are profound.
ory eectively acts as a (dynamic) knowledge base, and
the output is a textual response.
62.4.15 Memory networks
One fundamental principle of deep learning is to do away The real impact of deep learning in image or object
with hand-crafted feature engineering and to use raw fea-recognition, one major branch of computer vision, was
tures. This principle was rst explored successfully in the
felt in the fall of 2012 after the team of Geo Hinton and
architecture of deep autoencoder on the raw spectro- his students won the large-scale ImageNet competition by
gram or linear lter-bank features,[133] showing its su- a signicant margin over the then-state-of-the-art shallow
periority over the Mel-Cepstral features which contain machine learning methods. The technology is based on
a few stages of xed transformation from spectrograms. 20-year-old deep convolutional nets, but with much larger
The true raw features of speech, waveforms, have more scale on a much larger task, since it had been learned
recently been shown to produce excellent larger-scale that deep learning works quite well on large-scale speech
speech recognition results.[134] recognition. In 2013 and 2014, the error rate on the Im-
ageNet task using deep learning was further reduced at a
Since the initial successful debut of DNNs for speech
recognition around 2009-2011, there has been huge rapid pace, following a similar trend in large-scale speech
recognition.
progress made. This progress (as well as future direc-
tions) has been summarized into the following eight ma- As in the ambitious moves from automatic speech recog-
jor areas:[1][27][37] 1) Scaling up/out and speedup DNN nition toward automatic speech translation and under-
training and decoding; 2) Sequence discriminative train- standing, image classication has recently been extended
ing of DNNs; 3) Feature processing by deep models to the more ambitious and challenging task of automatic
with solid understanding of the underlying mechanisms; image captioning, in which deep learning is the essential
4) Adaptation of DNNs and of related deep models; underlying technology. [140] [141] [142] [143]
5) Multi-task and transfer learning by DNNs and re- One example application is a car computer said to be
lated deep models; 6) Convolution neural networks and trained with deep learning, which may be able to let cars
how to design them to best exploit domain knowledge interpret 360 camera views.[144]
of speech; 7) Recurrent neural network and its rich
LSTM variants; 8) Other types of deep models includ-
ing tensor-based models and integrated deep genera- 62.5.3 Natural language processing
tive/discriminative models.
Large-scale automatic speech recognition is the rst and Neural networks have been used for implementing
the most convincing successful case of deep learning in language models since the early 2000s.[145] Key tech-
the recent history, embraced by both industry and aca- niques in this eld are negative sampling[146] and word
demic across the board. Between 2010 and 2014, the embedding. A word embedding, such as word2vec, can
two major conferences on signal processing and speech be thought of as a representational layer in a deep learn-
recognition, IEEE-ICASSP and Interspeech, have seen ing architecture transforming an atomic word into a po-
near exponential growth in the numbers of accepted pa- sitional representation of the word relative to other words
pers in their respective annual conference papers on the in the dataset; the position is represented as a point in a
topic of deep learning for speech recognition. More im- vector space. Using a word embedding as an input layer
portantly, all major commercial speech recognition sys- to a recursive neural network (RNN) allows for the train-
tems (e.g., Microsoft Cortana, Xbox, Skype Translator, ing of the network to parse sentences and phrases using an
Google Now, Apple Siri, Baidu and iFlyTek voice search, eective compositional vector grammar. A compositional
and a range of Nuance speech products, etc.) nowa- vector grammar can be thought of as probabilistic context
days are based on deep learning methods.[1][135][136] Seefree grammar (PCFG) implemented by a recursive neu-
also the recent media interview with the CTO of Nuance ral network.[147] Recursive autoencoders built atop word
Communications.[137] embeddings have been trained to assess sentence simi-
larity and detect paraphrasing.[147] Deep neural architec-
The wide-spreading success in speech recognition
tures have achieved state-of-the-art results in many tasks
achieved by 2011 was followed shortly by large-scale im-
in natural language processing, such as constituency pars-
age recognition described next.
ing,[148] sentiment analysis,[149] information retrieval,[150]
[151]
machine translation, [152] [153] contextual entity link-
[154]
ing, and other areas of NLP. [155]
62.5.2 Image recognition
A common evaluation set for image classication is the 62.5.4 Drug discovery and toxicology
MNIST database data set. MNIST is composed of hand-
written digits and includes 60000 training examples and The pharmaceutical industry faces the problem that a
10000 test examples. Similar to TIMIT, its small size large percentage of candidate drugs fail to reach the mar-
allows multiple congurations to be tested. A compre- ket. These failures of chemical compounds are caused by
hensive list of results on this set can be found in.[138] The insucient ecacy on the biomolecular target (on-target
current best result on MNIST is an error rate of 0.23%, eect), undetected and undesired interactions with other
achieved by Ciresan et al. in 2012.[139] biomolecules (o-target eects), or unanticipated toxic
62.7. COMMERCIAL ACTIVITIES 399
eects.[156][157] In 2012 a team led by George Dahl won one layer of tissue maturing before another and so on un-
the Merck Molecular Activity Challenge using multi- til the whole brain is mature. [170]
task deep neural networks to predict the biomolecular tar- The importance of deep learning with respect to the evo-
get of a compound.[158][159] In 2014 Sepp Hochreiters lution and development of human cognition did not es-
group used Deep Learning to detect o-target and toxic cape the attention of these researchers. One aspect of
eects of environmental chemicals in nutrients, house- human development that distinguishes us from our near-
hold products and drugs and won the Tox21 Data Chal- est primate neighbors may be changes in the timing of
lenge of NIH, FDA and NCATS.[160][161] These im- development.[171] Among primates, the human brain re-
pressive successes show Deep Learning may be superior
mains relatively plastic until late in the post-natal pe-
to other virtual screening methods.[162][163] Researchers riod, whereas the brains of our closest relatives are more
from Google and Stanford enhanced Deep Learning for
completely formed by birth. Thus, humans have greater
drug discovery by combining data from a variety of access to the complex experiences aorded by being
sources.[164]
out in the world during the most formative period of
brain development. This may enable us to tune in to
rapidly changing features of the environment that other
62.5.5 Customer relationship manage- animals, more constrained by evolutionary structuring of
ment their brains, are unable to take account of. To the ex-
tent that these changes are reected in similar timing
Recently success has been reported with application of changes in hypothesized wave of cortical development,
deep reinforcement learning in direct marketing settings, they may also lead to changes in the extraction of infor-
illustrating suitability of the method for CRM automa- mation from the stimulus environment during the early
tion. A neural network was used to approximate the value self-organization of the brain. Of course, along with this
of possible direct marketing actions over the customer exibility comes an extended period of immaturity, dur-
state space, dened in terms of RFM variables. The esti- ing which we are dependent upon our caretakers and our
mated value function was shown to have a natural inter- community for both support and training. The theory
pretation as CLV (customer lifetime value).[165] of deep learning therefore sees the coevolution of cul-
ture and cognition as a fundamental condition of human
evolution.[172]
62.6 Deep learning in the human
brain 62.7 Commercial activities
Computational deep learning is closely related to a class Deep learning is often presented as a step towards re-
of theories of brain development (specically, neocorti- alising strong AI[173] and thus many organizations have
cal development) proposed by cognitive neuroscientists become interested in its use for particular applications.
in the early 1990s.[166] An approachable summary of this Most recently, in December 2013, Facebook announced
work is Elman, et al.'s 1996 book Rethinking Innate- that it hired Yann LeCun to head its new articial intel-
ness [167] (see also: Shrager and Johnson;[168] Quartz and ligence (AI) lab that will have operations in California,
Sejnowski [169] ). As these developmental theories were London, and New York. The AI lab will be used for
also instantiated in computational models, they are tech- developing deep learning techniques that will help Face-
nical predecessors of purely computationally-motivated book do tasks such as automatically tagging uploaded pic-
deep learning models. These developmental models share tures with the names of the people in them.[174]
the interesting property that various proposed learning
dynamics in the brain (e.g., a wave of nerve growth factor) In March 2013, Georey Hinton and two of his graduate
conspire to support the self-organization of just the sort of students, Alex Krizhevsky and Ilya Sutskever, were hired
inter-related neural networks utilized in the later, purely by Google. Their work will be focused on both improv-
computational deep learning models; and such computa- ing existing machine learning products at Google and also
tional neural networks seem analogous to a view of the help deal with the growing amount of data Google has.
brains neocortex as a hierarchy of lters in which each Google also purchased Hintons company, DNNresearch.
layer captures some of the information in the operating In 2014 Google also acquired DeepMind Technologies, a
environment, and then passes the remainder, as well as British start-up that developed a system capable of learn-
modied base signal, to other layers further up the hi- ing how to play Atari video games using only raw pixels
erarchy. This process yields a self-organizing stack of as data input.
transducers, well-tuned to their operating environment.
As described in The New York Times in 1995: "...the Also in 2014, Microsoft established The Deep Learning
infants brain seems to organize itself under the inu- Technology Center in its MSR division, amassing deep
ence of waves of so-called trophic-factors ... dierent learning experts for application-focused activities.
regions of the brain become connected sequentially, with And Baidu hired Andrew Ng to head their new Silicon
400 CHAPTER 62. DEEP LEARNING
Valley based research lab focusing on deep learning. order phenomenon, may in fact have roots deep within
the structure of the universe itself.
In further reference to the idea that a signicant degree of
62.8 Criticism and comment artistic sensitivity might inhere within relatively low lev-
els, whether biological or digital, of the cognitive hierar-
chy, there has recently been published a series of graphic
Given the far-reaching implications of articial intelli-
representations of the internal states of deep (20-30 lay-
gence coupled with the realization that deep learning is
ers) neural networks attempting to discern within essen-
emerging as one of its most powerful techniques, the sub-
tially random data the images on which they have been
ject is understandably attracting both criticism and com-
trained,[177] and these show a striking degree of what
ment, and in some cases from outside the eld of com-
can only be described as visual creativity. This work,
puter science itself.
moreover, has captured a remarkable level of public at-
A main criticism of deep learning concerns the lack tention, with the original research notice receiving well
of theory surrounding many of the methods. Most of in excess of one thousand comments, and The Guardian
the learning in deep architectures is just some form of coverage[178] achieving the status of most frequently ac-
gradient descent. While gradient descent has been under- cessed article on that newspapers web site.
stood for a while now, the theory surrounding other algo-
Some currently popular and successful deep learning ar-
rithms, such as contrastive divergence is less clear (i.e.,
chitectures display certain problematical behaviors[179]
Does it converge? If so, how fast? What is it approxi-
(e.g. condently classifying random data as belonging to
mating?). Deep learning methods are often looked at as
a familiar category of nonrandom images;[180] and mis-
a black box, with most conrmations done empirically,
classifying miniscule perturbations of correctly classied
rather than theoretically.
images [181] ). The creator of OpenCog, Ben Goertzel hy-
Others point out that deep learning should be looked pothesized [179] that these behaviors are tied with lim-
at as a step towards realizing strong AI, not as an all- itations in the internal representations learned by these
encompassing solution. Despite the power of deep learn- architectures, and that these same limitations would in-
ing methods, they still lack much of the functionality hibit integration of these architectures into heterogeneous
needed for realizing this goal entirely. Research psychol- multi-component AGI architectures. It is suggested that
ogist Gary Marcus has noted that: these issues can be worked around by developing deep
Realistically, deep learning is only part of the larger chal- learning architectures that internally form states homolo-
lenge of building intelligent machines. Such techniques gous to image-grammar [182] decompositions of observed
lack ways of representing causal relationships (...) have entities and events.[179] Learning a grammar (visual or
no obvious ways of performing logical inferences, and linguistic) from training data would be equivalent to re-
they are also still a long way from integrating abstract stricting the system to commonsense reasoning which op-
knowledge, such as information about what objects are, erates on concepts in terms of production rules of the
what they are for, and how they are typically used. The grammar, and is a basic goal of both human language
most powerful A.I. systems, like Watson (...) use tech- acquisition [183] and A.I. (Also see Grammar induction
[184]
niques like deep learning as just one element in a very )
complicated ensemble of techniques, ranging from the
statistical technique of Bayesian inference to deductive
reasoning. [175] 62.9 Deep learning software li-
To the extent that such a viewpoint implies, without in- braries
tending to, that deep learning will ultimately constitute
nothing more than the primitive discriminatory levels of a Torch - An open source software library for machine
comprehensive future machine intelligence, a recent pair learning based on the Lua programming language.
of speculations regarding art and articial intelligence[176]
oers an alternative and more expansive outlook. The Theano - An open source machine learning library
rst such speculation is that it might be possible to train for Python.
a machine vision stack to perform the sophisticated task
of discriminating between old master and amateur g- H2O.ai - An open source machine learning platform
ure drawings; and the second is that such a sensitivity written in Java with a parallel architecture.
might in fact represent the rudiments of a non-trivial Deeplearning4j - An open source deep learning li-
machine empathy. It is suggested, moreover, that such bray written for Java. It provides parallelization with
an eventuality would be in line with both anthropology, CPUs and GPUs.
which identies a concern with aesthetics as a key element
of behavioral modernity, and also with a current school OpenNN - An open source C++ library which im-
of thought which suspects that the allied phenomenon plements deep neural networks and provides paral-
of consciousness, formerly thought of as a purely high- lelization with CPUs.
62.11. REFERENCES 401
NVIDIA cuDNN - A GPU-accelerated library of [7] Olshausen, Bruno A. Emergence of simple-cell receptive
primitives for deep neural networks. eld properties by learning a sparse code for natural im-
ages. Nature 381.6583 (1996): 607-609.
DeepLearnToolbox - A Matlab/Octave toolbox for
deep learning. [8] Ronan Collobert (May 6, 2011). Deep Learning for
Ecient Discriminative Parsing. videolectures.net. Ca.
convnetjs - A Javascript library for training deep 7:45.
learning models. It contains online demos. [9] Gomes, Lee (20 October 2014). Machine-Learning
Maestro Michael Jordan on the Delusions of Big Data and
Gensim - A toolkit for natural language processing
Other Huge Engineering Eorts. IEEE Spectrum.
implemented in the Python programming language.
[10] Fukushima, K. (1980). Neocognitron: A self-organizing
Cae - A deep learning framework . neural network model for a mechanism of pattern recog-
nition unaected by shift in position. Biol. Cybern 36:
Apache SINGA[185] - A deep learning platform de- 193202. doi:10.1007/bf00344251.
veloped for scalability, usability and extensibility.
[11] P. Werbos., Beyond Regression: New Tools for Predic-
tion and Analysis in the Behavioral Sciences, PhD thesis,
Harvard University, 1974.
62.10 See also
[12] LeCun et al., Backpropagation Applied to Handwritten
Sparse coding Zip Code Recognition, Neural Computation, 1, pp. 541
551, 1989.
Compressed Sensing
[13] S. Hochreiter., "Untersuchungen zu dynamischen neu-
ronalen Netzen, Diploma thesis. Institut f. Informatik,
Connectionism
Technische Univ. Munich. Advisor: J. Schmidhuber,
Self-organizing map 1991.
[1] L. Deng and D. Yu (2014) Deep Learning: Methods [17] J. Weng, N. Ahuja and T. S. Huang, "Learning recog-
and Applications http://research.microsoft.com/pubs/ nition and segmentation using the Cresceptron, Interna-
209355/DeepLearning-NowPublishing-Vol7-SIG-039. tional Journal of Computer Vision, vol. 25, no. 2, pp.
pdf 105-139, Nov. 1997.
[2] Bengio, Yoshua (2009). Learning Deep Architectures [18] Morgan, Bourlard, Renals, Cohen, Franco (1993) Hy-
for AI (PDF). Foundations and Trends in Machine Learn- brid neural network/hidden Markov model systems for
ing 2 (1). continuous speech recognition. ICASSP/IJPRAI
[3] Y. Bengio, A. Courville, and P. Vincent., Representation [19] T. Robinson. (1992) A real-time recurrent error propaga-
Learning: A Review and New Perspectives, IEEE Trans. tion network word recognition system, ICASSP.
PAMI, special issue Learning Deep Architectures, 2013
[20] Waibel, Hanazawa, Hinton, Shikano, Lang. (1989)
[4] J. Schmidhuber, Deep Learning in Neural Networks: An Phoneme recognition using time-delay neural networks.
Overview http://arxiv.org/abs/1404.7828, 2014 IEEE Transactions on Acoustics, Speech and Signal Pro-
cessing.
[5] Patrick Glauner (2015), Comparison of Training Methods
for Deep Neural Networks, arXiv:1504.06825 [21] Baker, J.; Deng, Li; Glass, Jim; Khudanpur, S.; Lee, C.-
H.; Morgan, N.; O'Shaughnessy, D. (2009). Research
[6] Song, Hyun Ah, and Soo-Young Lee. Hierarchical Rep- Developments and Directions in Speech Recognition and
resentation Using NMF. Neural Information Processing. Understanding, Part 1. IEEE Signal Processing Magazine
Springer Berlin Heidelberg, 2013. 26 (3): 7580. doi:10.1109/msp.2009.932166.
402 CHAPTER 62. DEEP LEARNING
[22] Y. Bengio (1991). Articial Neural Networks and their [39] R. Raina, A. Madhavan, A. Ng., Large-scale Deep Unsu-
Application to Speech/Sequence Recognition, Ph.D. the- pervised Learning using Graphics Processors, Proc. 26th
sis, McGill University, Canada. Int. Conf. on Machine Learning, 2009.
[23] Deng, L.; Hassanein, K.; Elmasry, M. (1994). Analysis [40] Riesenhuber, M; Poggio, T. Hierarchical models of ob-
of correlation structure for a neural predictive model with ject recognition in cortex. Nature Neuroscience 1999
applications to speech recognition. Neural Networks 7 (11): 10191025.
(2): 331339. doi:10.1016/0893-6080(94)90027-2.
[41] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
[24] Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, Howard, W. Hubbard, L. D. Jackel. Backpropagation Ap-
A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, plied to Handwritten Zip Code Recognition. Neural Com-
P.; Sainath, T.; Kingsbury, B. (2012). Deep Neu- putation, 1(4):541551, 1989.
ral Networks for Acoustic Modeling in Speech Recog-
nition --- The shared views of four research groups. [42] S. Hochreiter. Untersuchungen zu dynamischen neu-
IEEE Signal Processing Magazine 29 (6): 8297. ronalen Netzen. Diploma thesis, Institut f. Informatik,
doi:10.1109/msp.2012.2205597. Technische Univ. Munich, 1991. Advisor: J. Schmidhu-
ber
[25] Deng, L.; Hinton, G.; Kingsbury, B. (2013). New types
of deep neural network learning for speech recognition [43] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmid-
and related applications: An overview (ICASSP)". huber. Gradient ow in recurrent nets: the diculty of
learning long-term dependencies. In S. C. Kremer and J.
[26] Keynote talk: Recent Developments in Deep Neural Net- F. Kolen, editors, A Field Guide to Dynamical Recurrent
works. ICASSP, 2013 (by Geo Hinton). Neural Networks. IEEE Press, 2001.
[27] Keynote talk: Achievements and Challenges of Deep [44] Hochreiter, Sepp; and Schmidhuber, Jrgen; Long Short-
Learning - From Speech Analysis and Recognition Term Memory, Neural Computation, 9(8):17351780,
To Language and Multimodal Processing, Interspeech, 1997
September 2014.
[45] Graves, Alex; and Schmidhuber, Jrgen; Oine Hand-
[28] G. E. Hinton., Learning multiple layers of representa- writing Recognition with Multidimensional Recurrent Neu-
tion, Trends in Cognitive Sciences, 11, pp. 428434, ral Networks, in Bengio, Yoshua; Schuurmans, Dale; Laf-
2007. ferty, John; Williams, Chris K. I.; and Culotta, Aron
(eds.), Advances in Neural Information Processing Systems
[29] J. Schmidhuber., Learning complex, extended sequences 22 (NIPS'22), December 7th10th, 2009, Vancouver, BC,
using the principle of history compression, Neural Com- Neural Information Processing Systems (NIPS) Founda-
putation, 4, pp. 234242, 1992. tion, 2009, pp. 545552
[30] J. Schmidhuber., My First Deep Learning System of [46] Graves, A.; Liwicki, M.; Fernandez, S.; Bertolami, R.;
1991 + Deep Learning Timeline 19622013. Bunke, H.; Schmidhuber, J. A Novel Connectionist Sys-
[31] http://research.microsoft.com/apps/pubs/default.aspx? tem for Improved Unconstrained Handwriting Recogni-
id=189004 tion. IEEE Transactions on Pattern Analysis and Machine
Intelligence 31 (5): 2009.
[32] L. Deng et al. Recent Advances in Deep Learning for
Speech Research at Microsoft, ICASSP, 2013. [47] Sven Behnke (2003). Hierarchical Neural Networks for
Image Interpretation. (PDF). Lecture Notes in Computer
[33] L. Deng, O. Abdel-Hamid, and D. Yu, A deep con- Science 2766. Springer.
volutional neural network using heterogeneous pooling
for trading acoustic invariance with phonetic confusion, [48] Smolensky, P. (1986). Information processing in dynam-
ICASSP, 2013. ical systems: Foundations of harmony theory. In D. E.
Rumelhart, J. L. McClelland, & the PDP Research Group,
[34] T. Sainath et al., Convolutional neural networks for Parallel Distributed Processing: Explorations in the Mi-
LVCSR, ICASSP, 2013. crostructure of Cognition. 1. pp. 194281.
[35] D. Yu, L. Deng, G. Li, and F. Seide (2011). Discrimi- [49] Hinton, G. E.; Osindero, S.; Teh, Y. (2006).
native pretraining of deep neural networks, U.S. Patent A fast learning algorithm for deep belief nets
Filing. (PDF). Neural Computation 18 (7): 15271554.
doi:10.1162/neco.2006.18.7.1527. PMID 16764513.
[36] NIPS Workshop: Deep Learning for Speech Recognition
and Related Applications, Whistler, BC, Canada, Dec. [50] Hinton, G. (2009). Deep belief networks. Scholarpedia
2009 (Organizers: Li Deng, Geo Hinton, D. Yu). 4 (5): 5947. doi:10.4249/scholarpedia.5947.
[37] Yu, D.; Deng, L. (2014). Automatic Speech Recogni- [51] John Marko (25 June 2012). How Many Computers to
tion: A Deep Learning Approach (Publisher: Springer)". Identify a Cat? 16,000.. New York Times.
[38] D. C. Ciresan et al., Deep Big Simple Neural Nets for [52] Ng, Andrew; Dean, Je (2012). Building High-
Handwritten Digit Recognition, Neural Computation, 22, level Features Using Large Scale Unsupervised Learning
pp. 32073220, 2010. (PDF).
62.11. REFERENCES 403
[53] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. [68] G. Dahl et al.., Improving DNNs for LVCSR using rec-
Schmidhuber. Flexible, High Performance Convolutional tied linear units and dropout, ICASSP, 2013.
Neural Networks for Image Classication. International
Joint Conference on Articial Intelligence (IJCAI-2011, [69] G. E. Hinton., A Practical Guide to Training Restricted
Barcelona), 2011. Boltzmann Machines, Tech. Rep. UTML TR 2010-003,
Dept. CS., Univ. of Toronto, 2010.
[54] Martines, H.; Bengio, Y.; Yannakakis, G. N. (2013).
Learning Deep Physiological Models of Aect. IEEE [70] Huang, Guang-Bin, Qin-Yu Zhu, and Chee-Kheong Siew.
Computational Intelligence 8 (2): 20. Extreme learning machine: theory and applications.
Neurocomputing 70.1 (2006): 489-501.
[55] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi-
Column Deep Neural Network for Trac Sign Classica- [71] Widrow, Bernard, et al. The no-prop algorithm: A new
tion. Neural Networks, 2012. learning algorithm for multilayer neural networks. Neural
Networks 37 (2013): 182-188.
[56] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber.
Deep Neural Networks Segment Neuronal Membranes in [72] Aleksander, Igor, et al. A brief introduction to Weight-
Electron Microscopy Images. In Advances in Neural In- less Neural Systems. ESANN. 2009.
formation Processing Systems (NIPS 2012), Lake Tahoe,
2012. [73] Hinton, G.E. Deep belief networks. Scholarpedia 4 (5):
5947. doi:10.4249/scholarpedia.5947.
[57] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column
Deep Neural Networks for Image Classication. IEEE [74] H. Larochelle et al.., An empirical evaluation of deep ar-
Conf. on Computer Vision and Pattern Recognition chitectures on problems with many factors of variation,
CVPR 2012. in Proc. 24th Int. Conf. Machine Learning, pp. 473480,
2007.
[58] D. J. Felleman and D. C. Van Essen, "Distributed hierar-
chical processing in the primate cerebral cortex, Cerebral [75] G. E. Hinton., Training Product of Experts by Minimiz-
Cortex, 1, pp. 1-47, 1991. ing Contrastive Divergence, Neural Computation, 14, pp.
17711800, 2002.
[59] J. Weng, "Natural and Articial Intelligence: Introduction
to Computational Brain-Mind, BMI Press, ISBN 978- [76] A. Fischer and C. Igel. Training Restricted Boltzmann
0985875725, 2012. Machines: An Introduction. Pattern Recognition 47, pp.
25-39, 2014
[60] J. Weng, "Why Have We Passed `Neural Networks Do not
Abstract Well'?, Natural Intelligence: the INNS Magazine, [77] http://ufldl.stanford.edu/tutorial/index.php/
vol. 1, no.1, pp. 13-22, 2011. Convolutional_Neural_Network
[64] T. Mikolov et al., Recurrent neural network based lan- [83] Larochelle, Hugo; Salakhutdinov, Ruslan (2010).
guage model, Interspeech, 2010. Ecient Learning of Deep Boltzmann Machines
(PDF). pp. 693700.
[65] LeCun, Y. et al. Gradient-based learning applied to doc-
ument recognition. Proceedings of the IEEE 86 (11): [84] Vincent, Pascal; Larochelle, Hugo; Lajoie, Isabelle; Ben-
22782324. doi:10.1109/5.726791. gio, Yoshua; Manzagol, Pierre-Antoine (2010). Stacked
Denoising Autoencoders: Learning Useful Representa-
[66] G. E. Hinton et al.., Deep Neural Networks for Acous- tions in a Deep Network with a Local Denoising Crite-
tic Modeling in Speech Recognition: The shared views of rion. The Journal of Machine Learning Research 11:
four research groups, IEEE Signal Processing Magazine, 33713408.
pp. 8297, November 2012.
[85] Deng, Li; Yu, Dong (2011). Deep Convex Net: A
[67] Y. Bengio et al.., Advances in optimizing recurrent net- Scalable Architecture for Speech Pattern Classication
works, ICASSP, 2013. (PDF). Proceedings of the Interspeech: 22852288.
404 CHAPTER 62. DEEP LEARNING
[86] Deng, Li; Yu, Dong; Platt, John (2012). Scalable stack- [101] Ranzato, Marc Aurelio; Boureau, Y-Lan (2007). Sparse
ing and learning for building deep architectures. 2012 Feature Learning for Deep Belief Networks (PDF). Ad-
IEEE International Conference on Acoustics, Speech and vances in neural information . . .: 18.
Signal Processing (ICASSP): 21332136.
[102] Socher, Richard; Lin, Clif (2011). Parsing Natural
[87] David, Wolpert (1992). Stacked generalization. Neu- Scenes and Natural Language with Recursive Neural Net-
ral networks 5(2): 241259. doi:10.1016/S0893- works (PDF). Proceedings of the . . .
6080(05)80023-1.
[103] Taylor, Graham; Hinton, Georey (2006). Modeling
[88] Bengio, Yoshua (2009). Learning deep architectures for Human Motion Using Binary Latent Variables (PDF).
AI. Foundations and trends in Machine Learning 2(1): Advances in neural . . .
1127.
[104] Vincent, Pascal; Larochelle, Hugo (2008). Extracting
[89] Hutchinson, Brian; Deng, Li; Yu, Dong (2012). Ten- and composing robust features with denoising autoen-
sor deep stacking networks. IEEE transactions on pattern coders. Proceedings of the 25th international conference
analysis and machine intelligence 115. on Machine learning - ICML '08: 10961103.
[90] Hinton, Georey; Salakhutdinov, Ruslan (2006). [105] Kemp, Charles; Perfors, Amy; Tenenbaum, Joshua
Reducing the Dimensionality of Data with (2007). Learning overhypotheses with hierarchical
Neural Networks. Science 313: 504507. Bayesian models. Developmental science. 10(3): 307
doi:10.1126/science.1127647. PMID 16873662. 21. doi:10.1111/j.1467-7687.2007.00585.x. PMID
17444972.
[91] Dahl, G.; Yu, D.; Deng, L.; Acero, A. (2012). Context-
Dependent Pre-Trained Deep Neural Networks for Large- [106] Xu, Fei; Tenenbaum, Joshua (2007). Word learning
Vocabulary Speech Recognition. Audio, Speech, and ... as Bayesian inference. Psychol Rev. 114(2): 24572.
20(1): 3042. doi:10.1037/0033-295X.114.2.245. PMID 17500627.
[92] Mohamed, Abdel-rahman; Dahl, George; Hinton, Geof- [107] Chen, Bo; Polatkan, Gungor (2011). The Hierarchical
frey (2012). Acoustic Modeling Using Deep Belief Net- Beta Process for Convolutional Factor Analysis and Deep
works. IEEE Transactions on Audio, Speech, and Lan- Learning (PDF). Machine Learning . . .
guage Processing. 20(1): 1422.
[108] Fei-Fei, Li; Fergus, Rob (2006). One-shot learning of
[93] Courville, Aaron; Bergstra, James; Bengio, Yoshua object categories. IEEE Trans Pattern Anal Mach Intell.
(2011). A Spike and Slab Restricted Boltzmann Ma- 28(4): 594611. doi:10.1109/TPAMI.2006.79. PMID
chine (PDF). International . . . 15: 233241. 16566508.
[94] Mitchell, T; Beauchamp, J (1988). Bayesian Vari- [109] Rodriguez, Abel; Dunson, David (2008). The
able Selection in Linear Regression. Journal of the Nested Dirichlet Process. Journal of the Ameri-
American Statistical Association. 83 (404): 10231032. can Statistical Association. 103(483): 11311154.
doi:10.1080/01621459.1988.10478694. doi:10.1198/016214508000000553.
[95] Courville, Aaron; Bergstra, James; Bengio, Yoshua [110] Ruslan, Salakhutdinov; Joshua, Tenenbaum (2012).
(2011). Unsupervised Models of Images by Spike-and- Learning with Hierarchical-Deep Models. IEEE trans-
Slab RBMs (PDF). Proceedings of the . . . 10: 18. actions on pattern analysis and machine intelligence: 114.
PMID 23267196.
[96] Hinton, Georey; Osindero, Simon; Teh, Yee-Whye
(2006). A Fast Learning Algorithm for Deep Belief [111] Chalasani, Rakesh; Principe, Jose (2013). Deep Predic-
Nets. Neural Computation 1554: 15271554. tive Coding Networks. arXiv preprint arXiv: 113.
[97] Larochelle, Hugo; Bengio, Yoshua; Louradour, Jerdme; [112] Cho, Youngmin (2012). Kernel Methods for Deep
Lamblin, Pascal (2009). Exploring Strategies for Train- Learning (PDF). pp. 19.
ing Deep Neural Networks. The Journal of Machine
Learning Research 10: 140. [113] Scholkopf, B; Smola, Alexander (1998). Nonlinear com-
ponent analysis as a kernel eigenvalue problem. Neural
[98] Coates, Adam; Carpenter, Blake (2011). Text Detection computation (44).
and Character Recognition in Scene Images with Unsu-
pervised Feature Learning. pp. 440445. [114] L. Deng, G. Tur, X. He, and D. Hakkani-Tur. Use of
Kernel Deep Convex Networks and End-To-End Learning
[99] Lee, Honglak; Grosse, Roger (2009). Convolutional for Spoken Language Understanding, Proc. IEEE Work-
deep belief networks for scalable unsupervised learning shop on Spoken Language Technologies, 2012
of hierarchical representations. Proceedings of the 26th
Annual International Conference on Machine Learning - [115] Mnih, Volodymyr et al. (2015). Human-level control
ICML '09: 18. through deep reinforcement learning (PDF) 518. pp.
529533.
[100] Lin, Yuanqing; Zhang, Tong (2010). Deep Coding Net-
work (PDF). Advances in Neural . . .: 19. [116] Hinton, Georey E. Distributed representations. (1984)
62.11. REFERENCES 405
[117] S. Das, C.L. Giles, G.Z. Sun, Learning Context Free [134] Z. Tuske, P. Golik, R. Schlter and H. Ney (2014).
Grammars: Limitations of a Recurrent Neural Network Acoustic Modeling with Deep Neural Networks Using
with an External Stack Memory, Proc. 14th Annual Raw Time Signal for LVCSR. Interspeech.
Conf. of the Cog. Sci. Soc., p. 79, 1992.
[135] McMillan, R. How Skype Used AI to Build Its Amazing
[118] Mozer, M. C., & Das, S. (1993). A connectionist symbol New Language Translator, Wire, Dec. 2014.
manipulator that discovers the structure of context-free
languages. NIPS 5 (pp. 863-870). [136] Hannun et al. (2014) Deep Speech: Scaling up end-to-
end speech recognition, arXiv:1412.5567.
[119] J. Schmidhuber. Learning to control fast-weight memo-
ries: An alternative to recurrent nets. Neural Computa- [137] Ron Schneiderman (2015) Accuracy, Apps Advance
tion, 4(1):131-139, 1992 Speech Recognition --- Interview with Vlad Sejnoha and
Li Deng, IEEE Signal Processing Magazine, Jan, 2015.
[120] F. Gers, N. Schraudolph, J. Schmidhuber. Learning pre-
cise timing with LSTM recurrent networks. JMLR 3:115- [138] http://yann.lecun.com/exdb/mnist/.
143, 2002.
[139] D. Ciresan, U. Meier, J. Schmidhuber., Multi-column
[121] J. Schmidhuber. An introspective network that can learn Deep Neural Networks for Image Classication, Tech-
to run its own weight change algorithm. In Proc. of the nical Report No. IDSIA-04-12', 2012.
Intl. Conf. on Articial Neural Networks, Brighton, pages
[140] Vinyals et al. (2014)."Show and Tell: A Neural Image
191-195. IEE, 1993.
Caption Generator, arXiv:1411.4555.
[122] Hochreiter, Sepp; Younger, A. Steven; Conwell, Peter R.
(2001). Learning to Learn Using Gradient Descent. [141] Fang et al. (2014)."From Captions to Visual Concepts and
ICANN 2001, 2130: 8794. Back, arXiv:1411.4952.
[123] Salakhutdinov, Ruslan, and Georey Hinton. Semantic [142] Kiros et al. (2014)."Unifying Visual-Semantic Embed-
hashing. International Journal of Approximate Reason- dings with Multimodal Neural Language Models, arXiv:
ing 50.7 (2009): 969-978. 1411.2539.
[124] Le, Quoc V., and Tomas Mikolov. Distributed repre- [143] Zhong, S.; Liu, Y.; Liu, Y. Bilinear Deep Learning for
sentations of sentences and documents. arXiv preprint Image Classication. Proceedings of the 19th ACM Inter-
arXiv:1405.4053 (2014). national Conference on Multimedia 11: 343352.
[125] Graves, Alex, Greg Wayne, and Ivo Danihelka. Neu- [144] Nvidia Demos a Car Computer Trained with Deep
ral Turing Machines. arXiv preprint arXiv:1410.5401 Learning (2015-01-06), David Talbot, MIT Technology
(2014). Review
[126] Weston, Jason, Sumit Chopra, and Antoine Bordes. [145] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin., A Neu-
Memory networks. arXiv preprint arXiv:1410.3916 ral Probabilistic Language Model, Journal of Machine
(2014). Learning Research 3 (2003) 11371155', 2003.
[127] TIMIT Acoustic-Phonetic Continuous Speech Corpus Lin- [146] Goldberg, Yoav; Levy, Omar. word2vec Explained:
guistic Data Consortium, Philadelphia. Deriving Mikolov et al.s Negative-Sampling Word-
Embedding Method (PDF). Arxiv. Retrieved 26 October
[128] Abdel-Hamid, O. et al. (2014). Convolutional Neural 2014.
Networks for Speech Recognition. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing 22 (10): [147] Socher, Richard; Manning, Christopher. Deep Learning
15331545. doi:10.1109/taslp.2014.2339736. for NLP (PDF). Retrieved 26 October 2014.
[129] Deng, L.; Platt, J. (2014). Ensemble Deep Learning for [148] Socher, Richard; Bauer, John; Manning, Christopher; Ng,
Speech Recognition. Proc. Interspeech. Andrew (2013). Parsing With Compositional Vector
Grammars (PDF). Proceedings of the ACL 2013 confer-
[130] Yu, D.; Deng, L. (2010). Roles of Pre-Training ence.
and Fine-Tuning in Context-Dependent DBN-HMMs for
Real-World Speech Recognition. NIPS Workshop on [149] Socher, Richard (2013). Recursive Deep Models for
Deep Learning and Unsupervised Feature Learning. Semantic Compositionality Over a Sentiment Treebank
(PDF). EMNLP 2013.
[131] Deng L., Li, J., Huang, J., Yao, K., Yu, D., Seide, F. et al.
Recent Advances in Deep Learning for Speech Research [150] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil (2014)
at Microsoft. ICASSP, 2013. " A Latent Semantic Model with Convolutional-Pooling
Structure for Information Retrieval, Proc. CIKM.
[132] Deng, L.; Li, Xiao (2013). Machine Learning Paradigms
for Speech Recognition: An Overview. IEEE Transac- [151] P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck
tions on Audio, Speech, and Language Processing. (2013) Learning Deep Structured Semantic Models for
Web Search using Clickthrough Data, Proc. CIKM.
[133] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and
G. Hinton (2010) Binary Coding of Speech Spectrograms [152] I. Sutskever, O. Vinyals, Q. Le (2014) Sequence to Se-
Using a Deep Auto-encoder. Interspeech. quence Learning with Neural Networks, Proc. NIPS.
406 CHAPTER 62. DEEP LEARNING
[153] J. Gao, X. He, W. Yih, and L. Deng(2014) Learning [169] Quartz, SR; Sejnowski, TJ (1997). The neural ba-
Continuous Phrase Representations for Translation Mod- sis of cognitive development: A constructivist mani-
eling, Proc. ACL. festo. Behavioral and Brain Sciences 20 (4): 537556.
doi:10.1017/s0140525x97001581.
[154] J. Gao, P. Pantel, M. Gamon, X. He, L. Deng (2014)
Modeling Interestingness with Deep Neural Networks, [170] S. Blakeslee., In brains early growth, timetable may be
Proc. EMNLP. critical, The New York Times, Science Section, pp. B5
B6, 1995.
[155] J. Gao, X. He, L. Deng (2014) Deep Learning for Natu-
ral Language Processing: Theory and Practice (Tutorial), [171] {BUFILL} E. Bull, J. Agusti, R. Blesa., Human
CIKM. neoteny revisited: The case of synaptic plasticity, Amer-
ican Journal of Human Biology, 23 (6), pp. 729739,
[156] Arrowsmith, J; Miller, P (2013). Trial watch: Phase 2011.
II and phase III attrition rates 2011-2012. Nature Re-
views Drug Discovery 12 (8): 569. doi:10.1038/nrd4090. [172] J. Shrager and M. H. Johnson., Timing in the develop-
PMID 23903212. ment of cortical function: A computational approach, In
B. Julesz and I. Kovacs (Eds.), Maturational windows and
[157] Verbist, B; Klambauer, G; Vervoort, L; Talloen, W; adult cortical plasticity, 1995.
The Qstar, Consortium; Shkedy, Z; Thas, O; Ben-
der, A; Ghlmann, H. W.; Hochreiter, S (2015). [173] D. Hernandez., The Man Behind the Google
Using transcriptomics to guide lead optimiza- Brain: Andrew Ng and the Quest for the New
tion in drug discovery projects: Lessons learned AI, http://www.wired.com/wiredenterprise/2013/
from the QSTAR project. Drug Discovery Today. 05/neuro-artificial-intelligence/all/. Wired, 10 May
doi:10.1016/j.drudis.2014.12.014. PMID 25582842. 2013.
[174] C. Metz., Facebooks 'Deep Learning' Guru Reveals
[158] Announcement of the winners of the Merck Molec-
the Future of AI, http://www.wired.com/wiredenterprise/
ular Activity Challenge https://www.kaggle.com/c/
2013/12/facebook-yann-lecun-qa/. Wired, 12 December
MerckActivity/details/winners.
2013.
[159] Dahl, G. E.; Jaitly, N.; & Salakhutdinov, R. (2014)
[175] G. Marcus., Is Deep Learning a Revolution in Articial
Multi-task Neural Networks for QSAR Predictions,
Intelligence?" The New Yorker, 25 November 2012.
ArXiv, 2014.
[176] Smith, G. W. (March 27, 2015). Art and Articial Intel-
[160] Toxicology in the 21st century Data Challenge https://
ligence. ArtEnt. Retrieved March 27, 2015.
tripod.nih.gov/tox21/challenge/leaderboard.jsp
[177] Alexander Mordvintsev, Christopher Olah, and Mike
[161] NCATS Announces Tox21 Data Challenge Winners Tyka (June 17, 2015). Inceptionism: Going Deeper into
http://www.ncats.nih.gov/news-and-events/features/ Neural Networks. Google Research Blog. Retrieved
tox21-challenge-winners.html June 20, 2015.
[162] Unterthiner, T.; Mayr, A.; Klambauer, G.; Steijaert, M.; [178] Alex Hern (June 18, 2015). Yes, androids do dream of
Ceulemans, H.; Wegner, J. K.; & Hochreiter, S. (2014) electric sheep. The Guardian. Retrieved June 20, 2015.
Deep Learning as an Opportunity in Virtual Screening.
Workshop on Deep Learning and Representation Learn- [179] Ben Goertzel. Are there Deep Reasons Underlying
ing (NIPS2014). the Pathologies of Todays Deep Learning Algorithms?
(2015) Url: http://goertzel.org/DeepLearning_v1.pdf
[163] Unterthiner, T.; Mayr, A.; Klambauer, G.; & Hochre-
iter, S. (2015) Toxicity Prediction using Deep Learning. [180] Nguyen, Anh, Jason Yosinski, and Je Clune. Deep
ArXiv, 2015. Neural Networks are Easily Fooled: High Condence
Predictions for Unrecognizable Images. arXiv preprint
[164] Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Kon- arXiv:1412.1897 (2014).
erding, D.;& Pande, V. (2015) Massively Multitask Net-
works for Drug Discovery. ArXiv, 2015. [181] Szegedy, Christian, et al. Intriguing properties of neural
networks. arXiv preprint arXiv:1312.6199 (2013).
[165] Tkachenko, Yegor. Autonomous CRM Control via CLV
[182] Zhu, S.C.; Mumford, D. A stochastic grammar of im-
Approximation with Deep Reinforcement Learning in
ages. Found. Trends. Comput. Graph. Vis. 2 (4): 259
Discrete and Continuous Action Space. (April 8, 2015).
362. doi:10.1561/0600000018.
arXiv.org: http://arxiv.org/abs/1504.01840
[183] Miller, G. A., and N. Chomsky. Pattern conception.
[166] Utgo, P. E.; Stracuzzi, D. J. (2002). Many-
Paper for Conference on pattern detection, University of
layered learning. Neural Computation 14: 24972529.
Michigan. 1957.
doi:10.1162/08997660260293319.
[184] Jason Eisner, Deep Learning of Recursive Struc-
[167] J. Elman, et al., Rethinking Innateness, 1996. ture: Grammar Induction, http://techtalks.tv/talks/
[168] Shrager, J.; Johnson, MH (1996). Dynamic plastic- deep-learning-of-recursive-structure-grammar-induction/
ity inuences the emergence of function in a simple 58089/
cortical array. Neural Networks 9 (7): 11191129. [185] http://singa.incubator.apache.org/
doi:10.1016/0893-6080(96)00033-0.
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 407
Kri, Gareth E Kegg, King of Hearts, Chobot, Theo Pardilla, Garas, Gwernol, Tone, Interested~enwiki, YurikBot, Wavelength, Pspoulin,
Sceptre, Cabiria, Jimp, Charles Gaudette, RussBot, Arado, Sillybilly, TheDoober, Piet Delport, SnoopY~enwiki, DanMS, Tsch81,
Hydrargyrum, Bill52270, Gaius Cornelius, CambridgeBayWeather, Pseudomonas, Thane, CarlHewitt, LauriO~enwiki, ENeville, Wiki alf,
Boneheadmx, Grafen, Tailpig, ImGz, Introgressive, Srinivasasha, Darker Dreams, John Newbury, RabidDeity, Thiseye, Robert McClenon,
Nick, Banes, Daniel Mietchen, Jpbowen, Larry laptop, Aldux, Iyerakshay, MarkAb, Tony1, Supten, Syrthiss, Linweizhen, Gadget850,
DeadEyeArrow, Gzabers, Dv82matt, Searchme, Richardcavell, Zaklog, Pawyilee, Jvoegele, K.Nevelsteen, Zzuuzz, Nikkimaria, Arthur
Rubin, ColinMcMillen, Xaxafrad, Jogers, Saizai, GraemeL, CWenger, HereToHelp, ArielGold, Dublinclontarf, Hamster Sandwich.,
Allens, Banus, sgeir IV.~enwiki, Anniepoo, NeilN, Kingboyk, GrinBot~enwiki, Jonathan Mtillon, Bo Jacoby, D Monack, DVD R W,
Teo64x, Oldhamlet, SmackBot, Mmernex, Shannonbonannon, Haymaker, Moxon, Mneser, 1dragon, InverseHypercube, Inego~enwiki,
Hydrogen Iodide, McGeddon, Martinp, Jared555, Bbewsdirector, Davewild, Pkirlin, CommodiCast, Cacuija, Vilerage, Took, Edgar181,
Alsandro, Moralis, W3bj3d1, Gilliam, Betacommand, Rpmorrow, ERcheck, Grokmoo, Mirokado, Chris the speller, Shaggorama,
NCurse, Thumperward, Stevage, J. Spencer, DHN-bot~enwiki, Darth Panda, Zsinj, Pegua, Can't sleep, clown will eat me, Erayman64,
Onorem, MJBurrage, JonHarder, EvelinaB, Nunocordeiro, RedHillian, Edivorce, Normxxx, Memming, COMPFUNK2, Freek Verkerk,
PrometheusX303, Jaibe, Shadow1, Richard001, Taggart Transcontinental, NapoleonB, Freemarket, Lacatosias, StephenReed, Jon Awbrey,
Metamagician3000, Salamurai, LeoNomis, Curly Turkey, Ck lostsword, Pilotguy, Kukini, Ohconfucius, Dankonikolic, William conway
bcc, Nishkid64, Baby16, Harryboyles, Titus III, John, Jonrgrover, AmiDaniel, Geoinline, Yan Kuligin, Madhukaleeckal, Tazmaniacs,
Gobonobo, Disavian, NewTestLeper79, ChaoticLogic, Chessmaniac, Goodnightmush, Ripe, Mr. Vernon, Newkidd11, Special-T,
WMod-NS, Owlbuster, Mr Stephen, Dicklyon, Emurph, Daphne A, Dee Jay Randall, AdultSwim, Condem, Glen Pepicelli, Zapvet,
Digsdirt, Nabeth, Peyre, Caiaa, Andreworkney, Hu12, Burto88, Levineps, Vespine, Mjgilson, Clarityend, Sierkejd, Joseph Solis
in Australia, Ralf Klinkenberg, Snoutholder, Igoldste, Rnb, Wfructose, Marysunshine, Dream land2080, Tawkerbot2, Dlohcierekim,
GerryWol, Eastlaw, Myncknm, 8754865, Rahuldharmani, CRGreathouse, CmdrObot, Eggman64, Wafulz, Makeemlighter, Discor-
dant~enwiki, Mudd1, Ruslik0, El aprendelenguas, ShelfSkewed, Pgr94, MeekMark, Casper2k3, Fordmadoxfraud, Myasuda, Livingston7,
Abdullahazzam, Gregbard, Shanoman, Cydebot, Peripitus, Abeg92, Besieged, Beta Trom, Peterdjones, Kaldosh, Gogo Dodo, Corpx,
Ivant, Rzwitserloot, JamesLucas, Julian Mendez, Josephorourke, LaserBeams, WikiNing, Robertinventor, Bubba2323, Bryan Seecrets,
Bitsmart, Gimmetrow, PamD, Michael Fourman, Thijs!bot, Epbr123, Mercury~enwiki, ConceptExp, Daniel, Hervegirod, Sagaciousuk,
N5iln, Mojo Hand, Atamyrat~enwiki, Marek69, Zzthex, GTof~enwiki, Mailseth, I already forgot, David D., Mschures, Mac-steele,
AntiVandalBot, Tabortiger, Luna Santin, Seaphoto, Opelio, Prolog, 17Drew, Nosirrom, Jj137, Ykalyan, TurntableX, Science History,
Theropod-X, Spinningobo, Myanw, Steelpillow, MikeLynch, JAnDbot, Jimothytrotter, MattBan, MTueld, Barek, Sarnholm, MER-C,
CosineKitty, Mrsolutions, Schlegel, The Transhumanist, Davespice, 100110100, Greensburger, TAnthony, .anacondabot, Acroterion,
Wasell, Coee2theorems, Magioladitis, Bongwarrior, VoABot II, Cobratom, Farquaadhnchmn, Gamkiller, Redaktor, Tedickey, Johannes
Simon, Tremilux, Rootxploit, Cic, Crunchy Numbers, Nposs, BatteryIncluded, Robotman1974, ArthurWeasley, Johnkoza1992, Jroudh,
Allstarecho, Adamreiswig, Wookiee cheese, Joydurgin, Siddharthsk2000, Talon Artaine, DerHexer, Tommy Herbert, TheRanger, Francob,
MartinBot, IanElmore, Bradgib, ARC Gritt, Naohiro19, Rettetast, Bissinger, Mschel, Kostisl, CommonsDelinker, Amareshjoshi, KTo288,
Lilac Soul, LedgendGamer, Mausy5043, Pacas, J.delanoy, JuniorMonkey, Trusilver, Anandcv, Gustavo1255, Richiekim, Jiuguang
Wang, Uncle Dick, Public Menace, Yonidebot, Laurusnobilis, Jkaplan, Dispenser, Eric Mathews Technology, DarkGhost08, Ignatzmice,
Touisiau, Tdewey, Pogowitwiz, Mikael Hggstrm, Ypetrachenko, Tarotcards, Topazxx, ColinClark, AntiSpamBot, Zubenelgenubi, Tobias
Kuhn, Lewblack, The Transhumanist (AWB), Hthth, Bobianite, Jorfer, Sbennett4, Sunderland06, Chuckino, Wesleymurphy, Burzmali,
Remember the dot, Walter Fritz, Riccardopoli, Andy Marchbanks, Inwind, Useight, Leontolstoy2, Halmstad, RJASE1, Funandtrvl, Idarin,
VolkovBot, Paranoid600, Stephen G Graham, Je G., Maghnus, Gogarburn, Philip Trueman, Kramlovesmovies, Sweetness46, TXiKiBoT,
Coder Dan, Vrossi1, Master Jaken, Java7837, Alan Rockefeller, Technopat, Olinga, Lordvolton, BarryList, Una Smith, Sirkad, DonutGuy,
Eroark, Seb az86556, Cremepu222, TheSix, Jeeny, The Divine Flualizer, Bugone, CO, Eubulides, Kurowoofwoof111, Andy Dingley,
Dirkbb, Haseo9999, Graymornings, Falcon8765, MCTales, Cvdwalt, Harderm, Dmcq, LittleBenW, Pjoef, AlleborgoBot, Symane,
Szeldich, Darthali, Iceworks, SieBot, RHodnett, Frans Fowler, JamesA, Tiddly Tom, WereSpielChequers, Paradoctor, Blackshadow153,
Viskonsas, WBTtheFROG, Yintan, Soler97, Grundle2600, Purbo T, DavidBourguignon, Yungjibbz, Mr.Z-bot, Wadeduck, Flyer22,
Darth Chyrsaor, Jojalozzo, Taemyr, Arthur Smart, Oxymoron83, Ioverka, Scorpion451, Ddxc, Harry~enwiki, MiNombreDeGuerra,
Steven Zhang, Lightmouse, Yankee-Bravo, Xobritbabeox10, RyanParis, Svick, Smallclone2, CharlesGillingham, Macduman, Wyckster,
Voices cray, Searchmaven, Cheesefondue, Neo., Denisarona, Finetooth, Emptymountains, ImageRemovalBot, Hpdl, Elassint, ClueBot,
Fribbler, Rumping, Prohlep, Avenged Eightfold, The Thing That Should Not Be, Logperson, Ezavarei, Ndenison, Unbuttered Parsnip,
Willingandable, Grantbow, Drmies, VQuakr, Mild Bill Hiccup, Malignedtruth, Boing! said Zebedee, Mayy may y, Nappy1020, Arne
Heise, Neverquick, Thomas Kist, Yakota21, Pooya.babahajyani, Time for action, Jdzarlino, Excirial, Gnome de plume, Hezarfenn, Vivio
Testarossa, Thunderhippo, Nayanraut, Xklsv, NuclearWarfare, Jotterbot, Iohannes Animosus, Sebaldusadam, Renamed user 3, Mr.Sisgay,
Mullhawk, Chaosdruid, Aitias, EPIC MASTER, Ranjithsutari, Scalhotrod, Johnuniq, HumphreyW, Apparition11, Sparkygravity,
MasterOfHisOwnDomain, Escientist, XLinkBot, AgnosticPreachersKid, Basploeger, Sgunteratparamus.k12.nj.us, Spitre, Pichpich, Alex
naish, Rror, Ost316, Libcub, Dheer7c, Noctibus, Thede, Truthnlove, D.M. from Ukraine, Addbot, Barsoomian, DOI bot, Guoguo12,
Kimveale, Betterusername, Ashish krazzy, Coolcatsh, AndrewHZ, DougsTech, Elsendero, TutterMouse, CanadianLinuxUser, Leszek
Jaczuk, Fluernutter, Damiens.rf, Me pras, Sebastian scha., MrOllie, Aykantspel, LaaknorBot, Nuclear Treason, The world deserves the
truth, Glane23, AndersBot, Favonian, Kyle1278, LinkFA-Bot, Rbalcke, AgadaUrbanit, Tassedethe, Brianjfox, Jan eissfeldt, Ricvelozo,
Jarble, KarenEdda, Ben Ben, Luckas-bot, PetroKonashevich, Yobot, Worldbruce, Themfromspace, Ptbotgourou, Fraggle81, Twexcom,
Obscuranym, Leoneris, Rinea, Mmxx, Taxisfolder, THEN WHO WAS PHONE?, Vini 17bot5, MHLU, PluniAlmoni, Pravdaverita,
AnomieBOT, Macilnar, Floquenbeam, 1exec1, Ai24081983, Frans-w1, Jim1138, IRP, Galoubet, Fraziergeorge122, Palace of the Purple
Emperor, AdjustShift, ChristopheS, Glenfarclas, Flewis, Materialscientist, Citation bot, Kjellmikal, Devantheryv, Vivohobson, Eumolpo,
Ankank, ArthurBot, Xqbot, TinucherianBot II, AVBAI, Roesslerj, Capricorn42, The Magnicent Clean-keeper, Staberind, Renaissancee,
Taylormas229, Oxwil, Wyklety, The Evil IP address, Turk olan, Isheden, S0aasdf2sf, Crzer07, A157247, Hi878, IntellectToday,
J04n, GrouchoBot, Armbrust, Wizardist, Diogeneselcinico42, Omnipaedista, Shirik, RibotBOT, Chris.urs-o, Mathonius, Shadowjams,
Chicarelli, WhatisFeelings?, AlGreen00, Dougofborg, Fillepa, FrescoBot, FalconL, Hobsonlane, Recognizance, Nojan, Ilcmuchas, Zero
Thrust, HJ Mitchell, Steve Quinn, Yoyosocool, Lightbound, Juno, Spectatorbot13, Intrealm, Philapathy, HamburgerRadio, Citation
bot 1, DeStilaDo, RCPayne, Lylodo, MacMed, Pinethicket, WaveRunner85, Jonesey95, Rameshngbot, Skyerise, Staorin, Agemoi,
Farmer21, Bubwater, Fartherred, Reconsider the static, IJBall, Cnwilliams, Bqdemon, JCAILLAT, TobeBot, Puzl bustr, SchreyP,
Compvis, LogAntiLog, ItsZippy, Lotje, Javierito92, Emarus, Fox Wilson, Vrenator, Lynn Wilbur, Robot8888, Gregman2, Fricanod,
Stroppolo, Wikireviewer99, BrightBlackHeaven, DARTH SIDIOUS 2, Chucks91, Mean as custard, RjwilmsiBot, Humanrobo,
, Theyer, BertSeghers, Rollins83, Nistra, DASHBot, EmausBot, John of Reading, Orphan Wiki, WikitanvirBot, Syncategoremata,
GoingBatty, Implements, Wikicolleen, Olof nord, Gimmetoo, Tommy2010, K6ka, Azlan Iqbal, Thecheesykid, Werieth, Evanh2008,
AvicBot, Vfrias, AVGavrilov, Josve05a, Shuipzv3, Lateg, Annonnimus, Habstinat, Unreal7, Foryourinfo, Tolly4bolly, Thine Antique
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 409
Pen, Staszek Lem, , Coasterlover1994, Westwood25, L Kensington, Casia wyq, Rangoon11, Keidax, Lilinjing cas, Terraorin,
JonRicheld, ClueBot NG, Xzbobzx, Tillander, ClaretAsh, Astrobase36, MelbourneStar, Satellizer, Mostlymark, James childs, Ro-
biminer, Tideat, Aloisdimpmoser, Prakashavinash, Delusion23, Widr, Danim, Heldrain, Helpful Pixie Bot, Adam Roush, SecretStory,
Whitehead3, IrishStephen, Calabe1992, Jeraphine Gryphon, Technical 13, BG19bot, Virtualerian, Jwchong, Starship.paint, Juro2351,
Mr.TAMER.Shlash, MangoWong, ElphiBot, MusikAnimal, Josvebot, Piguy101, Treva26, Reckiz, Cadiomals, Drift chambers, Altar,
Cassianp, Wiki-uoft, Snow Blizzard, TheProfessor, NotWith, Mschuldt, Glacialfox, Peter Baum, Devonmann101, Avery.mabojie,
Wannabemodel, Maarten Rail, BattyBot, StarryGrandma, Simbara, E prosser, Varagrawal, Cyberbot II, Ideafarmcity, Brookeweiner,
Fireboy3015, Psych2012Joordens, Marktindal, IjonTichyIjonTichy, RichardKPSun, Dexbot, Oritnk, Acrognale, Lugia2453, Sly1993,
Iskand26, SFK2, Dineshhi, Yassu66, Henrifrischborno, Reatlas, Samee, Joeinwiki, Faizan, ChassonR, Epicgenius, Newtesteditor,
Allaroundtheworld123, Amandapan1, Snatoon, Sonanto, Melonkelon, Kingfaj, Silas Ropac, Weightlessjapan1337, Satassi, Jiayuliu,
Tentinator, Rosenblumb1, Kogmaw, Syntaxerrorz, Msundqvist, Shrikarsan, Dustin V. S., Mamadoutadioukone, Laper1900, Gaurang2904,
Riddhi-gupta, Babitaarora, Megerler, IWishIHadALife, Orion dude, Mrm7171, MarinMersenne, , Aniruddhmanishpatel, Mandruss,
Ginsuloft, Mintiematt, Aubreybardo, Francois-Pier, Healybd, Cows12345, Zourcilleux, Fixuture, Mellon2030, Lakun.patra, Balaji4894,
FelixRosch, Wyn.junior, JacwanaKL, , 22merlin, MadScientistX11, Hnchinmaya, Monkbot, Lucyloo10, SantiLak, Jshel, TheQ
Editor, YdJ, T lanuza, BioMeow, Madan Pokharel, Pbravo2, Dodynamic, Evolutionvisions, Wstevens2090, Zeus000000, Charmingpianist,
ChamithN, Feynman1918, Shrestha sumit, Yoon Aris, Dorianluparu, Junbibi, Pythagoros, SpRu01, Comp-heur-intel, Praneshyadav1,
Kmp302, Devwebtel, KayleighwilliamsLS, Easeit7, Illiteration, Spectrum629, Miguelsnchz723, Newwikieditor678, KenTancwell,
Demdim0, Eobasanya, KasparBot, Pianophage, Csisawesome, Pmaiden, , Ray engh 302, Skyshines, Andyjin2002,
Aviartm and Anonymous: 1608
Information theory Source: https://en.wikipedia.org/wiki/Information_theory?oldid=668259571 Contributors: AxelBoldt, Brion VIB-
BER, Timo Honkasalo, Ap, Graham Chapman, XJaM, Toby Bartels, Hannes Hirzel, Edward, D, PhilipMW, Michael Hardy, Isomorphic,
Kku, Bobby D. Bryant, Varunrebel, Vinodmp, AlexanderMalmberg, Karada, (, Iluvcapra, Minesweeper, StephanWehner, Ahoerstemeier,
Angela, LittleDan, Kevin Baas, Poor Yorick, Andres, Novum, Charles Matthews, Guaka, Bemoeial, Ww, Dysprosia, The Anomebot, Mun-
ford, Hyacinth, Ann O'nyme, Shizhao, AnonMoos, MH~enwiki, Robbot, Fredrik, Rvollmert, Seglea, Chancemill, Securiger, SC, Lupo,
Wile E. Heresiarch, ManuelGR, Ancheta Wis, Giftlite, Lee J Haywood, COMPATT, KuniShiro~enwiki, SoWhy, Andycjp, Ynh~enwiki,
Pcarbonn, OverlordQ, L353a1, Rdsmith4, APH, Cihan, Elektron, Creidieki, Neuromancien, Rakesh kumar, Picapica, D6, Jwdietrich2,
Imroy, CALR, Noisy, Rich Farmbrough, Guanabot, NeuronExMachina, Ericamick, Xezbeth, MisterSheik, Crunchy Frog, El C, Spoon!,
Simon South, Bobo192, Smalljim, Rbj, Maurreen, Nothingmuch, Photonique, Andrewbadr, Haham hanuka, Pearle, Mpeisenbr, Mdd,
Msh210, Uncle Bill, Pouya, BryanD, PAR, Cburnett, Jheald, Geraldshields11, Kusma, DV8 2XL, Oleg Alexandrov, FrancisTyers, Velho,
Woohookitty, Linas, Mindmatrix, Ruud Koot, Eatsaq, Eyreland, SeventyThree, Kanenas, Graham87, Josh Parris, Mayumashu, SudoMonas,
Arunkumar, HappyCamper, Alejo2083, Chris Pressey, Mathbot, Annacoder, Nabarry, Srleer, Chobot, Commander Nemet, FrankTobia,
Siddhant, YurikBot, Wavelength, RobotE, RussBot, Michael Slone, Loom91, Grubber, ML, Yahya Abdal-Aziz, Raven4x4x, Moe Epsilon,
DanBri, Allchopin, Light current, Mceliece, Arthur Rubin, Lyrl, GrinBot~enwiki, Sardanaphalus, Lordspaz, SmackBot, Imz, Henri de
Solages, Incnis Mrsi, Reedy, InverseHypercube, Cazort, Gilliam, Metacomet, Octahedron80, Spellchecker, Colonies Chris, Jahiegel, Un-
nikrishnan.am, LouScheer, Calbaer, EPM, Djcmackay, Michael Ross, Tyrson, Jon Awbrey, Het, Bidabadi~enwiki, Chungc, SashatoBot,
Nick Green, Harryboyles, Sina2, Lachico, Almkglor, Bushsf, Sir Nicholas de Mimsy-Porpington, FreezBee, Dicklyon, E-Kartoel, Wiz-
ard191, Matthew Verey, Isvish, ScottHolden, CapitalR, Gnome (Bot), Tawkerbot2, Marty39, Daggerstab, CRGreathouse, Thermochap,
Ale jrb, Thomas Keyes, Mct mht, Pulkitgrover, Grgarza, Maria Vargas, Roman Cheplyaka, Hpalaiya, Vanished User jdksfajlasd, Near-
far, Heidijane, Thijs!bot, N5iln, WikiIT, James086, PoulyM, Edchi, D.H, Jvstone, HSRT, JAnDbot, BenjaminGittins, RainbowCrane,
Jthomp4338, Buettcher, MetsBot, David Eppstein, Pax:Vobiscum, MartinBot, Tamer ih~enwiki, Sigmundg, Jargon777, Policron, Useight,
VolkovBot, Joeoettinger, JohnBlackburne, Jimmaths, Constant314, Starrymessenger, Kjells, Magmi, AllGloryToTheHypnotoad, Bemba,
Radagast3, Newbyguesses, SieBot, Ivan tambuk, Robert Loring, Masgatotkaca, Pcontrop, Algorithms, Anchor Link Bot, Melcombe,
ClueBot, Fleem, Ammarsakaji, Estirabot, 7&6=thirteen, Oldrubbie, Vegetator, Singularity42, Dziewa, Lambtron, Johnuniq, SoxBot III,
HumphreyW, Vanished user uih38riiw4hjlsd, Mitch Ames, Addbot, Deepmath, Peerc, Eweinber, Sun Ladder, C9900, L.exsteens, Xlasne,
Egoistorms, Luckas-bot, Quadrescence, Yobot, TaBOT-zerem, Taxisfolder, Carleas, Twohoos, Cassandra Cathcart, Dbln, Materialscientist,
Informationtheory, Jingluolaodao, Expooz, Raysonik, Xqbot, Isheden, Informationtricks, Dani.gomezdp, PHansen, Masrudin, FrescoBot,
Nageh, Tiramisoo, Sanpitch, Gnomehacker, Pinethicket, Momergil, SkyMachine, SchreyP, Lotje, Miracle Pen, Vanadiumho, Kastchei,
Djjr, EmausBot, WikitanvirBot, Jmencisom, Bethnim, Quondum, Henriqueroscoe, Terra Novus, ClueBot NG, MelbourneStar, Barrel-
Proof, TimeOfDei, Frietjes, Thepigdog, Pzrq, MrJosiahT, Lawsonstu, Helpful Pixie Bot, Leopd, BG19bot, Wiki13, Trevayne08, Citation-
CleanerBot, Brad7777, Schafer510, BattyBot, Bankmichael1, SFK2, Jochen Burghardt, Limit-theorem, Szzoli, 314Username, Roastliras,
Eigentensor, Comp.arch, SakeUPenn, Logan.dunbar, Prof. Michael Bank, DanBalance, JellydPuppy, KasparBot and Anonymous: 262
Computational science Source: https://en.wikipedia.org/wiki/Computational_science?oldid=668832468 Contributors: SebastianHelm,
Charles Matthews, Fredrik, Mpiper, Ancheta Wis, Behnam, Hugh Mason, Alsocal, Discospinster, Cjb88, Andrewwall, Mykej, Wayne
Schroeder, Oleg Alexandrov, Ruud Koot, Vegaswikian, Meawoppl, Raikkonen, JJL, SmackBot, Peter Sloot, Chris the speller, Janm67,
LuchoX, Beetstra, Rhebus, Hu12, Haus, CmdrObot, Pgr94, Aajaja, Mentisto, VoABot II, Jwagnerhki, Mrseacow, SharkD, Antony-22,
JonMcLoone, Arithma, Sargursrihari, Lordvolton, Mcsmom, Douglas256, Kayvan45622, ClueBot, Lgstarn, Eurekaman, Tomtzigt, Qwfp,
Zoidbergmd, T68492, Nicoguaro, Superzoulou, Addbot, Wordsoup, Ekojekoj, Jarble, , Yobot, Caracho, AnomieBOT, Materialsci-
entist, P99am, Isheden, CES1596, FrescoBot, Kiefer.Wolfowitz, Gryllida, Willihans, Lotje, Persian knight shiraz, Almarklein, Ouji-n,
Beatnik8983, Super48paul, Vilietha, RockMagnetist, Vincehradil, ClueBot NG, Aiwing, Dslice00, Northamerica1000, Nsda, Anbu121,
EricEnfermero, Ema--or, Compsim, Ice666, Shtamy, Schimmdog, MariSo87, Jaime hoyos b, Yakitawa and Anonymous: 58
Exploratory data analysis Source: https://en.wikipedia.org/wiki/Exploratory_data_analysis?oldid=670352181 Contributors: Michael
Hardy, PuzzletChung, Schutz, Aetheling, Astaines, Cutler, Filemon, Giftlite, Jason Quinn, Khalid hassani, Bender235, El-
wikipedista~enwiki, Ejrrjs, BlueNovember, Mdd, Oleg Alexandrov, David Haslam, Btyner, Holek, Larman, Rjwilmsi, Chobot, Dppl,
YurikBot, Wavelength, Morphh, Closedmouth, Zvika, SmackBot, DCDuring, Jtneill, CommodiCast, Nbarth, Gragus, Sergio.ballestrero,
Abmac, Ft93110, Smallpond, Talgalili, JAnDbot, Magioladitis, Jim Douglas, Johnbibby, Rlsheehan, Luc.girardin, Ignacio Icke, STBotD,
Botx, Farcaster, Msrasnw, Melcombe, Delphis wk, Linforest, Bea68~enwiki, DragonBot, Ezsdgxrfhcv, Stealth500, BOTarate, JamesX-
inzhiLi, Qwfp, Tayste, Addbot, Bruce rennes, Kapilghodawat, NjardarBot, Srylesmor, Visnut, Delaszk, Yobot, AlotToLearn, Pablo07,
Dpoduval, Citation bot, ArthurBot, GrouchoBot, Omnipaedista, Decisal, Thehelpfulbot, Rdledesma, Boxplot, Kiefer.Wolfowitz, Dmitry
St, DixonDBot, Mean as custard, Valeropedro, Macseven, RockMagnetist, Mathstat, WikiMSL, BG19bot, Therealjoefaith, TwoTwoHello,
Naabou, Municca, Mkhambaty, Dave Braunschweig, Story645, Fratilias, Alakzi, KasparBot, Olgreenwood and Anonymous: 61
Predictive analytics Source: https://en.wikipedia.org/wiki/Predictive_analytics?oldid=669931704 Contributors: The Anome, Edward,
Michael Hardy, Kku, Karada, Ronz, Avani~enwiki, Giftlite, SWAdair, Thorwald, Bender235, Rubicon, Calair, Rajah, Mdd, Andrew-
410 CHAPTER 62. DEEP LEARNING
pmk, Stephen Turner, Dominic, Oleg Alexandrov, OwenX, HughJorgan, RussBot, Leighblackall, Aeusoes1, Gloumouth1, DeadEyeAr-
row, EAderhold, Zzuuzz, Talyian, Allens, Zvika, Yvwv, SmackBot, CommodiCast, Mcld, Gilliam, EncMstr, Eudaemonic3, S2rikanth,
Onorem, JonHarder, Krexer, Lpgeen, Doug Bell, Kuru, IronGargoyle, JHunterJ, Ralf Klinkenberg, CmdrObot, Van helsing, Reques-
tion, Pgr94, Myasuda, LeoHeska, Dancter, Talgalili, Scientio, Marek69, JustAGal, Batra, Mr. Darcy, AntiVandalBot, MER-C, Fbooth,
VoABot II, Baccyak4H, Bellemichelle, Sweet2, Gregheth, Apdevries, Sudheervaishnav, Ekotkie, Dontdoit, Jfroelich, Trusilver, Dvdpwiki,
Ramkumar.krishnan, Atama, Bonadea, BernardZ, Cyricx, GuyRo, Deanabb, Dherman652, Arpabr, Selain03, Hherbert, Cuttysc, Vikx1,
Ralftgehrig, Rpm698, MaynardClark, Drakedirect, Chrisguyot, Melcombe, Maralia, Into The Fray, Kai-Hendrik, Ahyeek, Howie Goodell,
Sterdeus, SpikeToronto, Stephen Milborrow, Jlamro, Isthisthingworking, SchreiberBike, Bateni, Cookiehead, Qwfp, Angoss, Sunsetsky,
Vianello, MystBot, Vaheterdu, BizAnalyst, Addbot, MrOllie, Download, Yobot, SOMart, AnomieBOT, IRP, Nosperantos, Citation bot,
Jtamad, BlaineKohl, Phy31277, CorporateM, GESICC, FrescoBot, Boxplot, I dream of horses, Triplestop, Dmitry St, SpaceFlight89, Jack-
verr, , Peter.borissow, Ethansdad, Cambridgeblue1971, Pamparam, Kmettler, Glenn Maddox, Vrenator, Crysb, DARTH SIDIOUS 2,
Onel5969, WikitanvirBot, Sugarfoot1001, Jssgator, Chire, MainFrame, Synecticsgroup, ClueBot NG, Thirdharringtonskier, Stefanomaz-
zalai, Ricekido, Widr, Luke145, Mikeono, Helpful Pixie Bot, WhartonCAI, JonasJSchreiber, Wbm1058, BG19bot, Rafaelgmonteiro,
Lisasolomonsalford, TLAN38, Lynnlangit, Flaticida, MrBill3, MC Wapiti, HHinman, BattyBot, Raspabill, Jeremy Kolb, TwoTwoHello,
Andrux, Mkhambaty, Cmdima, HeatherMKCampbell, Tommycarney, Tentinator, Sgolestanian, Gbtodd29, Brishtikonna, Thisaccoun-
tisbs, Mitchki.nj, Jvn mht, JaconaFrere, Pablodim91, Cbuyle, Monkbot, AmandaJohnson2014, JSHorwitz, Thomas Speidel, HappyVDZ,
Wikiperson99, Justincahoon, Femiolajiga, Stevennlay, Rlm1188, Bildn, Nivedita1414, Frankecoker, Annaelison, Olosko, Vedanga Ku-
mar, Gary2015, HelpUsStopSpam, Heinrichvk, Rodionos, Olavlaudy and Anonymous: 215
Business intelligence Source: https://en.wikipedia.org/wiki/Business_intelligence?oldid=667392522 Contributors: Manning Bartlett, Ant,
Chuq, Leandrod, Michael Hardy, Norm, Nixdorf, Kku, SebastianHelm, Ellywa, Ronz, Mkoval, Elvis, Mydogategodshat, Jay, Rednblu,
Pedant17, Chuckrussell, Traroth, Robbot, ZimZalaBim, Mirv, Aetheling, Lupo, Wile E. Heresiarch, Mattaschen, Psb777, Ianhowlett,
Beardo, AlistairMcMillan, Intergalacticz9, Macrakis, Joelm, Khalid hassani, Alem~enwiki, Edcolins, Golbez, Lucky 6.9, Roc, Alexf,
Beland, Bharatcit, Heirpixel, Karl-Henner, Gscshoyru, DMG413, Kadambarid, Guppynsoup, KeyStroke, Discospinster, Rhobite, Mart-
pol, S.K., RJHall, Saturnight, Just zis Guy, you know?, Etz Haim, Tjic, Reinyday, John Vandenberg, Maurreen, MPerel, Nsaa, Mdd,
Gwalarn, Alansohn, Gary, PaulHanson, Arthena, ABCD, Snowolf, Wtmitchell, Evil Monkey, Sciurin, Brookie, Stephen, Zntrip, Dr
Gangrene, Woohookitty, Mindmatrix, TigerShark, Camw, Arcann, Je3000, GregorB, Liface, Stefanomione, DePiep, Hans Genten, Dou-
glasGreen~enwiki, Ademkader, Slant, Alberrosidus, AndriuZ, ViriiK, M7bot, Danielsmith, Chrisvonsimson, Bgwhite, Wavelength, Tex-
asAndroid, StuOfInterest, RussBot, AVM, Bhny, DanMS, Manop, Grafen, Welsh, Joel7687, Aaron Brenneman, Muu-karhu, Mikeblas,
Zwobot, Pamela Haas, Langbk01, Zzuuzz, Chase me ladies, I'm the Cavalry, Arthur Rubin, Nraden, Guillom, Katieh5584, Tom Mor-
ris, Veinor, Drcwright, SmackBot, Schniider~enwiki, McGeddon, MeiStone, Brick Thrower, CommodiCast, Eskimbot, Ohnoitsjamie,
Folajimi, Jcarroll, Setti, Chris the speller, Bluebot, Stevage, Swells65, Nick Levine, TheKMan, Xyzzyplugh, Mitrius, Krich, Warren,
Yasst8, Ohconfucius, Wikiolap, Eliyak, Kuru, Tomhubbard, Dreamrequest, ElixirTechnology, Beetstra, Ashil04, Frederikton, Blork-mtl,
Larrymcp, Waggers, MTSbot~enwiki, Peyre, Aspandyar, Apolitano, AdjustablePliers, OnBeyondZebrax, Lancet75, IvanLanin, Az1568,
Nhgaudreau, Codeculturist, HMishko, SkyWalker, Racecarradar, CmdrObot, ShelfSkewed, Nmoureld, Cryptblade, Dancter, X0lani,
Roberta F., FrancoGG, Thijs!bot, Wernight, Qwyrxian, Czenek~enwiki, PerfectStorm, CharlesHoman, Batra, QuiteUnusual, Prolog,
Charlesmnicholls, Kbeneby, Lfstevens, JAnDbot, Sarnholm, MER-C, Rongou, YK Times, Entgroupzd, Technologyvoices, Supercactus,
Magioladitis, VoABot II, Rajashekar iitm, Vanished user ty12kl89jq10, Nposs, Ionium, Peters72, WLU, Halfgoku, Cquels, Iamthenewno2,
R'n'B, Gary a mason, Trusilver, Svetovid, DanDoughty, Siryendor, Extransit, A40220, Wxhat1, Sinotara, Srknet, Edit06, Mark Bosley,
Naniwako, L'Aquatique, Islamomt, Wendecover, Priyank bolia, WinterSpw, Phani96, Seankenalty, Je G., Philip Trueman, TXiKiBoT,
Blackstar138, Perohanych, Technopat, Rich Janis, Fredsmith2, Mcclarke, Bansipatel, Andy Dingley, JukoFF, Wikidan829, Ceranthor,
Quantpole, Ermite~enwiki, Hazel77, Dwandelt, Moonriddengirl, SEOtools, Julianclark, Android Mouse Bot, Ireas, ObserverToSee, Corp
Vision, Janner210, Aadeel, Bcarrdba, Ncw0617, Melcombe, Denisarona, Jvlock527, Ukpremier, Martarius, ClueBot, WriterListener,
Natasha81, John ellenberger, Supertouch, Ryan Rasmussen, Chrisawiki, Niceguyedc, Rickybarron, LeoFrank, Srkview, Aexis, Seth-
Grimes, Kit Berg, Jpnofseattle, Mymallandnews, Tompana82, Sierramadrid, DumZiBoT, Man koznar~enwiki, Jmkim dot com, Ejosse1,
Writerguy71, Addbot, Butterwell, Mehtasanjay, Wsvlqc, Ronhjones, BlackLips, MrOllie, Glane23, Fauxstar, Lightbot, , Pravi-
surabhi, Luckas-bot, Yobot, Fraggle81, Travis.a.buckingham, Evans1982, Becky Sayles, Coolpriyanka10, IW.HG, Sualfradique, Intelli-
gentknowledgeyoudrive, AnomieBOT, Rubinbot, Jsmith1108, Piano non troppo, Materialscientist, Citation bot, Stationcall, Quebec99,
Jehan21, BlaineKohl, Momotoshi, Wperdue, JVRudnick, Prazan, Euthenicsit, Wilcoxaj, RibotBOT, Urchandru, Mathonius, Lovede-
mon84, Shadowjams, Opagecrtr, Forceblue, Force88, , Mark Renier, Wiki episteme, Glaugh, D'ohBot, Greenboite, Pacic202,
Pinethicket, Rayrubenstein, Qqppqqpp, Triplestop, Jim380, Serols, Dnedzel, Jandalhandler, Steelsabre, Ordnascrazy, Hyphen DJW, ITPer-
forms, Ethansdad, Genuinedierence, Wondigoma, Iaantheron, Ansumang, Crysb, Dr.apostrophe, Goyalaishwarya, Sulhan, Vasant Dhar,
Navvod, RjwilmsiBot, Bonanjef, Ananthnats, Rollins83, ITtalker, Helwr, Logical Cowboy, Timtempleton, JEL888, Dewritech, Kellylautt,
K6ka, AsceticRose, Jesaisca, Dnazip, F, Jahub, TheWakeUpFactory, Ruislick0, Alpha Quadrant (alt), Eken7, Makecat, Erianna, Tjtyrrell,
Openstrings, L Kensington, Yorkshiresoul, Bevelson, Alexandra31, Beroneous, Hanantaiber, Outbackprincess, ClueBot NG, CaveJohnson,
AMJBIUser, This lousy T-shirt, Jaej, Qarakesek, Happyinmaine, Robiminer, Mathew105601, Widr, Pmresource, Helpful Pixie Bot, Tim-
Mulherin, Bpm123, Kaimu17, BG19bot, Vaulttech, Mr.Gaebrial, Joshua.pierce84, Xjengvyen, Jwcga, Chafe66, Loripiquet, Einsteinlebt,
Reverend T. R. Malthus, Meclee, Sutanupaul, Y.Kondrykava, Khazar2, Dhavalp453, Jkofron4, Ivytrejo, Cwobeel, Meg2765, Mogism,
Bpmbooks, XXN, Riyadmks, OnTheNet21, Michael.h.zimmerman, Ergoodell, Zkhall, Mangotron, HowardDresner, Mikevandeneijnden,
Yanis ahmed, Dkrapohl, Ginsuloft, ReclaGroup, BIcurious3334, DauphineBI, Lakun.patra, Compprof9000, Mgt88drcr, Julep.hawthorne,
Tastiera, Marc Schnwandt, TechnoTalk, Wiki-jonne, BrettofMoore, Vanished user 9j34rnfjemnrjnasj4, Generalcontributor, Clumsied,
Mihaescu Constantin, Deever21, Xpansa, Galaktikasoft, Frankecoker, Gary2015, BrandonMcBride, Brendonritz, ThatKongregateGuy,
Soheilmamani and Anonymous: 692
Analytics Source: https://en.wikipedia.org/wiki/Analytics?oldid=670788887 Contributors: SimonP, Michael Hardy, Kku, Ronz, Julesd,
Charles Matthews, Dysprosia, Kadambarid, Stephenpace, Visviva, Hanswaarle, Je3000, GrundyCamellia, Rjwilmsi, Intgr, Srleer,
Rick lightburn, DeadEyeArrow, MagneticFlux, SmackBot, C.Fred, CommodiCast, Ohnoitsjamie, BenAveling, PitOfBabel, Deli nk,
Sergio.ballestrero, Wikiolap, Kuru, Ocatecir, 16@r, Beetstra, RichardF, IvanLanin, Hobophobe, Lamiot, Zgemignani, NishithSingh,
Gogo Dodo, Barticus88, Brandoneus, QuiteUnusual, TFinn734, Magioladitis, Prabhu137, Elringo, Kimleonard, KylieTastic, Jevansen,
VolkovBot, Trevorallred, Jimmaths, Tavix, BarryList, Bansipatel, Rpanigassi, Kerenb, Falcon8765, LittleBenW, Planbhups, Sanya r, Mel-
combe, Maralia, Aharol, Apptrain, Ottawahitech, Cyberjacob, GDibyendu, Deineka, Addbot, Mortense, Freakmighty, MrOllie, Glory-
daze716, Luckas-bot, Yobot, Ptbotgourou, Freikorp, AnomieBOT, HikeBandit, Spugsley, BlaineKohl, Kerberus13, Omnipaedista, Em-
cien, FrescoBot, James Doehring, Ethansdad, Jonkerz, RjwilmsiBot, TjBot, DASHBot, Timtempleton, Kellylautt, Tmguru, Simplyuttam,
Idea Farm, KyleAraujo, Gregory787, Paolo787, ClueBot NG, Networld1965, Helpful Pixie Bot, WhartonCAI, Wbm1058, Mkennedy1981,
BG19bot, Pine, Jobin RV, Loripiquet, Analytically, MikeLampaBI, Clkim, Melenc, Cryptodd, TheAdamEvans, Vishal.dani, OnTheNet21,
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 411
Municca, Dougs campbell, Gmid associates, Makvar, RupertLipton1986, I am One of Many, Edwinboothnyc, JuanCarlosBrandt, Luke-
bradford, Jenny Rankin, Prussonyc, Rzicari, Ramg iitk, Daph8, Filedelinkerbot, Raj Aradhyula, Aymanmogh, Bildn, Anecdotic, Vidyasnap,
Abcdudtc and Anonymous: 110
Data mining Source: https://en.wikipedia.org/wiki/Data_mining?oldid=670989267 Contributors: Dreamyshade, WojPob, Bryan Derk-
sen, The Anome, Ap, Verloren, Andre Engels, Fcueto, Matusz, Deb, Boleslav Bobcik, Hefaistos, Mswake, N8chz, Michael Hardy, Con-
fusss, Fred Bauder, Isomorphic, Nixdorf, Dhart, Ixfd64, Lament, Alo, CesarB, Ahoerstemeier, Haakon, Ronz, Angela, Den fjttrade
ankan~enwiki, Netsnipe, Jtzg, Tristanb, Hike395, Mydogategodshat, Dcoetzee, Andrevan, Jay, Fuzheado, WhisperToMe, Epic~enwiki,
Tpbradbury, Furrykef, Traroth, Nickshanks, Joy, Shantavira, Pakcw, Robbot, ZimZalaBim, Altenmann, Henrygb, Ojigiri~enwiki, Sun-
ray, Aetheling, Apogr~enwiki, Wile E. Heresiarch, Tobias Bergemann, Filemon, Adam78, Alan Liefting, Giftlite, ShaunMacPherson,
Sepreece, Philwelch, Tom harrison, Jkseppan, Simon Lacoste-Julien, Ianhowlett, Varlaam, LarryGilbert, Kainaw, Siroxo, Adam McMas-
ter, Just Another Dan, Neilc, Comatose51, Chowbok, Gadum, Pgan002, Bolo1729, SarekOfVulcan, Raand, Antandrus, Onco p53, Over-
lordQ, Gscshoyru, Urhixidur, Kadambarid, Mike Rosoft, Monkeyman, KeyStroke, Rich Farmbrough, Nowozin, Stephenpace, Vitamin b,
Bender235, Flyskippy1, Marner, Aaronbrick, Etz Haim, Janna Isabot, Mike Schwartz, John Vandenberg, Maurreen, Ejrrjs, Nsaa, Mdd,
Alansohn, Gary, Walter Grlitz, Denoir, Rd232, Jeltz, Jet57, Jamiemac, Malo, Compo, Caesura, Axeman89, Vonaurum, Oleg Alexan-
drov, Jefgodesky, Nuno Tavares, OwenX, Woohookitty, Mindmatrix, Katyare, TigerShark, LOL, David Haslam, Ralf Mikut, GregorB,
Hynespm, Essjay, MarcoTolo, Joerg Kurt Wegner, Simsong, Lovro, Tslocum, Graham87, Deltabeignet, BD2412, Kbdank71, DePiep,
CoderGnome, Chenxlee, Sjakkalle, Rjwilmsi, Gmelli, Lavishluau, Michal.burda, Bubba73, Bensin, GeorgeBills, GregAsche, HughJor-
gan, Twerbrou, FlaBot, Emarsee, AlexAnglin, Ground Zero, Mathbot, Jrtayloriv, Predictor, Bmicomp, Compuneo, Vonkje, Gurubrahma,
BMF81, Chobot, DVdm, Bgwhite, The Rambling Man, YurikBot, Wavelength, NTBot~enwiki, H005, Phantomsteve, AVM, Hede2000,
Splash, SpuriousQ, Ansell, RadioFan, Hydrargyrum, Gaius Cornelius, Philopedia, Bovineone, Zeno of Elea, EngineerScotty, Nawlin-
Wiki, Grafen, ONEder Boy, Mshecket, Aaron Brenneman, Jpbowen, Tony1, Dlyons493, DryaUnda, Bota47, Tlevine, Ripper234, Gra-
ciella, Deville, Zzuuzz, Lt-wiki-bot, Fang Aili, Pb30, Modify, GraemeL, Wikiant, JoanneB, LeonardoRob0t, ArielGold, Katieh5584,
John Broughton, SkerHawx, Capitalist, Palapa, SmackBot, Looper5920, ThreeDee912, TestPilot, Unyoyega, Cutter, KocjoBot~enwiki,
Bhikubhadwa, Thunderboltz, CommodiCast, Comp8956, Delldot, Eskimbot, Slhumph, Onebravemonkey, Ohnoitsjamie, Skizzik, Some-
wherepurple, Leo505, MK8, Thumperward, DHN-bot~enwiki, Tdelamater, Antonrojo, Dierentview, Janvo, Can't sleep, clown will eat
me, Sergio.ballestrero, Frap, Nixeagle, Serenity-Fr, Thefriedone, JonHarder, Propheci, Joinarnold, Bennose, Mackseem~enwiki, Rada-
gast83, Nibuod, Daqu, DueSouth, Blake-, Krexer, Weregerbil, Vina-iwbot~enwiki, Andrei Stroe, Deepred6502, Spiritia, Lambiam, Wiki-
olap, Kuru, Bmhkim, Vgy7ujm, Calum Macisdean, Athernar, Burakordu, Feraudyh, 16@r, Beetstra, Mr Stephen, Jimmy Pitt, Julthep,
Dicklyon, Waggers, Ctacmo, RichardF, Nabeth, Beefyt, Hu12, Enggakshat, Vijay.babu.k, Ft93110, Dagoldman, Veyklevar, Ralf Klinken-
berg, JHP, IvanLanin, Paul Foxworthy, Adrian.walker, Linkspamremover, CRGreathouse, CmdrObot, Filip*, Van helsing, Shorespirit,
Matt1299, Kushal one, CWY2190, Ipeirotis, Nilfanion, Cydebot, Valodzka, Gogo Dodo, Ar5144-06, Akhil joey, Martin Jensen, Pingku,
Oli2140, Mikeputnam, Talgalili, Malleus Fatuorum, Thijs!bot, Barticus88, Nirvanalulu, Drowne, Scientio, Kxlai, Headbomb, Ubuntu2,
AntiVandalBot, Seaphoto, Ajaysathe, Gwyatt-agastle, Onasraou, Spencer, Alphachimpbot, JAnDbot, Wiki0709, Barek, Sarnholm, MER-
C, The Transhumanist, Bull3t, TFinn734, Andonic, Mkch, Hut 8.5, Leiluo, Jguthaaz, EntropyAS, SiobhanHansa, Timdew, Dmmd123,
Connormah, Bongwarrior, VoABot II, Tedickey, Giggy, JJ Harrison, David Eppstein, Chivista~enwiki, Gomm, Pmbhagat, Fourthcourse,
Kgeischmann, RoboBaby, Quanticle, ERI employee, R'n'B, Jfroelich, Tgeairn, Pharaoh of the Wizards, Trusilver, Bongomatic, Roxy1984,
Andres.santana, Shwapnil, DanDoughty, Foober, Ocarbone, RepubCarrier, Gzkn, AtholM, Salih, LordAnubisBOT, Starnestommy, Jma-
jeremy, A m sheldon, AntiSpamBot, LeighvsOptimvsMaximvs, Ramkumar.krishnan, Shoessss, Josephjthomas, Parikshit Basrur, Doug4,
Cometstyles, DH85868993, DorganBot, Bonadea, WinterSpw, Mark.hornick, Andy Marchbanks, Yecril, BernardZ, RJASE1, Idioma-
bot, RonFredericks, Black Kite, Je G., Jimmaths, DataExp, Philip Trueman, Adamminstead, TXiKiBoT, Deleet, Udufruduhu, Dean-
abb, Valerie928, TyrantX, OlavN, Arpabr, Vlad.gerchikov, Don4of4, Raymondwinn, Mannafredo, 1yesfan, Bearian, Jkosik1, Wykypy-
dya, Billinghurst, Atannir, Hadleywickham, Hherbert, Falcon8765, Sebastjanmm, Monty845, Pjoef, Mattelsen, AlleborgoBot, Burkean-
girl, NHRHS2010, Rknasc, Pdfpdf, Equilibrioception, Calliopejen1, VerySmartNiceGuy, Euryalus, Dawn Bard, Estard, Srp33, Jerryob-
ject, Kexpert, Mark Klamberg, Curuxz, Flyer22, Eikoku, JCLately, Powtroll, Jpcedenog, Strife911, Pyromaniaman, Oxymoron83, Gp-
swiki, Dodabe~enwiki, Gargvikram07, Mtys, Fratrep, Chrisguyot, Odo Benus, Stfg, StaticGull, Sanya r, DixonD, Kjtobo, Melcombe,
48states, LaUs3r, Pinkadelica, Ypouliot, Denisarona, Sbacle, Kotsiantis, Loren.wilton, Sfan00 IMG, Nezza 4 eva, ClueBot, The Thing
That Should Not Be, EoGuy, Supertouch, Kkarimi, Blanchardb, Edayapattiarun, Lbertolotti, Shaw76, Verticalsearch, Sebleouf, Hanif-
bbz, Abrech, Sterdeus, DrCroco, Nano5656, Aseld, Amossin, Dekisugi, SchreiberBike, DyingIce, Atallcostsky, 9Nak, Dank, Versus22,
Katanada, Qwfp, DumZiBoT, Sunsetsky, XLinkBot, Articdawg, Cgfjpfg, Ecmalthouse, Little Mountain 5, WikHead, SilvonenBot, Badger-
net, Foxyliah, Freestyle-69, Texterp, Addbot, DOI bot, Mabdul, Landon1980, Mhahsler, AndrewHZ, Elsendero, Matt90855, Jpoelma13,
Cis411, Drkknightbatman, MrOllie, Download, RTG, M.r santosh kumar., Glane23, Delaszk, Chzz, Swift-Epic (Refectory), AtheWeath-
erman, Fauxstar, Jesuja, Luckas-bot, Yobot, Adelpine, Bunnyhop11, Ptbotgourou, Cm001, Hulek, Alusayman, Ryanscraper, Carleas,
Nallimbot, SOMart, Tiany9027, AnomieBOT, Rjanag, Jim1138, JackieBot, Fahadsadah, OptimisticCynic, Dudukeda, Materialscientist,
Citation bot, Schul253, Cureden, Capricorn42, Gtfjbl, Lark137, Liwaste, The Evil IP address, Tomwsulcer, BluePlateSpecial, Dr Old-
ekop, Rosannel, Rugaaad, RibotBOT, Charvest, Tareq300, Cmccormick8, Smallman12q, Andrzejrauch, Davgrig04, Stekre, Whizzdumb,
Thehelpfulbot, Kyleamiller, OlafvanD, FrescoBot, Mark Renier, Ph92, W Nowicki, X7q, Colewaldron, Er.piyushkp, HamburgerRadio,
Atlantia, Webzie, Citation bot 1, Killian441, Manufan 11, Rustyspatula, Pinethicket, Guerrerocarlos, Toohuman1, BRUTE, Elseviered-
itormath, Stpasha, MastiBot, SpaceFlight89, Jackverr, UngerJ, Juliustch, Priyank782, TobeBot, Pamparam, Btcoal, Kmettler, Jonkerz,
GregKaye, Glenn Maddox, Jayrde, Angelorf, Reaper Eternal, Chenzheruc, Pmauer, DARTH SIDIOUS 2, Mean as custard, Rjwilmsi-
Bot, Mike78465, D vandyke67, Ripchip Bot, Slon02, Aaronzat, Helwr, Ericmortenson, EmausBot, Acather96, BillyPreset, Fly by Night,
WirlWhind, GoingBatty, Emilescheepers444, Stheodor, Lawrykid, Uploadvirus, Wikipelli, Dcirovic, Joanlofe, Anir1uph, Chire, Cronk28,
Zedutchgandalf, Vangelis12, T789, Rick jens, Donner60, Terryholmsby, MainFrame, Phoglenix, Raomohsinkhan, ClueBot NG, Mathstat,
Aiwing, Nuwanmenuka, Statethatiamin, CherryX, Candace Gillhoolley, Robiminer, Leonardo61, Twillisjr, Widr, WikiMSL, Luke145,
EvaJamax, Debuntu, Helpful Pixie Bot, AlbertoBetulla, HMSSolent, Ngorman, Inoshika, Data.mining, ErinRea, BG19bot, Wanming149,
PhnomPencil, Lisasolomonsalford, Uksas, Naeemmalik036, Chafe66, Onewhohelps, Netra Nahar, Aranea Mortem, Jasonem, Flaticida,
Funkykeith777, Moshiurbd, Nathanashleywild, Anilkumar 0587, Mpaye, Rabarbaro70, Thundertide, BattyBot, Aacruzr, Warrenxu, Ijon-
TichyIjonTichy, Harsh 2580, Dexbot, Webclient101, Mogism, TwoTwoHello, Frosty, Bradhill14, 7376a73b3bf0a490fa04bea6b76f4a4b,
L8fortee, Dougs campbell, Mark viking, Cmartines, Epicgenius, THill182, Delaf, Melonkelon, Herpderp1235689999, Revengetechy,
Amykam32, The hello doctor, Mimarios1, Huang cynthia, DavidLeighEllis, Gnust, Rbrandon87, Astigitana, Alihaghi, Philip Habing,
Wccsnow, Jianhui67, Tahmina.tithi, Yeda123, Skr15081997, Charlotth, Jfrench7, Zjl9191, Davidhart007, Routerdecomposer, Augt.pelle,
Justincahoon, Gstoel, Wiki-jonne, MatthewP42, 115ash, LiberumConsilium, Ran0512, Daniel Bachar, Galaktikasoft, Prof PD Hoy, Gold-
CoastPrior, Gary2015, KasparBot, Baharsahu, Hillbilly Dragon Farmer and Anonymous: 987
Big data Source: https://en.wikipedia.org/wiki/Big_data?oldid=670390555 Contributors: William Avery, Heron, Kku, Samw, An-
412 CHAPTER 62. DEEP LEARNING
drewman327, Ryuch, , Topbanana, Paul W, F3meyer, Sunray, Giftlite, Langec, Erik Carson, Utcursch, Beland, Jeremykemp,
David@scatter.com, Discospinster, Rich Farmbrough, Kdammers, ArnoldReinhold, Narsil, Viriditas, Lenov, Gary, Pinar, Tobych, Mi-
ranche, Broeni, Tomlzz1, Axeman89, Woohookitty, Pol098, Qwertyus, Rjwilmsi, ElKevbo, Jehochman, Nihiltres, Lumin~enwiki, Tedder,
DVdm, SteveLoughran, Aeusoes1, Daniel Mietchen, Dimensionsix, Katieh5584, Henryyan, McGeddon, Od Mishehu, Gilliam, Ohnoits-
jamie, Chris the speller, RDBrown, Pegua, Madman2001, Krexer, Kuru, Almaz~enwiki, Dl2000, The Letter J, Chris55, Yragha, Jac16888,
Marc W. Abel, Cydebot, Matrix61312, Quibik, DumbBOT, Malleus Fatuorum, EdJohnston, Nick Number, Cowb0y, Lmusher, Joseph-
marty, Kforeman1, Rmyeid, OhanaUnited, Relyk, Wllm, Magioladitis, Nyq, Tedickey, Steven Walling, Thevoid00, Casieg, Jim.henderson,
Tokyogirl79, MacShimi, McSly, NewEnglandYankee, Lamp90, Asefati, Pchackal, Mgualtieri, VolkovBot, JohnBlackburne, Vincent Lex-
trait, Philip Trueman, Ottb19, Billinghurst, Grinq, Scottywong, Luca Naso, Dawn Bard, Yintan, Jazzwang, Jojikiba, Eikoku, SPACKlick,
CutOTies, Mkbergman, Melcombe, Siskus, PabloStraub, Dilaila, Martarius, Sfan00 IMG, Faalagorn, Apptrain, Morrisjd1, Grantbow,
Mild Bill Hiccup, Ottawahitech, Cirt, Auntof6, Lbertolotti, Gnome de plume, Resoru, Pablomendes, Saisdur, SchreiberBike, MPH007,
Rui Gabriel Correia, Mymallandnews, XLinkBot, Ost316, Benboy00, MystBot, P.r.newman, Addbot, Mortense, Drevicko, Thomas888b,
AndrewHZ, Tothwolf, Ronhjones, Moosehadley, MrOllie, Download, Jarble, Arbitrarily0, Luckas-bot, Yobot, Fraggle81, Manivannan
pk, Elx, Jean.julius, AnomieBOT, Jim1138, Babrodtk, Bluerasberry, Materialscientist, Citation bot, Xqbot, Marko Grobelnik, Bgold12,
Anna Frodesiak, Tomwsulcer, Srich32977, Omnipaedista, Smallman12q, Joaquin008, Jugdev, FrescoBot, Jonathanchaitow, I42, PeterEast-
ern, AtmosNews, B3t, I dream of horses, HRoestBot, Jonesey95, Jandalhandler, Mengxr, Ethansdad, Yzerman123, Msalganik, ,
Sideways713, Stuartzs, Jfmantis, Mean as custard, RjwilmsiBot, Ripchip Bot, Mm479arok, Winchetan, Petermcelwee, DASHBot, Emaus-
Bot, John of Reading, Oliverlyc, Timtempleton, Dewritech, Peaceray, Radshashi, Cmlloyd1969, K6ka, HiW-Bot, Richard asr, ZroBot,
Checkingfax, BobGourley, Josve05a, Xtzou, Chire, Kilopi, Laurawilber, Rcsprinter123, Rick jens, Palosirkka, MainFrame, Chuispaston-
Bot, Sean Quixote, Axelode, Mhiji, Helpsome, ClueBot NG, Behrad3d, Danielg922, Pramanicks, Jj1236, Widr, WikiMSL, Lawsonstu,
Fvillanustre, Helpful Pixie Bot, Lowercase sigmabot, BG19bot, And Adoil Descended, Seppemans123, Jantana, Innocentantic, Northamer-
ica1000, Asplanchna, MusikAnimal, AvocatoBot, Noelwclarke, Matt tubb, Jordanzhang, Bar David, InfoCmplx, Atlasowa, Fylbecatulous,
Camberleybates, BattyBot, WH98, DigitalDev, Haroldpolo, Ryguyrg, Untioencolonia, Shirishnetke, Ampersandian, MarkTraceur, Chris-
Gualtieri, TheJJJunk, Khazar2, Vaibhav017, IjonTichyIjonTichy, Saturdayswiki, Mheikkurinen, Seherrell, Mjvaugh2, ChazzI73, Davi-
dogm, Mherradora, Jkofron4, Stevebillings, Indianbusiness, Toopathnd, Jeremy Kolb, Frosty, Jamesx12345, OnTheNet21, BrighterTo-
morrow, Jacoblarsen net, Epicgenius, DavidKSchneider, Socratesplato9, Anirudhrata, Parasdoshiblog, Edwinboothnyc, JuanCarlosBrandt,
Helenellis, MMeTrew, Warrenpd86, AuthorAnil, ViaJFK, Gary Simon, Bsc, FCA, FBCS, CITP, Mcio, Joe204, Caraconan, Evaluator-
group, Hessmike, TJLaher123, Chengying10, IndustrialAutomationGuru, Dabramsdt, Prussonyc, Abhishek1605, Dilaila123, Willymomo,
Rzicari, Mandruss, Mingminchi, BigDataGuru1, Sugamsha, Sysp, Azra2013, Paul2520, Dudewhereismybike, Shahbazali101, Yeda123,
Miakeay, Stamptrader, Accountdp, Morganmissen, JeanneHolm, Yourconnotation, JenniferAndy, Arcamacho, Amgauna, Bigdatavomit,
Monkbot, Wikientg, Scottishweather, Textractor, Analytics ireland, Lspin01l, ForumOxford Online, Mansoor-siamak, Belasobral, Sight-
estrp, Jwdang4, Amortias, Wikiauthor22, Femiolajiga, Tttcraig, Lepro2, Mythnder, DexterToo, Mr P. Kopee, Pablollopis, SVtechie,
Deathmuncher19, Smaske, Greystoke1337, Prateekkeshari, Hmrv83, KaraHayes, Iqmc, Lalith269, Helloyoubum, Jakesher, IEditEncy-
clopedia, Rajsbhatta123, Ragnar Valgeirsson, Vedanga Kumar, Fgtyg78, Gary2015, EricVSiegel, Benedge46, Friafternoon, KasparBot,
Adzzyman, Pmaiden, Spetrowski88, JuiAmale, Yasirsid and Anonymous: 330
Euclidean distance Source: https://en.wikipedia.org/wiki/Euclidean_distance?oldid=669284157 Contributors: Damian Yerrick, Axel-
Boldt, XJaM, Boleslav Bobcik, Michael Hardy, Nikai, Epl18, AnthonyQBachler, Fredrik, Altenmann, MathMartin, Saforrest, Enochlau,
Giftlite, BenFrantzDale, Bender235, Rgdboer, Bobo192, Dvogel, Obradovic Goran, Fawcett5, Oleg Alexandrov, Warbola, Ruud Koot,
Isnow, Qwertyus, Unused007, Ckelloug, DVdm, Wavelength, Multichill, Number 57, StuRat, Arthur Rubin, Clams, SmackBot, Rev-
erendSam, InverseHypercube, 127, Mcld, Oli Filth, Papa November, Octahedron80, Nbarth, Tsca.bot, OrphanBot, Bombshell, Lambiam,
Delnite, Aldarione, Jminguillona, Thijs!bot, JAnDbot, .anacondabot, Theunicyclegirl, Yesitsapril, Graeme.e.smith, Robertgreer, Comet-
styles, JohnBlackburne, TXiKiBoT, FrederikHertzum, Tiddly Tom, Paolo.dL, Dattorro, Justin W Smith, DragonBot, Freebit50, Triath-
ematician, Qwfp, Zik2, Ali Esfandiari, SilvonenBot, Addbot, Fgnievinski, AkhtaBot, Tanhabot, LaaknorBot, Favonian, West.andrew.g,
Yobot, Ehaussecker, Nallimbot, Ciphers, Materialscientist, Xqbot, Simeon87, Erik9bot, Gleb.svechnikov, Sawomir Biay, RedBot, Emaus-
Bot, Rasim, RA0808, Cskudzu, Quondum, Kweckzilber, EdoBot, ClueBot NG, Wcherowi, Stultiwikia, Papadim.G, Ascoldcaves, Arro-
gantrobot, Soni, Jcarrete, ShuBraque, Erotemic, 7Sidz, Loraof and Anonymous: 69
Hamming distance Source: https://en.wikipedia.org/wiki/Hamming_distance?oldid=669792039 Contributors: Damian Yerrick, Ap,
Pit~enwiki, Kku, Kevin Baas, Poor Yorick, Dcoetzee, Silvonen, David Shay, Altenmann, Wile E. Heresiarch, Tosha, Giftlite, Seabhcan,
BenFrantzDale, Markus Kuhn, CryptoDerk, Beland, Gene s, Mindspillage, Leibniz, Zaslav, Danakil, Aaronbrick, Blotwell, Flammifer,
3mta3, Awolsoldier, Obradovic Goran, ABCD, Pouya, Cburnett, Wsloand, Joepzander, Linas, Kasuga~enwiki, Ruud Koot, Zelse81, Qw-
ertyus, Rjwilmsi, Tizio, Pentalith, Mathbot, Margosbot~enwiki, Quuxplusone, Bgwhite, YurikBot, Personman, Michael Slone, Armistej,
Archelon, Alcides, Ttam, Zwobot, Attilios, SmackBot, BiT, Bluebot, DHN-bot~enwiki, Scray, Frap, Decltype, Slach~enwiki, SashatoBot,
Lambiam, Shir Khan, Loadmaster, ThePacker, DagErlingSmrgrav, ChetTheGray, Eastlaw, CRGreathouse, Krauss, Thijs!bot, Headbomb,
Wainson, Fulkkari~enwiki, Adma84, JAnDbot, Sterrys, JPG-GR, David Eppstein, JMyrleFuller, ANONYMOUS COWARD0xC0DE,
Ksero, DorganBot, JohnBlackburne, Lixo2, Sue Rangell, Svick, Hhbruun, Thegeneralguy, TSylvester, DragonBot, Alexbot, SchreiberBike,
Muro Bot, Cerireid, Ouz Ergin, MystBot, Addbot, LaaknorBot, Gnorthup, Ramses68, Lightbot, Math Champion, Luckas-bot, Ptbot-
gourou, AnomieBOT, Joule36e5, Materialscientist, RibotBOT, Citation bot 1, Kiefer.Wolfowitz, Compvis, Ripchip Bot, Valyt, EmausBot,
Olof nord, Froch514, Jnaranjo86, Sumanah, Jcarrete, Wkschwartz, Tiagofrepereira, Rubenaodom, ScrapIronIV, Some1Redirects4You and
Anonymous: 78
Norm (mathematics) Source: https://en.wikipedia.org/wiki/Norm_(mathematics)?oldid=667112150 Contributors: Zundark, The Anome,
Tomo, Patrick, Michael Hardy, SebastianHelm, Selket, Zero0000, Robbot, Altenmann, MathMartin, Bkell, Tobias Bergemann, Tosha, Con-
nelly, Giftlite, BenFrantzDale, Lethe, Fropu, Sendhil, Dratman, Jason Quinn, Tomruen, Almit39, Urhixidur, Beau~enwiki, PhotoBox,
Sperling, Paul August, Bender235, MisterSheik, EmilJ, Dalf, Bobo192, Army1987, Bestian~enwiki, HasharBot~enwiki, Ncik~enwiki,
ABCD, Oleg Alexandrov, Linas, MFH, Nahabedere, Tlroche, HannsEwald, Mike Segal, Magidin, Mathbot, ChongDae, Jenny Harrison,
Tardis, CiaPan, Chobot, Algebraist, Wavelength, Eraserhead1, Hairy Dude, KSmrq, JosephSilverman, VikC, Trovatore, Vanished user
1029384756, Crasshopper, David Pal, Tribaal, Fmccown, Arthur Rubin, TomJF, Killerandy, Lunch, That Guy, From That Show!, Smack-
Bot, David Kernow, InverseHypercube, Melchoir, Mhss, Bluebot, Oli Filth, Silly rabbit, Nbarth, Sbharris, Tamfang, Ccero, Cybercobra,
DMacks, Lambiam, Dicklyon, SimonD, CBM, Irritate, MaxEnt, Mct mht, Rudjek, Xtv, Thijs!bot, D4g0thur, Headbomb, Steve Kroon,
Urdutext, Selvik, Heysan, JAnDbot, Magioladitis, Reminiscenza, Chutzpan, Sullivan.t.j, ANONYMOUS COWARD0xC0DE, JoergenB,
Robin S, Allispaul, Pharaoh of the Wizards, Lucaswilkins, Singularitarian, Potatoswatter, Idioma-bot, Cerberus0, JohnBlackburne, PMajer,
Don Quixote de la Mancha, Falcongl, Wikimorphism, Synthebot, Free0willy, Dan Polansky, RatnimSnave, Paolo.dL, MiNombreDeGuerra,
JackSchmidt, ClueBot, Veromies, Baldphil, Mpd1989, Rockfang, Brews ohare, Hans Adler, Jaan Vajakas, Addbot, Saavek47, Zorrobot,
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 413
, Luckas-bot, Yobot, TaBOT-zerem, Kan8eDie, Ziyuang, SvartMan, Citation bot, Jxramos, ArthurBot, DannyAsher, Bdmy, Dlazesz,
Omnipaedista, RibotBOT, Shadowjams, Quartl, FrescoBot, Paine Ellsworth, Sawomir Biay, Pinethicket, Kiefer.Wolfowitz, NearSe-
tAccount, Stpasha, RedBot, ~enwiki, Dmitri666, Datahaki, JumpDiscont, Weedwhacker128, Xnn, FoxRaweln, Tom Peleg, Jowa
fan, EmausBot, Helptry, Egies, KHamsun, ZroBot, Midas02, Quondum, Bugmenot10, PerimeterProf, Sebjlan, Petrb, ClueBot NG,
Wcherowi, Lovasoa, Snotbot, Helpful Pixie Bot, Rheyik, Aisteco, Deltahedron, Mgkrupa, Laiwoonsiu and Anonymous: 113
Regularization (mathematics) Source: https://en.wikipedia.org/wiki/Regularization_(mathematics)?oldid=668286789 Contributors: The
Anome, Fnielsen, Jitse Niesen, Giftlite, BenFrantzDale, 3mta3, Arcenciel, Oleg Alexandrov, Qwertyus, Patrick Gill, Gareth McCaughan,
Eubot, RussBot, Alexmorgan, SmackBot, Took, Chris the speller, Memming, CBM, Headbomb, David Eppstein, Nigholith, ShambhalaFes-
tival, Denisarona, Alexbot, Skbkekas, Addbot, Luckas-bot, AnomieBOT, Citation bot, Obersachsebot, FrescoBot, Kiefer.Wolfowitz, Jone-
sey95, Noblestats, Chire, Helpful Pixie Bot, Benelot, Illia Connell, Mark viking, Star767, Terrance26, AioftheStorm, Monkbot, Ddunn801
and Anonymous: 14
Loss function Source: https://en.wikipedia.org/wiki/Loss_function?oldid=671232453 Contributors: The Anome, Michael Hardy, Karada,
Delirium, CesarB, Ronz, Den fjttrade ankan~enwiki, A5, Benwing, Henrygb, Cutler, Giftlite, Lethe, Jason Quinn, Nova77, MarkSweep,
Rich Farmbrough, Bender235, MisterSheik, 3mta3, John Quiggin, Jheald, Shoey, Eclecticos, Qwertyus, Rjwilmsi, Chobot, Wavelength,
KSmrq, Zvika, Chris the speller, Nbarth, Hongooi, Zvar, DavidBailey, Robosh, Kvng, Shirahadasha, Headbomb, Yellowbeard, Daniel5Ko,
Kjtobo, Melcombe, Rumping, Koczy, El bot de la dieta, Qwfp, Addbot, Fgnievinski, Yobot, Pasmargo, AnomieBOT, J04n, X7q, Raise-
the-Sail, Nabiw1, Trappist the monk, Duoduoduo, EmausBot, Bnegreve, Akats, Chire, ClueBot NG, BG19bot, Chrish42, Ejjordan, Bodor-
menta, Andy231987, Brirush, Limit-theorem, Mark viking, Loraof and Anonymous: 22
Least squares Source: https://en.wikipedia.org/wiki/Least_squares?oldid=668060355 Contributors: The Anome, HelgeStenstrom, En-
chanter, Maury Markowitz, Michael Hardy, Delirium, William M. Connolley, Snoyes, Mxn, Drz~enwiki, Charles Matthews, Dysprosia,
Jitse Niesen, Robbot, Benwing, Muxxa, Gandalf61, Henrygb, Ashwin, Bkell, Giftlite, BenFrantzDale, Sreyan, Kusunose, DragonySix-
tyseven, Frau Holle, Geof, Paul August, Elwikipedista~enwiki, JustinWick, El C, O18, Oyz, Landroni, Jumbuck, Anthony Appleyard,
John Quiggin, Drbreznjev, WojciechSwiderski~enwiki, Oleg Alexandrov, Gmaxwell, Soultaco, Isnow, Male1979, GeLuxe, Salix alba,
Nneonneo, Tardis, Hugeride, Bgwhite, YurikBot, Wavelength, Jlc46, KSmrq, Philopedia, Grafen, Deodar~enwiki, Witger, Bordaigorl, Scs,
Syrthiss, Arthur Rubin, JahJah, Zvika, KnightRider~enwiki, SmackBot, Unyoyega, DanielPeneld, Tgdwyer, Oli Filth, Kostmo, DHN-
bot~enwiki, Ladislav Mecir, Rludlow, Tamfang, Berland, Memming, G716, Lambiam, Derek farn, Dicklyon, Theswampman, Ichoran,
Tudy77, CapitalR, Harold f, Tensheapz, Ezrakilty, Cydebot, Zzzmarcus, Talgalili, Thijs!bot, Naucer, Tolstoy the Cat, MattWatt, Daniel
il, Escarbot, JEBrown87544, Woollymammoth, And4e, Daytona2, JAnDbot, Eromana, Gavia immer, Magioladitis, Albmont, Jllm06,
Risujin, MyNameIsNeo, Livingthingdan, Jkjo, User A1, AllenDowney, Glrx, Pharaoh of the Wizards, Vi2, Jiuguang Wang, Baiusmc,
DorganBot, PesoSwe, Larryisgood, Nevillerichards, Jmath666, Dfarrar, Forwardmeasure, Sbratu, Bpringlemeir, Petergans, BotMultichill,
Zbvhs, MinorContributor, KoenDelaere, Lourakis, Mika.scher, JackSchmidt, Water and Land, Melcombe, Oekaki, ClueBot, Vikasatkin,
Turbojet, Cipherous, Lbertolotti, Geoeg, Uraza, Bender2k14, Sun Creator, Brews ohare, Muro Bot, ChrisHodgesUK, BOTarate, Qwfp,
XLinkBot, Juliusllb, NellieBly, Addbot, Mmonks, AndrewHZ, Fgnievinski, MrOllie, EconoPhysicist, Publichealthguru, Glane23, LinkFA-
Bot, Kruzmissile, Erutuon, Lightbot, Zorrobot, TeH nOmInAtOr, Meisam, Legobot, Yobot, Sked123, AnomieBOT, SomethingElse-
ToSay, GLRenderer, Ciphers, Rubinbot, Materialscientist, HanPritcher, Citation bot, JmCor, Xqbot, Urbansuperstar~enwiki, Flavio Gui-
tian, Dwlotter, RibotBOT, Hamamelis, FrescoBot, J6w5, Idfah, AstaBOTh15, Kiefer.Wolfowitz, Astropro, Stpasha, Emptiless, Sss41,
Full-date unlinking bot, Jonkerz, Weedwhacker128, Elitropia, EmausBot, WikitanvirBot, Netheril96, Cfg1777, Manyu aditya, ZroBot,
Durka42, JonAWellner, JA(000)Davidson, AManWithNoPlan, Mayur, Zfeinst, Robin48gx, Sigma0 1, JaneCow, Mikhail Ryazanov, Clue-
Bot NG, KlappCK, Demonsquirrel, Habil zare, Hikenstu, Helpful Pixie Bot, Mythirdself, Koertefa, Abryhn, BG19bot, Benelot, Lxlxlx82,
Vanangamudiyan, Op47, Manoguru, Simonfn, Kodiologist, ChrisGualtieri, IPWAI, Gameboy97q, Tschmidt23, Dexbot, Declaration1776,
Dough34, RichardInMiami, Mgfbinae, Leegrc, Velvel2, Yilincau, Gowk, BTM912, KasparBot and Anonymous: 208
Newtons method Source: https://en.wikipedia.org/wiki/Newton{}s_method?oldid=663961763 Contributors: AxelBoldt, Lee Daniel
Crocker, Zundark, Miguel~enwiki, Roadrunner, Formulax~enwiki, Hirzel, Pichai Asokan, Patrick, JohnOwens, Michael Hardy, Pit~enwiki,
Dominus, Dcljr, Loisel, Minesweeper, Ejrh, Looxix~enwiki, Cyp, Poor Yorick, Pizza Puzzle, Hike395, Dcoetzee, Jitse Niesen, Kbk, Saltine,
AaronSw, Robbot, Jaredwf, Fredrik, Wikibot, Giftlite, Rs2, BenFrantzDale, Neilc, MarkSweep, PDH, Torokun, Sam Hocevar, Kutulu,
Fintor, Frau Holle, TheObtuseAngleOfDoom, Paul August, Pt, Aude, Iamunknown, Blotwell, Nk, Haham hanuka, LutzL, Borisblue, Jeltz,
Laug, Olegalexandrov, Oleg Alexandrov, Tbsmith, Joriki, Shreevatsa, LOL, Decrease789, Jimbryho, Robert K S, JonBirge, GregorB,
Casey Abell, Eyu100, Mathbot, Shultzc, Kri, Glenn L, Wikipedia is Nazism, Chobot, YurikBot, Wavelength, Laurentius, Swerty, Jabber-
Wok, KSmrq, Exir Kamalabadi, Tomisti, Alias Flood, Marquez~enwiki, SmackBot, RDBury, Selfworm, Adam majewski, Saravask, Tom
Lougheed, InverseHypercube, Jagged 85, Dulcamara~enwiki, Commander Keane bot, Slaniel, Skizzik, Chris the speller, Berland, Rrburke,
Earlh, ConMan, Jon Awbrey, Henning Makholm, Bdiscoe, Wvbailey, Coredesat, Jim.belk, Magmait, Gco, JRSpriggs, CRGreathouse, Jack-
zhp, David Cooke, Holycow958, Eric Le Bigot, Cydebot, Quibik, Christian75, Billtubbs, Talgalili, Thijs!bot, Epbr123, Nonagonal Spider,
Headbomb, Martin Hedegaard, BigJohnHenry, Ben pcc, Seaphoto, CPMartin, JAnDbot, Coee2theorems, VoABot II, JamesBWatson,
Baccyak4H, Avicennasis, David Eppstein, User A1, GuidoGer, Arithmonic, Glrx, Pbroks13, Kawautar, Rankarana, Nedunuri, K.menin,
Gombang, Chiswick Chap, Goingstuckey, Policron, Juliancolton, Homo logos, JohnBlackburne, Philip Trueman, TXiKiBoT, Anony-
mous Dissident, Broadbot, Aaron Rotenberg, Draconx, Pitel, Katzmik, Psymun747, SieBot, Gex999, Dawn Bard, Bentogoa, Flyer22,
MinorContributor, Jasondet, Smarchesini, Redmarkviolinist, Dreamofthedolphin, Cyfal, PlantTrees, ClueBot, Metaprimer, Wysprgr2005,
JP.Martin-Flatin, Mild Bill Hiccup, CounterVandalismBot, Tesspub, Chrisgolden, Annne furanku, Dekisugi, Xooll, Muro Bot, Jpginn,
RMFan1, Galoisgroupie, Addbot, Some jerk on the Internet, Eweinber, Fluernutter, Ckamas, Protonk, EconoPhysicist, LinkFA-Bot,
Uscitizenjason, AgadaUrbanit, Numbo3-bot, Tide rolls, CountryBot, Xieyihui, Luckas-bot, Yobot, Estudiarme, AnomieBOT, 1exec1, Il-
legal604, Apau98, , Materialscientist, Zhurov, ArthurBot, PavelSolin, Capricorn42, Titolatif, CBoeckle, Nyirenda, Dlazesz,
Point-set topologist, CnkALTDS, Shadowjams, FrescoBot, Pepper, DNA Games, Citation bot 1, Tkuvho, Gaba p, Pinethicket, Eyrryds,
Kiefer.Wolfowitz, White Shadows, Vrenator, ClarkSims, Duoduoduo, KMic, Suusion of Yellow, H.ehsaan, Jfmantis, Hyarmendacil,
123Mike456Winston789, EmausBot, TheJesterLaugh, KHamsun, Mmeijeri, Shuipzv3, D.Lazard, Eniagrom, U+003F, Bomazi, Chris857,
Howard nyc, Kyle.drerup, Elvek, Mikhail Ryazanov, ClueBot NG, KlappCK, Aero-Plex, Chogg, Helpful Pixie Bot, EmadIV, Drift cham-
bers, Toshiki, Scuchina, AWTom, RiabzevMichael, Electricmun11, Khazar2, Qxukhgiels, Dexbot, M.shahriarinia, I am One of Many,
Lakshmi7977, Manoelramon, Ginsuloft, Mohit.del94, Blitztall, Rob Haelterman, Loraof, Akemdh and Anonymous: 276
Supervised learning Source: https://en.wikipedia.org/wiki/Supervised_learning?oldid=643120523 Contributors: Damian Yerrick,
LC~enwiki, Isomorph, Darius Bacon, Boleslav Bobcik, Michael Hardy, Oliver Pereira, Zeno Gantner, Chadloder, Alo, Ahoerstemeier,
Cyp, Snoyes, Rotem Dan, Cherkash, Mxn, Hike395, Shizhao, Topbanana, Unknown, Ancheta Wis, Giftlite, Markus Krtzsch, Dun-
charris, MarkSweep, APH, Gene s, Sam Hocevar, Violetriga, Skeppy, Denoir, Mscnln, Rrenaud, Oleg Alexandrov, Lloydd, Joerg Kurt
Wegner, Marudubshinki, Qwertyus, Mathbot, Chobot, YurikBot, Wavelength, Jlc46, Ritchy, Tony1, Tribaal, BenBildstein, SmackBot,
414 CHAPTER 62. DEEP LEARNING
Reedy, KnowledgeOfSelf, MichaelGasser, Zearin, Dfass, Beetstra, CapitalR, Domanix, Sad1225, Thijs!bot, Mailseth, Escarbot, Pro-
log, Peteymills, Robotman1974, 28421u2232nfenfcenc, A3nm, David Eppstein, Mange01, Paskari, VolkovBot, Naveen Sundar, Jamelan,
Temporaluser, EverGreg, Yintan, Flyer22, Melcombe, Baosheng, Kotsiantis, Doloco, Skbkekas, Magdon~enwiki, DumZiBoT, Addbot,
AndrewHZ, Anders Sandberg, EjsBot, MrOllie, Buster7, Numbo3-bot, Yobot, Twri, Fstonedahl, FrescoBot, LucienBOT, X7q, Mostafa
mahdieh, Classier1234, Erylaos, Zadroznyelkan, Fritq, BertSeghers, WikitanvirBot, Fly by Night, Sun116, Pintaio, Dappermuis, Tdi-
etterich, WikiMSL, EvaJamax, BrutForce, Colbert Sesanker, J.Davis314, Citing, Ferrarisailor, ChrisGualtieri, YFdyh-bot, Alialamifard,
Francisbach, Donjohn1 and Anonymous: 65
Linear regression Source: https://en.wikipedia.org/wiki/Linear_regression?oldid=671189370 Contributors: The Anome, Taw, Ap,
Danny, Miguel~enwiki, Rade Kutil, Edward, Patrick, Michael Hardy, GABaker, Shyamal, Kku, Tomi, TakuyaMurata, Den fjttrade
ankan~enwiki, Kevin Baas, Rossami, Hike395, Jitse Niesen, Andrewman327, Taxman, Donarreiskoer, Robbot, Jaredwf, Benwing, Gak,
ZimZalaBim, Yelyos, Babbage, Henrygb, Jcole, Wile E. Heresiarch, Giftlite, BenFrantzDale, Fleminra, Alison, Duncharris, Jason Quinn,
Ato, Utcursch, Pgan002, MarkSweep, Piotrus, Wurblzap~enwiki, Icairns, Urhixidur, Natrij, Discospinster, Rich Farmbrough, Pak21,
Paul August, Bender235, Violetriga, Elwikipedista~enwiki, Gauge, MisterSheik, Spoon!, Perfecto, O18, Davidswelt, R. S. Shaw, Tobac-
man, Arcadian, NickSchweitzer, 99of9, Crust, Landroni, Storm Rider, Musiphil, Arthena, ABCD, Kotasik, Avenue, Snowolf, LFaraone,
Forderud, Drummond, Oleg Alexandrov, Abanima, Tappancsa, Mindmatrix, BlaiseFEgan, Btyner, Joerg Kurt Wegner, Lacurus~enwiki,
Graham87, Qwertyus, Rjwilmsi, Vegaswikian, Matt Deres, TeaDrinker, Chobot, Manscher, FrankTobia, Wavelength, RussBot, Gaius Cor-
nelius, Bug42, Afelton, Thiseye, Cruise, Moe Epsilon, Voidxor, Dggoldst, Arch o median, Arthur Rubin, Drallim, Anarch21, SolarMcPanel,
SmackBot, NickyMcLean, Quazar777, Prodego, InverseHypercube, Jtneill, DanielPeneld, Evanreyes, Commander Keane bot, Ohnoits-
jamie, Hraefen, Afa86, Markush, Amatulic, Feinstein, Oli Filth, John Reaves, Berland, Wolf87, Cybercobra, Semanticprecision, G716,
Unco, Theblackgecko, Lambiam, Jonas August, Vjeet a, Nijdam, Beetstra, Emurph, Hu12, Pjrm, LAlawMedMBA, JoeBot, Chris53516,
AlainD, Jsorens, Tawkerbot2, CmdrObot, JRavn, CBM, Anakata, Chrike, Harej bot, Thomasmeeks, Neelix, Cassmus, Bumbulski, Max-
Ent, 137 0, Pedrolapinto, Mmmooonnnsssttteeerrr, Farshidforouz~enwiki, FrancoGG, Talgalili, Thijs!bot, Epbr123, Tolstoy the Cat, Jfaller,
Whooooooknows, Natalie Erin, Woollymammoth, Mack2, JAnDbot, MER-C, Je560, Ph.eyes, Hectorlamadrid, Magioladitis, Tripbeetle,
MastCell, Albmont, Baccyak4H, Ddr~enwiki, David Eppstein, Joostw, Apal~enwiki, Yonaa, R'n'B, Noyder, Charlesmartin14, Kawau-
tar, Mbhiii, J.delanoy, Scythe of Death, Salih, TomyDuby, Jaxha, HyDeckar, Jewzip, MrPaul84, Copsi, Llorenzi, VolkovBot, Smarty07,
Muzzamo, Mfreund~enwiki, Jsd115, Zhenqinli, Greswik, P1h3r1e3d13, Ricardo MC, Jhedengren, Petergans, Karthik Sarma, Rlendog,
Zsniew, Paolo.dL, OKBot, Water and Land, Scottyoak2, Melcombe, Gpap.gpap, Tanvir Ahmmed, ClueBot, HairyFotr, Cp111, Rhubbarb,
Dromedario~enwiki, Alexbot, Ecov, Kaspar.jan, Tokorode~enwiki, Skbkekas, Stephen Milborrow, Diaa abdelmoneim, Qwfp, Bigoperm,
Sunsetsky, XLinkBot, Tofallis, W82~enwiki, Tayste, Addbot, RPHv, Fgnievinski, Doronp, MrOllie, Download, Forich, Zorrobot, Et-
trig, Luckas-bot, Yobot, Sked123, Its Been Emotional, AnomieBOT, Rubinbot, IRP, Materialscientist, HanPritcher, Citation bot, Lixi-
aoxu, Sketchmoose, Flavio Guitian, Gtfjbl, Istrill, Mstangeland, Aa77zz, Imran.fanaswala, Fstonedahl, PhysicsJoe, Nickruiz, FrescoBot,
X7q, Citation bot 1, AstaBOTh15, Boxplot, Pinethicket, Elockid, Kiefer.Wolfowitz, Rdecker02, Jonesey95, Stpasha, Oldrrb, Trappist
the monk, Wotnow, Duoduoduo, PAC2, Wombathammer, Diannaa, RjwilmsiBot, Elitropia, John of Reading, Sugarfoot1001, Julienbar-
lan, Wikieconometrician, Bkearb, NGPriest, BartlebytheScrivener, Chewings72, Esaintpierre, Manipande, ClueBot NG, Mathstat, Friet-
jes, BlueScreenD, Helpful Pixie Bot, Grandwgy, Daonng, Mark Arsten, MyWikiNik, ChrisGualtieri, Illia Connell, Dansbecker, Dexbot,
Hkoslik, Sa publishers, Ossifragus, Bha100710, Drvikas74, Melonkelon, Asif usa, Pandadai, Bryanrutherford0, Tertius51, Logan.dunbar,
Monkbot, Jpeterson1346, Bob nau, Moorshed, Velvel2, Whatfoxsays, 18trevor3695, Split97 and Anonymous: 373
Tikhonov regularization Source: https://en.wikipedia.org/wiki/Tikhonov_regularization?oldid=670944714 Contributors: Gareth Owen,
Edward, Michael Hardy, Willem, Charles Matthews, Dysprosia, Benwing, Wile E. Heresiarch, BenFrantzDale, Markus Kuhn, Sietse,
Quadell, Sam Hocevar, Rich Farmbrough, Bender235, Billlion, Arcenciel, Jheald, Oleg Alexandrov, Simetrical, Shoyer, Btyner, BD2412,
Qwertyus, Rjwilmsi, Gseryakov, EricCHill, Wavelength, RussBot, Bruguiea, Dtrebbien, Yahya Abdal-Aziz, Syrthiss, Caliprincess, Sharat
sc, Zvika, SmackBot, CapitalSasha, JesseStone, Oli Filth, NickPenguin, Tarantola, Lavaka, David s gra, Shorespirit, Lklundin, A3nm,
Pablodiazgutierrez, Lantonov, TomyDuby, Asjogren, Sigmundur, STBotD, Thiverus, FghIJklm, Melcombe, Skbkekas, Addbot, Legobot,
Ptbotgourou, AnomieBOT, Angry bee, SassoBot, Richarddonkin, FrescoBot, Fortdj33, Citation bot 1, Wkretzsch, Duoduoduo, Rjwilm-
siBot, EmausBot, ZroBot, Jotaf, Yagola, Koertefa, SciCompTeacher, Manoguru, BattyBot, Viraltux, Wsrosenthal, Evan Aad, Monkbot
and Anonymous: 59
Regression analysis Source: https://en.wikipedia.org/wiki/Regression_analysis?oldid=670597343 Contributors: Berek, Taw,
ChangChienFu, Michael Hardy, Kku, Meekohi, Jeremymiles, Ronz, Den fjttrade ankan~enwiki, Hike395, Quickbeam, Jitse
Niesen, Taxman, Samsara, Bevo, Mazin07, Benwing, Robinh, Giftlite, Bnn, TomViza, BrendanH, Jason Quinn, Noe, Piotrus,
APH, Israel Steinmetz, Urhixidur, Rich Farmbrough, Pak21, Paul August, Bender235, Bobo192, Cretog8, Arcadian, NickSchweitzer,
Photonique, Mdd, Jrme, Denoir, Arthena, Riana, Avenue, Emvee~enwiki, Nvrmnd, Gene Nygaard, Krubo, Oleg Alexandrov, Abanima,
Lkinkade, Woohookitty, LOL, Marc K, Kosher Fan, BlaiseFEgan, Wayward, Btyner, Lacurus~enwiki, Gmelli, Salix alba, MZMcBride,
Pruneau, Mathbot, Valermos, Goudzovski, King of Hearts, Chobot, Jdannan, Krishnavedala, Wavelength, Wimt, Afelton, Brian Crawford,
DavidHouse~enwiki, DeadEyeArrow, Avraham, Jmchen, NorsemanII, Tribaal, Closedmouth, Arthur Rubin, Josh3580, Wikiant, Shawnc,
robot, Veinor, Doubleplusje, SmackBot, NickyMcLean, Deimos 28, Antro5, Cazort, Gilliam, Feinstein, Oli Filth, Nbarth, Ctbolt,
DHN-bot~enwiki, Gruzd, Hve, Berland, EvelinaB, Radagast83, Cybercobra, Krexer, CarlManaster, Nrcprm2026, G716, Mwtoews,
Cosmix, Tedjn, Friend of facts, Danilcha, John, FrozenMan, Tim bates, JorisvS, IronGargoyle, Beetstra, Dicklyon, AdultSwim, Kvng,
Joseph Solis in Australia, Chris53516, AbsolutDan, Ioannes Pragensis, Markjoseph125, CBM, Thomasmeeks, GargoyleMT, Ravens-
fan5252, JohnInDC, Talgalili, Wikid77, Qwyrxian, Sagaciousuk, Tolstoy the Cat, N5iln, Carpentc, AntiVandalBot, Woollymammoth,
Lcalc, JAnDbot, Goskan, Giler, QuantumEngineer, Ph.eyes, SiobhanHansa, DickStartz, JamesBWatson, Username550, Fleagle11,
Marcelobbribeiro, David Eppstein, DerHexer, Apdevries, Thenightowl~enwiki, Mbhiii, Discott, Trippingpixie, Cpiral, Gzkn, Rod57,
TomyDuby, Coppertwig, RenniePet, Policron, Bobianite, Blueharmony, Peepeedia, EconProf86, Qtea, BernardZ, TinJack, CardinalDan,
HughD, DarkArcher, Gpeilon, TXiKiBoT, SueHay, Qxz, Gnomepirate, Sintaku, Antaltamas, JhsBot, Broadbot, Beusson, Cremepu222,
Zain Ebrahim111, Billinghurst, Kusyadi, Traderlion, Asjoseph, Petergans, Rlendog, BotMultichill, Statlearn, Gerakibot, Matthew Yeager,
Timhowardriley, Strife911, Indianarhodes, Amitabha sinha, OKBot, Water and Land, AlanUS, Savedthat, Mangledorf, Randallbsmith,
Amadas, Tesi1700, Melcombe, Denisarona, JL-Bot, Mrfebruary, Kotsiantis, Tdhaene, The Thing That Should Not Be, Sabri76,
Auntof6, DragonBot, Sterdeus, Skbkekas, Stephen Milborrow, Cfn011, Crash D 0T0, SBemper, Qwfp, Antonwg, Sunsetsky, XLinkBot,
Gerhardvalentin, Nomoskedasticity, Veryhuman, Piratejosh85, WikHead, SilvonenBot, Hess88, Addbot, Diegoful, Wootbag, Geced,
MrOllie, LaaknorBot, Lightbot, Luckas-bot, Yobot, Themfromspace, TaBOT-zerem, Andresswift, KamikazeBot, Eaihua, Tempodivalse,
AnomieBOT, Andypost, RandomAct, HanPritcher, Citation bot, Jyngyr, LilHelpa, Obersachsebot, Xqbot, Statisticsblog, TinucherianBot
II, Ilikeed, J04n, GrouchoBot, BYZANTIVM, Fstonedahl, Bartonpoulson, D0kkaebi, Citation bot 1, Dmitronik~enwiki, Boxplot,
Yuanfangdelang, Pinethicket, Kiefer.Wolfowitz, Tom.Reding, Stpasha, Di1000, Jonkerz, Duoduoduo, Diannaa, Tbhotch, RjwilmsiBot,
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 415
EmausBot, RA0808, KHamsun, F, Julienbarlan, Hypocritical~enwiki, Kgwet, Zfeinst, Bomazi, ChuispastonBot, 28bot, Rocketrod1960,
ClueBot NG, Mathstat, MelbourneStar, Joel B. Lewis, CH-stat, Helpful Pixie Bot, BG19bot, Giogm2000, CitationCleanerBot, Hakimo99,
Gprobins, Prof. Squirrel, Attleboro, Illia Connell, JYBot, Sinxvin, Francescapelusi, Lugia2453, SimonPerera, Lemnaminor, Inniti4,
EJM86, Francisbach, Eli the King, Monkbot, Bob nau, Moorshed k, Moorshed, KasparBot and Anonymous: 385
Statistical learning theory Source: https://en.wikipedia.org/wiki/Statistical_learning_theory?oldid=655393279 Contributors: Michael
Hardy, Hike395, Jitse Niesen, Bearcat, Tomchiukc, Klemen Kocjancic, Rajah, Qwertyus, Chris the speller, Geach, ClydeC, Katharineamy,
VolkovBot, Melcombe, XLinkBot, Fgnievinski, MichalSylwester, Brightgalrs, RealityApologist, Callanecc, Zephyrus Tavvier, ClueBot
NG, BG19bot, Aisteco, Atastorino, Francisbach, I3roly, Mgfbinae, Jala Daibajna, Slee325 and Anonymous: 15
VapnikChervonenkis theory Source: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_theory?oldid=666685396 Con-
tributors: Awaterl, Michael Hardy, Hike395, Bearcat, Tomchiukc, Naddy, Gene s, Rajah, Male1979, Qwertyus, SmackBot, Vermorel,
Momet, Vanisaac, JForget, Mmunoz.ag, Ph.eyes, Tremilux, Alfred Legrand, Nadiatalent, Gambette, Rei-bot, Guillaume2303, Melcombe,
Pichpich, Addbot, Yobot, Qorilla, Xqbot, Alfredo ougaowen, Newyorkadam, BG19bot, Crispulop, Brk0 0, Mneykov and Anonymous: 8
Probably approximately correct learning Source: https://en.wikipedia.org/wiki/Probably_approximately_correct_learning?oldid=
601228074 Contributors: The Anome, SimonP, Michael Hardy, Kku, Hike395, Charles Matthews, Dysprosia, Populus, Andris, Daniel
Brockman, APH, Gene s, Wimmerm, Mathbot, Zvika, Serg3d2, German name, John b cassel, A3nm, Martarius, MystBot, Addbot, Wzhx
cc, JonathanWilliford, Twri, Rfgrove, RedBot, Ego White Tray, Zormit and Anonymous: 15
Algorithmic learning theory Source: https://en.wikipedia.org/wiki/Algorithmic_learning_theory?oldid=593887386 Contributors:
Michael Hardy, Hike395, Secretlondon, Onebyone, Mrdice, APH, Alansohn, Dsemeas, Nesbit, Cybercobra, Dreadstar, Jwfour, RichardF,
Ferris37, Mattisse, James086, Melcombe, Yobot, AnomieBOT, John of Reading, Schulte.oliver, Jochen Burghardt and Anonymous: 7
Statistical hypothesis testing Source: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing?oldid=668899473 Contributors: The
Anome, DavidSJ, Edward, Patrick, Michael Hardy, Gabbe, Dcljr, Ronz, Den fjttrade ankan~enwiki, Darkwind, Sir Paul, Poor Yorick,
Cherkash, Trainspotter~enwiki, Dcoetzee, Robbot, Robbyjo~enwiki, Henrygb, Wile E. Heresiarch, Giftlite, J heisenberg, Jackol, Utcursch,
Pgan002, Andycjp, Maneesh, Daniel11, Rich Farmbrough, AlexKepler, Bender235, Cyc~enwiki, Cretog8, Spalding, Johnkarp, Davidruben,
Arcadian, NickSchweitzer, Nsaa, Mdd, Storm Rider, Alansohn, Gary, Guy Harris, Arthena, John Quiggin, Hu, Cburnett, Pixie, Ultrama-
rine, InBalance, Oleg Alexandrov, Woohookitty, JonBirge, BlaiseFEgan, Btyner, Marudubshinki, Graham87, Rjwilmsi, Jake Wartenberg,
Badjoby, Bdolicki, Pete.Hurd, Pstevens, Benlisquare, Adoniscik, YurikBot, Wavelength, RobotE, Xdenizen, Crasshopper, Bota47, Ott2,
NormDor, TrickyTank, Digfarenough, Andrew73, Zvika, SmackBot, Reedy, Tom Lougheed, Mcld, Chris the speller, Feinstein, Jprg1966,
Bazonka, Nbarth, Jyeee, Danielkueh, Jmnbatista, Robma, Radagast83, Cybercobra, Nakon, Richard001, G716, Salamurai, Lambiam, Tim
bates, Nijdam, IronGargoyle, Slakr, Varuag doos, Meritus~enwiki, Hu12, Pejman47, Rory O'Kane, MystRivenExile, Levineps, K, Ja-
son.grossman, AbsolutDan, Jackzhp, Mudd1, Thomasmeeks, Requestion, Pewwer42, Cydebot, Reywas92, B, Michael C Price, Ggchen,
Mattisse, JamesAM, Talgalili, Thijs!bot, Epbr123, Wikid77, Crazy george, Adjespers, Philippe, Urdutext, AntiVandalBot, Lordmetroid,
NYC2TLV, JAnDbot, Mikelove1, Andonic, SiobhanHansa, Acroterion, Magioladitis, VoABot II, Albmont, Pdbogen, Sullivan.t.j, Cydmab,
Kateshortforbob, Mbhiii, Xiphosurus, Ulyssesmsu, Silas S. Brown, Coppertwig, Policron, Dhaluza, Juliancolton, DavidCBryant, DeFault-
Ryan, Speciate, Sam Blacketer, VolkovBot, LeilaniLad, Agricola44, FreeT, Someguy1221, Strategist333, Snowbot, Jcchurch, Finnrind,
Zheric~enwiki, Statlearn, Krawi, Dailyknowledge, Bentogoa, Flyer22, Drakedirect, Ddxc, Larjohn~enwiki, Czap42, Kjtobo, Melcombe,
Bradford rob, Sfan00 IMG, ClueBot, Aua, Shabbychef, Catraeus, Adamjslund, Skbkekas, The-tenth-zdog, Qwfp, DumZiBoT, Wolveri-
neski, Terry0051, Libcub, Tayste, Addbot, Mortense, Some jerk on the Internet, Fgnievinski, Innv, , Ryanblak, MrOllie, Protonk,
CarsracBot, West.andrew.g, Tanath, Luckas-bot, Yobot, Andresswift, Brougham96, AnomieBOT, Materialscientist, Citation bot, Xqbot,
Srich32977, J04n, Thosjleep, JonDePlume, A.amitkumar, FrescoBot, Jollyroger131, Citation bot 1, DrilBot, Boxplot, Wittygrittydude,
RedBot, Ivancho.was.here, Bbarkley2, Thoytpt, Aurorion, Verlainer, Kastchei, J36miles, John of Reading, Philippe (WMF), GoingBatty,
Andreim27, Tsujimasen, Eukaryote89, Hatshepsut1988, ClueBot NG, BWoodrum, Kkddkkdd, Psy1235, JimsMaher, Yg12, Rezabot,
Widr, Trift, Helpful Pixie Bot, BG19bot, MasterMind5991, Mechnikov, AvocatoBot, Ryan.morton, Op47, Emanuele.olivetti, Manoguru,
Leapsword, BattyBot, Attleboro, Viraltux, Tibbyshep, TheJJJunk, Illia Connell, Knappsych, Citruscoconut, GargantuanDan, TejDham,
Valcust, Jamesmcmahon0, Waynechew87, Nullhypothesistester, Penitence, Anrnusna, Kfudala, Tertius51, Monkbot, BethNaught, TedPSS,
1980na, Libarian, SolidPhase, Anon124, Zjillanz and Anonymous: 324
Bayesian inference Source: https://en.wikipedia.org/wiki/Bayesian_inference?oldid=670136415 Contributors: The Anome, Fubar Ob-
fusco, DavidSJ, Jinian, Edward, JohnOwens, Michael Hardy, Lexor, Karada, Ronz, Suisui, Den fjttrade ankan~enwiki, LouI, EdH,
Jonik, Hike395, Novum, Timwi, WhisperToMe, Selket, SEWilco, Jose Ramos, Insightaction, Banno, Robbot, Kiwibird, Benwing, Meduz,
Henrygb, AceMyth, Wile E. Heresiarch, Ancheta Wis, Giftlite, DavidCary, Dratman, Leonard G., JimD, Wmahan, Pcarbonn, Mark-
Sweep, L353a1, FelineAvenger, APH, Sam Hocevar, Perey, Discospinster, Rich Farmbrough, Bender235, ZeroOne, Donsimon~enwiki,
MisterSheik, El C, Edward Z. Yang, DimaDorfman, Cje~enwiki, John Vandenberg, LeonardoGregianin, Jung dalglish, Hooperbloob,
Landroni, Arcenciel, Nurban, Avenue, Cburnett, Jheald, Facopad, Sjara, Oleg Alexandrov, Roylee, Joriki, Mindmatrix, BlaiseFEgan,
Btyner, Magister Mathematicae, Tlroche, Rjwilmsi, Ravik, Jemcneill, Billjeerys, FlaBot, Brendan642, Kri, Chobot, Reetep, Gdrbot,
Adoniscik, Wavelength, Pacaro, Gaius Cornelius, ENeville, Dysmorodrepanis~enwiki, Snek01, BenBildstein, Modify, Mastercampbell,
Nothlit, NielsenGW, Mebden, Bo Jacoby, Cmglee, Boggie~enwiki, Harthacnut, SmackBot, Mmernex, Rtc, Mcld, Cunya, Gilliam, Doc-
torW, Nbarth, G716, Jbergquist, Turms, Bejnar, Gh02t, Wyxel, Josephsieh, JeonghunNoh, Thermochap, BoH, Basar, TheRegicider,
Farzaneh, Lindsay658, Tdunning, Helgus, EdJohnston, Jvstone, Mack2, Lfstevens, Makohn, Stephanhartmannde, Comrade jo, Ph.eyes,
Coee2theorems, Ling.Nut, Charlesbaldo, DAGwyn, User A1, Tercer, STBot, Tobyr2, LittleHow, Policron, Jebadge, Bhepburn, Rob-
calver, James Kidd, VolkovBot, Thedjatclubrock, Maghnus, TXiKiBoT, Andrewaskew, GirasoleDE, SieBot, Doctorfree, Natta.d, Anchor
Link Bot, Melcombe, Kvihill, Rnchdavis, Smithpith, GeneCallahan, Krogstadt, Reovalis, Hussainshafqat, Charledl, ERosa, Qwfp, Td-
slk, XLinkBot, Erreip, Addbot, K-MUS, Metagraph, LaaknorBot, Ozob, Legobot, Yobot, Gongshow, AnomieBOT, Citation bot, Shadak,
Danielshin, VladimirReshetnikov, KingScot, JonDePlume, Thehelpfulbot, FrescoBot, Olexa Riznyk, WhatWasDone, Haeinous, JFK0502,
Kiefer.Wolfowitz, 124Nick, Night Jaguar, Scientist2, Trappist the monk, Gnathan87, Philocentric, Jonkerz, Jowa fan, EmausBot, Blume-
hua, Montgolre, Moswento, McPastry, Bagrowjp, SporkBot, Willy.pregliasco, Floombottle, Epdeloso, ClueBot NG, Mathstat, Bayes
Puppy, Jj1236, Albertttt, Thepigdog, Helpful Pixie Bot, Michael.d.larkin, Jeraphine Gryphon, Whyking thc, Intervallic, CitationCleaner-
Bot, DaleSpam, Kaseton, Simonsm21, Danielribeirosilva, ChrisGualtieri, Alialamifard, Yongli Han, 90b56587, MittensR, Mark viking,
Boomx09, Waynechew87, Hamoudafg, Promise her a denition, Abacenis, Engheta, Avehtari, SolidPhase, LadyLeodia, KasparBot and
Anonymous: 228
Chi-squared distribution Source: https://en.wikipedia.org/wiki/Chi-squared_distribution?oldid=666620239 Contributors: AxelBoldt,
Bryan Derksen, The Anome, Ap, Michael Hardy, Stephen C. Carlson, Tomi, Mdebets, Ronz, Den fjttrade ankan~enwiki, Willem, Jitse
Niesen, Hgamboa, Fibonacci, Zero0000, AaronSw, Robbot, Sander123, Seglea, Henrygb, Robinh, Isopropyl, Weialawaga~enwiki, Giftlite,
416 CHAPTER 62. DEEP LEARNING
Dbenbenn, BenFrantzDale, Herbee, Sietse, MarkSweep, Gauss, Zfr, Fintor, Rich Farmbrough, Dbachmann, Paul August, Bender235,
MisterSheik, O18, TheProject, NickSchweitzer, Iav, Jumbuck, B k, Kotasik, Sligocki, PAR, Cburnett, Shoey, Oleg Alexandrov, Mindma-
trix, Btyner, Rjwilmsi, Pahan~enwiki, Salix alba, FlaBot, Alvin-cs, Pstevens, Philten, Roboto de Ajvol, YurikBot, Wavelength, Schmock,
Tony1, Zwobot, Jspacemen01-wiki, Reyk, Zvika, KnightRider~enwiki, SmackBot, Eskimbot, BiT, Afa86, Bluebot, TimBentley, Master
of Puppets, Silly rabbit, Nbarth, AdamSmithee, Iwaterpolo, Eliezg, Robma, A.R., G716, Saippuakauppias, Rigadoun, Loodog, Mgigan-
teus1, Qiuxing, Funnybunny, Chris53516, Tawkerbot2, Jackzhp, CBM, Rrob, Dgw, FilipeS, Blaisorblade, Talgalili, Thijs!bot, DanSoper,
Lovibond, Pabristow, MER-C, Plantsurfer, Mcorazao, J-stan, Leotolstoy, Wasell, VoABot II, Jaekrystyn, User A1, TheRanger, MartinBot,
STBot, Steve8675309, Neon white, Icseaturtles, It Is Me Here, TomyDuby, Mikael Hggstrm, Quantling, Policron, Nm420, HyDeckar,
Sam Blacketer, DrMicro, LeilaniLad, Gaara144, AstroWiki, Notatoad, Johnlv12, Wesamuels, Tarkashastri, Quietbritishjim, Rlendog,
Sheppa28, Phe-bot, Jason Goldstick, Tombomp, OKBot, Melcombe, Digisus, Volkan.cevher, Loren.wilton, Animeronin, ClueBot, Jdg-
ilbey, MATThematical, UKoch, SamuelTheGhost, EtudiantEco, Bluemaster, Qwfp, XLinkBot, Knetlalala, MystBot, Paulginz, Fergikush,
Tayste, Addbot, Fgnievinski, Fieldday-sunday, MrOllie, Download, LaaknorBot, Renatokeshet, Lightbot, Ettrig, Chaldor, Luckas-bot,
Yobot, Wjastle, Johnlemartirao, AnomieBOT, Microball, MtBell, Materialscientist, Geek1337~enwiki, EOBarnett, DirlBot, LilHelpa,
Lixiaoxu, Xqbot, Eliel Jimenez, Etoombs, Control.valve, NocturneNoir, GrouchoBot, RibotBOT, Entropeter, Shadowjams, Grinofwales,
Constructive editor, FrescoBot, Tom.Reding, Stpasha, MastiBot, Gperjim, Fergusq, Xnn, RjwilmsiBot, Kastchei, Alph Bot, Wassermann7,
Markg0803, EmausBot, Yuzisee, Dai bach, Pet3ris, U+003F, Zephyrus Tavvier, Levdtrotsky, ChuispastonBot, Emilpohl, Brycehughes,
ClueBot NG, BG19bot, Analytics447, Snouy, Drhowey, Dlituiev, Minsbot, HelicopterLlama, Limit-theorem, Ameer diaa, Idoz he,
Zjbranson, DonaghHorgan, Catalin.ghervase, BeyondNormality, Monkbot, Alakzi, Bderrett, Uceeylu and Anonymous: 238
Chi-squared test Source: https://en.wikipedia.org/wiki/Chi-squared_test?oldid=667776610 Contributors: The Anome, Matusz, Michael
Hardy, Tomi, Karada, Ronz, Ciphergoth, Jtzg, Mxn, Silversh, Crissov, Robbot, Giftlite, Andris, Matt Crypto, MarkSweep, Piotrus,
Elektron, Rich Farmbrough, Cap'n Refsmmat, Kwamikagami, Smalljim, PAR, Stefan.karpinski, Spangineer, Wtmitchell, Falcorian, Blue-
moose, Aatombomb, Strait, MZMcBride, Yar Kramer, JoseMires~enwiki, Intgr, Pstevens, YurikBot, Wavelength, Darker Dreams, DY-
LAN LENNON~enwiki, Avraham, Arthur Rubin, Reyk, SmackBot, Turadg, Nbarth, Scwlong, Whpq, Bowlhover, G716, Lambiam, Cron-
holm144, Loodog, Tim bates, Smith609, Beetstra, Chris53516, Usgnus, Dgw, Requestion, WeggeBot, Steel, Karuna8, Talgalili, Thijs!bot,
Adjespers, Itsmejudith, AntiVandalBot, ReviewDude, Seaphoto, Johannes Simon, Ranger2006, Baccyak4H, KenyaSong, Serviscope Mi-
nor, MartinBot, Poeloq, Lbeaumont, Khatterj, JoshuaEyer, STBotD, VolkovBot, Pleasantville, Grotendeels Onschadelijk, Synthebot, Igno-
scient, SieBot, Matthew Yeager, Quest for Truth, Svick, Melcombe, Digisus, Tuxa, Animeronin, ClueBot, Muhandes, Qwfp, Tdslk, Wik-
Head, SilvonenBot, Prax54, Sindbad72, Tayste, Addbot, Luzingit, Doronp, MrOllie, Bhdavis1978, Legobot, Luckas-bot, Yobot, Amirobot,
AnomieBOT, Walter Grassroot, Unara, Jtamad, KuRiZu, GrouchoBot, Joxemai, Thehelpfulbot, Pinethicket, I dream of horses, Madonius,
Kastchei, EmausBot, Kgwet, Lolcatsdeamon13, Orange Suede Sofa, Levdtrotsky, ClueBot NG, Hyiltiz, Ion vasilief, Epfuerst, Helpful
Pixie Bot, Evcifreo, Chafe66, Jf.alcover, Aymankamelwiki, Brirush, RichardMarioFratini, MNikulin, EJM86, BethNaught, Hannasnow,
Iwilsonp, Pentaquark and Anonymous: 140
Goodness of t Source: https://en.wikipedia.org/wiki/Goodness_of_fit?oldid=664740099 Contributors: Khendon, Michael Hardy, Ronz,
Den fjttrade ankan~enwiki, Benwing, David Edgar, Giftlite, ReallyNiceGuy, Army1987, NickSchweitzer, Keavich, Alkarex, Btyner,
Demian12358, YurikBot, Wavelength, Amakuha, Jon Olav Vik, Carlosguitar, Slashme, Kslays, Chris the speller, BostonMA, Mwtoews,
Nutcracker, Dicklyon, Belizefan, Jayen466, Mr Gronk, Talgalili, Thijs!bot, Danger, Ph.eyes, Fjalokin, Glrx, Bonadea, SueHay, Llam-
abr, Tomaxer, Melcombe, ClueBot, Tomas e, Hoskee, Qwfp, DumZiBoT, Addbot, Fgnievinski, Renatokeshet, AnomieBOT, Gumlicks,
Joxemai, Fortdj33, Kastchei, John of Reading, Dai bach, JordiGH, Mathstat, MerlIwBot, Bhaveshpatil04 and Anonymous: 51
Likelihood-ratio test Source: https://en.wikipedia.org/wiki/Likelihood-ratio_test?oldid=668781783 Contributors: The Anome, Fnielsen,
Torfason, Michael Hardy, Kku, Notheruser, Den fjttrade ankan~enwiki, Jtzg, Cherkash, Unknown, Seglea, Meduz, Babbage, Henrygb,
Elysdir, Robinh, Giftlite, Pgan002, MarkSweep, Corti, Bender235, El C, Arcadian, Seans Potato Business, Cburnett, Jheald, Oleg Alexan-
drov, Btyner, Graham87, NeoUrfahraner, Pete.Hurd, Thecurran, Adoniscik, YurikBot, Cancan101, Draeco, Robertvan1, RL0919, Nescio,
Badgettrg, SmackBot, Tom Lougheed, Rajah9, Nbarth, Yimmieg, Moverly, Tim bates, Dchudz, Smith609, AnRtist, Jackzhp, RobDe68,
AgentPeppermint, Guy Macon, Mack2, Kniwor, JamesBWatson, TomyDuby, Quantling, Cmcnicoll, AlleborgoBot, Arknascar44, Adis-
malscientist, Jeremiahrounds, Melcombe, Mild Bill Hiccup, Wildland, 1ForTheMoney, Qwfp, Jmac2222, Prax54, Jht4060, Tayste, Addbot,
DOI bot, Nilayvaish, Legobot, Zaqrfv, AnomieBOT, Citation bot, Twri, LilHelpa, ArcadianOnUnsecuredLoc, Kristjn Jnasson, Fortdj33,
Vthesniper, Octonion, Aryan1989, HRoestBot, Ridgeback22, Madbix, Kastchei, Salvio giuliano, EmausBot, Wiki091005!!, Fanyavizuri,
Frietjes, Masssly, Sboludo, Helpful Pixie Bot, Fayue1015, NaftaliHarris, BG19bot, Chafe66, Limesave, Chuk.plante, Dexbot, Nm160111,
Penitence, FrB.TG, TedPSS and Anonymous: 88
Statistical classication Source: https://en.wikipedia.org/wiki/Statistical_classification?oldid=666781901 Contributors: The Anome,
Michael Hardy, GTBacchus, Hike395, Robbot, Benwing, Giftlite, Beland, Violetriga, Kierano, Jrme, Anthony Appleyard, Denoir, Oleg
Alexandrov, Bkkbrad, Qwertyus, Bgwhite, Roboto de Ajvol, YurikBot, Jrbouldin, Dtrebbien, Tianicita, Tobi Kellner, SmackBot, Ob-
ject01, Mcld, Chris the speller, Nervexmachina, Can't sleep, clown will eat me, Memming, Cybercobra, Richard001, Bohunk, Beetstra,
Hu12, Billgaitas@hotmail.com, Trauber, Juansempere, Thijs!bot, Prolog, Mack2, Peteymills, VoABot II, Robotman1974, Quocminh9,
RJASE1, Jamelan, ThomHImself, Gdupont, Junling, Melcombe, WikiBotas, Agor153, Addbot, Giggly37, Fgnievinski, SpBot, Movado73,
Yobot, Oleginger, AnomieBOT, Ashershow1, Verbum Veritas, FrescoBot, Gire 3pich2005, DrilBot, Classier1234, Jonkerz, Fly by Night,
Microfries, Chire, Sigma0 1, Rmashhadi, ClueBot NG, Girish280, MerlIwBot, Helpful Pixie Bot, Chyvve, Swsboarder366, Klilidiplomus,
Ferrarisailor, Mark viking, Francisbach, Imphil, I Less than3 Maths, LdyBruin and Anonymous: 65
Binary classication Source: https://en.wikipedia.org/wiki/Binary_classification?oldid=668840507 Contributors: The Anome, Michael
Hardy, Kku, Nichtich~enwiki, Janka~enwiki, Henrygb, Sepreece, Wmahan, Dfrankow, Nonpareility, 3mta3, Oleg Alexandrov, Linas,
Btyner, Qwertyus, Salix alba, FlaBot, Jaraalbe, DRosenbach, RG2, SmackBot, Chris the speller, Nbarth, Mauro Bieg, Ebraminio,
Amit Moscovich, Coolhandscot, Mstillman, STBot, Rlsheehan, Salih, Mikael Hggstrm, HELvet, Dr.007, Synthesis88, Jamelan, The-
fellswooper, Pkgx, Melcombe, Denisarona, Mild Bill Hiccup, Qwfp, AndrewHZ, Fgnievinski, Yobot, AnomieBOT, Twri, Saeidpourbabak,
FrescoBot, Duoduoduo, MartinThoma, Pablo Picossa, Alishahss75ali, Sds57, SoledadKabocha, Alialamifard, Richard Kohar, DoctorTer-
rella, Loraof, Shirleyyoung0812 and Anonymous: 32
Maximum likelihood Source: https://en.wikipedia.org/wiki/Maximum_likelihood?oldid=670023975 Contributors: The Anome,
ChangChienFu, Patrick, Michael Hardy, Lexor, Dcljr, Karada, Ellywa, Den fjttrade ankan~enwiki, Cherkash, Hike395, Samsara, Phil
Boswell, R3m0t, Guan, Henrygb, Robinh, Giftlite, DavidCary, BenFrantzDale, Chinasaur, Jason Quinn, Urhixidur, Rich Farmbrough,
Rama, Chowells, Bender235, Maye, Violetriga, 3mta3, Arthena, Inky, PAR, Cburnett, Algocu, Ultramarine, Oleg Alexandrov, James
I Hall, Rschulz, Btyner, Marudubshinki, Graham87, BD2412, Rjwilmsi, Koavf, Cjpun, Mathbot, Nivix, Jrtayloriv, Chobot, Reetep,
YurikBot, Wavelength, Cancan101, Dysmorodrepanis~enwiki, Avraham, Saric, Bo Jacoby, XpXiXpY, Zvika, SolarMcPanel, SmackBot,
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 417
Royalguard11, Warren.cheung, Chris the speller, Nbarth, Hongooi, Ju, Earlh, TedE, Dreadstar, G716, Qwerpoiu, Loodog, Lim Wei
Quan, Rogerbrent, Hu12, Freeside3, Cbrown1023, Lavaka, Simo Kaupinmki, 137 0, Travelbird, Headbomb, John254, Z10x, Nick Num-
ber, Binarybits, Alfalfahotshots, Magioladitis, Albmont, Livingthingdan, Baccyak4H, Julian Brown, A3nm, Cehc84, Algebraic, R'n'B,
Lilac Soul, Samikrc, Rlsheehan, Gill110951, AntiSpamBot, Policron, MJamesCA, Mathuranathan, JereyRMiles, Slaunger, TXiKiBoT,
Ataby, JimJJewett, Jsd115, Ramiromagalhaes, Henrikholm, AlleborgoBot, Matt Gleeson, Logan, Zonuleofzinn, Quietbritishjim, Bot-
Multichill, CurranH, Davidmosen, Svick, Melcombe, Classicalecon, Davyzhu, The Thing That Should Not Be, Drazick, Alexbot, Nu-
clearWarfare, Qwfp, Agravier, Vitanyi, Ninja247, Addbot, DOI bot, Lucifer87, MrOllie, SpBot, Hawk8103, Westcoastr13, Luckas-bot,
Yobot, Amirobot, Mathdrum, Zbodnar, AnomieBOT, DemocraticLuntz, RVS, Citation bot, ArthurBot, Xqbot, Flavio Guitian, Xappppp,
Shadowjams, Af1523, FrescoBot, LucienBOT, Citation bot 1, EduardoValle, Kiefer.Wolfowitz, Jmc200, Stpasha, BPets, Casp11, Dimtsit,
Jingyu.cui, RjwilmsiBot, Brandynwhite, Set theorist, Cal-linux, K6ka, Chadhoward, DavidMCEddy, Alexey.kudinkin, JA(000)Davidson,
SporkBot, Zfeinst, Zueignung, Gjshisha, Mikhail Ryazanov, ClueBot NG, Gareth Grith-Jones, MelbourneStar, Arthurcburigo, Frietjes,
Delusion23, Nak9x, Helpful Pixie Bot, Tony Tan, TheMathAddict, Dlituiev, Khazar2, Illia Connell, JYBot, Oneway3124, Jamesmcmahon0,
Deschwartz, Hamoudafg, Crispulop, LokeshRavindranathan, Monkbot, Engheta, Isambard Kingdom and Anonymous: 200
Linear classier Source: https://en.wikipedia.org/wiki/Linear_classifier?oldid=669802403 Contributors: The Anome, Hike395, Whis-
perToMe, Rls, Benwing, Bernhard Bauer, Wile E. Heresiarch, BenFrantzDale, Neilc, MarkSweep, Bobo192, Jung dalglish, Arcenciel,
Oleg Alexandrov, Linas, Bluemoose, Qwertyus, Mathbot, Sderose, Daniel Mietchen, SmackBot, Hongooi, Mcswell, Marcuscalabresus,
Thijs!bot, Camphor, AnAj, Dougher, BrotherE, TXiKiBoT, Qxz, Phe-bot, Melcombe, SPiNoZA, Jakarr, Addbot, Yobot, AnomieBOT,
Sgoder and Anonymous: 21
Logistic regression Source: https://en.wikipedia.org/wiki/Logistic_regression?oldid=671175842 Contributors: Twanvl, Michael Hardy,
Tomi, Den fjttrade ankan~enwiki, Benwing, Gak, Giftlite, BrendanH, YapaTi~enwiki, Dfrankow, Pgan002, Bolo1729, Qef, Rich Farm-
brough, Mani1, Bender235, Mdf, O18, NickSchweitzer, CarrKnight, Kierano, Arthena, Velella, Oleg Alexandrov, LOL, BlaiseFEgan,
Qwertyus, Rjwilmsi, Ground Zero, Sderose, Shaggyjacobs, YurikBot, Wavelength, Cancan101, Johndburger, Rodolfo Hermans, Jtneill,
D nath1, Lassefolkersen, Aldaron, G716, Esrever, RomanSpa, Nutcracker, Cbuckley, Ionocube, Kenkleinman, Markjoseph125, David s
gra, Jjoseph, Olberd, Requestion, Future Perfect at Sunrise, Kallerdis, Neoforma, LachlanA, Tomixdf, Mack2, JAnDbot, Every Creek
Counts, Sanchom, Owenozier, Magioladitis, Baccyak4H, Nszilard, Mbhiii, Lilac Soul, Kpmiyapuram, Djjrjr, Ronny Gunnarsson, Ype-
trachenko, Dvdpwiki, Squids and Chips, Mobeets, Ktalon, TXiKiBoT, Harrelfe, Antaltamas, Aaport, Jamelan, Synthebot, Dvandeventer,
Anameofmyveryown, Prakash Nadkarni, BAnstedt, Junling, AlanUS, Melcombe, Sphilbrick, Denisarona, Alpapad, Statone, SchreiberBike,
Cmelan, Aprock, Qwfp, XLinkBot, WikHead, Tayste, Addbot, Kurttg~enwiki, New Image Uploader 929, Luckas-bot, Yobot, Secondsmin-
utes, AnomieBOT, Ciphers, Materialscientist, Xqbot, Gtfjbl, FrescoBot, X7q, Orubt, Chenopodiaceous, Albertzeyer, Trappist the monk,
Duoduoduo, Diannaa, RjwilmsiBot, Grumpfel, Strano.m, Mudx77, EmausBot, RMGunton, Alexey.kudinkin, Zephyrus Tavvier, Kchowd-
hary, Sigma0 1, DASHBotAV, Vldscore, Kjalarr, Mhsmith0, Timutre, Helpful Pixie Bot, Ngocminh.oss, BG19bot, Martha6981, DPL
bot, BattyBot, Guziran99, Eatmajor7th, AndrewSmithQueens, Jey42, JTravelman, SFK2, Yongli Han, Merespiz, Tentinator, Pkalczynski,
Yoboho, Hayes.rachel, E8xE8, Tertius51, Nyashinski, ThuyNgocTran, Monkbot, SantiLak, Bluma.Gelley, Kamerondeckerharris, Srp54,
P.thesling, Alexander Craig Russell, , Velvel2, Anuragsodhi, Hughkf, PushTheButton108, LockemyCock, Gzluyongxi
and Anonymous: 182
Linear discriminant analysis Source: https://en.wikipedia.org/wiki/Linear_discriminant_analysis?oldid=668696480 Contributors: The
Anome, Fnielsen, XJaM, Edward, Michael Hardy, Kku, Den fjttrade ankan~enwiki, Hike395, Jihg, Deus~enwiki, Nonick, Giftlite, Dun-
charris, Dfrankow, Pgan002, 3mta3, Arcenciel, Forderud, Crackerbelly, Qwertyus, Rjwilmsi, Mathbot, Predictor, Adoniscik, YurikBot,
Timholy, Doncram, Tcooke, Shawnc, SmackBot, Maksim-e~enwiki, Slashme, Mcld, Memming, Solarapex, Beetstra, Dicklyon, Lifeartist,
StanfordProgrammer, Petrus Adamus, Sopoforic, Cydebot, Thijs!bot, AlexAlex, AnAj, Mack2, Cpl Syx, Stephenchou0722, Lwaldron,
R'n'B, G.kunter~enwiki, Nechamayaniger, Daviddoria, SieBot, Ivan tambuk, Mverleg, Jonomillin, OKBot, Melcombe, Produit, Statone,
Calimo, Qwfp, Addbot, Mabdul, AndrewHZ, Lightbot, Yobot, Citation bot, Klisanor, Sylwia Ufnalska, Morten Isaksen, Olg wiki, Schnitzel-
MannGreek, Pcoat, FrescoBot, X7q, Citation bot 1, Wkretzsch, Heavy Joke, Www wwwjs1, Jfmantis, EmausBot, Dewritech, Radshashi,
Manyu aditya, Marion.cuny, Vldscore, WikiMSL, Helpful Pixie Bot, BG19bot, CitationCleanerBot, Khazar2, Illia Connell, I am One of
Many, Lcparra, Ashleyleia, ArtfulVampire, SJ Defender, , Degill, Olosko and Anonymous: 107
Naive Bayes classier Source: https://en.wikipedia.org/wiki/Naive_Bayes_classifier?oldid=669331569 Contributors: The Anome, Awa-
terl, Olivier, Michael Hardy, Bewildebeast, Zeno Gantner, Karada, Cyp, Den fjttrade ankan~enwiki, Hike395, Njoshi3, WhisperToMe,
Toreau, Phil Boswell, RedWolf, Bkell, Wile E. Heresiarch, Giftlite, Akella, JimD, Bovlb, Macrakis, Neilc, Pgan002, MarkSweep, Gene
s, Cagri, Anirvan, Trevor MacInnis, Thorwald, Splatty, Rich Farmbrough, Violetriga, Peterjoel, Smalljim, John Vandenberg, BlueN-
ovember, Jason Davies, Caesura, Oleg Alexandrov, KKramer~enwiki, Btyner, Mandarax, Qwertyus, Rjwilmsi, Hgkamath, Johnnyw,
Mathbot, Intgr, Sderose, YurikBot, Wavelength, PiAndWhippedCream, Cancan101, Bovineone, Arichnad, Karipuf, BOT-Superzerocool,
Evryman, Johndburger, Mebden, XAVeRY, SmackBot, InverseHypercube, CommodiCast, Stimpy, ToddDeLuca, Gilliam, NickGar-
vey, Chris the speller, OrangeDog, PerVognsen, Can't sleep, clown will eat me, Memming, Mitar, Neshatian, Jklin, Ringger, WMod-
NS, Tobym, Shorespirit, Mat1971, Dstanfor, Arauzo, Dantiston, Sytelus, Vera Rita~enwiki, Dkemper, Prolog, Ninjakannon, Jrennie,
MSBOT, Coee2theorems, Tremilux, Saurabh911, Robotman1974, David Eppstein, User A1, HebrewHammerTime, AllenDowney,
Troos, AntiSpamBot, Newtman, STBotD, Mike V, RJASE1, VolkovBot, Maghnus, Anna Lincoln, Mbusux, Anders gorm, EverGreg,
Fcady2007, Jojalozzo, Ddxc, Dchwalisz, AlanUS, Melcombe, Headlessplatter, Kotsiantis, Justin W Smith, Motmahp, Calimo, Diane-
garey, Doobliebop, Alousybum, Sunsetsky, XLinkBot, Herlocker, Addbot, RPHv, Tsunanet, MrOllie, LaaknorBot, Yobot, TaBOT-zerem,
Twexcom, AnomieBOT, Rubinbot, Smk65536, The Almighty Bob, Cantons-de-l'Est, , FrescoBot, X7q, Proviktor, Svour-
droculed, Rickyphyllis, Jonesey95, Georey I Webb, Classier1234, Mwojnars, Wingiii, Larry.europe, Helwr, EmausBot, Orphan Wiki,
Tommy2010, GarouDan, Joseagonzalez, ClueBot NG, Hofmic, NilsHaldenwang, Luoli2000, BG19bot, MusikAnimal, Chafe66, Kavish-
war.wagholikar, Geduowenyang, Hipponix, Fcbarbi, Librawill, ChrisGualtieri, XMU zhangy, Alialamifard, CorvetteC6RVip, Jamesm-
cmahon0, Tonytonov, Jmagasin, ScienceRandomness, Qingyuanxingsi, Micpalmia, Soa Koutsouveli, Yuchsiao, Mvdyck, Don neufeld,
YoniSmolin, Rapanshi, Ananth.sankar.1963, Hmerzic and Anonymous: 184
Cross-validation (statistics) Source: https://en.wikipedia.org/wiki/Cross-validation_(statistics)?oldid=668481251 Contributors: Michael
Hardy, Shyamal, Delirium, Den fjttrade ankan~enwiki, Hike395, Phil Boswell, Cutler, Pgan002, Urhixidur, Discospinster, 3mta3, Bfg,
Jrg Knappen~enwiki, GregorB, Btyner, Qwertyus, Rjwilmsi, Mattopia, BMF81, Wavelength, Rsrikanth05, Bruguiea, Saric, Zvika, Capi-
talist, SmackBot, Glvgfz, Stimpy, Nbarth, Iridescent, CmdrObot, Olaf Davis, Gogo Dodo, Rphirsch, Blaisorblade, Headbomb, Mgierdal,
Fogeltje, Alanmalek, AnAj, Onasraou, JAnDbot, Olaf, Necroforest, Johnbibby, Paresnah, Jiuguang Wang, VolkovBot, Jamelan, SieBot,
Anchor Link Bot, Melcombe, Headlessplatter, Calimo, Skbkekas, Rbeg, XLinkBot, MystBot, Addbot, Sohail stat, Fieldday-sunday, Jo-
sevellezcaldas, Movado73, Legobot, Yobot, Materialscientist, Citation bot, Xqbot, Georg Stillfried, WaysToEscape, Allion, X7q, Code-
monkey87, Duoduoduo, Ulatekh, Noblestats, WikitanvirBot, Mjollnir82, ZroBot, H3llBot, Donner60, Robertschulze, ClueBot NG, Widr,
418 CHAPTER 62. DEEP LEARNING
Helpful Pixie Bot, Beaumont877, AdventurousSquirrel, ChrisGualtieri, Khazar2, Clevera, Rolf h nelson, Pandadai, Winlose378, Cosmin-
stamate, Emmanuel-L.T, Degill, Mberming, Kouroshbehzadian and Anonymous: 79
Unsupervised learning Source: https://en.wikipedia.org/wiki/Unsupervised_learning?oldid=660135356 Contributors: Michael Hardy,
Kku, Alo, Ahoerstemeier, Hike395, Ojigiri~enwiki, Gene s, Urhixidur, Alex Kosoruko, Aaronbrick, Bobo192, 3mta3, Tablizer, De-
noir, Nkour, Qwertyus, Rjwilmsi, Chobot, Roboto de Ajvol, YurikBot, Darker Dreams, Daniel Mietchen, SmackBot, CommodiCast,
Trebor, DHN-bot~enwiki, Lambiam, CRGreathouse, Carstensen, Thijs!bot, Jaxelrod, AnAj, Peteymills, David Eppstein, Agenteseg-
reto, Maheshbest, Timohonkela, Ng.j, EverGreg, Algorithms, Kotsiantis, Auntof6, PixelBot, Edg2103, Addbot, EjsBot, Yobot, Les boys,
AnomieBOT, Salvamoreno, D'ohBot, Skyerise, Ranjan.acharyya, BertSeghers, EmausBot, Fly by Night, Rotcaeroib, Stheodor, Daryakav,
Ida Shaw, Chire, Candace Gillhoolley, WikiMSL, Helpful Pixie Bot, Majidjanz and Anonymous: 40
Cluster analysis Source: https://en.wikipedia.org/wiki/Cluster_analysis?oldid=667300175 Contributors: The Anome, Fnielsen, Nealmcb,
Michael Hardy, Shyamal, Kku, Tomi, GTBacchus, Den fjttrade ankan~enwiki, Cherkash, BAxelrod, Hike395, Dbabbitt, Phil Boswell,
Robbot, Gandalf61, Babbage, Aetheling, Giftlite, Lcgarcia, Cfp, BenFrantzDale, Soundray~enwiki, Ketil, Khalid hassani, Angelo.romano,
Dfrankow, Gadum, Pgan002, Gene s, EBB, Sam Hocevar, Pwaring, Jutta, Abdull, Bryan Barnard, Rich Farmbrough, Mathiasl26, Neu-
ronExMachina, Yersinia~enwiki, Bender235, Alex Kosoruko, Aaronbrick, John Vandenberg, Greenleaf~enwiki, Ahc, NickSchweitzer,
3mta3, Jonsafari, Jumbuck, Jrme, Terrycojones, Denoir, Jnothman, Stefan.karpinski, Hazard, Oleg Alexandrov, Soultaco, Woohookitty,
Linas, Uncle G, Borb, Ruud Koot, Tabletop, Male1979, Joerg Kurt Wegner, DESiegel, Ruziklan, Sideris, BD2412, Qwertyus, Rjwilmsi,
Koavf, Salix alba, Michal.burda, Denis Diderot, Klonimus, FlaBot, Mathbot, BananaLanguage, Kcarnold, Payo, Jrtayloriv, Windharp,
BMF81, Roboto de Ajvol, The Rambling Man, YurikBot, Wavelength, Argav, SpuriousQ, Pseudomonas, NawlinWiki, Gareth Jones,
Bayle Shanks, TCrossland, JFD, Hirak 99, Zzuuzz, Rudrasharman, Zigzaglee, Closedmouth, Dontaskme, Kevin, Killerandy, Airconswitch,
SmackBot, Drakyoko, Jtneill, Pkirlin, Object01, Mcld, Ohnoitsjamie, KaragouniS, Bryan Barnard1, MalafayaBot, Drewnoakes, Tenawy,
DHN-bot~enwiki, Iwaterpolo, Zacronos, MatthewKarlsen, Krexer, Bohunk, MOO, Lambiam, Friend of facts, Benash, ThomasHofmann,
Dfass, Beetstra, Ryulong, Nabeth, Hu12, Iridescent, Ralf Klinkenberg, Madla~enwiki, Alanbino, Origin415, Bairam, Ioannes Pragensis,
Joaoluis, Megannnn, Nczempin, Harej bot, Slack---line, Playtime, Endpoint, Dgtized, Skittleys, DumbBOT, Talgalili, Thijs!bot, Barticus88,
Vinoduec, Mailseth, Danhoppe, Phoolimin, Onasraou, Denaxas, AndreasWittenstein, Daytona2, MikeLynch, JAnDbot, Inverse.chi, .ana-
condabot, Magioladitis, Andrimirzal, Fallschirmjger, JBIdF, David Eppstein, User A1, Eeera, Varun raptor, LedgendGamer, Jiuguang
Wang, Sommersprosse, Koko90, Smite-Meister, McSly, Dvdpwiki, DavidCBryant, AStrathman, Camrn86, TXiKiBoT, Rnc000, Tams
Kdr, Mundhenk, Maxim, Winterschlaefer, Lamro, Wheatin, Arrenbas, Sesilbumu, Tomfy, Kerveros 99, Seemu, WRK, Drdan14,
Harveydrone, Graham853, Wcdriscoll, Zwerglein~enwiki, Osian.h, FghIJklm, Melcombe, Kotsiantis, Freeman77, Victor Chmara, Kl4m,
Mugvin, Manuel freire, Boing! said Zebedee, Tim32, PixelBot, Lartoven, Chaosdruid, Aprock, Practical321, Qwfp, FORTRANslinger,
Sunsetsky, Ocean931, Phantom xxiii, XLinkBot, Pichpich, Gnowor, Sujaykoduri, WikHead, Addbot, Allenchue, DOI bot, Bruce rennes,
Fgnievinski, Gangcai, MrOllie, FerrousTigrus, Delaszk, Tide rolls, Lightbot, PAvdK, Fjrohlf, Tobi, Luckas-bot, Yobot, Gulfera, Hungpuiki,
AnomieBOT, Flamableconcrete, Materialscientist, Citation bot, Xqbot, Erud, Sylwia Ufnalska, Simeon87, Omnipaedista, Kamitsaha,
Playthebass, FrescoBot, Sacomoto, D'ohBot, Dan Golding, JohnMeier, Slowmo0815, Atlantia, Citation bot 1, Boxplot, Edfox0714, Mon-
dalorBot, Lotje, E.V.Krishnamurthy, Capez1, Koozedine, Tbalius, RjwilmsiBot, Ripchip Bot, Jchemmanoor, GodfriedToussaint, Aaron-
zat, Helwr, EmausBot, John of Reading, Stheodor, Elixirrixile, BOUMEDJOUT, ZroBot, Sgoder, Chire, Darthhappyface, Jucypsycho,
RockMagnetist, Wakebrdkid, Fazlican, Anita5192, ClueBot NG, Marion.cuny, Ericfouh, Simeos, Poirel, Robiminer, Michael-stanton,
Girish280, Helpful Pixie Bot, Novusuna, BG19bot, Cpkex0102, Wiki13, TimSwast, Cricetus, Douglas H Fisher, Mu.ting, ColanR, Cor-
nelius3, Illia Connell, Compsim, Mogism, Frosty, Abewley, Mark viking, Metcalm, Ninjarua, Trouveur de faits, TCMemoire, ErezHartuv,
Monkbot, Leegrc, Imsubhashjha, , Olosko, Angelababy00 and Anonymous: 327
Expectationmaximization algorithm Source: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm?oldid=
671233060 Contributors: Rodrigob, Michael Hardy, Karada, Jrauser, BAxelrod, Hike395, Phil Boswell, Owenman, Robbyjo~enwiki, Ben-
wing, Wile E. Heresiarch, Giftlite, Paisa, Vadmium, Onco p53, MarkSweep, Piotrus, Cataphract, Rama, MisterSheik, Alex Kosoruko,
O18, John Vandenberg, Jjmerelo~enwiki, 3mta3, Terrycojones, B k, Eric Kvaalen, Cburnett, Finfobia, Jheald, Forderud, Sergey Dmitriev,
Igny, Bkkbrad, Bluemoose, Btyner, Qwertyus, Rjwilmsi, KYPark, Salix alba, Hild, Mathbot, Glopk, Kri, BradBeattie, YurikBot, Nils
Grimsmo, Schmock, Rgis B., Klutzy, Hakeem.gadi, Maechler, Ladypine, M.A.Dabbah, SmackBot, Mcld, Nbarth, Tekhnoend, Iwater-
polo, Bilgrau, Joeyo, Raptur, Derek farn, Jrouquie, Dicklyon, Alex Selby, Saviourmachine, Lavaka, Requestion, Cydebot, A876, Kallerdis,
Libro0, Blaisorblade, Skittleys, Andyrew609, Talgalili, Tiedyeina, Rusmike, Headbomb, RobHar, LachlanA, AnAj, Zzpmarco, Deki-
masu, JamesBWatson, Richard Bartholomew, Livingthingdan, Nkwatra, User A1, Edratzer, Osquar F, Numbo3, Salih, GongYi, Douglas-
Lanman, Bigredbrain, Market Eciency, Lamro, Daviddoria, Pine900, Tambal, Mosaliganti1.1, Melcombe, Sitush, Pratx, Alexbot, Hbeigi,
Jakarr, Jwmarck, XLinkBot, Jamshidian, Addbot, Sunjuren, Fgnievinski, LaaknorBot, Aanthony1243, Peni, Luckas-bot, Yobot, Leonar-
doWeiss, AnomieBOT, Citation bot, TechBot, Chuanren, FrescoBot, Nageh, Erhanbas, Nocheenlatierra, Qiemem, Kiefer.Wolfowitz,
Jmc200, Stpasha, Jszymon, GeypycGn, Trappist the monk, Thi Nhi, Ismailari, Dropsciencenotbombs, RjwilmsiBot, Slon02, Emaus-
Bot, Mikealandewar, John of Reading, , Chire, Statna, ClueBot NG, Rezabot, Meea, Qwerty9967, Helpful Pixie Bot, Rxnt, Bibcode
Bot, BG19bot, Chafe66, Whym, Lvilnis, BattyBot, Yasuo2, Illia Connell, JYBot, Blegat, Yogtad, Tentinator, Marko0991, Ginsuloft, Wcc-
snow, Ronniemaor, Monkbot, Nboley, Faror91, DilumA, Rider ranger47, Velvel2, Crimsonslide, Megadata tensor, Surbut, Greatwave and
Anonymous: 151
K-means clustering Source: https://en.wikipedia.org/wiki/K-means_clustering?oldid=671161262 Contributors: Fnielsen, Michael Hardy,
Ixfd64, Den fjttrade ankan~enwiki, Charles Matthews, Dbabbitt, Phil Boswell, Ashwin, Pengo, Giftlite, BenFrantzDale, Dunchar-
ris, Soren.harward, WorldsApart, Ratiocinate, Gazpacho, Rich Farmbrough, Mathiasl26, Greenleaf~enwiki, 3mta3, Jonsafari, Andkaha,
Ricky81682, Jnothman, Alai, Robert K S, Qwertyus, Rjwilmsi, Hgkamath, Miserlou, Gringer, Mathbot, Mahlon, Chobot, Bgwhite, Uk-
Paolo, YurikBot, Wavelength, SpuriousQ, Annabel, Hakkinen, SamuelRiv, Leishi, Killerandy, SmackBot, Zanetu, Mauls, Mcld, Memming,
Cronholm144, Barabum, Denshade, Mauro Bieg, CBM, Chrike, Chrisahn, Talgalili, Thijs!bot, June8th, N5iln, Headbomb, Nick Number,
Phoolimin, Sanchom, Charibdis, Smartcat, Magioladitis, David Eppstein, Kzafer, Gfxguy, Turketwh, Stimpak, Mati22081979, Alirn,
JohnBlackburne, TXiKiBoT, FedeLebron, ChrisDing, Corvus cornix, Ostrouchov, Yannis1962, Billinghurst, Maxlittle2007, Erniepan,
Illuminated, Strife911, Weston.pace, Ntvuok, AlanUS, Melcombe, PerryTachett, MenoBot, DEEJAY JPM, DragonBot, Alexbot, Pot,
Tbmurphy, Rcalhoun, Agor153, Qwfp, Niteskate, Tavlos, Avoided, Addbot, DOI bot, Foma84, Fgnievinski, Homncruse, Wfolta, An-
dresH, Yobot, AnomieBOT, Jim1138, Materialscientist, Citation bot, LilHelpa, Honkkis, Gtfjbl, Gilo1969, Simeon87, Woolleynick,
Wonderful597, Dpf90, Foobarhoge, FrescoBot, Dan Golding, Phillipe Israel, Jonesey95, Cincoutprabu, Amkilpatrick, NedLevine, Ranu-
mao, Larry.europe, Helwr, EmausBot, John of Reading, Lessbread, Manyu aditya, ZroBot, Sgoder, Chire, Toninowiki, 0sm0sm0,
Helpsome, ClueBot NG, Mathstat, Jack Greenmaven, Railwaycat, BlueScreenD, Jsanchezalmeida, BG19bot, MusikAnimal, Mark Ar-
sten, SciCompTeacher, Chmarkine, Utacsecd, Amritamaz, EdwardH, Sundirac, BattyBot, Illia Connell, MarkPundurs, MindAfterMath,
Jamesx12345, MEmreCelebi, Jcallega, Watarok, E8xE8, Quenhitran, Anrnusna, MSheshera, Monkbot, Mazumdarparijat, Joma.huguet,
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 419
Niraj Aher, Alvisedt, Laiwoonsiu, Eyurtsev, HelpUsStopSpam, Varunjoshi42 and Anonymous: 199
Hierarchical clustering Source: https://en.wikipedia.org/wiki/Hierarchical_clustering?oldid=670589093 Contributors: Jose Icaza,
Nealmcb, GTBacchus, Hike395, Dmb000006, 3mta3, Mandarax, Qwertyus, Rjwilmsi, Piet Delport, Hakkinen, DoriSmith, Smack-
Bot, Mitar, Mwtoews, Krauss, Skittleys, Talgalili, Headbomb, Magioladitis, David Eppstein, Cypherzero0, Salih, FedeLebron, Kr-
ishna.91, Grscjo3, Qwfp, Eric5000, SleightTrickery, MystBot, Addbot, Netzwerkerin, Yobot, Legendre17, AnomieBOT, GrouchoBot,
FrescoBot, Iamtravis, Citation bot 1, DixonDBot, Ismailari, Saitenschlager, Robtoth1, NedLevine, RjwilmsiBot, WikitanvirBot, Jackiey99,
Jy19870110, ZroBot, Chire, Ars12345, Sgj67, Mathstat, Widr, KLBot2, Kamperh, SciCompTeacher, IluvatarBot, SarahLZ, Astros4477,
Jmajf, Joeinwiki, PeterLFlomPhD, StuartWilsonMaui, Meatybrainstu, and Anonymous: 50
Instance-based learning Source: https://en.wikipedia.org/wiki/Instance-based_learning?oldid=615580426 Contributors: Ehamberg, Qw-
ertyus, SmackBot, Hmains, AlanUS, RjwilmsiBot, Gareldnate, LeviShel, Verhoevenben, Mann.timothy, ChrisGualtieri and Anonymous:
5
K-nearest neighbors algorithm Source: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm?oldid=665873983 Contributors:
The Anome, B4hand, Michael Hardy, Ronz, Charles Matthews, Topbanana, AnonMoos, Pakaran, Robbot, Altenmann, DHN, Adam
McMaster, Pgan002, Dan aka jack, Thorwald, Rama, Slambo, Barro~enwiki, BlueNovember, Caesura, GiovanniS, RHaworth, SQF-
reak, Btyner, Marudubshinki, BD2412, Qwertyus, Rjwilmsi, Stoph, Debivort, Wavelength, Janto, Garion96, SmackBot, CommodiCast,
Mdd4696, Stimpy, Mcld, DHN-bot~enwiki, Hongooi, MisterHand, Joerite, Memming, Gnack, Hu12, Atreys, Ogerard, Kozuch, AnAj,
MER-C, Olaf, Jbom1, Peteymills, Dustinsmith, User A1, Mach7, McSly, AntiSpamBot, RJASE1, Joeoettinger, TXiKiBoT, ITurtle, Mpx,
SieBot, Prakash Nadkarni, Flyer22, Narasimhanator, AlanUS, Melcombe, Eamon Nerbonne, Svante1, Cibi3d, ClueBot, JP.Martin-Flatin,
Algomaster, Alexbot, Agor153, El bot de la dieta, Rubybrian, Pradtke, XLinkBot, Ploptimist, Addbot, MrOllie, Protonk, Luckas-bot,
Yobot, AnomieBOT, Tappoz, Citation bot, Megatang, Miym, PM800, Leonid Volnitsky, FrescoBot, Paine Ellsworth, X7q, Citation bot
1, Rickyphyllis, Emslo69, Lars Washington, Dinamik-bot, Bracchesimo, Delmonde, Geomwiz, Sideways713, DARTH SIDIOUS 2, Thed-
wards, RjwilmsiBot, Larry.europe, GodfriedToussaint, Nikolaosvasiloglou, EmausBot, Logical Cowboy, Fly by Night, Wijobs, Microfries,
Slightsmile, Manyu aditya, Meng6, Chire, Yc319, Mlguy, Vedantkumar, Lovasoa, Dennis97519, Chafe66, Hipponix, Luvegood, Chris-
Gualtieri, Vbaculum, Jamesx12345, Joeinwiki, Sevamoo, TejDham, Comp.arch, LokeshRavindranathan, Skr15081997, Monkbot, Niraj
Aher, Moshe.benyamin, Crystallizedcarbon, Vermelhomaraj, Sachith500 and Anonymous: 114
Principal component analysis Source: https://en.wikipedia.org/wiki/Principal_component_analysis?oldid=670832593 Contributors:
Ed Poor, Fnielsen, Schewek, Bernfarr, Michael Hardy, Shyamal, Wapcaplet, Ixfd64, Tomi, Jovan, CatherineMunro, Den fjttrade
ankan~enwiki, Kevin Baas, Cherkash, Hike395, A5, Guaka, Dcoetzee, Ike9898, Jfeckstein, Jessel, Sboehringer, Vincent kraeutler,
Metasquares, Phil Boswell, Npettiaux, Benwing, Centic, Smb1001, Saforrest, Giftlite, BenFrantzDale, Lupin, Chinasaur, Amp, Yke, Ja-
son Quinn, Khalid hassani, Dfrankow, Pgan002, Gdm, Fpahl, OverlordQ, Rcs~enwiki, Gene s, Lumidek, Jmeppley, Frau Holle, David-
strauss, Thorwald, Richie, Discospinster, Rich Farmbrough, Pjacobi, Bender235, Gauge, Mdf, Nicolasbock, Lysdexia, Anthony Appleyard,
Denoir, Jason Davies, Eric Kvaalen, BernardH, Pontus, Jheald, BlastOButter42, Falcorian, Jfr26, RzR~enwiki, Waldir, Kesla, Ketiltrout,
Rjwilmsi, AndyKali, FlaBot, Winterstein, Mathbot, Itinerant1, Tomer Ish Shalom, Chobot, Adoniscik, YurikBot, Wavelength, Pmg, Vecter,
Freiberg, HenrikMidtiby~enwiki, Bruguiea, Trovatore, Holon, Jpbowen, Crasshopper, Entropeneur, Bota47, SamuelRiv, DaveWF, JCipri-
ani, H@r@ld, Whaa?, Zvika, Lunch, SmackBot, Slashme, Larry Doolittle, Jtneill, Mdd4696, Wikipedia@natividads.com, Mcld, Misfeldt,
Njerseyguy, AhmedHan, Oli Filth, Metacomet, Mihai preda, Tekhnoend, Huji, Tamfang, Kjetil1001, Dr. Crash, Vina-iwbot~enwiki,
Ck lostsword, Thejerm, Lambiam, Mgiganteus1, Ben Moore, Dicklyon, Hovden, Nwstephens, Eclairs, Hu12, Luwo, Conormct, Dound,
Mishrasknehu, Denizstij, CRGreathouse, Shorespirit, MaxEnt, MC10, Hypersphere, Indeterminate, Carstensen, Markluel, Seicer, Tal-
galili, RichardVeryard, MaTT~enwiki, Javijabot, Dr. Submillimeter, Tillman, GromXXVII, MER-C, JPRBW, .anacondabot, Sirhans,
Meredyth, Brusegadi, Daemun, Destynova, A Haupteisch, User A1, Parunach, Blackcat100, Zefram, R'n'B, Jorgenumata, Jiuguang
Wang, McSly, GongYi, Robertgreer, Qtea, Swatiquantie, VasilievVV, GcSwRhIc, ChrisDing, Amaher, Slysplace, Jmath666, Peter ja
shaw, Sjpajantha, Ericmelse, SieBot, ToePeu.bot, Rl1rl1, Smsarmad, Oxymoron83, Algorithms, AlanUS, Tesi1700, Melcombe, Headless-
platter, DonAByrd, Vectraproject, Anturtle, ClueBot, Ferred, HairyFotr, Mild Bill Hiccup, Robmontagna, Dj.science, SteelSoul, Cal-
imo, NuclearWarfare, Skbkekas, Gundersen53, Agor153, SchreiberBike, Aprock, Ondrejspilka, JamesXinzhiLi, User102, StevenDH,
Kegon, HarrivBOT, Qwfp, XLinkBot, Dkondras, Kakila, Kbdankbot, Tayste, Addbot, Bruce rennes, MrOllie, Delaszk, Mdnahas, Light-
bot, , Legobot, Luckas-bot, Yobot, Crisluengo, AnakngAraw, Chosesdites, Archy33, AnomieBOT, Ciphers, T784303, Citation bot,
Fritsebits, Xqbot, Gtfjbl, Sylwia Ufnalska, Chuanren, Omnipaedista, BulldogBeing, Soon Lee, Joxemai, MuellerJak, Amosdor, Fres-
coBot, Rdledesma, X7q, BenzolBot, Gaba p, Pinethicket, Dront, Hechay, Duoduoduo, Jfmantis, PCAexplorer, RjwilmsiBot, Kastchei,
Helwr, Alfaisanomega, Davoodshamsi, GoingBatty, Fran jo, ZroBot, Josve05a, Drusus 0, Sgoder, Chire, GeorgeBarnick, Mayur, Fjoel-
skaldr, JordiGH, RockMagnetist, Brycehughes, ClueBot NG, Marion.cuny, Ldvbin, WikiMSL, Helpful Pixie Bot, Roybgardner, Nagarajan
paramasivam, BG19bot, Naomi altman, Chafe66, Ga29sic, JiemingChen, Susie8876, Statisfactions, SarahLZ, Cretchen, Fylbecatulous,
Imarkovs, Dfbeaton, Cccddd2012, BereNice V, ChrisGualtieri, GoShow, Aimboy, Jogfalls1947, Stevebillings, Duncanpark, Lugia2453,
The Quirky Kitty, Germanoverlord, SimonPerera, GabeIglesia, Paum89, HalilYurdugul, OhGodItsSoAmazing, Sangdon Lee, Tbouwman,
Poline3939, Pandadai, Tmhuey, Pijjin, Hdchina2010, Chenhow2008, Themtide999, Statistix35, Phleg1, Hilary Hou, Mehr86, Monkbot,
Yrobbers, Bzeitner, JamesMishra, Uprockrhiz, Cyrilauburtin, Potnisanish, Velvel2, Mew95001, CarlJohanI, Olosko, Wanghe07, Embat,
Ben.dichter, Olgreenwood and Anonymous: 344
Dimensionality reduction Source: https://en.wikipedia.org/wiki/Dimensionality_reduction?oldid=620805391 Contributors: Michael
Hardy, Kku, William M. Connolley, Charles Matthews, Stormie, Psychonaut, Texture, Wile E. Heresiarch, Wolfkeeper, Pgan002, Neu-
ronExMachina, Euyyn, Runner1928, Arthena, Zawersh, Oleg Alexandrov, Waldir, Joerg Kurt Wegner, Qwertyus, Ddofer, Tagith, Bg-
white, YurikBot, Wavelength, Soumya.ray, Welsh, Gareth Jones, Voidxor, SmackBot, Mcld, Charivari, Kvng, Laurens-af, CapitalR,
ShelfSkewed, BetacommandBot, Barticus88, Sylenius, Mentisto, Dougher, Xetrov, SieBot, Kerveros 99, Hegh, Melcombe, Agor153,
BOTarate, Lespinats, Addbot, Delaszk, , Movado73, Yobot, Fc renato, Ciphers, FrescoBot, Sa'y, Jonkerz, Helwr, ClueBot NG,
WikiMSL, Helpful Pixie Bot, Craigacp, Cccddd2012, HurriH, OhGodItsSoAmazing, Diman.kham and Anonymous: 46
Greedy algorithm Source: https://en.wikipedia.org/wiki/Greedy_algorithm?oldid=667684717 Contributors: AxelBoldt, Hfastedge,
CatherineMunro, Notheruser, PeterBrooks, Charles Matthews, Dcoetzee, Malcohol, Jaredwf, Meduz, Sverdrup, Wlievens, Enochlau,
Giftlite, Smjg, Kim Bruning, Jason Quinn, Pgan002, Andycjp, Andreas Kaufmann, TomRitchford, Discospinster, ZeroOne, Nabla, Dio-
midis Spinellis, Nandhp, Obradovic Goran, Haham hanuka, CKlunck, Swapspace, Ralphy~enwiki, Ryanmcdaniel, Hammertime, Mechon-
barsa, CloudNine, Mindmatrix, LOL, Cruccone, Ruud Koot, Que, Sango123, FlaBot, New Thought, Kri, Pavel Kotrc, YurikBot, Wave-
length, Hairy Dude, TheMandarin, Nethgirb, Bota47, Marcosw, Darrel francis, SmackBot, Brianyoumans, Unyoyega, KocjoBot~enwiki,
NickShaforosto, Trezatium, SynergyBlades, DHN-bot~enwiki, Emurphy42, Omgoleus, MichaelBillington, Mlpkr, Wleizero, Cjohnzen,
Mcstrother, Suanshsinghal, Cydebot, Jibbist, Thijs!bot, Wikid77, Nkarthiks, Escarbot, Uselesswarrior, Clan-destine, Salgueiro~enwiki,
420 CHAPTER 62. DEEP LEARNING
Chamale, Jddriessen, Albany NY, Magioladitis, Eleschinski2000, Avicennasis, David Eppstein, Mange01, Zangkannt, Policron, BernardZ,
Maghnus, TXiKiBoT, ArzelaAscoli, Monty845, Hobartimus, Denisarona, HairyFotr, Meekywiki, Enmc, Addbot, Legobot, Fraggle81,
Materialscientist, Shrishaster, Hhulzk, Rickproser, , Eirik the Viking, X7q, Kiefer.Wolfowitz, A8UDI, JumpDiscont, Skakkle, Po-
lariseke, John of Reading, Bernard Teo, Optimering, ZroBot, Ziradkar, Chire, AManWithNoPlan, EdoBot, Swfung8, Petrb, ClueBot NG,
ElphiBot, Makecat-bot, Lesser Cartographies, Scarlettail, Kuchayrameez, Srijanrshetty, Amaniitk, Boky90 and Anonymous: 109
Reinforcement learning Source: https://en.wikipedia.org/wiki/Reinforcement_learning?oldid=670259161 Contributors: Wmorgan, Im-
ran, Mrwojo, Michael Hardy, Togelius, DopeshJustin, Kku, Delirium, Hike395, Charles Matthews, Robbot, Altenmann, Giftlite, Dratman,
Gene s, Juxi, Urhixidur, Bender235, Tobacman, Diego Moya, Nvrmnd, Oleg Alexandrov, Olethros, Qwertyus, Seliopou, Mathbot, Banazir,
Kri, Chobot, Bgwhite, YurikBot, Wavelength, Masatran, Digfarenough, SmackBot, Fabrice.Rossi, Vermorel, Jcarroll, Chris the speller,
Ash.dyer, DHN-bot~enwiki, Mitar, Beetstra, Flohack, Ceran, Janrpeters, XApple, ShelfSkewed, Perimosocordiae, Skittleys, Rev.bayes,
Escarbot, Tremilux, Parunach, R'n'B, Wfu, Jiuguang Wang, Shyking, Kpmiyapuram, Qsung, Szepi~enwiki, Nedrutland, Mdchang, Sebast-
janmm, MrinalKalakrishnan, Flyer22, Melcombe, Rinconsoleao, MBK004, XLinkBot, Addbot, DOI bot, MrOllie, Download, Mianarshad,
Yobot, Maderlock, Citation bot, LilHelpa, DSisyphBot, J04n, Gosavia, FrescoBot, Fgpilot, Kartoun, Mr ashyash, D'ohBot, Citation bot
1, Albertzeyer, Wikinacious, Skyerise, Trappist the monk, Dpbert, Stuhlmueller, RjwilmsiBot, Claggierk, EmausBot, Macopema, Chire,
Jcautilli, DrewNoakes, Correction45, Rlguy, ChuispastonBot, Mbdts, Dvir-ad, Albertttt, Uymj, Helpful Pixie Bot, BG19bot, Stephen Bal-
aban, ChrisGualtieri, Rbabuska, Ra ules, Chrislgarry, Awliehr, Monkbot, SoloGen and Anonymous: 118
Decision tree learning Source: https://en.wikipedia.org/wiki/Decision_tree_learning?oldid=671088225 Contributors: Michael Hardy,
TheEternalVortex, Greenrd, Maximus Rex, Tschild, Populus, Khalid hassani, Pgan002, Raand, Dan aka jack, Andreas Kaufmann, Dis-
cospinster, John Vandenberg, Giraedata, Mdd, Equinoxe, VeXocide, Bushytails, GregorB, Qwertyus, Rjwilmsi, Gmelli, Salix alba, Vonkje,
SmackBot, Diegotorquemada, Mcld, Riedl, Zven, Mitar, Krexer, Kuru, Beetstra, Courcelles, Ceran, Pgr94, Yaris678, Talgalili, A3RO,
Nobar, Martinkunev, Destynova, Jessicapierce, Dobi~enwiki, User A1, Jalaska13, A m sheldon, Xs2me, Polyextremophile, Seminalist,
Extabgrad, Dodabe~enwiki, Foxj, Stephen Milborrow, Dank, Pichpich, Addbot, AndrewHZ, MrOllie, Download, Peni, Yobot, TestE-
ditBot, AnomieBOT, Jim1138, Royote, Citation bot, FrescoBot, Hobsonlane, X7q, Kelos omos1, Orchidbox, Citation bot 1, Boxplot,
Thinking of England, Janez Demsar, Mwojnars, Gzorg, Wik-dt, Chad.burrus, Bethnim, ZroBot, Chire, Liorrokach, Pxtreme75, ClueBot
NG, Psorakis, Aristitleism, Frietjes, Widr, Helpful Pixie Bot, BG19bot, QualitycontrolUS, BendelacBOT, Mightyteja, Djplaner, Citation-
CleanerBot, A923812, Sboosali, Douglas H Fisher, Zhang1989cn, JYBot, Lizhengui, Declaration1776, Jey42, Mgibby5, Pimgd, SiraRaven,
Slash1986, Monkbot, 00tau, HossPatrol, DizzyRebel and Anonymous: 88
Information gain in decision trees Source: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees?oldid=664264867 Con-
tributors: Kku, Dcoetzee, Neilc, Andreas Kaufmann, Mathiasl26, Flammifer, Nulli~enwiki, Musiphil, Jheald, Nulli2, SmackBot, Mcld,
Freelance Intellectual, Funnyfarmofdoom, Seminalist, Mild Bill Hiccup, AndrewHZ, MattTait, , Erik9bot, QualitycontrolUS,
A923812, BattyBot, New Children of the Almighty, Monkbot and Anonymous: 24
Ensemble learning Source: https://en.wikipedia.org/wiki/Ensemble_learning?oldid=667910483 Contributors: The Anome, Greenrd, Vi-
oletriga, 3mta3, Jheald, Qwertyus, Vegaswikian, Wavelength, Crasshopper, ToddDeLuca, Littenberg, Mickeyg13, Mandra Oleka, Magiola-
ditis, Destynova, Salih, EverGreg, Headlessplatter, Kotsiantis, Calimo, Skbkekas, Cowwaw, Qwfp, Sameer0s, Addbot, AndrewHZ, Ettrig,
Yobot, Erik9bot, Citation bot 1, John of Reading, Liorrokach, Zephyrus Tavvier, Helpful Pixie Bot, Laughsinthestocks, BG19bot, Rmasba,
Monkbot, Tyler Streeter, Delibzr, Anshurm, Velvel2, Olosko, Meteozay and Anonymous: 26
Random forest Source: https://en.wikipedia.org/wiki/Random_forest?oldid=670688623 Contributors: Michael Hardy, Willsmith, Zeno
Gantner, Ronz, Den fjttrade ankan~enwiki, Hike395, Nstender, Giftlite, Neilc, Pgan002, Sam Hocevar, Urhixidur, Andreas Kaufmann,
Rich Farmbrough, O18, Rajah, Ferkel, 3mta3, Knowledge Seeker, Rrenaud, Qwertyus, Rjwilmsi, Punk5, Nigosh, Mathbot, LuisPedro-
Coelho, Bgwhite, RussBot, Dsol, Diegotorquemada, Mcld, Bluebot, Eep1mp, Cybercobra, Mitar, Ben Moore, Shorespirit, Ninetyone, In-
nohead, Bumbulski, Jason Dunsmore, Talgalili, Thijs!bot, Tolstoy the Cat, Headbomb, Utopiah, Baccyak4H, Hue White, David Eppstein,
Trusilver, Yogeshkumkar12, Dvdpwiki, Gerifalte~enwiki, WereSpielChequers, Melcombe, Headlessplatter, Jashley13, Xiawi, Alexbot,
Dboehmer, MystBot, Addbot, AndrewHZ, Bastion Monk, MrOllie, Jperl, Legobot, Yobot, AnomieBOT, Randomexpert, Jim1138, Citation
bot, Twri, V35b, Nippashish, Sgtf, X7q, Dront, Yurislator, Delmonde, John of Reading, ZroBot, Chire, Jwollbold, Pokbot, V.cheplygina,
Joel B. Lewis, EmmanuelleGouillart, Helpful Pixie Bot, BG19bot, QualitycontrolUS, Spaligo, Stevetihi, Schreckse, A923812, JoshuSasori,
JimmyJimmereeno, ChrisGualtieri, Kosio.the.truthseeker, IOverThoughtThis, Bvlb, Svershin, Austrartsua, Monkbot, HossPatrol, Dongkyu
Kim, Puxiao129, StudentDH and Anonymous: 87
Boosting (machine learning) Source: https://en.wikipedia.org/wiki/Boosting_(machine_learning)?oldid=670591164 Contributors: The
Anome, Michael Hardy, Cherkash, BAxelrod, Hike395, Phil Boswell, Seabhcan, Indelible~enwiki, Pgan002, Beland, MarkSweep, Collino,
APH, Gene s, Urhixidur, Ehamberg, Kate, Nowozin, Violetriga, JustinWick, HasharBot~enwiki, Velella, GJeery, Demiurg, Qwertyus,
Rjwilmsi, Intgr, Bgwhite, YurikBot, Petri Krohn, KYN, Slaniel, COMPFUNK2, Mitar, Trifon Triantallidis, Jminguillona, Nick Ot-
tery, Edchi, Seaphoto, Matuag, Magioladitis, A3nm, Kent37, Kovianyo, Sebastjanmm, AaronArvey, Slett~enwiki, AlanUS, Hughpugh,
Kotsiantis, ClueBot, Piastu, Xiawi, TarzanASG, Excirial, Calimo, Skbkekas, XLinkBot, Glane23, SpBot, Movado73, Legobot, Yobot,
AnomieBOT, Hahahaha4, Cobranet, Yousian, X7q, Lreyzin, Boxplot, Danyaljj, Leopd, BG19bot, Striaten, Ianschillebeeckx, Franois
Robere, Jthurst3, Polar Mermaid, Velvel2, Olosko and Anonymous: 57
Bootstrap aggregating Source: https://en.wikipedia.org/wiki/Bootstrap_aggregating?oldid=664942138 Contributors: Fnielsen, Michael
Hardy, Delirium, Den fjttrade ankan~enwiki, Hike395, Mtcv, Phil Boswell, Neilc, Pgan002, Gdm, Urhixidur, Violetriga, JustinWick,
Alex Kosoruko, BlueNovember, Blobglob, Rrenaud, Alai, Oleg Alexandrov, Demiurg, GregorB, Qwertyus, Koolkao, Vonkje, Splash,
SmackBot, Minhtuanht~enwiki, Stimpy, Hongooi, Radagast83, Mitar, John, Kvng, Beefyt, Bumbulski, Tolstoy the Cat, Rubesam,
Lawrencehu~enwiki, EagleFan, David Eppstein, Melcombe, Kotsiantis, Martarius, MystBot, Addbot, DOI bot, AndrewHZ, Movado73,
AnomieBOT, Nippashish, X7q, Citation bot 1, Boxplot, ELAD3, Chire, ChuispastonBot, Helpful Pixie Bot, Wbm1058, BG19bot,
Chafe66, Joeinwiki, Mark viking, Brendonboshell, Ealfaro, Tc325, Rmasba, Caozhu, Anshurm and Anonymous: 30
Gradient boosting Source: https://en.wikipedia.org/wiki/Gradient_boosting?oldid=667898442 Contributors: Ronz, Topbanana, Benwing,
Violetriga, Qwertyus, Hongooi, Dgianotti, Medovina, Andre.holzner, Jazzcat81, Seminalist, JeDonner, MrOllie, Davedev15, Yobot,
Jtamad, LilHelpa, Sophus Bie, X7q, Yihucha166, Trappist the monk, CristiCbz, Alephnot, Chire, HHinman, Mark viking, Pprettenhofer,
DerHessi, Crowwork, P.thesling, XQQ14, Gary2015 and Anonymous: 24
Semi-supervised learning Source: https://en.wikipedia.org/wiki/Semi-supervised_learning?oldid=667482117 Contributors: Edward,
Kku, Delirium, Furrykef, Benwing, Rajah, Arthena, Facopad, Soultaco, Bkkbrad, Ruud Koot, Qwertyus, Gmelli, Chobot, Dav-
eWF, Cedar101, Jcarroll, Drono, Phoxhat, Rahimiali, Bookuser, Lamro, Tbmurphy, Addbot, MrOllie, Luckas-bot, Yobot, Gelbukh,
AnomieBOT, Xqbot, Omnipaedista, Romainbrasselet, D'ohBot, Wokonen, EmausBot, Grisendo, Stheodor, Rahulkmishra, Pintaio, Helpful
Pixie Bot, BG19bot, CarrieVS, AK456, Techerin, M.shahriarinia, Rcpt2 and Anonymous: 28
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 421
Perceptron Source: https://en.wikipedia.org/wiki/Perceptron?oldid=669825263 Contributors: The Anome, Koyaanis Qatsi, Ap, Stever-
tigo, Lexor, Ahoerstemeier, Ronz, Muriel Gottrop~enwiki, Glenn, IMSoP, Hike395, Furrykef, Benwing, Bernhard Bauer, Naddy, Rholton,
Fuelbottle, Giftlite, Markus Krtzsch, BrendanH, Neilc, Pgan002, JimWae, Gene s, AndrewKeenanRichardson, Hydrox, Luqui, Rama,
Nwerneck, Robert.ensor, Poromenos, Caesura, Blahedo, Henry W. Schmitt, Hazard, MartinSpacek, Linas, Olethros, Male1979, Qwer-
tyus, Rjwilmsi, Margosbot~enwiki, Predictor, YurikBot, RussBot, Gaius Cornelius, Hwasungmars, Gareth Jones, SamuelRiv, DaveWF,
Nikkimaria, Closedmouth, Killerandy, SmackBot, Saravask, InverseHypercube, Pkirlin, CommodiCast, Eskimbot, El Cubano, Derkuci,
Frap, Xyzzyplugh, Memming, Sumitkb, Tony esopi patra, Lambiam, Fishoak, Beetstra, CapitalR, Momet, RighteousRaven, Mcstrother,
Tinyfool, Thijs!bot, Nick Number, Binarybits, Mat the w, Seaphoto, QuiteUnusual, Beta16, Jorgenumata, Pwmunro, Paskari, Joshua Is-
sac, Shiggity, VolkovBot, TXiKiBoT, Xiaopengyou7, Ocolon, Hawkins.tim, Kadiddlehopper, SieBot, Truecobb, AlanUS, CharlesGilling-
ham, ClueBot, UKoch, MrKIA11, XLinkBot, Gnowor, Addbot, GVDoubleE, LinkFA-Bot, Lightbot, , Luckas-bot, Yobot, Time-
root, SergeyJ, AnomieBOT, Phantom Hoover, Engshr2000, Materialscientist, Twri, GnawnBot, PabloCastellano, Emchristiansen, Bizso,
Tuetschek, Octavianvoicu, Olympi, FrescoBot, Perceptrive, X7q, Mydimle, LauraHale, Cjlim, Gregman2, Igogo3000, EmausBot, Orphan
Wiki, ZroBot, Ohyusheng, Chire, Amit159, MrWink, JE71, John.naveen, Algernonka, ChuispastonBot, Sigma0 1, ClueBot NG, Biewers,
Arrandale, Jackrae, Ricekido, MchLrn, BG19bot, ElectricUvula, Damjanmk, Whym, Dexbot, JurgenNL, Kn330wn, Ianschillebeeckx,
Toritris, Terrance26, Francisbach, Kevin Leutzinger, Joma.huguet, Bjaress, KasparBot, ALLY FRESH, Kitschcocktail, Inigolv, Elizabeth
goodspeed and Anonymous: 166
Support vector machine Source: https://en.wikipedia.org/wiki/Support_vector_machine?oldid=670820716 Contributors: The Anome,
Gareth Owen, Enchanter, Ryguasu, Edward, Michael Hardy, Dmd3e, Oliver Pereira, Kku, Zeno Gantner, Mark Foskey, Jll, Jimregan,
Hike395, Barak~enwiki, Dysprosia, Pheon, Kgajos, Mpost89, Benwing, Altenmann, Wile E. Heresiarch, Tobias Bergemann, Giftlite,
Sepreece, BenFrantzDale, Fropu, Dj Vu, Dfrankow, Neilc, Pgan002, Gene s, Urhixidur, Rich Farmbrough, Mathiasl26, Nowozin, Ylai,
Andrejj, MisterSheik, Lycurgus, Alex Kosoruko, Aaronbrick, Cyc~enwiki, Rajah, Terrycojones, Diego Moya, Pzoot, Nvrmnd, Gortu,
Boyd Steere, Alai, Stuartyeates, The Belgain, Ralf Mikut, Ruud Koot, Waldir, Joerg Kurt Wegner, Qwertyus, Michal.burda, Brighteror-
ange, Mathbot, Vsion, Gurch, Sderose, Diza, Tedder, Chobot, Adoniscik, YurikBot, Ste1n, CambridgeBayWeather, Rsrikanth05, Pseu-
domonas, Seb35, Korny O'Near, Gareth Jones, Karipuf, Retardo, Bota47, SamuelRiv, Tribaal, Palmin, Thnidu, Digfarenough, CWenger,
Sharat sc, Mebden, Amit man, Lordspaz, Otheus, SmackBot, RDBury, Jwmillerusa, Moeron, InverseHypercube, Vardhanw, Ealdent,
Golwengaud, CommodiCast, Mcdu, Eskimbot, Vermorel, Mcld, Riedl, Srchvrs, Avb, MattOates, Memming, Mitar, Sadeq, Jonas Au-
gust, Rijkbenik, Jrouquie, Adilraja, Dicklyon, Mattsachs, Hu12, Ojan, JoeBot, Atreys, Tawkerbot2, Lavaka, Harold f, Owen214, Cm-
drObot, FunPika, Ezrakilty, Innohead, Bumbulski, Anthonyhcole, Farzaneh, Carstensen, Gnfnrf, Thijs!bot, Drowne, Trevyn, Dkem-
per, Prolog, AnAj, Dougher, Wootery, Sanchom, Shashank4, BrotherE, Coee2theorems, Tremilux, Qjqash3, Americanhero, A3nm,
Parunach, Jacktance~enwiki, Ledona delano, Andreas Mueller, Mschel, FelipeVargasRigo, David Callan, Freeboson, Nickvence, Po-
lusfx, Singularitarian, Senu, Salih, JamesMayeld, Pradeepgk, Supportvector, STBotD, RJASE1, Satyr9, Svm slave, TXiKiBoT, Sana-
tio, Carcinus, Majeru, Seminalist, Canavalia, Simonstr, Domination989, Kerveros 99, Quentin Pradet, AaronArvey, Caltas, Behind The
Wall Of Sleep, Udirock, Savedthat, JeanSenellart, Melcombe, Tiny plastic Grey Knight, Wawe1811, Martarius, Sfan00 IMG, Grant-
bow, Peter.kese, Alexbot, Pot, DeltaQuad, MagnusPI, Chaosdruid, Djgulp3~enwiki, Qwfp, Asymptosis, Sunsetsky, Addbot, DOI bot,
AndrewHZ, MrOllie, Lightbot, Peni, Legobot, Yobot, Legobot II, AnomieBOT, Erel Segal, Jim1138, Materialscientist, Citation bot, Sal-
bang, Tripshall, Twri, LilHelpa, Megatang, Chuanren, RibotBOT, Hamidhaji, WaysToEscape, Alisaleh88, HB28205, Romainbrasselet,
X7q, Elvenhack, Wikisteve316, Citation bot 1, RedBot, SpaceFlight89, Classier1234, Zadroznyelkan, ACKiwi, Onel5969, Rjwilmsi-
Bot, Zhejiangustc, J36miles, EmausBot, Alfaisanomega, Nacopt, NikolasE~enwiki, Datakid1, Msayag, Stheodor, Colmenares jb, Manyu
aditya, ZroBot, Khaled.Boukharouba, Chire, Amit159, Tolly4bolly, Pintaio, Tahir512, ChengHsuanLi, Ecc81, DaChazTech, Randallbrit-
ten, Aaaaa bbbbb1355, Liuyipei, Arafat.sultan, ClueBot NG, Kkddkkdd, MchLrn, Helpful Pixie Bot, Alisneaky, Elferdo, BG19bot, Sci-
CompTeacher, Xlicolts613, Boyander, Bodormenta, Illia Connell, Elmackev, Mohraz2, Deepakiitmandi, Adrienbrunetwiki, Velvel2, Cur-
tisZeng, Cjqed, ZhangJiaqiPKU, KasparBot, Tutuwang, Vijay.singularity.krish and Anonymous: 346
Articial neural network Source: https://en.wikipedia.org/wiki/Artificial_neural_network?oldid=670995295 Contributors: Magnus
Manske, Ed Poor, Iwnbap, PierreAbbat, Youandme, Susano, Hfastedge, Mrwojo, Michael Hardy, Erik Zachte, Oliver Pereira, Bobby D.
Bryant, Zeno Gantner, Parmentier~enwiki, Delirium, Pieter Suurmond, (, Alo, 168..., Ellywa, Ronz, Snoyes, Den fjttrade ankan~enwiki,
Cgs, Glenn, Cyan, Hike395, Hashar, Novum, Charles Matthews, Guaka, Timwi, Reddi, Andrewman327, Munford, Furrykef, Bevo, Fvw,
Raul654, Nyxos, Unknown, Pakcw, Robbot, Chopchopwhitey, Bkell, Hadal, Wikibot, Diberri, Xanzzibar, Wile E. Heresiarch, Con-
nelly, Giftlite, Rs2, Markus Krtzsch, Spazzm, Seabhcan, BenFrantzDale, Zigger, Everyking, Rpyle731, Wikiwikifast, Foobar, Edrex,
Jabowery, Wildt~enwiki, Wmahan, Neilc, Quadell, Beland, Lylum, Gene s, Sbledsoe, Mozzerati, Karl-Henner, Jmeppley, Asbestos, Fin-
tor, AAAAA, Splatty, Rich Farmbrough, Pak21, NeuronExMachina, Michal Jurosz, Pjacobi, Mecanismo, Zarutian, Dbachmann, Ben-
der235, ZeroOne, Violetriga, Mavhc, One-dimensional Tangent, Gyll, Stephane.magnenat, Mysteronald, .:Ajvol:., Fotinakis, Nk, Tritium6,
JesseHogan, Mdd, Passw0rd, Zachlipton, Alansohn, Jhertel, Anthony Appleyard, Denoir, Arthena, Fritz Saalfeld, Sp00n17, Rickyp, Hu,
Tyrell turing, Cburnett, Notjim, Drbreznjev, Forderud, Oleg Alexandrov, Mogigoma, Madmardigan53, Justinlebar, Olethros, Ylem, Dr.U,
Gengiskanhg, Male1979, Bar0n, Waldir, Eslip17, Yoghurt, Ashmoo, Graham87, Qwertyus, Imersion, Grammarbot, Rjwilmsi, Jeema,
Venullian, SpNeo, Intgr, Predictor, Kri, BradBeattie, Plarroy, Windharp, Mehran.asadi, Commander Nemet, Wavelength, Borgx, Ian-
Manka, Rsrikanth05, Philopedia, Ritchy, David R. Ingham, Grafen, Nrets, Exir Kamalabadi, Deodar~enwiki, Mosquitopsu, Jpbowen,
Dennis!, JulesH, Moe Epsilon, Supten, DeadEyeArrow, Eclipsed, SamuelRiv, Tribaal, Chase me ladies, I'm the Cavalry, CWenger, Don-
halcon, Banus, Shepard, John Broughton, A13ean, SmackBot, PinstripeMonkey, McGeddon, CommodiCast, Jfmiller28, Stimpy, Com-
mander Keane bot, Feshmania, ToddDeLuca, Diegotorquemada, Patrickdepinguin, KYN, Gilliam, Bluebot, Oli Filth, Gardoma, Com-
plexica, Nossac, Hongooi, Pdtl, Izhikevich, Trifon Triantallidis, SeanAhern, Neshatian, Vernedj, Dankonikolic, Rory096, Sina2, SS2005,
Kuru, Plison, Lakinekaki, Bjankuloski06en~enwiki, IronGargoyle, WMod-NS, Dicklyon, Citicat, StanfordProgrammer, Ojan, Chi3x10,
Aeternus, CapitalR, Atreys, George100, Gveret Tered, Devourer09, SkyWalker, CmdrObot, Leonoel, CBM, Mcstrother, MarsRover,
CX, Arauzo, Peterdjones, Josephorourke, Kozuch, ClydeC, NotQuiteEXPComplete, Irigi, Mbell, Oldiowl, Tolstoy the Cat, Headbomb,
Mitchell.E.Timin, Davidhorman, Sbandrews, KrakatoaKatie, QuiteUnusual, Prolog, AnAj, LinaMishima, Whenning, Hamaryns, Daytona2,
JAnDbot, MER-C, Dcooper, Extropian314, Magioladitis, VoABot II, Amitant, Jimjamjak, SSZ, Robotman1974, David Eppstein, User
A1, Martynas Patasius, Pmbhagat, JaGa, Tuhinsubhrakonar, SoyYo, Nikoladie~enwiki, R'n'B, Maproom, K.menin, Gill110951, Tarot-
cards, Plasticup, Margareta, Paskari, Jamesontai, Kiran uvpce, Jamiejoseph, Error9312, Jlaramee, Je G., A4bot, Singleheart, Ebbedc,
Lordvolton, Ask123, CanOfWorms, Mundhenk, Wikiisawesome, M karamanov, Enkya, Blumenkraft, Twikir, Mikemoral, Oldag07, Sm-
sarmad, Flyer22, Janopus, Bwieliczko, Dhateld, F.j.gaze, Mark Lewis Epstein, S2000magician, PuercoPop, Martarius, ClueBot, Ignacio
Javier Igjav, Ahyeek, The Thing That Should Not Be, Fadesga, Zybler, Midiangr, Epsilon60198, Thomas Tvileren, Wduch, Excirial,
Three-quarter-ten, Skbkekas, Chaosdruid, Aprock, Qwfp, Jean-claude perez, Achler, XLinkBot, AgnosticPreachersKid, BodhisattvaBot,
Stickee, Cmr08, Porphyro, Fippy Darkpaw, Addbot, DOI bot, AndrewHZ, Thomblake, Techjerry, Looie496, MrOllie, Transmobilator,
Jarble, Yobot, Blm19732008, Nguyengiap84~enwiki, SparkOfCreation, AnomieBOT, DemocraticLuntz, Tryptosh, Trevithj, Jim1138,
422 CHAPTER 62. DEEP LEARNING
Durran65, MockDuck, JonathanWilliford, Materialscientist, Citation bot, Eumolpo, Twri, NFD9001, Isheden, J04n, Omnipaedista, Mark
Schierbecker, RibotBOT, RoodyBeep, Gunjan verma81, FrescoBot, X7q, mer Cengiz elebi, Outback the koala, Citation bot 1, Ty-
lor.Sampson, Jonesey95, Calmer Waters, Skyerise, Trappist the monk, Krassotkin, Cjlim, Fox Wilson, The Strategist, LilyKitty, Eparo,
, Jfmantis, Mehdiabbasi, VernoWhitney, Wiknn, BertSeghers, DASHBot, EmausBot, Nacopt, Dzkd, Racerx11, Japs 88, Going-
Batty, RaoInWiki, Roposeidon, Epsiloner, Stheodor, Benlansdell, Radshashi, K6ka, D'oh!, Thisisentchris87, Aavindraa, Chire, Glosser.ca,
IGeMiNix, Donner60, Yoshua.Bengio, Shinosin, Venkatarun95, ChuckNorrisPwnedYou, Petrb, ClueBot NG, Raghith, Robiminer, Snot-
bot, Tideat, Frietjes, Gms3591, Ryansandersuk, Widr, MerlIwBot, Helpful Pixie Bot, Trepier, BG19bot, Thwien, Adams7, Rahil2000,
Chafe66, Michaelmalak, Compfreak7, Kirananils, Altar, J.Davis314, Attleboro, Pratyya Ghosh, JoshuSasori, Ferrarisailor, Eugeneche-
ung, Mtschida, ChrisGualtieri, Dave2k6inthemix, Whebzy, APerson, JurgenNL, Oritnk, Stevebillings, Djfrost711, Sa publishers, , Mark
viking, Markus.harz, Deeper Learning, Vinchaud20, Soueumxm, Toritris, Evolution and evolvability, Sboddhu, Sharva029, Paheld, Putting
things straight, Rosario Berganza, Monkbot, Buggiehuggie, Santoshwriter, Likerhayter, Joma.huguet, Bclark401, Rahulpratapsingh06, Don-
keychee, Michaelwine, Xsantostill, Jorge Guerra Pires, Wfwhitney, Loc Bourgois, KasparBot and Anonymous: 497
Deep learning Source: https://en.wikipedia.org/wiki/Deep_learning?oldid=671235450 Contributors: The Anome, Ed Poor, Michael
Hardy, Meekohi, Glenn, Bearcat, Nandhp, Stesmo, Giraedata, Jonsafari, Oleg Alexandrov, Justin Ormont, BD2412, Qwertyus,
Rjwilmsi, Kri, Bgwhite, Tomdooner, Arado, Bhny, Malcolma, Arthur Rubin, Mebden, SeanAhern, Dicklyon, JHP, ChrisCork, Lfstevens,
A3nm, R'n'B, Like.liberation, Popoki, Jshrager, Strife911, Bfx0, Daniel Hershcovich, Pinkpedaller, Dthomsen8, Addbot, Mamikonyana,
Yobot, AnomieBOT, Jonesey95, Zabbarob, Wyverald, RjwilmsiBot, Larry.europe, Helwr, GoingBatty, Sergey WereWolf, SlowByte,
Yoshua.Bengio, JuyangWeng, Renklauf, Bittenus, Widr, BG19bot, Lukas.tencer, Kareltje63, Synchronist, Gameboy97q, IjonTichyIjon-
Tichy, Mogism, AlwaysCoding, Mark viking, Cagarie, Deeper Learning, Prisx, Wikiyant, Underow42, GreyShields, Opokopo, Prof.
Oundest, Gigavanti, Sevensharpnine, Yes deeper, Monkbot, Chieftains337, Samueldg89, Rober9876543210, Nikunj157, Engheta, As-
purdy, Velvel2, TeaLover1996, Deng629, Zhuikov, Stergioc, Jerodlycett, DragonbornXXL, Lzjpaul and Anonymous: 113
62.12.2 Images
File:2013-09-11_Bus_wrapped_with_SAP_Big_Data_parked_outside_IDF13_(9730051783).jpg Source: https://upload.wikimedia.
org/wikipedia/commons/8/8d/2013-09-11_Bus_wrapped_with_SAP_Big_Data_parked_outside_IDF13_%289730051783%29.jpg Li-
cense: CC BY-SA 2.0 Contributors: Bus wrapped with SAP Big Data parked outside IDF13 Original artist: Intel Free Press
File:Ambox_important.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public do-
main Contributors: Own work, based o of Image:Ambox scales.svg Original artist: Dsmurat (talk contribs)
File:Animation2.gif Source: https://upload.wikimedia.org/wikipedia/commons/c/c0/Animation2.gif License: CC-BY-SA-3.0 Contribu-
tors: Own work Original artist: MG (talk contribs)
File:Ann_dependency_(graph).svg Source: https://upload.wikimedia.org/wikipedia/commons/d/dd/Ann_dependency_%28graph%29.
svg License: CC BY-SA 3.0 Contributors: Vector version of File:Ann dependency graph.png Original artist: Glosser.ca
File:Anscombe{}s_quartet_3.svg Source: https://upload.wikimedia.org/wikipedia/commons/e/ec/Anscombe%27s_quartet_3.svg Li-
cense: CC BY-SA 3.0 Contributors:
Anscombe.svg Original artist: Anscombe.svg: Schutz
File:ArtificialFictionBrain.png Source: https://upload.wikimedia.org/wikipedia/commons/1/17/ArtificialFictionBrain.png License:
CC-BY-SA-3.0 Contributors: ? Original artist: ?
File:Artificial_neural_network.svg Source: https://upload.wikimedia.org/wikipedia/commons/e/e4/Artificial_neural_network.svg Li-
cense: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett
File:Automated_online_assistant.png Source: https://upload.wikimedia.org/wikipedia/commons/8/8b/Automated_online_assistant.
png License: Attribution Contributors:
The text is adapted from the Wikipedia merchandise page (this automated customer service itself, however, is ctional), and pasted into a
page in Wikipedia:
Original artist: Mikael Hggstrm
File:Bayes_icon.svg Source: https://upload.wikimedia.org/wikipedia/commons/e/ed/Bayes_icon.svg License: CC0 Contributors: <a
href='http://validator.w3.org/' data-x-rel='nofollow'><img alt='W3C' src='https://upload.wikimedia.org/wikipedia/commons/thumb/
1/1a/Valid_SVG_1.1_%28green%29.svg/88px-Valid_SVG_1.1_%28green%29.svg.png' width='88' height='30' style='vertical-align:
top' srcset='https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Valid_SVG_1.1_%28green%29.svg/132px-Valid_SVG_1.
1_%28green%29.svg.png 1.5x, https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Valid_SVG_1.1_%28green%29.svg/
176px-Valid_SVG_1.1_%28green%29.svg.png 2x' data-le-width='91' data-le-height='31' /></a>iThe source code of this SVG is
<a data-x-rel='nofollow' class='external text' href='http://validator.w3.org/check?uri=https%3A%2F%2Fcommons.wikimedia.org%
2Fwiki%2FSpecial%3AFilepath%2FBayes_icon.svg,<span>,&,</span>,ss=1#source'>valid</a>.
Original artist: Mikhail Ryazanov
File:Bayes_theorem_visualisation.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/bf/Bayes_theorem_visualisation.
svg License: CC BY-SA 3.0 Contributors: Own work Original artist: Cmglee
File:Bayesian_inference_archaeology_example.jpg Source: https://upload.wikimedia.org/wikipedia/commons/6/6d/Bayesian_
inference_archaeology_example.jpg License: CC0 Contributors: Own work Original artist: Gnathan87
File:Bayesian_inference_event_space.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/ad/Bayesian_inference_event_
space.svg License: CC0 Contributors: Own work Original artist: Gnathan87
File:Bendixen_-_Carl_Friedrich_Gau,_1828.jpg Source: https://upload.wikimedia.org/wikipedia/commons/3/33/Bendixen_-_Carl_
Friedrich_Gau%C3%9F%2C_1828.jpg License: Public domain Contributors: published in Astronomische Nachrichten 1828 Original
artist: Siegfried Detlev Bendixen
File:Big_data_cartoon_t_gregorius.jpg Source: https://upload.wikimedia.org/wikipedia/commons/b/b3/Big_data_cartoon_t_
gregorius.jpg License: CC BY 2.0 Contributors: Cartoon: Big Data Original artist: Thierry Gregorius
62.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 423