Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change, 2023
Polysemies, or "colexifications", are of great interest in cognitive and historical linguistics, ... more Polysemies, or "colexifications", are of great interest in cognitive and historical linguistics, since meanings that are frequently expressed by the same lexeme are likely to be conceptually similar, and lie along a common pathway of semantic change. We argue that these types of inferences can be more reliably drawn from polysemies of cognate sets (which we call "dialexifications") than from polysemies of lexemes. After giving a precise definition of dialexification, we introduce EvoSem, a cross-linguistic database of etymologies scraped from several online sources. Based on this database (publicly available at http://tiny.cc/EvoSem), we measure for each pair of senses how many cognate sets include them both-i.e. how often this pair of senses is "dialexified". This allows us to construct a weighted dialexification graph for any set of senses, indicating the conceptual and historical closeness of each pair. We also present an online interface for browsing our database, including graphs and interactive tables. We then discuss potential applications to NLP tasks and to linguistic research.
We examine a database of 3089 languages coded for 351 morphosyntactic features, including almost ... more We examine a database of 3089 languages coded for 351 morphosyntactic features, including almost all of the morphosyntactic features found in The World Atlas of Language Structures (Dryer & Haspelmath 2013). We apply Factor Analysis of Mixed Data, and determine that the main dimensions of global morphological variation involve (1) word order in clauses and adpositional phrases, (2) head-versus dependent-marking, and (3) a set of features that show an east-west distribution. We find roughly the same features clustering in similar dimensions when we examine individual macro-areas, thus confirming the universal relevance of these groupings of features, as encapsulated in well-known implicational universals. This study confirms established insights in linguistic typology, extending earlier research to a much larger set of languages, and uncovers a number of areal patterns in the data.
Light Warlpiri is a newly emerged Australian mixed language that systematically combines nominal ... more Light Warlpiri is a newly emerged Australian mixed language that systematically combines nominal structure from Warlpiri (Australian, Pama-Nyungan) with verbal structure from Kriol (an English-lexified Creole) and English, with additional innovations in the verbal auxiliary system. Lexical items are drawn from both Warlpiri and the two English-lexified sources, Kriol and English. The Light Warlpiri verb system is interesting because of questions raised about how it combines elements of its sources. Most verb stems are derived from Kriol or English, but Warlpiri stems also occur, with reanalysis, and stems of either source host Kriol-derived transitive marking (e.g., hit-im ‘hit-TR’). Transitive marking is productive but also variable. In this paper, we examine transitivity and its marking on Light Warlpiri verbs, drawing on narrative data from an extensive corpus of adult speech. The study finds that transitive marking on verbs in Light Warlpiri is conditioned by six of Hopper and T...
As Virtual Reality emerges as an accessible technology, researchers have begun to experiment with... more As Virtual Reality emerges as an accessible technology, researchers have begun to experiment with its potential for data visualisation. In this poster we will highlight some conceptual issues around using VR and other immersive 3D technologies to explore the morphosyntactic data from the World Phonotactics Database (Donohue et al 2013), which is an expansion of the World Atlas of Language Structures (Dryer & Haspelmath 2013). While typological data has an inherent geographical dimension, mapping the space to other features of the data instead illuminate different sorts of clusters and structures. Two-dimensional graphs or visualisations, however, can be difficult to interpret, as information is static and densely layered. In the visualization discussed here, WALS/WPD languages are subjected to multidimensional scaling: essentially the number of features on which any two languages differ are counted, and the total is divided by the number of features that are known for both. A 3D (x,y,z) location is found for each language such that these pairwise distances are represented spatially. The result is a three-dimensional scatterplot. In a VR headset, the user is located at (0,0,0) in the midst of the cloud of points, and can look around. It can also be viewed in a pseudo-3D environment using WebGL in a desktop browser. The scatterplot can map any csv datafile, as long as there are columns for (x,y,z) coordinates as well as (optionally) columns of features that can be used for colouring the points. This therefore has potential uses beyond linguistics as well.
One of the main advantages of cognitive linguistics (and in particular Cognitive Grammar) over ot... more One of the main advantages of cognitive linguistics (and in particular Cognitive Grammar) over other approaches to the study of language structure is the fact that every descriptive construct is defined in psychological terms. This means, ideally, that any cognitive linguistic description of a word or grammatical construction constitutes a hypothesis about the mental representation of that structure. It should thus be possible to verify such descriptions, or to decide between competing analyses of a phenomenon, by experimentally testing the hypotheses that they entail. Such tests have been rare, however, due to the difficulty of operationalising many of the semantic notions used in Cognitive Grammar. The present thesis reports on attempts to operationalise and test (using questionnaires, production tasks, and reaction time measurements) four descriptive claims formulated in the framework of Cognitive Grammar: that finite complementation constructions are headed by the complement-tak...
Siva Kalyan & Alexandre François. 2019. When the waves meet the trees: A response to Jacques & List. In Siva Kalyan, Alexandre François & Harald Hammarström (eds), Understanding language genealogy: Alternatives to the tree model. Special issue of Journal of Historical Linguistics 9/1: 167–176., 2019
This special issue of the _Journal of Historical Linguistics_, titled “Understanding language gen... more This special issue of the _Journal of Historical Linguistics_, titled “Understanding language genealogy: Alternatives to the tree model”, was created with the idea of opening a constructive debate around the issues raised by the Tree model. We (Kalyan and François) had previously published articles highlighting the shortcomings of the tree for representing language history, and promoting an approach – called Historical Glottometry – based on the Wave model (François 2014, 2017; Kalyan & François 2018). We invited two historical linguists, Guillaume Jacques and Johann-Mattis List, as discussants in our volume; their chapter “Save the trees: Why we need tree models in linguistic reconstruction” [doi:10.1075/jhl.17008.mat] argued in favour of the Tree model – and this is our final response. In this response article, we review our apparently conflicting perspectives, but endeavour to find common ground between them – hence the title “when the waves meet the trees”. We show that our differences are partly due to distinct definitions of key concepts (subgroup, shared innovations…). We also show that the argument of “Incomplete lineage sorting”, which they use to defend the tree, could as well defend the wave model, since it can capture the key notion of intersecting innovations that is so prevalent in linkages, and so problematic in the traditional tree approach. The final part of our response shows how Historical Glottometry (HG) provides a way to reconstruct the historical process of ‘linkage breaking’, whereby a dialect continuum breaks progressively into separate languages. All in all, our two approaches are fundamentally compatible – even though we find the Wave model, ultimately, to be more realistic.
Siva Kalyan, Alexandre François & Harald Hammarström. 2019. Problems with, and alternatives to, the tree model in historical linguistics. In S. Kalyan, A. François & H. Hammarström (eds), Understanding language genealogy: Alternatives to the tree model. Journal of Historical Linguistics 9/1: 1–8., Jul 1, 2019
There are important reasons to be sceptical of the accuracy and usefulness of the family-tree mod... more There are important reasons to be sceptical of the accuracy and usefulness of the family-tree model in historical linguistics. That model assumes that every linguistic innovation applies to a language considered as an undifferentiated whole, a point with no “width”. But this assumption makes it impossible to use a tree to model the partial diffusion of an innovation within a language community (“internal diffusion”), or the diffusion of an innovation across language communities (“external diffusion”). These limitations have long been noticed by historical linguists (Schmidt 1872, Schuchardt 1900); but they become glaringly obvious in the cases discussed by Ross (1988) and François (2014) under the heading of “linkages” – i.e., language families that arise through the diversification, in situ, of a dialect network. The articles in this special issue all contribute towards addressing this problem, from a range of perspectives.
Since the beginnings of historical linguistics, the family tree has been the most widely accepted... more Since the beginnings of historical linguistics, the family tree has been the most widely accepted model for representing historical relations between languages. While this sort of representation is easy to grasp, and allows for a simple, attractive account of the development of a language family, the assumptions made by the tree model are applicable in only a small number of cases: namely, when a speaker population undergoes successive splits, with subsequent loss of contact among subgroups. A tree structure is unsuited for dealing with dialect continua, as well as language families that develop out of dialect continua (for which Ross 1988 uses the term “linkage”); in these situations, the scopes of innovations (in other words, their isoglosses) are not nested, but rather they persistently intersect, so that any proposed tree representation is met with abundant counterexamples. In this paper, we define “Historical Glottometry”, a new method capable of identifying and representing genealogical subgroups even when they intersect. Finally, we apply this glottometric method to a specific linkage, consisting of 17 Oceanic languages spoken in northern Vanuatu.
Kalyan, Siva and Alexandre François. 2018. Freeing the Comparative Method from the tree model: A framework for Historical Glottometry. In Ritsuko Kikusawa & Lawrence Reid (eds), _Let's talk about trees: Genetic Relationships of Languages and Their Phylogenic Representation_ (Senri Ethnological Studies, 98). Ōsaka: National Museum of Ethnology. 59–89.
Usage-based models of language propose that the acceptability of an element in a constructional s... more Usage-based models of language propose that the acceptability of an element in a constructional slot is determined by its similarity to attested fillers of that slot (Bybee 2010, ch. 4). However, Ambridge and Goldberg (2008) find that the acceptability of a long-distance-dependency (LDD) question does not correlate with the judged similarity of the matrix verb to think and say, which are by far the most frequently attested fillers of this slot. They propose instead that the acceptability of LDD questions is determined by the degree of fit between the information-structure properties of the matrix verb and those specified by the construction—specifically, the degree to which the matrix verb foregrounds its complement clause. This paper explores the possibility of reconciling this explanation with one based on similarity by suggesting that in this case the relevant aspect of similarity is precisely the verb’s foregrounding of its complement. Evidence for this suggestion comes from psychological research showing that in a categorization task, the similarity of an item to the exemplars of a category is judged primarily with respect to the features common to all category members, as well as from the observation that virtually all attested matrix verbs in LDD questions strongly tend to foreground their complements.
This map visualises distances among the grammars of 910 of the world's languages using colours (i... more This map visualises distances among the grammars of 910 of the world's languages using colours (in the manner of dialectometry). Distances are computed on the basis of the morphosyntax features in the World Phonotactics Database (Donohue et al. 2013), with the features weighted so as to control for dependencies (following Hammarström & O'Connor 2013). The distances are then subjected to multidimensional scaling in 3 dimensions, and the axes are mapped to red, green and blue.
Individual colours are not directly interpretable; however, warm colours (red, yellow etc.) correspond to predominantly head-initial languages, and cool colours (blue, green etc.) correspond to predominantly head-final languages.
This map visualises distances among the sound systems of the world's languages using colours (in ... more This map visualises distances among the sound systems of the world's languages using colours (in the manner of dialectometry). Distances are computed on the basis of the phonological features in the World Phonotactics Database (Donohue et al. 2013), with the features weighted so as to control for dependencies (following Hammarström & O'Connor 2013). The distances are then subjected to multidimensional scaling in 3 dimensions, and the axes are mapped to red, green and blue.
The main division visible is between Australia and the rest of the world.
When searching for language universals in linguistic typology, it is important to choose a sample... more When searching for language universals in linguistic typology, it is important to choose a sample of the world’s languages that is representative of many different linguistic areas; otherwise, one risks mistaking properties that have spread by contact or inheritance for genuine universal tendencies in human language. In order to avoid this problem, one needs to be able to identify linguistic areas in a consistent and objective manner. This has never been done: as Hammarström & Güldemann (2014: 95–96) note, “Unfortunately, there is no (near-complete) list of established areas…and even partial lists contain uncertainties concerning their delimitation”. This presentation proposes one way of addressing this issue. The dataset used for this project was the WALS subset of the World Phonotactics Database (Donohue et al. 2013), which contains the 257 morphosyntax features from the World Atlas of Language Structures (Dryer & Haspelmath 2013), coded for 1601 languages, 943 of which have non-blank values for more than 50% of the features. The first step was to compute pairwise typological distances among these 943 languages; this was done using the Gower coefficient, weighting the features in a way that accounts for feature dependencies (following the procedure suggested in Hammarström & O’Connor 2013). Figure 1 shows the resulting typological distances, transformed into a 3D MDS solution which is then interpreted as RGB vectors (as in dialectometric studies such as Goebl 2006 and Szmrecsányi 2011), and displayed on a map using a Voronoi tessellation. The next step in the identification of linguistic areas was to define a measure of “distance” between languages that would reflect typological distance for languages that are close together, and geographic distance for languages that are far apart; this would ensure that geographically close languages are clustered together only if they are also typologically similar, and geographically distant languages are rarely clustered together, regardless of typological similarity. This new distance measure was constructed as follows: First, a geographic adjacency graph was computed, where two languages A and B are linked if and only if there is no third language C which is closer to A and B than they are to each other (following the procedure in Hammarström & Güldemann 2014: 101–102). Then, the “areal distance” between A and B was defined as equal to their typological distance if they were either immediately adjacent or 2 steps apart; if they were more than 2 steps apart, then their areal distance was equal to the sum of typological distances along the shortest path between A and B that connects only geographically-adjacent languages. These areal distances were used to cluster the 943 languages into 47 clusters with an average of about 20 languages each; these clusters were then displayed within the Voronoi tessellation computed earlier. Figure 2 shows a sample cluster, which is clearly a highly plausible candidate for a linguistic area. Likewise, all the clusters that were found are geographically contiguous (by construction), and most of them can be readily identified with established linguistic areas, or with language families or subgroups.
Since its popularization by August Schleicher (1853, 1873), the most common way of representing a... more Since its popularization by August Schleicher (1853, 1873), the most common way of representing a language family has been as a tree. Tree representations are highly amenable to computational methods (as in biology), and they are sufficiently restrictive that they allow one to deduce the features that characterize each clade. The tree model has well-recognised limitations (Saussure 1995 [1916], Gray et al. 2010, etc.); in particular, it assumes that the language family evolved primarily by successive demographic splits, and so is incapable of representing (former) dialect networks. However, while other ways of representing a language family have occasionally been proposed, starting from Schmidt’s (1872) “Wave Model”, few of these alternatives are as precisely-specified or as restrictive as the tree model. In this presentation, we describe a computational method that allows us to represent the structure of a language family without assuming that it evolved by successive splits, and in such a way that we can identify the exact features that characterize each subgroup (allowing that subgroups may overlap). This method is correspondence analysis. Correspondence analysis (also known as optimal scaling, or dual scaling) allows one to take a binary matrix tabulating the presence or absence of m features in each of n items, and produce a solution in min(m – 1, n – 1) dimensions in which there are points that represent the items as well as points that represent the features. Crucially, the point representing each item is close to the points representing the features it exhibits; likewise, the point representing each feature is close to the points representing the items that exhibit it. Correspondence analysis has been successfully used in social network analysis as a representation of bipartite “affiliation” networks (Wasserman & Faust 1994: 334–342). We applied correspondence analysis (using the FactoMineR package in R: Lé et al. 2008) to a database of “plausible cognacy” judgments for a selection of 26 Germanic languages. In our data, each column represents a language (e.g. English, German, etc.), and each row represents a word that looks similar across at least some of the languages (e.g. water, Wasser, etc.), whether due to shared inheritance or borrowing. We would expect the clusters in the correspondence analysis solution to correspond to genealogical subgroups or to contact zones. Indeed, this is what we find. As seen in Figure 1, the languages divide neatly into the well-established North Germanic (Scandinavian), East Germanic (Gothic) and West Germanic subgroups, as well as a cluster consisting of English and Scots (i.e. languages of the British Isles). In addition, we can see that within each subgroup, there are “central” and “peripheral” members—e.g. Danish (a North Germanic language) approaches the West Germanic languages. Most importantly, these language clusters correspond directly to clusters of cognate sets, which means that (unlike with many other alternatives to the tree model), we have immediate access to the evidence that supports each cluster.
When searching for language universals in linguistic typology, it is important to choose a sample... more When searching for language universals in linguistic typology, it is important to choose a sample of the world’s languages that is representative of many different linguistic areas; otherwise, one risks mistaking properties that have spread by contact or inheritance for genuine universal tendencies in human language. In order to avoid this problem, one needs to be able to identify linguistic areas in a consistent and objective manner. This has never been done: as Hammarström & Güldemann (2014: 95–96) note, “Unfortunately, there is no (near-complete) list of established areas…and even partial lists contain uncertainties concerning their delimitation”. This presentation proposes one way of addressing this issue. The dataset used for this project was the WALS subset of the World Phonotactics Database (Donohue et al. 2013), which contains the 257 morphosyntax features from the World Atlas of Language Structures (Dryer & Haspelmath 2013), coded for 1601 languages, 943 of which have non-blank values for more than 50% of the features. The first step was to compute pairwise typological distances among these 943 languages; this was done using the Gower coefficient, weighting the features in a way that accounts for feature dependencies (following the procedure suggested in Hammarström & O’Connor 2013). Figure 1 shows the resulting typological distances, transformed into a 3D MDS solution which is then interpreted as RGB vectors (as in dialectometric studies such as Goebl 2006 and Szmrecsányi 2011), and displayed on a map using a Voronoi tessellation. The next step in the identification of linguistic areas was to define a measure of “distance” between languages that would reflect typological distance for languages that are close together, and geographic distance for languages that are far apart; this would ensure that geographically close languages are clustered together only if they are also typologically similar, and geographically distant languages are rarely clustered together, independently of typological similarity. This new distance measure was constructed as follows: First, a geographic adjacency graph was computed, where two languages A and B are linked if and only if there are no more than two other languages which lie within the circle whose diameter is the line segment AB. (This is thus a variant of the “Gabriel graph”: see Gabriel & Sokal 1969.) Then, the “areal distance” between A and B was defined as equal to their typological distance if they were either immediately adjacent or 2 steps apart; if they were more than 2 steps apart, then their areal distance was equal to the sum of typological distances along the shortest path between A and B that connects only geographically-adjacent languages. These areal distances were used to cluster the 943 languages into 47 clusters with an average of about 20 languages each; these clusters were then displayed within the Voronoi tessellation computed earlier. Figures 2–4 show some sample clusters. The areal clusters found are all (by construction) geographically contiguous; further, most of them can be readily identified with established linguistic areas, or with language families or subgroups.
As Virtual Reality emerges as an accessible technology, researchers have begun to experiment with... more As Virtual Reality emerges as an accessible technology, researchers have begun to experiment with its potential for data visualisation. In this poster we will highlight some conceptual issues around using VR and other immersive 3D technologies to explore the morphosyntactic data from the World Phonotactics Database (Donohue et al 2013), which is an expansion of the World Atlas of Language Structures (Dryer & Haspelmath 2013).
While typological data has an inherent geographical dimension, mapping the space to other features of the data instead illuminate different sorts of clusters and structures. Two-dimensional graphs or visualisations, however, can be difficult to interpret, as information is static and densely layered.
In the visualization discussed here, WALS/WPD languages are subjected to multidimensional scaling: essentially the number of features on which any two languages differ are counted, and the total is divided by the number of features that are known for both. A 3D (x,y,z) location is found for each language such that these pairwise distances are represented spatially.
The result is a three-dimensional scatterplot. In a VR headset, the user is located at (0,0,0) in the midst of the cloud of points, and can look around. It can also be viewed in a pseudo-3D environment using WebGL in a desktop browser.
The scatterplot can map any csv datafile, as long as there are columns for (x,y,z) coordinates as well as (optionally) columns of features that can be used for colouring the points. This therefore has potential uses beyond linguistics as well.
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change, 2023
Polysemies, or "colexifications", are of great interest in cognitive and historical linguistics, ... more Polysemies, or "colexifications", are of great interest in cognitive and historical linguistics, since meanings that are frequently expressed by the same lexeme are likely to be conceptually similar, and lie along a common pathway of semantic change. We argue that these types of inferences can be more reliably drawn from polysemies of cognate sets (which we call "dialexifications") than from polysemies of lexemes. After giving a precise definition of dialexification, we introduce EvoSem, a cross-linguistic database of etymologies scraped from several online sources. Based on this database (publicly available at http://tiny.cc/EvoSem), we measure for each pair of senses how many cognate sets include them both-i.e. how often this pair of senses is "dialexified". This allows us to construct a weighted dialexification graph for any set of senses, indicating the conceptual and historical closeness of each pair. We also present an online interface for browsing our database, including graphs and interactive tables. We then discuss potential applications to NLP tasks and to linguistic research.
We examine a database of 3089 languages coded for 351 morphosyntactic features, including almost ... more We examine a database of 3089 languages coded for 351 morphosyntactic features, including almost all of the morphosyntactic features found in The World Atlas of Language Structures (Dryer & Haspelmath 2013). We apply Factor Analysis of Mixed Data, and determine that the main dimensions of global morphological variation involve (1) word order in clauses and adpositional phrases, (2) head-versus dependent-marking, and (3) a set of features that show an east-west distribution. We find roughly the same features clustering in similar dimensions when we examine individual macro-areas, thus confirming the universal relevance of these groupings of features, as encapsulated in well-known implicational universals. This study confirms established insights in linguistic typology, extending earlier research to a much larger set of languages, and uncovers a number of areal patterns in the data.
Light Warlpiri is a newly emerged Australian mixed language that systematically combines nominal ... more Light Warlpiri is a newly emerged Australian mixed language that systematically combines nominal structure from Warlpiri (Australian, Pama-Nyungan) with verbal structure from Kriol (an English-lexified Creole) and English, with additional innovations in the verbal auxiliary system. Lexical items are drawn from both Warlpiri and the two English-lexified sources, Kriol and English. The Light Warlpiri verb system is interesting because of questions raised about how it combines elements of its sources. Most verb stems are derived from Kriol or English, but Warlpiri stems also occur, with reanalysis, and stems of either source host Kriol-derived transitive marking (e.g., hit-im ‘hit-TR’). Transitive marking is productive but also variable. In this paper, we examine transitivity and its marking on Light Warlpiri verbs, drawing on narrative data from an extensive corpus of adult speech. The study finds that transitive marking on verbs in Light Warlpiri is conditioned by six of Hopper and T...
As Virtual Reality emerges as an accessible technology, researchers have begun to experiment with... more As Virtual Reality emerges as an accessible technology, researchers have begun to experiment with its potential for data visualisation. In this poster we will highlight some conceptual issues around using VR and other immersive 3D technologies to explore the morphosyntactic data from the World Phonotactics Database (Donohue et al 2013), which is an expansion of the World Atlas of Language Structures (Dryer & Haspelmath 2013). While typological data has an inherent geographical dimension, mapping the space to other features of the data instead illuminate different sorts of clusters and structures. Two-dimensional graphs or visualisations, however, can be difficult to interpret, as information is static and densely layered. In the visualization discussed here, WALS/WPD languages are subjected to multidimensional scaling: essentially the number of features on which any two languages differ are counted, and the total is divided by the number of features that are known for both. A 3D (x,y,z) location is found for each language such that these pairwise distances are represented spatially. The result is a three-dimensional scatterplot. In a VR headset, the user is located at (0,0,0) in the midst of the cloud of points, and can look around. It can also be viewed in a pseudo-3D environment using WebGL in a desktop browser. The scatterplot can map any csv datafile, as long as there are columns for (x,y,z) coordinates as well as (optionally) columns of features that can be used for colouring the points. This therefore has potential uses beyond linguistics as well.
One of the main advantages of cognitive linguistics (and in particular Cognitive Grammar) over ot... more One of the main advantages of cognitive linguistics (and in particular Cognitive Grammar) over other approaches to the study of language structure is the fact that every descriptive construct is defined in psychological terms. This means, ideally, that any cognitive linguistic description of a word or grammatical construction constitutes a hypothesis about the mental representation of that structure. It should thus be possible to verify such descriptions, or to decide between competing analyses of a phenomenon, by experimentally testing the hypotheses that they entail. Such tests have been rare, however, due to the difficulty of operationalising many of the semantic notions used in Cognitive Grammar. The present thesis reports on attempts to operationalise and test (using questionnaires, production tasks, and reaction time measurements) four descriptive claims formulated in the framework of Cognitive Grammar: that finite complementation constructions are headed by the complement-tak...
Siva Kalyan & Alexandre François. 2019. When the waves meet the trees: A response to Jacques & List. In Siva Kalyan, Alexandre François & Harald Hammarström (eds), Understanding language genealogy: Alternatives to the tree model. Special issue of Journal of Historical Linguistics 9/1: 167–176., 2019
This special issue of the _Journal of Historical Linguistics_, titled “Understanding language gen... more This special issue of the _Journal of Historical Linguistics_, titled “Understanding language genealogy: Alternatives to the tree model”, was created with the idea of opening a constructive debate around the issues raised by the Tree model. We (Kalyan and François) had previously published articles highlighting the shortcomings of the tree for representing language history, and promoting an approach – called Historical Glottometry – based on the Wave model (François 2014, 2017; Kalyan & François 2018). We invited two historical linguists, Guillaume Jacques and Johann-Mattis List, as discussants in our volume; their chapter “Save the trees: Why we need tree models in linguistic reconstruction” [doi:10.1075/jhl.17008.mat] argued in favour of the Tree model – and this is our final response. In this response article, we review our apparently conflicting perspectives, but endeavour to find common ground between them – hence the title “when the waves meet the trees”. We show that our differences are partly due to distinct definitions of key concepts (subgroup, shared innovations…). We also show that the argument of “Incomplete lineage sorting”, which they use to defend the tree, could as well defend the wave model, since it can capture the key notion of intersecting innovations that is so prevalent in linkages, and so problematic in the traditional tree approach. The final part of our response shows how Historical Glottometry (HG) provides a way to reconstruct the historical process of ‘linkage breaking’, whereby a dialect continuum breaks progressively into separate languages. All in all, our two approaches are fundamentally compatible – even though we find the Wave model, ultimately, to be more realistic.
Siva Kalyan, Alexandre François & Harald Hammarström. 2019. Problems with, and alternatives to, the tree model in historical linguistics. In S. Kalyan, A. François & H. Hammarström (eds), Understanding language genealogy: Alternatives to the tree model. Journal of Historical Linguistics 9/1: 1–8., Jul 1, 2019
There are important reasons to be sceptical of the accuracy and usefulness of the family-tree mod... more There are important reasons to be sceptical of the accuracy and usefulness of the family-tree model in historical linguistics. That model assumes that every linguistic innovation applies to a language considered as an undifferentiated whole, a point with no “width”. But this assumption makes it impossible to use a tree to model the partial diffusion of an innovation within a language community (“internal diffusion”), or the diffusion of an innovation across language communities (“external diffusion”). These limitations have long been noticed by historical linguists (Schmidt 1872, Schuchardt 1900); but they become glaringly obvious in the cases discussed by Ross (1988) and François (2014) under the heading of “linkages” – i.e., language families that arise through the diversification, in situ, of a dialect network. The articles in this special issue all contribute towards addressing this problem, from a range of perspectives.
Since the beginnings of historical linguistics, the family tree has been the most widely accepted... more Since the beginnings of historical linguistics, the family tree has been the most widely accepted model for representing historical relations between languages. While this sort of representation is easy to grasp, and allows for a simple, attractive account of the development of a language family, the assumptions made by the tree model are applicable in only a small number of cases: namely, when a speaker population undergoes successive splits, with subsequent loss of contact among subgroups. A tree structure is unsuited for dealing with dialect continua, as well as language families that develop out of dialect continua (for which Ross 1988 uses the term “linkage”); in these situations, the scopes of innovations (in other words, their isoglosses) are not nested, but rather they persistently intersect, so that any proposed tree representation is met with abundant counterexamples. In this paper, we define “Historical Glottometry”, a new method capable of identifying and representing genealogical subgroups even when they intersect. Finally, we apply this glottometric method to a specific linkage, consisting of 17 Oceanic languages spoken in northern Vanuatu.
Kalyan, Siva and Alexandre François. 2018. Freeing the Comparative Method from the tree model: A framework for Historical Glottometry. In Ritsuko Kikusawa & Lawrence Reid (eds), _Let's talk about trees: Genetic Relationships of Languages and Their Phylogenic Representation_ (Senri Ethnological Studies, 98). Ōsaka: National Museum of Ethnology. 59–89.
Usage-based models of language propose that the acceptability of an element in a constructional s... more Usage-based models of language propose that the acceptability of an element in a constructional slot is determined by its similarity to attested fillers of that slot (Bybee 2010, ch. 4). However, Ambridge and Goldberg (2008) find that the acceptability of a long-distance-dependency (LDD) question does not correlate with the judged similarity of the matrix verb to think and say, which are by far the most frequently attested fillers of this slot. They propose instead that the acceptability of LDD questions is determined by the degree of fit between the information-structure properties of the matrix verb and those specified by the construction—specifically, the degree to which the matrix verb foregrounds its complement clause. This paper explores the possibility of reconciling this explanation with one based on similarity by suggesting that in this case the relevant aspect of similarity is precisely the verb’s foregrounding of its complement. Evidence for this suggestion comes from psychological research showing that in a categorization task, the similarity of an item to the exemplars of a category is judged primarily with respect to the features common to all category members, as well as from the observation that virtually all attested matrix verbs in LDD questions strongly tend to foreground their complements.
This map visualises distances among the grammars of 910 of the world's languages using colours (i... more This map visualises distances among the grammars of 910 of the world's languages using colours (in the manner of dialectometry). Distances are computed on the basis of the morphosyntax features in the World Phonotactics Database (Donohue et al. 2013), with the features weighted so as to control for dependencies (following Hammarström & O'Connor 2013). The distances are then subjected to multidimensional scaling in 3 dimensions, and the axes are mapped to red, green and blue.
Individual colours are not directly interpretable; however, warm colours (red, yellow etc.) correspond to predominantly head-initial languages, and cool colours (blue, green etc.) correspond to predominantly head-final languages.
This map visualises distances among the sound systems of the world's languages using colours (in ... more This map visualises distances among the sound systems of the world's languages using colours (in the manner of dialectometry). Distances are computed on the basis of the phonological features in the World Phonotactics Database (Donohue et al. 2013), with the features weighted so as to control for dependencies (following Hammarström & O'Connor 2013). The distances are then subjected to multidimensional scaling in 3 dimensions, and the axes are mapped to red, green and blue.
The main division visible is between Australia and the rest of the world.
When searching for language universals in linguistic typology, it is important to choose a sample... more When searching for language universals in linguistic typology, it is important to choose a sample of the world’s languages that is representative of many different linguistic areas; otherwise, one risks mistaking properties that have spread by contact or inheritance for genuine universal tendencies in human language. In order to avoid this problem, one needs to be able to identify linguistic areas in a consistent and objective manner. This has never been done: as Hammarström & Güldemann (2014: 95–96) note, “Unfortunately, there is no (near-complete) list of established areas…and even partial lists contain uncertainties concerning their delimitation”. This presentation proposes one way of addressing this issue. The dataset used for this project was the WALS subset of the World Phonotactics Database (Donohue et al. 2013), which contains the 257 morphosyntax features from the World Atlas of Language Structures (Dryer & Haspelmath 2013), coded for 1601 languages, 943 of which have non-blank values for more than 50% of the features. The first step was to compute pairwise typological distances among these 943 languages; this was done using the Gower coefficient, weighting the features in a way that accounts for feature dependencies (following the procedure suggested in Hammarström & O’Connor 2013). Figure 1 shows the resulting typological distances, transformed into a 3D MDS solution which is then interpreted as RGB vectors (as in dialectometric studies such as Goebl 2006 and Szmrecsányi 2011), and displayed on a map using a Voronoi tessellation. The next step in the identification of linguistic areas was to define a measure of “distance” between languages that would reflect typological distance for languages that are close together, and geographic distance for languages that are far apart; this would ensure that geographically close languages are clustered together only if they are also typologically similar, and geographically distant languages are rarely clustered together, regardless of typological similarity. This new distance measure was constructed as follows: First, a geographic adjacency graph was computed, where two languages A and B are linked if and only if there is no third language C which is closer to A and B than they are to each other (following the procedure in Hammarström & Güldemann 2014: 101–102). Then, the “areal distance” between A and B was defined as equal to their typological distance if they were either immediately adjacent or 2 steps apart; if they were more than 2 steps apart, then their areal distance was equal to the sum of typological distances along the shortest path between A and B that connects only geographically-adjacent languages. These areal distances were used to cluster the 943 languages into 47 clusters with an average of about 20 languages each; these clusters were then displayed within the Voronoi tessellation computed earlier. Figure 2 shows a sample cluster, which is clearly a highly plausible candidate for a linguistic area. Likewise, all the clusters that were found are geographically contiguous (by construction), and most of them can be readily identified with established linguistic areas, or with language families or subgroups.
Since its popularization by August Schleicher (1853, 1873), the most common way of representing a... more Since its popularization by August Schleicher (1853, 1873), the most common way of representing a language family has been as a tree. Tree representations are highly amenable to computational methods (as in biology), and they are sufficiently restrictive that they allow one to deduce the features that characterize each clade. The tree model has well-recognised limitations (Saussure 1995 [1916], Gray et al. 2010, etc.); in particular, it assumes that the language family evolved primarily by successive demographic splits, and so is incapable of representing (former) dialect networks. However, while other ways of representing a language family have occasionally been proposed, starting from Schmidt’s (1872) “Wave Model”, few of these alternatives are as precisely-specified or as restrictive as the tree model. In this presentation, we describe a computational method that allows us to represent the structure of a language family without assuming that it evolved by successive splits, and in such a way that we can identify the exact features that characterize each subgroup (allowing that subgroups may overlap). This method is correspondence analysis. Correspondence analysis (also known as optimal scaling, or dual scaling) allows one to take a binary matrix tabulating the presence or absence of m features in each of n items, and produce a solution in min(m – 1, n – 1) dimensions in which there are points that represent the items as well as points that represent the features. Crucially, the point representing each item is close to the points representing the features it exhibits; likewise, the point representing each feature is close to the points representing the items that exhibit it. Correspondence analysis has been successfully used in social network analysis as a representation of bipartite “affiliation” networks (Wasserman & Faust 1994: 334–342). We applied correspondence analysis (using the FactoMineR package in R: Lé et al. 2008) to a database of “plausible cognacy” judgments for a selection of 26 Germanic languages. In our data, each column represents a language (e.g. English, German, etc.), and each row represents a word that looks similar across at least some of the languages (e.g. water, Wasser, etc.), whether due to shared inheritance or borrowing. We would expect the clusters in the correspondence analysis solution to correspond to genealogical subgroups or to contact zones. Indeed, this is what we find. As seen in Figure 1, the languages divide neatly into the well-established North Germanic (Scandinavian), East Germanic (Gothic) and West Germanic subgroups, as well as a cluster consisting of English and Scots (i.e. languages of the British Isles). In addition, we can see that within each subgroup, there are “central” and “peripheral” members—e.g. Danish (a North Germanic language) approaches the West Germanic languages. Most importantly, these language clusters correspond directly to clusters of cognate sets, which means that (unlike with many other alternatives to the tree model), we have immediate access to the evidence that supports each cluster.
When searching for language universals in linguistic typology, it is important to choose a sample... more When searching for language universals in linguistic typology, it is important to choose a sample of the world’s languages that is representative of many different linguistic areas; otherwise, one risks mistaking properties that have spread by contact or inheritance for genuine universal tendencies in human language. In order to avoid this problem, one needs to be able to identify linguistic areas in a consistent and objective manner. This has never been done: as Hammarström & Güldemann (2014: 95–96) note, “Unfortunately, there is no (near-complete) list of established areas…and even partial lists contain uncertainties concerning their delimitation”. This presentation proposes one way of addressing this issue. The dataset used for this project was the WALS subset of the World Phonotactics Database (Donohue et al. 2013), which contains the 257 morphosyntax features from the World Atlas of Language Structures (Dryer & Haspelmath 2013), coded for 1601 languages, 943 of which have non-blank values for more than 50% of the features. The first step was to compute pairwise typological distances among these 943 languages; this was done using the Gower coefficient, weighting the features in a way that accounts for feature dependencies (following the procedure suggested in Hammarström & O’Connor 2013). Figure 1 shows the resulting typological distances, transformed into a 3D MDS solution which is then interpreted as RGB vectors (as in dialectometric studies such as Goebl 2006 and Szmrecsányi 2011), and displayed on a map using a Voronoi tessellation. The next step in the identification of linguistic areas was to define a measure of “distance” between languages that would reflect typological distance for languages that are close together, and geographic distance for languages that are far apart; this would ensure that geographically close languages are clustered together only if they are also typologically similar, and geographically distant languages are rarely clustered together, independently of typological similarity. This new distance measure was constructed as follows: First, a geographic adjacency graph was computed, where two languages A and B are linked if and only if there are no more than two other languages which lie within the circle whose diameter is the line segment AB. (This is thus a variant of the “Gabriel graph”: see Gabriel & Sokal 1969.) Then, the “areal distance” between A and B was defined as equal to their typological distance if they were either immediately adjacent or 2 steps apart; if they were more than 2 steps apart, then their areal distance was equal to the sum of typological distances along the shortest path between A and B that connects only geographically-adjacent languages. These areal distances were used to cluster the 943 languages into 47 clusters with an average of about 20 languages each; these clusters were then displayed within the Voronoi tessellation computed earlier. Figures 2–4 show some sample clusters. The areal clusters found are all (by construction) geographically contiguous; further, most of them can be readily identified with established linguistic areas, or with language families or subgroups.
As Virtual Reality emerges as an accessible technology, researchers have begun to experiment with... more As Virtual Reality emerges as an accessible technology, researchers have begun to experiment with its potential for data visualisation. In this poster we will highlight some conceptual issues around using VR and other immersive 3D technologies to explore the morphosyntactic data from the World Phonotactics Database (Donohue et al 2013), which is an expansion of the World Atlas of Language Structures (Dryer & Haspelmath 2013).
While typological data has an inherent geographical dimension, mapping the space to other features of the data instead illuminate different sorts of clusters and structures. Two-dimensional graphs or visualisations, however, can be difficult to interpret, as information is static and densely layered.
In the visualization discussed here, WALS/WPD languages are subjected to multidimensional scaling: essentially the number of features on which any two languages differ are counted, and the total is divided by the number of features that are known for both. A 3D (x,y,z) location is found for each language such that these pairwise distances are represented spatially.
The result is a three-dimensional scatterplot. In a VR headset, the user is located at (0,0,0) in the midst of the cloud of points, and can look around. It can also be viewed in a pseudo-3D environment using WebGL in a desktop browser.
The scatterplot can map any csv datafile, as long as there are columns for (x,y,z) coordinates as well as (optionally) columns of features that can be used for colouring the points. This therefore has potential uses beyond linguistics as well.
One task of a typologist is to arrive at a sample of languages that provides an adequate basis fo... more One task of a typologist is to arrive at a sample of languages that provides an adequate basis for drawing generalisations about Language. Typically, typologists try to choose languages to allow an even representation of language families, or else to allow an even representation of geographical areas. The trouble is that these two criteria are often in opposition: if we balanced a world sample by language family, Australia would be underrepresented. If we balanced the same sample by geography, Pama-Nyungan would be overrepresented. One way to get around the problems of balancing areal and genealogical criteria is to ignore areas and genealogy completely, and simply focus on the distribution of languages in terms of their typological features. Then we can define a “representative sample” of languages as a set of languages that (a) is “maximally dispersed” through the feature space (in order to maximally represent diversity), and (b) is such that no two languages are “too close together” (in order to avoid over-representing very similar language types). Dahl (2008) makes an initial attempt at defining such an a posteriori approach to typological sampling, by simply computing the distance between every pair of languages, and then for every pair of languages that is “too close”, removing the one that is less well-described. However, this approach is limited by two factors: Firstly, it is deterministic, i.e., it produces only one sample for a given input set of languages, even though intuitively there should be many samples that are comparably good. Secondly, the threshold for languages being “too close” is arbitrarily determined by the desired sample size; in fact, we would like to argue that the sample size (and the criterion for when two languages are “too close”) should be determined by the inherent variation in the total set of languages We propose that a typologically representative sample can be built in the following way: (1) Rescale the inter-language distances so as to maximise the variation between languages in dense neighbourhoods and languages in sparse neighbourhoods. (2) Pick one language at random, and add it to the sample. (3) Pick another language at random, and decide whether to include it in the sample based on how far (typologically) it is from the language already in the sample (if it’s far away, it’s very likely to be added; if it’s close by, it’s unlikely to be added). Continue picking languages at random, and decide the likelihood of including each one by how well-separated it is from the languages already in the sample. Continue until every language has been considered. We present a posteriori typological samples computed on the basis of two typological feature sets: the morphosyntactic features in WALS (Dryer and Haspelmath 2013), as coded in the extended World Phonotactics Database (Donohue et al. 2013), and the phonological features in the WPD. We show that the number of languages in a morphosyntactically representative sample of the world’s languages is greater than the number of languages in a phonologically representative sample. This shows the economy achieved when using typology to drive typological sampling, and highlights the loci of global linguistic diversity.
In this talk we explore different ways of clustering the languages of Melanesia and Island Southe... more In this talk we explore different ways of clustering the languages of Melanesia and Island Southeast Asia. We take a sample of 65 languages, distributed geographically and genealogically. These include 33 Austronesian languages representing most major non-Oceanic subgroups, a number of Papuan languages of Indonesia and New Guinea, and a selection of other languages that might be expected at some point to have had contact with Austronesian languages. We cluster these languages on the basis of (1) judgments of “plausible cognacy” for basic vocabulary lexemes (where potential borrowings are also classified as “cognates”), (2) phonological features from the World Phonotactics Database (Donohue et al. 2013), and (3) morphosyntactic features from WALS (Dryer and Haspelmath 2013). These different types of clustering can be expected to provide clues about both historical and typological patterns, and the process of teasing out the sources of these patterns forms the methodological backbone of our talk. Because of the linkage-like structure of the families involved, we adopted non-cladistic clustering methods to tease apart the data. In particular, we used Correspondence Analysis (Faust 2005, Kalyan and Donohue in preparation), which is unique in allowing us to immediately identify which of the features in our data are responsible for each cluster of languages. Not surprisingly, given the large proportion of Austronesian languages in our data, we consistently find strong lexical evidence for a cluster that contains most Austronesian languages in the sample, but with intriguing disparities on the fringes. For example, in terms of body-part vocabulary, Tai-Kadai languages (Ong Be, Thai and Zhuang) are surprisingly close to the Austronesian cluster, as is Makasae (Papuan, East Timor); by contrast, Ambai (Austronesian, Cenderawasih Bay) and Atayal (Austronesian, northern Taiwan) are highly divergent from the Austronesian “prototype”. In the domain of “tools”, however, Ambai hews closer to the Austronesian cluster, together with Makasae and a number of Papuan families; but Tai-Kadai is relatively distant. Comparing these patterns with those found when we examine the typological data, we are able to map the linguistic outcomes that reflect different social histories in this part of the world. We briefly present a sample of the features in our data that are directly responsible for the clustering patterns that we see, and speculate on what these may tell us about social histories in Island Southeast Asia.
A long-standing problem in Cognitive Grammar (e.g. Langacker 1987: 187–188) is the characterisa- ... more A long-standing problem in Cognitive Grammar (e.g. Langacker 1987: 187–188) is the characterisa- tion of the relationship between different kinds of salience—a problem which has received renewed attention in recent years (e.g. Langacker forthcoming). This presentation suggests an independently- developed approach, which aims to provide each kind of salience with an operational definition.
Perhaps the most easily-operationalised concept of salience is what is variously known as “accessibility” (Ariel 1990), “activation”, or “givenness” (Chafe 1994). Roughly, this refers to the de- gree to which an entity is in the “focus of consciousness” of the discourse participants at a particular point in time. It is usually diagnosed by the ease with which the entity can be referred to with a pro- noun or an unstressed noun phrase; and in psycholinguistics, it is studied in terms of visual attention and conceptual priming (among many possibilities).
Given the ready availability of “accessibility”/“givenness” as a concept of “salience” that is se- curely known to be psychologically real, it makes sense to try to relate the many types of salience postulated in Cognitive Grammar to some notion of referential accessibility. This is not a new idea; Langacker (2001) describes possessors, topics, and focal participants (core arguments) as types of “reference points”, using a construct that is also used to describe referential accessibility in the con- text of pronominal anaphora (van Hoek 1997). This presentation focuses on the notion of “profile”— while also touching on other kinds of salience—and tries to make as explicit as possible the concep- tual import of each kind of salience in terms of the dynamics of language processing.
The profile of an expression is that element of the evoked conceptual content that the expres- sion “designates” (Langacker 1987: 183). Langacker (1987: 188) explicitly suggests that the profile is “activated at a...higher level of intensity” than other elements of the evoked content (e.g. in The plane that descended, the conception of the plane is more active than that of its downward motion). This characterisation appears to be supported by experiments using the visual-world paradigm, which clearly show that hearing a noun phrase directs attention to (in other words, visually activates) the entity that the noun phrase designates.
Yet this cannot be the full story, as it is easy to find cases of noun phrases whose referents are not activated (in the sense that they are not subsequently available as antecedents of pronouns). This is the case not only with “non-referential” NPs (e.g. predicate nominals), but also with NPs that are contained in presuppositions (e.g. I met the woman who fixed my computer. *It...).
However, even in these cases, it seems likely that the entity designated by the NP, though it may not be activated in an absolute sense, nonetheless is more active at the end of the NP than it was at the start. Thus, as a first approximation, it is suggested that the profile of an expression is the entity whose activation increases (even if not maximally) during the course of that expression.
A further complication arises, however, when we turn our attention from noun phrases to finite clauses. While it is true that a clause activates the conception of an event (e.g. Sam washed the win- dows. It took two hours.), it typically also activates the participants that are coded by its arguments (e.g. ...They hadn’t been cleaned for a year.)—indeed, even more so. Yet we would not want to say that these participants are designated by the clause.
A possible solution to this is to argue that in fact, a relational profile consists of multiple enti- ties: the participant(s) in the relation, as well as an entity that is emblematic of the relationship (e.g. a Davidsonian argument in the case of a finite clause, or a location in the case of a locative PP).
While the above proposals are admittedly speculative, they go some way toward fulfilling the pressing need for operational definitions of semantic notions in cognitive linguistics (Dąbrowska 2009). Such definitions are essential in order for cognitive grammatical analyses to be tested empiri- cally.
The traditional analysis of finite complementation in English (e.g. I know that she left) is that... more The traditional analysis of finite complementation in English (e.g. I know that she left) is that the propositional-attitude verb (e.g. know) is the head or “main verb”, whereas the content clause that follows it (e.g. (that) she left) is a “subordinate clause” (Langacker 1991, Halliday 1994, etc.). This is typically established on the basis of structural parallels between content clauses and direct-object NPs.
However, as shown by Thompson (2002), Verhagen (2005) and others, in spoken discourse it is nearly always the content clause that carries the main informational load of the sentence, and serves to “move the discourse forward”. On the basis of this fact, these researchers argue that the head of a finite-complementation construction is actually the content clause, and that the propositional-attitude verb and its subject are better understood as constituting an evidential or epistemic adverbial modifier.
Dąbrowska (2009) points out that disagreements such as this are caused by the lack of clear operational definitions, in this case of the notion “head”—or “profile determinant”, in the terminology of Cognitive Grammar. In this talk, we propose an operational definition of “profile determinant”, and use it to experimentally investigate whether finite complementation constructions are headed by the propositional-attitude verb or the content clause.
In Cognitive Grammar, the “profile determinant” of a construction is the component which designates the same entity as the entire composite structure (Langacker 1987: 288). (For example, the profile determinant of jar lid is lid (not jar), because jar lid designates a type of lid and not a type of jar.) A consequence of this definition (Langacker 1987: 467) is that the semantic properties of the profile determinant are inherited by the composite structure. (Thus, a jar lid has all the properties of a lid, but not those of a jar.) Applying this to finite complementation, this means that if know is the profile determinant of I know that she left, then the sentence designates an event of knowing rather than one of leaving (Langacker 1991: 436), and exhibits the semantic properties of know, but not those of left.
We report on two experiments in which we examine a set of sixteen propositional-attitude verbs, and try to determine, in each case, whether it is the propositional-attitude verb or the content clause that is the profile determinant. In the first experiment, subjects were shown one sentence exemplifying each propositional-attitude verb; after each sentence, they saw a list of twelve features that included four features of the propositional-attitude verb and four features of the main verb of the content clause. (These features had been elicited in a preliminary study.) The subjects’ task was to indicate which features best describe the overall meaning of the sentence. In this experiment all subjects saw the same list of sixteen sentences; the second experiment controlled for the (potential) influence of the content clauses by having sixteen conditions, each with different combinations of propositional-attitude verbs and content clauses.
We found that four of the propositional-attitude verbs in our study (thought, remember, suspect and admitted) tend to be profile-determining, and five (found, realised, says, agreed and believe) are non-profile-determining (in other words, the content clause is the head); the other verbs (knew, announced, showed, saw, suggested, concluded and claimed) did not show a significant effect. We consider possible semantic and syntactic explanations for this pattern of results, and find ourselves forced to conclude that what we have found are simply idiosyncratic properties of these verbs.
We hope, however, to have shown that the concept of a “profile determinant” can indeed be given a meaningful operational definition, and that this can be used to test grammatical analyses experimentally.
This paper is an attempt to subgroup the Pama-Nyungan languages of the Pilbara region of Western ... more This paper is an attempt to subgroup the Pama-Nyungan languages of the Pilbara region of Western Australia, i.e., the languages traditionally assigned to the Mantharta, Kanyara, and Ngayarta groups (WHa, b, and c in Dixon 2002). Subgroups are defined on the basis of shared morphological and lexical innovations, using data from Dench (1994), O’Grady (1966), and Austin (1981). However, rather than assuming that innovation-defined subgroups will be nested in a family tree, I allow for subgroups to overlap, and to have varying levels of “subgroupiness” (as defined in Kalyan and François forthcoming). This approach allows one to apply the insights of the Wave Model to the Australian situation, without losing the power of the Comparative Method, and especially the principle of subgrouping by shared innovations.
"Since its development in 1853 by August Schleicher, the family tree has become the most widely a... more "Since its development in 1853 by August Schleicher, the family tree has become the most widely accepted model for representing historical relations between languages. And yet, it has also been an object of criticism, even among followers of the Comparative Method (e.g. Ross 1988, Bossong 2009, Heggarty et al. 2010), for the problematic assumptions that underlie it: (1) that the genealogy of languages can be traced back by looking exclusively at divergence, to the exclusion of convergence and diffusion; (2) that each modern language thus belongs to a single subgroup, which is itself nested in another discrete subgroup, and so on and so forth.
The tree model may be appropriate when a speaker population undergoes successive splits, with subsequent loss of contact among descendants. For all other scenarios, it fails to provide an accurate representation of language history. In particular, it is unable to deal with dialect continua, or with the language families that develop out of them—for which Ross (1988) proposed the term “linkage”. In such cases, the scopes of innovations (their isoglosses) are not nested but persistently intersect, in ways which cannot be accurately represented by any tree structure. Though Ross's initial observations about linkages concerned the languages of western Melanesia, it is clear that linkages are found in many other areas as well—such as Fiji (Geraghty 1983), Northern India (Toulmin 2009), etc.
In this presentation, we focus on the 17 languages of the Torres and Banks islands in Vanuatu, which form a linkage (François 2011), and attempt to develop adequate representa¬tions for this linkage. Our data consist of 474 linguistic innovations reflected in the area—phonological, morphological, lexical or otherwise—identified on the basis of a strict application of the Comparative Method. With these 17 × 474 data points, we illustrate the method of Historical Glottometry, our proposed quantitative approach to language subgrouping in situations of linkage. One tenet of Glottometry is that innovation-defined subgroups may intersect; they define patterns that can be quantified and measured. We calculate the cohesiveness of each subgroup (proportion of time it is confirmed by the data), and define the relative strengths of all subgroups by calculating their subgroupiness (number of exclusively shared innovations weighted by subgroup cohesiveness). The result is a “glottometric diagram” of northern Vanuatu, in which the relative strengths of subgroups can be visually represented, in ways more faithful to historical reality than what the tree model can do.
Overall, we hope to show that Historical Glottometry enables a fine-grained, reliable and testable representation of language history in genealogical linkages, that combines the valuable insights of the Comparative Method with a diffusionist, non-cladistic model of language diversification."
"Since the beginnings of historical linguistics, the family tree has been the most widely accepte... more "Since the beginnings of historical linguistics, the family tree has been the most widely accepted model for representing historical relations between languages. It is even being reinvigorated by current research in computational phylogenetics (e.g. Gray et al. 2009). While this sort of representation is certainly easy to grasp, and allows for a simple, attractive account of the development of a language family, the assumptions made by the tree model are applicable in only a small number of cases.
The tree model is appropriate when a speaker population undergoes successive splits, with subsequent loss of contact among subgroups. For all other scenarios, it fails to provide an accurate representation of language history (cf. Durie & Ross 1996; Pawley 1999; Heggarty et al. 2010). In particular, it is unable to deal with dialect continua, as well as language families that develop out of dialect continua (for which Ross 1988 has proposed the term "linkage"). In such cases, the scopes of innovations (in other words, their isoglosses) are not nested, but rather they persistently intersect, so that any proposed tree representation is met with abundant counterexamples. Though Ross's initial observations about linkages concerned the languages of Western Melanesia, it is clear that linkages are found in many other areas as well (such as Fiji: cf. Geraghty 1983).
In this presentation, we focus on the 17 languages of the Torres and Banks islands in Vanuatu, which form a linkage (Tryon 1996, François 2011), and attempt to develop adequate representations for it. Our data consists of a database of 474 linguistic innovations reflected in the area—whether phonological, morphological, lexical or otherwise. Based on this rich database, we propose to define a new approach to representing and reconstructing language history – an approach we call historical glottometry.
Firstly, we use the tools of dialectometry developed by European dialectologists (Goebl 2006, Nerbonne 2010, Szmrecsányi 2011). For each pair of languages, we compute the ratio of shared innovations to non-shared innovations. Converting this ratio into a distance and applying multidimensional scaling, we are able to accurately visualise the degree of historical divergence among the languages.
Secondly, we attempt to answer the question of how closely the Torres-Banks linkage approximates a tree—in other words, to what extent the isoglosses are nested. For each isogloss, we compute the proportion of isoglosses that contain or are contained in it to isoglosses that intersect it; this allows us to determine its "subgroupiness". In the case of a language family that develops exactly in the manner assumed by the tree model, every isogloss would have a subgroupiness of 100%. We thus propose an isogloss map with line darkness dependent on subgroupiness as a general representation, which includes the situation assumed by the tree model as a special case.
Overall, the approach of historical glottometry, anchored as it is in the classical comparative method (with its focus on shared innovations), provides a reliable and verifiable representation of language history, that avoids the misleading assumptions of the tree model."
""It is often assumed that the meaning of a polysemous lexeme (such as a preposition) is mentally... more ""It is often assumed that the meaning of a polysemous lexeme (such as a preposition) is mentally represented as a network of senses, and that in any given context of use, its interpretation is one of these senses, or possibly an elaboration of one of these senses. Debates about the meaning of a polysemous lexeme are usually about whether two interpretations are sufficiently different (and unpredictable) to constitute distinct senses, or whether they are sufficiently similar (or predictably different) for the differences to be attributable either to pragmatic rules or to “contextual modulation” (Cruse 1986) operating on a single, unified sense.
A question which (to my knowledge) is never addressed, is how to determine, in a principled way, what the contextual interpretation of a word is in the first place. (It seems to be widely agreed that it usually goes beyond those aspects of meaning conventionally associated with the word—an assumption that I would like to question.) Having a clear answer to this is necessary in order to start formulating testable hypotheses about which of these interpretations form part of the mental representation of the word.
I would like to make the following proposals:
1. The contextual interpretation of a word consists of those aspects of meaning that are
a) frequently present in the entity that the word refers to, and
b) present in the entity that it refers to in the current context.
2. The mental representation of the word's meaning consists of all of its attested contextual interpretations (as in an exemplar model, e.g. Bybee 2001);
3. Conventional tests of sensehood are largely tests of whether a certain meaning constitutes an acceptable contextual interpretation of the word""
"A long-standing problem in Cognitive Grammar is the characterization of the relationship between... more "A long-standing problem in Cognitive Grammar is the characterization of the relationship between different kinds of prominence. Here I consider two kinds of prominence—profiling and pragmatic assertion—and attempt to relate the two by construing them as instances of a more general concept of salience that has a basis in cognitive psychology.
Tversky (1977: 342–344) notes that the salience of a feature (as measured by the rated similarity of objects sharing that feature) depends on (among other things) the set of items in the context of which the salience of the feature is measured. That is, the more likely a person is to use a feature as the basis for classifying a set of items—the greater the “diagnosticity” of the feature within that set—the greater the salience of that feature in that context.
Given that the salience of a feature must be defined with respect to a particular context of items, I propose that profiling and pragmatic assertion are both kinds of diagnosticity, but are defined with respect to different contexts. A profiled feature has high diagnosticity in the context of all units in the language, whereas an asserted feature has high diagnosticity in the context made relevant by the discourse (which, I argue, is defined by those presupposed features that are necessary for the asserted one).
The hypothesized characterization of profiling seems to be supported by various experiments using a pile-sorting task. Miller (1967: 63–66) cites unpublished data from Jeremy M. Anglin showing that when a set of words belonging to different parts of speech (noun, verb, adjective, adverb) are classified, the highest-level clusters (the subsets such that the items contained in it are least likely to be classified with items outside of it) correspond almost perfectly to the part-of-speech categories, suggesting that the properties of these categories have high diagnosticity in the context of English lexemes. This is what we would expect, given the Cognitive Grammar definitions of parts of speech as different kinds of profiles (e.g. Langacker 1987: 189).
Bencini and Goldberg (2000: 645, 647) find that when sentences instantiating the transitive, ditransitive, resultative and caused-motion constructions are classified, the ditransitive sentences are the ones least likely to be sorted together with other sentence types. The interpretation that the property of having an indirect object has greater diagnosticity than the property of having an oblique argument is consistent with Goldbergʼs (1995: 48–49) analysis of “core” arguments as profiled and “non-core” arguments as unprofiled.
There is abundant evidence that the focus of a sentence is salient in the discourse context (e.g. Birch and Rayner 2010, and references cited therein). Given that the focus is defined to be part of the assertion (Lambrecht 1994: 213), this can also be interpreted as evidence for the salience of asserted properties. The identification of the “discourse context” with the context defined by sentence presuppositions is suggested by Millerʼs (1969) finding that people tend to sort words together on the basis of shared presuppositions in those wordsʼ definitions. In other words, the asserted properties of a word are most useful in distinguishing it from other words that have the same presuppositions, and hence they have the highest diagnosticity in this context."
These are the slides for a talk that we gave on April 19, 2017 at the Canberra Functional Program... more These are the slides for a talk that we gave on April 19, 2017 at the Canberra Functional Programming Group. It is a work in progress, so comments and feedback are welcome.
Ever since the development of the Comparative Method by the Neo-grammarians, the family tree has ... more Ever since the development of the Comparative Method by the Neo-grammarians, the family tree has been the most widely accepted model for representing historical relations between languages. It is even being reinvigorated by the current development of computer approaches to phylogenetic studies (e.g. Gray et al. 2009). Admittedly, its application provides an easily interpretable storyline involving subgroups, protolanguages and population splits, and is amenable to an appealingly simple visual representation. This phylogenetic model, however, rests on a number of problematic assumptions (Bossong 2009): (1) that languages essentially evolve in isolation from their neighbours; (2) that the history of languages can be traced back by looking exclusively at divergence, to the exclusion of convergence and diffusion; (3) that each modern language belongs to a single subgroup, which is itself nested in another discrete subgroup, and so on and so forth.
The tree model suits just one ideal case: when a population went through successive migration pulses with systematic loss of contact. For all other scenarios, it fails to provide any accurate representation of language history, as has been widely observed already (cf. Durie & Ross 1996; Pawley 1999; Heggarty et al. 2010). In particular, it is unable to deal with cases of diffusion across dialect continuums. Ross (1988) has proposed the term linkage to refer to “a group of communalects which have arisen by dialect differentiation”, i.e. the modern descendants of an earlier dialect continuum. Just like dialect chains, linkages are not compatible with family trees, because they do not involve discrete subgroups, but constantly intersecting isoglosses. Ross’ important observations, initially made about languages of Western Melanesia, deserve to be extrapolated to other parts of the world. We need to develop an accurate representation of language history in dialect continuums and linkages, that would combine the scientific power of the comparative method with a diffusionist, non-cladistic approach.
Our talk will present a method for unravelling and representing the linguistic history of a specific linkage: Vanuatu. Even though modern Vanuatu languages have long lost any mutual intelligibility, their history is best represented using a wave-model approach (Tryon 1996, François 2011a): each post-dispersal innovation diffused across a social network of small communities in constant interaction speaking mutually intelligible dialects, à la Fiji (Geraghty 1983). Focusing on the 17 languages of the Banks & Torres Islands (François 2011b), we identify 441 linguistic innovations reflected in the area – whether phonological, morphological, lexical or otherwise. Using the tools of dialectometry developed by European dialectologists (Goebl 2006, Nerbonne 2010, Szmrecsanyi 2011), notably Multi Dimensional Scaling, we track the geographic patterns of linguistic diffusion. We identify historically significant clusters of languages, albeit intersecting ones, and show what they tell us about the social history of the area. Our purpose is to show it is possible to achieve an accurate and elegant representation of linkages, by taking advantage of the strengths of the Comparative Method, yet steering clear of the phylogenetic model and its unfortunate delusions.
Siva Kalyan, Alexandre François & Harald Hammarström (eds), Understanding language genealogy: Alternatives to the tree model. Special issue of Journal of Historical Linguistics 9/1., 2019
There are important reasons to be sceptical of the accuracy and usefulness of the family-tree mod... more There are important reasons to be sceptical of the accuracy and usefulness of the family-tree model in historical linguistics. That model assumes that every linguistic innovation applies to a language considered as an undifferentiated whole, a point with no “width”. But this assumption makes it impossible to use a tree to model the partial diffusion of an innovation within a language community (“internal diffusion”), or the diffusion of an innovation across language communities (“external diffusion”). These limitations have long been noticed by historical linguists (Schmidt 1872, Schuchardt 1900); but they become glaringly obvious in the cases discussed by Ross (1988) and François (2014) under the heading of “linkages” – i.e., language families that arise through the diversification, in situ, of a dialect network. The articles in this special issue all contribute towards addressing this problem, from a range of perspectives.
****
Problems with, and alternatives to, the tree model in historical linguistics — Siva Kalyan, Alexandre François & Harald Hammarström
Non-tree-like signal using multiple tree topologies — Annemarie Verkerk
Visualizing the Boni dialects with Historical Glottometry — Alexander Elias
Subgrouping the Sogeram languages — Don Daniels, Danielle Barth & Wolfgang Barth:
Save the trees: Why we need tree models in linguistic reconstruction — Guillaume Jacques & Johann-Mattis List
When the waves meet the trees: A response to Jacques and List — Siva Kalyan & Alexandre François
Uploads
In this response article, we review our apparently conflicting perspectives, but endeavour to find common ground between them – hence the title “when the waves meet the trees”. We show that our differences are partly due to distinct definitions of key concepts (subgroup, shared innovations…). We also show that the argument of “Incomplete lineage sorting”, which they use to defend the tree, could as well defend the wave model, since it can capture the key notion of intersecting innovations that is so prevalent in linkages, and so problematic in the traditional tree approach.
The final part of our response shows how Historical Glottometry (HG) provides a way to reconstruct the historical process of ‘linkage breaking’, whereby a dialect continuum breaks progressively into separate languages. All in all, our two approaches are fundamentally compatible – even though we find the Wave model, ultimately, to be more realistic.
Kalyan, Siva and Alexandre François. 2018. Freeing the Comparative Method from the tree model: A framework for Historical Glottometry. In Ritsuko Kikusawa & Lawrence Reid (eds), _Let's talk about trees: Genetic Relationships of Languages and Their Phylogenic Representation_ (Senri Ethnological Studies, 98). Ōsaka: National Museum of Ethnology. 59–89.
Individual colours are not directly interpretable; however, warm colours (red, yellow etc.) correspond to predominantly head-initial languages, and cool colours (blue, green etc.) correspond to predominantly head-final languages.
The main division visible is between Australia and the rest of the world.
The dataset used for this project was the WALS subset of the World Phonotactics Database (Donohue et al. 2013), which contains the 257 morphosyntax features from the World Atlas of Language Structures (Dryer & Haspelmath 2013), coded for 1601 languages, 943 of which have non-blank values for more than 50% of the features. The first step was to compute pairwise typological distances among these 943 languages; this was done using the Gower coefficient, weighting the features in a way that accounts for feature dependencies (following the procedure suggested in Hammarström & O’Connor 2013). Figure 1 shows the resulting typological distances, transformed into a 3D MDS solution which is then interpreted as RGB vectors (as in dialectometric studies such as Goebl 2006 and Szmrecsányi 2011), and displayed on a map using a Voronoi tessellation.
The next step in the identification of linguistic areas was to define a measure of “distance” between languages that would reflect typological distance for languages that are close together, and geographic distance for languages that are far apart; this would ensure that geographically close languages are clustered together only if they are also typologically similar, and geographically distant languages are rarely clustered together, regardless of typological similarity. This new distance measure was constructed as follows: First, a geographic adjacency graph was computed, where two languages A and B are linked if and only if there is no third language C which is closer to A and B than they are to each other (following the procedure in Hammarström & Güldemann 2014: 101–102). Then, the “areal distance” between A and B was defined as equal to their typological distance if they were either immediately adjacent or 2 steps apart; if they were more than 2 steps apart, then their areal distance was equal to the sum of typological distances along the shortest path between A and B that connects only geographically-adjacent languages.
These areal distances were used to cluster the 943 languages into 47 clusters with an average of about 20 languages each; these clusters were then displayed within the Voronoi tessellation computed earlier. Figure 2 shows a sample cluster, which is clearly a highly plausible candidate for a linguistic area. Likewise, all the clusters that were found are geographically contiguous (by construction), and most of them can be readily identified with established linguistic areas, or with language families or subgroups.
Correspondence analysis (also known as optimal scaling, or dual scaling) allows one to take a binary matrix tabulating the presence or absence of m features in each of n items, and produce a solution in min(m – 1, n – 1) dimensions in which there are points that represent the items as well as points that represent the features. Crucially, the point representing each item is close to the points representing the features it exhibits; likewise, the point representing each feature is close to the points representing the items that exhibit it. Correspondence analysis has been successfully used in social network analysis as a representation of bipartite “affiliation” networks (Wasserman & Faust 1994: 334–342).
We applied correspondence analysis (using the FactoMineR package in R: Lé et al. 2008) to a database of “plausible cognacy” judgments for a selection of 26 Germanic languages. In our data, each column represents a language (e.g. English, German, etc.), and each row represents a word that looks similar across at least some of the languages (e.g. water, Wasser, etc.), whether due to shared inheritance or borrowing. We would expect the clusters in the correspondence analysis solution to correspond to genealogical subgroups or to contact zones.
Indeed, this is what we find. As seen in Figure 1, the languages divide neatly into the well-established North Germanic (Scandinavian), East Germanic (Gothic) and West Germanic subgroups, as well as a cluster consisting of English and Scots (i.e. languages of the British Isles). In addition, we can see that within each subgroup, there are “central” and “peripheral” members—e.g. Danish (a North Germanic language) approaches the West Germanic languages. Most importantly, these language clusters correspond directly to clusters of cognate sets, which means that (unlike with many other alternatives to the tree model), we have immediate access to the evidence that supports each cluster.
The dataset used for this project was the WALS subset of the World Phonotactics Database (Donohue et al. 2013), which contains the 257 morphosyntax features from the World Atlas of Language Structures (Dryer & Haspelmath 2013), coded for 1601 languages, 943 of which have non-blank values for more than 50% of the features. The first step was to compute pairwise typological distances among these 943 languages; this was done using the Gower coefficient, weighting the features in a way that accounts for feature dependencies (following the procedure suggested in Hammarström & O’Connor 2013). Figure 1 shows the resulting typological distances, transformed into a 3D MDS solution which is then interpreted as RGB vectors (as in dialectometric studies such as Goebl 2006 and Szmrecsányi 2011), and displayed on a map using a Voronoi tessellation.
The next step in the identification of linguistic areas was to define a measure of “distance” between languages that would reflect typological distance for languages that are close together, and geographic distance for languages that are far apart; this would ensure that geographically close languages are clustered together only if they are also typologically similar, and geographically distant languages are rarely clustered together, independently of typological similarity. This new distance measure was constructed as follows: First, a geographic adjacency graph was computed, where two languages A and B are linked if and only if there are no more than two other languages which lie within the circle whose diameter is the line segment AB. (This is thus a variant of the “Gabriel graph”: see Gabriel & Sokal 1969.) Then, the “areal distance” between A and B was defined as equal to their typological distance if they were either immediately adjacent or 2 steps apart; if they were more than 2 steps apart, then their areal distance was equal to the sum of typological distances along the shortest path between A and B that connects only geographically-adjacent languages.
These areal distances were used to cluster the 943 languages into 47 clusters with an average of about 20 languages each; these clusters were then displayed within the Voronoi tessellation computed earlier. Figures 2–4 show some sample clusters.
The areal clusters found are all (by construction) geographically contiguous; further, most of them can be readily identified with established linguistic areas, or with language families or subgroups.
While typological data has an inherent geographical dimension, mapping the space to other features of the data instead illuminate different sorts of clusters and structures. Two-dimensional graphs or visualisations, however, can be difficult to interpret, as information is static and densely layered.
In the visualization discussed here, WALS/WPD languages are subjected to multidimensional scaling: essentially the number of features on which any two languages differ are counted, and the total is divided by the number of features that are known for both. A 3D (x,y,z) location is found for each language such that these pairwise distances are represented spatially.
The result is a three-dimensional scatterplot. In a VR headset, the user is located at (0,0,0) in the midst of the cloud of points, and can look around. It can also be viewed in a pseudo-3D environment using WebGL in a desktop browser.
The scatterplot can map any csv datafile, as long as there are columns for (x,y,z) coordinates as well as (optionally) columns of features that can be used for colouring the points. This therefore has potential uses beyond linguistics as well.
In this response article, we review our apparently conflicting perspectives, but endeavour to find common ground between them – hence the title “when the waves meet the trees”. We show that our differences are partly due to distinct definitions of key concepts (subgroup, shared innovations…). We also show that the argument of “Incomplete lineage sorting”, which they use to defend the tree, could as well defend the wave model, since it can capture the key notion of intersecting innovations that is so prevalent in linkages, and so problematic in the traditional tree approach.
The final part of our response shows how Historical Glottometry (HG) provides a way to reconstruct the historical process of ‘linkage breaking’, whereby a dialect continuum breaks progressively into separate languages. All in all, our two approaches are fundamentally compatible – even though we find the Wave model, ultimately, to be more realistic.
Kalyan, Siva and Alexandre François. 2018. Freeing the Comparative Method from the tree model: A framework for Historical Glottometry. In Ritsuko Kikusawa & Lawrence Reid (eds), _Let's talk about trees: Genetic Relationships of Languages and Their Phylogenic Representation_ (Senri Ethnological Studies, 98). Ōsaka: National Museum of Ethnology. 59–89.
Individual colours are not directly interpretable; however, warm colours (red, yellow etc.) correspond to predominantly head-initial languages, and cool colours (blue, green etc.) correspond to predominantly head-final languages.
The main division visible is between Australia and the rest of the world.
The dataset used for this project was the WALS subset of the World Phonotactics Database (Donohue et al. 2013), which contains the 257 morphosyntax features from the World Atlas of Language Structures (Dryer & Haspelmath 2013), coded for 1601 languages, 943 of which have non-blank values for more than 50% of the features. The first step was to compute pairwise typological distances among these 943 languages; this was done using the Gower coefficient, weighting the features in a way that accounts for feature dependencies (following the procedure suggested in Hammarström & O’Connor 2013). Figure 1 shows the resulting typological distances, transformed into a 3D MDS solution which is then interpreted as RGB vectors (as in dialectometric studies such as Goebl 2006 and Szmrecsányi 2011), and displayed on a map using a Voronoi tessellation.
The next step in the identification of linguistic areas was to define a measure of “distance” between languages that would reflect typological distance for languages that are close together, and geographic distance for languages that are far apart; this would ensure that geographically close languages are clustered together only if they are also typologically similar, and geographically distant languages are rarely clustered together, regardless of typological similarity. This new distance measure was constructed as follows: First, a geographic adjacency graph was computed, where two languages A and B are linked if and only if there is no third language C which is closer to A and B than they are to each other (following the procedure in Hammarström & Güldemann 2014: 101–102). Then, the “areal distance” between A and B was defined as equal to their typological distance if they were either immediately adjacent or 2 steps apart; if they were more than 2 steps apart, then their areal distance was equal to the sum of typological distances along the shortest path between A and B that connects only geographically-adjacent languages.
These areal distances were used to cluster the 943 languages into 47 clusters with an average of about 20 languages each; these clusters were then displayed within the Voronoi tessellation computed earlier. Figure 2 shows a sample cluster, which is clearly a highly plausible candidate for a linguistic area. Likewise, all the clusters that were found are geographically contiguous (by construction), and most of them can be readily identified with established linguistic areas, or with language families or subgroups.
Correspondence analysis (also known as optimal scaling, or dual scaling) allows one to take a binary matrix tabulating the presence or absence of m features in each of n items, and produce a solution in min(m – 1, n – 1) dimensions in which there are points that represent the items as well as points that represent the features. Crucially, the point representing each item is close to the points representing the features it exhibits; likewise, the point representing each feature is close to the points representing the items that exhibit it. Correspondence analysis has been successfully used in social network analysis as a representation of bipartite “affiliation” networks (Wasserman & Faust 1994: 334–342).
We applied correspondence analysis (using the FactoMineR package in R: Lé et al. 2008) to a database of “plausible cognacy” judgments for a selection of 26 Germanic languages. In our data, each column represents a language (e.g. English, German, etc.), and each row represents a word that looks similar across at least some of the languages (e.g. water, Wasser, etc.), whether due to shared inheritance or borrowing. We would expect the clusters in the correspondence analysis solution to correspond to genealogical subgroups or to contact zones.
Indeed, this is what we find. As seen in Figure 1, the languages divide neatly into the well-established North Germanic (Scandinavian), East Germanic (Gothic) and West Germanic subgroups, as well as a cluster consisting of English and Scots (i.e. languages of the British Isles). In addition, we can see that within each subgroup, there are “central” and “peripheral” members—e.g. Danish (a North Germanic language) approaches the West Germanic languages. Most importantly, these language clusters correspond directly to clusters of cognate sets, which means that (unlike with many other alternatives to the tree model), we have immediate access to the evidence that supports each cluster.
The dataset used for this project was the WALS subset of the World Phonotactics Database (Donohue et al. 2013), which contains the 257 morphosyntax features from the World Atlas of Language Structures (Dryer & Haspelmath 2013), coded for 1601 languages, 943 of which have non-blank values for more than 50% of the features. The first step was to compute pairwise typological distances among these 943 languages; this was done using the Gower coefficient, weighting the features in a way that accounts for feature dependencies (following the procedure suggested in Hammarström & O’Connor 2013). Figure 1 shows the resulting typological distances, transformed into a 3D MDS solution which is then interpreted as RGB vectors (as in dialectometric studies such as Goebl 2006 and Szmrecsányi 2011), and displayed on a map using a Voronoi tessellation.
The next step in the identification of linguistic areas was to define a measure of “distance” between languages that would reflect typological distance for languages that are close together, and geographic distance for languages that are far apart; this would ensure that geographically close languages are clustered together only if they are also typologically similar, and geographically distant languages are rarely clustered together, independently of typological similarity. This new distance measure was constructed as follows: First, a geographic adjacency graph was computed, where two languages A and B are linked if and only if there are no more than two other languages which lie within the circle whose diameter is the line segment AB. (This is thus a variant of the “Gabriel graph”: see Gabriel & Sokal 1969.) Then, the “areal distance” between A and B was defined as equal to their typological distance if they were either immediately adjacent or 2 steps apart; if they were more than 2 steps apart, then their areal distance was equal to the sum of typological distances along the shortest path between A and B that connects only geographically-adjacent languages.
These areal distances were used to cluster the 943 languages into 47 clusters with an average of about 20 languages each; these clusters were then displayed within the Voronoi tessellation computed earlier. Figures 2–4 show some sample clusters.
The areal clusters found are all (by construction) geographically contiguous; further, most of them can be readily identified with established linguistic areas, or with language families or subgroups.
While typological data has an inherent geographical dimension, mapping the space to other features of the data instead illuminate different sorts of clusters and structures. Two-dimensional graphs or visualisations, however, can be difficult to interpret, as information is static and densely layered.
In the visualization discussed here, WALS/WPD languages are subjected to multidimensional scaling: essentially the number of features on which any two languages differ are counted, and the total is divided by the number of features that are known for both. A 3D (x,y,z) location is found for each language such that these pairwise distances are represented spatially.
The result is a three-dimensional scatterplot. In a VR headset, the user is located at (0,0,0) in the midst of the cloud of points, and can look around. It can also be viewed in a pseudo-3D environment using WebGL in a desktop browser.
The scatterplot can map any csv datafile, as long as there are columns for (x,y,z) coordinates as well as (optionally) columns of features that can be used for colouring the points. This therefore has potential uses beyond linguistics as well.
One way to get around the problems of balancing areal and genealogical criteria is to ignore areas and genealogy completely, and simply focus on the distribution of languages in terms of their typological features. Then we can define a “representative sample” of languages as a set of languages that (a) is “maximally dispersed” through the feature space (in order to maximally represent diversity), and (b) is such that no two languages are “too close together” (in order to avoid over-representing very similar language types). Dahl (2008) makes an initial attempt at defining such an a posteriori approach to typological sampling, by simply computing the distance between every pair of languages, and then for every pair of languages that is “too close”, removing the one that is less well-described. However, this approach is limited by two factors: Firstly, it is deterministic, i.e., it produces only one sample for a given input set of languages, even though intuitively there should be many samples that are comparably good. Secondly, the threshold for languages being “too close” is arbitrarily determined by the desired sample size; in fact, we would like to argue that the sample size (and the criterion for when two languages are “too close”) should be determined by the inherent variation in the total set of languages
We propose that a typologically representative sample can be built in the following way: (1) Rescale the inter-language distances so as to maximise the variation between languages in dense neighbourhoods and languages in sparse neighbourhoods. (2) Pick one language at random, and add it to the sample. (3) Pick another language at random, and decide whether to include it in the sample based on how far (typologically) it is from the language already in the sample (if it’s far away, it’s very likely to be added; if it’s close by, it’s unlikely to be added). Continue picking languages at random, and decide the likelihood of including each one by how well-separated it is from the languages already in the sample. Continue until every language has been considered.
We present a posteriori typological samples computed on the basis of two typological feature sets: the morphosyntactic features in WALS (Dryer and Haspelmath 2013), as coded in the extended World Phonotactics Database (Donohue et al. 2013), and the phonological features in the WPD. We show that the number of languages in a morphosyntactically representative sample of the world’s languages is greater than the number of languages in a phonologically representative sample. This shows the economy achieved when using typology to drive typological sampling, and highlights the loci of global linguistic diversity.
Because of the linkage-like structure of the families involved, we adopted non-cladistic clustering methods to tease apart the data. In particular, we used Correspondence Analysis (Faust 2005, Kalyan and Donohue in preparation), which is unique in allowing us to immediately identify which of the features in our data are responsible for each cluster of languages.
Not surprisingly, given the large proportion of Austronesian languages in our data, we consistently find strong lexical evidence for a cluster that contains most Austronesian languages in the sample, but with intriguing disparities on the fringes. For example, in terms of body-part vocabulary, Tai-Kadai languages (Ong Be, Thai and Zhuang) are surprisingly close to the Austronesian cluster, as is Makasae (Papuan, East Timor); by contrast, Ambai (Austronesian, Cenderawasih Bay) and Atayal (Austronesian, northern Taiwan) are highly divergent from the Austronesian “prototype”. In the domain of “tools”, however, Ambai hews closer to the Austronesian cluster, together with Makasae and a number of Papuan families; but Tai-Kadai is relatively distant. Comparing these patterns with those found when we examine the typological data, we are able to map the linguistic outcomes that reflect different social histories in this part of the world.
We briefly present a sample of the features in our data that are directly responsible for the clustering patterns that we see, and speculate on what these may tell us about social histories in Island Southeast Asia.
Perhaps the most easily-operationalised concept of salience is what is variously known as “accessibility” (Ariel 1990), “activation”, or “givenness” (Chafe 1994). Roughly, this refers to the de- gree to which an entity is in the “focus of consciousness” of the discourse participants at a particular point in time. It is usually diagnosed by the ease with which the entity can be referred to with a pro- noun or an unstressed noun phrase; and in psycholinguistics, it is studied in terms of visual attention and conceptual priming (among many possibilities).
Given the ready availability of “accessibility”/“givenness” as a concept of “salience” that is se- curely known to be psychologically real, it makes sense to try to relate the many types of salience postulated in Cognitive Grammar to some notion of referential accessibility. This is not a new idea; Langacker (2001) describes possessors, topics, and focal participants (core arguments) as types of “reference points”, using a construct that is also used to describe referential accessibility in the con- text of pronominal anaphora (van Hoek 1997). This presentation focuses on the notion of “profile”— while also touching on other kinds of salience—and tries to make as explicit as possible the concep- tual import of each kind of salience in terms of the dynamics of language processing.
The profile of an expression is that element of the evoked conceptual content that the expres- sion “designates” (Langacker 1987: 183). Langacker (1987: 188) explicitly suggests that the profile is “activated at a...higher level of intensity” than other elements of the evoked content (e.g. in The plane that descended, the conception of the plane is more active than that of its downward motion). This characterisation appears to be supported by experiments using the visual-world paradigm, which clearly show that hearing a noun phrase directs attention to (in other words, visually activates) the entity that the noun phrase designates.
Yet this cannot be the full story, as it is easy to find cases of noun phrases whose referents are not activated (in the sense that they are not subsequently available as antecedents of pronouns). This is the case not only with “non-referential” NPs (e.g. predicate nominals), but also with NPs that are contained in presuppositions (e.g. I met the woman who fixed my computer. *It...).
However, even in these cases, it seems likely that the entity designated by the NP, though it may not be activated in an absolute sense, nonetheless is more active at the end of the NP than it was at the start. Thus, as a first approximation, it is suggested that the profile of an expression is the entity whose activation increases (even if not maximally) during the course of that expression.
A further complication arises, however, when we turn our attention from noun phrases to finite clauses. While it is true that a clause activates the conception of an event (e.g. Sam washed the win- dows. It took two hours.), it typically also activates the participants that are coded by its arguments (e.g. ...They hadn’t been cleaned for a year.)—indeed, even more so. Yet we would not want to say that these participants are designated by the clause.
A possible solution to this is to argue that in fact, a relational profile consists of multiple enti- ties: the participant(s) in the relation, as well as an entity that is emblematic of the relationship (e.g. a Davidsonian argument in the case of a finite clause, or a location in the case of a locative PP).
While the above proposals are admittedly speculative, they go some way toward fulfilling the pressing need for operational definitions of semantic notions in cognitive linguistics (Dąbrowska 2009). Such definitions are essential in order for cognitive grammatical analyses to be tested empiri- cally.
However, as shown by Thompson (2002), Verhagen (2005) and others, in spoken discourse it is nearly always the content clause that carries the main informational load of the sentence, and serves to “move the discourse forward”. On the basis of this fact, these researchers argue that the head of a finite-complementation construction is actually the content clause, and that the propositional-attitude verb and its subject are better understood as constituting an evidential or epistemic adverbial modifier.
Dąbrowska (2009) points out that disagreements such as this are caused by the lack of clear operational definitions, in this case of the notion “head”—or “profile determinant”, in the terminology of Cognitive Grammar. In this talk, we propose an operational definition of “profile determinant”, and use it to experimentally investigate whether finite complementation constructions are headed by the propositional-attitude verb or the content clause.
In Cognitive Grammar, the “profile determinant” of a construction is the component which designates the same entity as the entire composite structure (Langacker 1987: 288). (For example, the profile determinant of jar lid is lid (not jar), because jar lid designates a type of lid and not a type of jar.) A consequence of this definition (Langacker 1987: 467) is that the semantic properties of the profile determinant are inherited by the composite structure. (Thus, a jar lid has all the properties of a lid, but not those of a jar.) Applying this to finite complementation, this means that if know is the profile determinant of I know that she left, then the sentence designates an event of knowing rather than one of leaving (Langacker 1991: 436), and exhibits the semantic properties of know, but not those of left.
We report on two experiments in which we examine a set of sixteen propositional-attitude verbs, and try to determine, in each case, whether it is the propositional-attitude verb or the content clause that is the profile determinant. In the first experiment, subjects were shown one sentence exemplifying each propositional-attitude verb; after each sentence, they saw a list of twelve features that included four features of the propositional-attitude verb and four features of the main verb of the content clause. (These features had been elicited in a preliminary study.) The subjects’ task was to indicate which features best describe the overall meaning of the sentence. In this experiment all subjects saw the same list of sixteen sentences; the second experiment controlled for the (potential) influence of the content clauses by having sixteen conditions, each with different combinations of propositional-attitude verbs and content clauses.
We found that four of the propositional-attitude verbs in our study (thought, remember, suspect and admitted) tend to be profile-determining, and five (found, realised, says, agreed and believe) are non-profile-determining (in other words, the content clause is the head); the other verbs (knew, announced, showed, saw, suggested, concluded and claimed) did not show a significant effect. We consider possible semantic and syntactic explanations for this pattern of results, and find ourselves forced to conclude that what we have found are simply idiosyncratic properties of these verbs.
We hope, however, to have shown that the concept of a “profile determinant” can indeed be given a meaningful operational definition, and that this can be used to test grammatical analyses experimentally.
The tree model may be appropriate when a speaker population undergoes successive splits, with subsequent loss of contact among descendants. For all other scenarios, it fails to provide an accurate representation of language history. In particular, it is unable to deal with dialect continua, or with the language families that develop out of them—for which Ross (1988) proposed the term “linkage”. In such cases, the scopes of innovations (their isoglosses) are not nested but persistently intersect, in ways which cannot be accurately represented by any tree structure. Though Ross's initial observations about linkages concerned the languages of western Melanesia, it is clear that linkages are found in many other areas as well—such as Fiji (Geraghty 1983), Northern India (Toulmin 2009), etc.
In this presentation, we focus on the 17 languages of the Torres and Banks islands in Vanuatu, which form a linkage (François 2011), and attempt to develop adequate representa¬tions for this linkage. Our data consist of 474 linguistic innovations reflected in the area—phonological, morphological, lexical or otherwise—identified on the basis of a strict application of the Comparative Method. With these 17 × 474 data points, we illustrate the method of Historical Glottometry, our proposed quantitative approach to language subgrouping in situations of linkage. One tenet of Glottometry is that innovation-defined subgroups may intersect; they define patterns that can be quantified and measured. We calculate the cohesiveness of each subgroup (proportion of time it is confirmed by the data), and define the relative strengths of all subgroups by calculating their subgroupiness (number of exclusively shared innovations weighted by subgroup cohesiveness). The result is a “glottometric diagram” of northern Vanuatu, in which the relative strengths of subgroups can be visually represented, in ways more faithful to historical reality than what the tree model can do.
Overall, we hope to show that Historical Glottometry enables a fine-grained, reliable and testable representation of language history in genealogical linkages, that combines the valuable insights of the Comparative Method with a diffusionist, non-cladistic model of language diversification."
The tree model is appropriate when a speaker population undergoes successive splits, with subsequent loss of contact among subgroups. For all other scenarios, it fails to provide an accurate representation of language history (cf. Durie & Ross 1996; Pawley 1999; Heggarty et al. 2010). In particular, it is unable to deal with dialect continua, as well as language families that develop out of dialect continua (for which Ross 1988 has proposed the term "linkage"). In such cases, the scopes of innovations (in other words, their isoglosses) are not nested, but rather they persistently intersect, so that any proposed tree representation is met with abundant counterexamples. Though Ross's initial observations about linkages concerned the languages of Western Melanesia, it is clear that linkages are found in many other areas as well (such as Fiji: cf. Geraghty 1983).
In this presentation, we focus on the 17 languages of the Torres and Banks islands in Vanuatu, which form a linkage (Tryon 1996, François 2011), and attempt to develop adequate representations for it. Our data consists of a database of 474 linguistic innovations reflected in the area—whether phonological, morphological, lexical or otherwise. Based on this rich database, we propose to define a new approach to representing and reconstructing language history – an approach we call historical glottometry.
Firstly, we use the tools of dialectometry developed by European dialectologists (Goebl 2006, Nerbonne 2010, Szmrecsányi 2011). For each pair of languages, we compute the ratio of shared innovations to non-shared innovations. Converting this ratio into a distance and applying multidimensional scaling, we are able to accurately visualise the degree of historical divergence among the languages.
Secondly, we attempt to answer the question of how closely the Torres-Banks linkage approximates a tree—in other words, to what extent the isoglosses are nested. For each isogloss, we compute the proportion of isoglosses that contain or are contained in it to isoglosses that intersect it; this allows us to determine its "subgroupiness". In the case of a language family that develops exactly in the manner assumed by the tree model, every isogloss would have a subgroupiness of 100%. We thus propose an isogloss map with line darkness dependent on subgroupiness as a general representation, which includes the situation assumed by the tree model as a special case.
Overall, the approach of historical glottometry, anchored as it is in the classical comparative method (with its focus on shared innovations), provides a reliable and verifiable representation of language history, that avoids the misleading assumptions of the tree model."
A question which (to my knowledge) is never addressed, is how to determine, in a principled way, what the contextual interpretation of a word is in the first place. (It seems to be widely agreed that it usually goes beyond those aspects of meaning conventionally associated with the word—an assumption that I would like to question.) Having a clear answer to this is necessary in order to start formulating testable hypotheses about which of these interpretations form part of the mental representation of the word.
I would like to make the following proposals:
1. The contextual interpretation of a word consists of those aspects of meaning that are
a) frequently present in the entity that the word refers to, and
b) present in the entity that it refers to in the current context.
2. The mental representation of the word's meaning consists of all of its attested contextual interpretations (as in an exemplar model, e.g. Bybee 2001);
3. Conventional tests of sensehood are largely tests of whether a certain meaning constitutes an acceptable contextual interpretation of the word""
Tversky (1977: 342–344) notes that the salience of a feature (as measured by the rated similarity of objects sharing that feature) depends on (among other things) the set of items in the context of which the salience of the feature is measured. That is, the more likely a person is to use a feature as the basis for classifying a set of items—the greater the “diagnosticity” of the feature within that set—the greater the salience of that feature in that context.
Given that the salience of a feature must be defined with respect to a particular context of items, I propose that profiling and pragmatic assertion are both kinds of diagnosticity, but are defined with respect to different contexts. A profiled feature has high diagnosticity in the context of all units in the language, whereas an asserted feature has high diagnosticity in the context made relevant by the discourse (which, I argue, is defined by those presupposed features that are necessary for the asserted one).
The hypothesized characterization of profiling seems to be supported by various experiments using a pile-sorting task. Miller (1967: 63–66) cites unpublished data from Jeremy M. Anglin showing that when a set of words belonging to different parts of speech (noun, verb, adjective, adverb) are classified, the highest-level clusters (the subsets such that the items contained in it are least likely to be classified with items outside of it) correspond almost perfectly to the part-of-speech categories, suggesting that the properties of these categories have high diagnosticity in the context of English lexemes. This is what we would expect, given the Cognitive Grammar definitions of parts of speech as different kinds of profiles (e.g. Langacker 1987: 189).
Bencini and Goldberg (2000: 645, 647) find that when sentences instantiating the transitive, ditransitive, resultative and caused-motion constructions are classified, the ditransitive sentences are the ones least likely to be sorted together with other sentence types. The interpretation that the property of having an indirect object has greater diagnosticity than the property of having an oblique argument is consistent with Goldbergʼs (1995: 48–49) analysis of “core” arguments as profiled and “non-core” arguments as unprofiled.
There is abundant evidence that the focus of a sentence is salient in the discourse context (e.g. Birch and Rayner 2010, and references cited therein). Given that the focus is defined to be part of the assertion (Lambrecht 1994: 213), this can also be interpreted as evidence for the salience of asserted properties. The identification of the “discourse context” with the context defined by sentence presuppositions is suggested by Millerʼs (1969) finding that people tend to sort words together on the basis of shared presuppositions in those wordsʼ definitions. In other words, the asserted properties of a word are most useful in distinguishing it from other words that have the same presuppositions, and hence they have the highest diagnosticity in this context."
The tree model suits just one ideal case: when a population went through successive migration pulses with systematic loss of contact. For all other scenarios, it fails to provide any accurate representation of language history, as has been widely observed already (cf. Durie & Ross 1996; Pawley 1999; Heggarty et al. 2010). In particular, it is unable to deal with cases of diffusion across dialect continuums. Ross (1988) has proposed the term linkage to refer to “a group of communalects which have arisen by dialect differentiation”, i.e. the modern descendants of an earlier dialect continuum. Just like dialect chains, linkages are not compatible with family trees, because they do not involve discrete subgroups, but constantly intersecting isoglosses. Ross’ important observations, initially made about languages of Western Melanesia, deserve to be extrapolated to other parts of the world. We need to develop an accurate representation of language history in dialect continuums and linkages, that would combine the scientific power of the comparative method with a diffusionist, non-cladistic approach.
Our talk will present a method for unravelling and representing the linguistic history of a specific linkage: Vanuatu. Even though modern Vanuatu languages have long lost any mutual intelligibility, their history is best represented using a wave-model approach (Tryon 1996, François 2011a): each post-dispersal innovation diffused across a social network of small communities in constant interaction speaking mutually intelligible dialects, à la Fiji (Geraghty 1983). Focusing on the 17 languages of the Banks & Torres Islands (François 2011b), we identify 441 linguistic innovations reflected in the area – whether phonological, morphological, lexical or otherwise. Using the tools of dialectometry developed by European dialectologists (Goebl 2006, Nerbonne 2010, Szmrecsanyi 2011), notably Multi Dimensional Scaling, we track the geographic patterns of linguistic diffusion. We identify historically significant clusters of languages, albeit intersecting ones, and show what they tell us about the social history of the area. Our purpose is to show it is possible to achieve an accurate and elegant representation of linkages, by taking advantage of the strengths of the Comparative Method, yet steering clear of the phylogenetic model and its unfortunate delusions.
****
Problems with, and alternatives to, the tree model in historical linguistics — Siva Kalyan, Alexandre François & Harald Hammarström
Non-tree-like signal using multiple tree topologies — Annemarie Verkerk
Visualizing the Boni dialects with Historical Glottometry — Alexander Elias
Subgrouping the Sogeram languages — Don Daniels, Danielle Barth & Wolfgang Barth:
Save the trees: Why we need tree models in linguistic reconstruction — Guillaume Jacques & Johann-Mattis List
When the waves meet the trees: A response to Jacques and List — Siva Kalyan & Alexandre François