Journal articles (peer-reviewed)
Indogermanische Forschungen, 2024
This paper presents evidence for a regular sound change in 12th century Old French where onset /k... more This paper presents evidence for a regular sound change in 12th century Old French where onset /k/ voiced to /ɡ/ before /la/ or /ra/, a dedicated treatment for the first unified account in regular sound change offered for a contemporaneous change seen in 13 reflexes of inherited etyma and 4 reflexes of Germanic borrowings. A dating in the late 12th century for VOV is supported by both corpus evidence and how it is bled, counterbled, and fed by other sound changes whose dating is established in prior literature. All apparent exceptions are explicable by sound symbolism or analogy due to membership in one of three derivationally productive word families, each with a root word for which VOV was regularly bled. The ultimate origins of this Old French velar onset voicing most likely lie in an older and phonetically-conditioned synchronic voicing process that originally operated both word-internally and at boundaries. Word-internally, where conditioning was static, the alternation became a phonemic difference and thereafter lost productivity relatively early. But synchronic alternation continued at boundaries, where its conditioning continued to depend on the adjacent word. It survived in a progressively reshaped manner at the word onset, until the late 12th century, when this voicing alternation, now much narrower in scope, gave rise to VOV as a regular diachronic sound change. I present various points of evidence from Romance and elsewhere supporting this explanation. Previous treatment of the words in question considered each only in isolation, bypassing the chance to consider the proposed regular sound change as a possibility. Thus, this study shows the importance of considering both lexically-focused approaches and the probing for regular sound changes as necessarily complementary methods. (Accepted January 3, 2024; pre-proofs manuscript -- to appear in 2024)
Bookmarks Related papers MentionsView impact
Zeitschrift für Balkanologie, 2023
While recent work has shown the Angevin Kingdom of Albania in the 13th and 14th centuries to have... more While recent work has shown the Angevin Kingdom of Albania in the 13th and 14th centuries to have had a significant impact on Albania with effects ranging from religious demography and socioeconomic structure to architectural and artistic heritage, the linguistic impact of contact between Old French and Old Albanian has not yet been investigated. The diverse strata of the Albanian lexicon have long been an object of interest, but while impressive surveys have analyzed layers as thin as Gothic, to date no study of Angevin-Albanian contact appears to have ever been published. This paper aims to address this oversight and build a foundation for further investigation. After considering the historical context and examining the few Albanian etyma previously attributed to Angevin borrowings in past works, I argue there are at least thirteen highly plausible borrowings from the langue d'oïl in Albanian, consisting of two cases that have been supported in recent scholarship and eleven new cases that I present: napë, parriz, pëllas, kurt, trevë, bërsi, beronjë, punjashë, turbë, fe, and pre. The motivation for each is given by the diachronic phonology of Albanian, corpus evidence for attestation of relevant forms in Old French and langue d'oïl varieties, and historical semantic and sociolinguistic considerations.
Bookmarks Related papers MentionsView impact
Diachronica, Nov 28, 2022
Traditionally, historical phonologists have relied on tedious manual derivations to sequence the ... more Traditionally, historical phonologists have relied on tedious manual derivations to sequence the sound changes that have shaped the phonological evolution of languages. However, humans are prone to errors, and cannot track thousands of parallel derivations in any efficient manner. We demonstrate computerized forward reconstruction (CFR), deriving each etymon in parallel, as a task with metrics to optimize, and as a tool which drastically facilitates inquiry. To this end we present DiaSim, an application which simulates “cascades” of diachronic developments over a language’s lexicon and provides various diagnostics for “debugging” those cascades. We test our method on a Latin-to-French reflex prediction task, using a newly compiled, publicly available dataset FLLex consisting of 1368 paired Latin and modern French forms. We also introduce a second dataset, FLLAPS, which maps 310 reflexes from Latin through five attested intermediate stages up to Modern French, derived from Pope (1934)’s periodic development tables. We present publicly available rule cascades: the baseline BaseCLEF and BaseCLEF* cascades, based on Pope (1934)’s widely-cited view of French development, and DiaCLEF, made from incremental corrections to BaseCLEF aided by DiaSim’s diagnostics. DiaCLEF outperforms the baselines by large margins, improving raw accuracy on FLLex from 3.2% to 84.9% of etyma, with similarly large improvements for each of FLLAPS’ periods. Changes were made to build DiaCLEF considering only the baseline and DiaSim’s diagnostics, but they often independently reproduced past work in French diachronic phonology, corroborating both our procedure and past endeavors; we discuss the implications of some of our findings in detail.
Bookmarks Related papers MentionsView impact
Peer-reviewed Conference Proceedings
LTF4HALA, 2020
Traditionally, historical phonologists have relied on tedious manual derivations to calibrate the... more Traditionally, historical phonologists have relied on tedious manual derivations to calibrate the sequences of sound changes that shaped the phonological evolution of languages. However, humans are prone to errors, and cannot track thousands of parallel word derivations in any efficient manner. We propose to instead automatically derive each lexical item in parallel, and we demonstrate forward reconstruction as both a computational task with metrics to optimize, and as an empirical tool for inquiry. For this end we present DiaSim, a user-facing application that simulates “cascades” of diachronic developments over a language’s lexicon and provides diagnostics for “debugging” those cascades. We test our methodology on a Latin-to-French reflex prediction task, using a newly compiled dataset FLLex with 1368 paired Latin/French forms. We also present, FLLAPS, which maps 310 Latin reflexes through five stages until Modern French, derived from Pope (1934)’s sound tables. Our publicly avail...
Bookmarks Related papers MentionsView impact
LT4HALA, 2020
Traditionally, historical phonologists have relied on tedious manual derivations to calibrate the... more Traditionally, historical phonologists have relied on tedious manual derivations to calibrate the sequences of sound changes that shaped
the phonological evolution of languages. However, humans are prone to errors, and cannot track thousands of parallel word derivations in
any efficient manner. We propose to instead automatically derive each lexical item in parallel, and we demonstrate forward reconstruction
as both a computational task with metrics to optimize, and as an empirical tool for inquiry. For this end we present DiaSim, a user-facing
application that simulates “cascades” of diachronic developments over a language’s lexicon and provides diagnostics for “debugging”
those cascades. We test our methodology on a Latin-to-French reflex prediction task, using a newly compiled dataset FLLex with
1368 paired Latin/French forms. We also present, FLLAPS, which maps 310 Latin reflexes through five stages until Modern French,
derived from Pope (1934)’s sound tables. Our publicly available rule cascades include the baselines BaseCLEF and BaseCLEF*,
representing the received view of Latin to French development, and DiaCLEF, build by incremental corrections to BaseCLEF aided by
DiaSim’s diagnostics. DiaCLEF vastly outperforms the baselines, improving final accuracy on FLLex from 3.2%to 84.9%, and similar
improvements across FLLAPS’ stages.
Bookmarks Related papers MentionsView impact
Bachelor's thesis
Marr, Clayton, "Towards a computational model of long-term diachronic change: simulating the... more Marr, Clayton, "Towards a computational model of long-term diachronic change: simulating the development of Classical Latin to Modern French" (2017). Senior Capstone Projects. 710. Table of Contents: 2 … 1. Abstract 1 3… 2. Introduction : Historical Phonology 4… 3. Major controversies in Historical Phonology 5… 3a. How does sound change? Neogrammarian regularism confronts lexical diffusionism 11… 3b. Why does sound change? Strict internalism confronts language contact effects 14… 4. Introduction : The French Language and Historical Phonology 16… 4a. Diachronic sound change and Neogrammarian regularity in French 17… 4b. Historical dialect and register relations in French 21… 4c. Language contact effects in French 28… 5. Computational models of diachronic change: reconstruction and simulation 30… 6. Description of this language simulation package 31… 6a. Representation of phonological/phonetic units 33… 6b. A class hierarchy of Phone objects 33… 6c. Modeling of phonological ...
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Conference Proceedings
This paper discusses Siena’s Clinical Decision Assistant’s (SCDA) system and its participation in... more This paper discusses Siena’s Clinical Decision Assistant’s (SCDA) system and its participation in the Text Retrieval Conference (TREC) Clinical Decision Support Track (CDST) of 2015. The overall goal of the 2015 track is to link medical cases to information that is pertinent to patient care. Participants were given a set of 30 topics in the form of medical case narratives and a snapshot of 733,138 articles from PubMed Central (PMC). The 30 topics were annotated into three major subsets: diagnosis, test and treatment, with ten of each type. Each topic serves as an idealized representation of actual medical records and includes both a description, which contains a complete account of the patient visit, and a summary, which is typically a one or two sentence summary of the main points in the description. SCDA used several methods to attempt improve the accuracy of medical cases retrieved. SCDA used the metathesaurus Unified Medical Language System (UMLS) that was implemented using Meta...
Bookmarks Related papers MentionsView impact
This paper discusses Siena’s Clinical Decision Assistant’s (SCDA) system and its participation in... more This paper discusses Siena’s Clinical Decision Assistant’s (SCDA) system and its participation in the Text Retrieval Conference (TREC) Clinical Decision Support Track (CDST) of 2015. The overall goal of the 2015 track is to link medical cases to information that is pertinent to patient care. Participants were given a set of 30 topics in the form of medical case narratives and a snapshot of 733,138 articles from PubMed Central (PMC). The 30 topics were annotated into three major subsets: diagnosis, test and treatment, with ten of each type. Each topic serves as an idealized representation of actual medical records and includes both a description, which contains a complete account of the patient visit, and a summary, which is typically a one or two sentence summary of the main points in the description. SCDA used several methods to attempt improve the accuracy of medical cases retrieved. SCDA used the metathesaurus Unified Medical Language System (UMLS) that was implemented using Meta...
Bookmarks Related papers MentionsView impact
Talks
ICHL 26, Workshop 3: Computational models of diachronic language change, 2023
Computerized forward reconstruction, or CFR (Sims-Williams, 2018), offers an automatic and
syste... more Computerized forward reconstruction, or CFR (Sims-Williams, 2018), offers an automatic and
systematic means of testing hypotheses about the chronology of sound change in a language. While
computing the effects of historical sound changes over millennia for thousands of etyma is laborious
and extremely time-consuming, this task is accomplished within seconds by a CFR system such as
DiaSim, which was created for not only evaluating hypothesized relative chronologies of sound
changes, or “diachronic cascades”, but also “debugging them” by reporting statistics on how errors
pattern (Marr and Mortensen, 2020). As a test case, past work applied this system to the phonological
evolution of Latin into French, and a CFR-enabled “debugging” procedure improved accuracy from a
3.2% baseline for a cascade based on the 1934 received view to 84.9%. In the process, various proposals
in the post-1934 literature on French were supported by the fact that they were independently
produced as part of a systematic debugging process using DiaSim that was undertaken without
reference to them (Marr and Mortensen, 2022), while the endeavor also may have revealed a new
regular sound change in Old French, which was ultimately robustly supported by additional data
(Marr, 2023b). However, as French boasts both a large corpus since medieval times and extensive past
research, the experiment with French was more of a “laboratory run” to test the validity of the
approach of debugging a language’s historical phonology via CFR, a prelude to bringing it into the field
as an investigative technique.
This paper will bring in CFR to tackle Albanian diachronic phonology, starting with the Latin stratum
of the its lexicon. Given the lack or loss of attestation of Albanian before the 15th century and its status
as the only surviving member of its branch of Indo-European (Rusakov, 2018), reconstruction of
Albanian diachronic phonology, and thus of Proto-Albanian, has always leaned heavily on the
outcomes of strata of loanwords in Albanian from better-attested sources (Orel, 2000). Of these, the
Latin layer (Çabej, 1962; Bonnet, 1998) is by far the most significant. Latin loanwords are more
numerous than inheritance from Proto-Indo-European, Proto-Albanian is dated in relation to the time
of contact with Latin, and Albanian diachronic phonology is in a large part an exercise in generalization
from analyses of the outcomes of ancient Latin loans (Orel, 2000; Demiraj, 2006; Rusakov, 2017; De
Vaan, 2018), though with significant contributions from Albanian historical dialectology (Curtis,
2018) and the other “layers”. Nevertheless, issues do remain that concern the Latin layer of Albanian,
such as rival etymologies between imperial-era Latin loans and later Romance loans (Bonnet, 1998),
and these have potential implications for the reconstruction of Proto-Albanian, and the greater
mysteries of the language’s history within the Balkans (Friedman and Joseph, 2022). Thus, an
evaluation and debugging of the received view on Albanian diachronic phonology as applied to its
largest single pillar, the Latin stratum, offers both a new approach to an old but still vexing problem,
and a step for CFR as an empirical method, between the curated “lab” case of French, and the “field”
of understudied languages and language families.
This endeavor will apply DiaSim to CLEA, a dataset compiled in 2020–2022 and to be released with
this paper, of 1007 Albanian etyma of ancient Latin origin as asserted by at least one of a set of reputed
references (Bonnet, 1998; Orel, 1998, 2000; De Vaan, 2018; Topalli, 2017; Çabej, 1986), and will work
from a base cascade representing the views of Orel (2000) and De Vaan (2018). The same debugging
process as Marr and Mortensen (2022) will be applied, with accuracy reported for modern Albanian
outcomes, and discussion of any systematic patterning of errors and possible solutions proposed.
Bookmarks Related papers MentionsView impact
Uploads
the phonological evolution of languages. However, humans are prone to errors, and cannot track thousands of parallel word derivations in
any efficient manner. We propose to instead automatically derive each lexical item in parallel, and we demonstrate forward reconstruction
as both a computational task with metrics to optimize, and as an empirical tool for inquiry. For this end we present DiaSim, a user-facing
application that simulates “cascades” of diachronic developments over a language’s lexicon and provides diagnostics for “debugging”
those cascades. We test our methodology on a Latin-to-French reflex prediction task, using a newly compiled dataset FLLex with
1368 paired Latin/French forms. We also present, FLLAPS, which maps 310 Latin reflexes through five stages until Modern French,
derived from Pope (1934)’s sound tables. Our publicly available rule cascades include the baselines BaseCLEF and BaseCLEF*,
representing the received view of Latin to French development, and DiaCLEF, build by incremental corrections to BaseCLEF aided by
DiaSim’s diagnostics. DiaCLEF vastly outperforms the baselines, improving final accuracy on FLLex from 3.2%to 84.9%, and similar
improvements across FLLAPS’ stages.
systematic means of testing hypotheses about the chronology of sound change in a language. While
computing the effects of historical sound changes over millennia for thousands of etyma is laborious
and extremely time-consuming, this task is accomplished within seconds by a CFR system such as
DiaSim, which was created for not only evaluating hypothesized relative chronologies of sound
changes, or “diachronic cascades”, but also “debugging them” by reporting statistics on how errors
pattern (Marr and Mortensen, 2020). As a test case, past work applied this system to the phonological
evolution of Latin into French, and a CFR-enabled “debugging” procedure improved accuracy from a
3.2% baseline for a cascade based on the 1934 received view to 84.9%. In the process, various proposals
in the post-1934 literature on French were supported by the fact that they were independently
produced as part of a systematic debugging process using DiaSim that was undertaken without
reference to them (Marr and Mortensen, 2022), while the endeavor also may have revealed a new
regular sound change in Old French, which was ultimately robustly supported by additional data
(Marr, 2023b). However, as French boasts both a large corpus since medieval times and extensive past
research, the experiment with French was more of a “laboratory run” to test the validity of the
approach of debugging a language’s historical phonology via CFR, a prelude to bringing it into the field
as an investigative technique.
This paper will bring in CFR to tackle Albanian diachronic phonology, starting with the Latin stratum
of the its lexicon. Given the lack or loss of attestation of Albanian before the 15th century and its status
as the only surviving member of its branch of Indo-European (Rusakov, 2018), reconstruction of
Albanian diachronic phonology, and thus of Proto-Albanian, has always leaned heavily on the
outcomes of strata of loanwords in Albanian from better-attested sources (Orel, 2000). Of these, the
Latin layer (Çabej, 1962; Bonnet, 1998) is by far the most significant. Latin loanwords are more
numerous than inheritance from Proto-Indo-European, Proto-Albanian is dated in relation to the time
of contact with Latin, and Albanian diachronic phonology is in a large part an exercise in generalization
from analyses of the outcomes of ancient Latin loans (Orel, 2000; Demiraj, 2006; Rusakov, 2017; De
Vaan, 2018), though with significant contributions from Albanian historical dialectology (Curtis,
2018) and the other “layers”. Nevertheless, issues do remain that concern the Latin layer of Albanian,
such as rival etymologies between imperial-era Latin loans and later Romance loans (Bonnet, 1998),
and these have potential implications for the reconstruction of Proto-Albanian, and the greater
mysteries of the language’s history within the Balkans (Friedman and Joseph, 2022). Thus, an
evaluation and debugging of the received view on Albanian diachronic phonology as applied to its
largest single pillar, the Latin stratum, offers both a new approach to an old but still vexing problem,
and a step for CFR as an empirical method, between the curated “lab” case of French, and the “field”
of understudied languages and language families.
This endeavor will apply DiaSim to CLEA, a dataset compiled in 2020–2022 and to be released with
this paper, of 1007 Albanian etyma of ancient Latin origin as asserted by at least one of a set of reputed
references (Bonnet, 1998; Orel, 1998, 2000; De Vaan, 2018; Topalli, 2017; Çabej, 1986), and will work
from a base cascade representing the views of Orel (2000) and De Vaan (2018). The same debugging
process as Marr and Mortensen (2022) will be applied, with accuracy reported for modern Albanian
outcomes, and discussion of any systematic patterning of errors and possible solutions proposed.
the phonological evolution of languages. However, humans are prone to errors, and cannot track thousands of parallel word derivations in
any efficient manner. We propose to instead automatically derive each lexical item in parallel, and we demonstrate forward reconstruction
as both a computational task with metrics to optimize, and as an empirical tool for inquiry. For this end we present DiaSim, a user-facing
application that simulates “cascades” of diachronic developments over a language’s lexicon and provides diagnostics for “debugging”
those cascades. We test our methodology on a Latin-to-French reflex prediction task, using a newly compiled dataset FLLex with
1368 paired Latin/French forms. We also present, FLLAPS, which maps 310 Latin reflexes through five stages until Modern French,
derived from Pope (1934)’s sound tables. Our publicly available rule cascades include the baselines BaseCLEF and BaseCLEF*,
representing the received view of Latin to French development, and DiaCLEF, build by incremental corrections to BaseCLEF aided by
DiaSim’s diagnostics. DiaCLEF vastly outperforms the baselines, improving final accuracy on FLLex from 3.2%to 84.9%, and similar
improvements across FLLAPS’ stages.
systematic means of testing hypotheses about the chronology of sound change in a language. While
computing the effects of historical sound changes over millennia for thousands of etyma is laborious
and extremely time-consuming, this task is accomplished within seconds by a CFR system such as
DiaSim, which was created for not only evaluating hypothesized relative chronologies of sound
changes, or “diachronic cascades”, but also “debugging them” by reporting statistics on how errors
pattern (Marr and Mortensen, 2020). As a test case, past work applied this system to the phonological
evolution of Latin into French, and a CFR-enabled “debugging” procedure improved accuracy from a
3.2% baseline for a cascade based on the 1934 received view to 84.9%. In the process, various proposals
in the post-1934 literature on French were supported by the fact that they were independently
produced as part of a systematic debugging process using DiaSim that was undertaken without
reference to them (Marr and Mortensen, 2022), while the endeavor also may have revealed a new
regular sound change in Old French, which was ultimately robustly supported by additional data
(Marr, 2023b). However, as French boasts both a large corpus since medieval times and extensive past
research, the experiment with French was more of a “laboratory run” to test the validity of the
approach of debugging a language’s historical phonology via CFR, a prelude to bringing it into the field
as an investigative technique.
This paper will bring in CFR to tackle Albanian diachronic phonology, starting with the Latin stratum
of the its lexicon. Given the lack or loss of attestation of Albanian before the 15th century and its status
as the only surviving member of its branch of Indo-European (Rusakov, 2018), reconstruction of
Albanian diachronic phonology, and thus of Proto-Albanian, has always leaned heavily on the
outcomes of strata of loanwords in Albanian from better-attested sources (Orel, 2000). Of these, the
Latin layer (Çabej, 1962; Bonnet, 1998) is by far the most significant. Latin loanwords are more
numerous than inheritance from Proto-Indo-European, Proto-Albanian is dated in relation to the time
of contact with Latin, and Albanian diachronic phonology is in a large part an exercise in generalization
from analyses of the outcomes of ancient Latin loans (Orel, 2000; Demiraj, 2006; Rusakov, 2017; De
Vaan, 2018), though with significant contributions from Albanian historical dialectology (Curtis,
2018) and the other “layers”. Nevertheless, issues do remain that concern the Latin layer of Albanian,
such as rival etymologies between imperial-era Latin loans and later Romance loans (Bonnet, 1998),
and these have potential implications for the reconstruction of Proto-Albanian, and the greater
mysteries of the language’s history within the Balkans (Friedman and Joseph, 2022). Thus, an
evaluation and debugging of the received view on Albanian diachronic phonology as applied to its
largest single pillar, the Latin stratum, offers both a new approach to an old but still vexing problem,
and a step for CFR as an empirical method, between the curated “lab” case of French, and the “field”
of understudied languages and language families.
This endeavor will apply DiaSim to CLEA, a dataset compiled in 2020–2022 and to be released with
this paper, of 1007 Albanian etyma of ancient Latin origin as asserted by at least one of a set of reputed
references (Bonnet, 1998; Orel, 1998, 2000; De Vaan, 2018; Topalli, 2017; Çabej, 1986), and will work
from a base cascade representing the views of Orel (2000) and De Vaan (2018). The same debugging
process as Marr and Mortensen (2022) will be applied, with accuracy reported for modern Albanian
outcomes, and discussion of any systematic patterning of errors and possible solutions proposed.