Wikidata:Property proposal/form decomposition

From Wikidata
Jump to navigation Jump to search

form decomposition

[edit]

Originally proposed at Wikidata:Property proposal/Lexemes

   Done: form decomposition (P12527) (Talk and documentation)
Descriptionform decomposition
Data typeForm
Domainform
Example 1nin9-zu/π’Žπ’ͺ (L643660-F2) β†’ nin9/π’Ž (L643660-F1), nin9-zu/π’Žπ’ͺ (L643660-F2) β†’ zu/π’ͺ (L1116255-F1) (see also elaboration of Example 1)
Example 2lugal/π’ˆ— (L643713-F10) β†’ lugal[king][-ak][-ΓΈ] N.GEN.ABS (see elaboration of Example 2)
Example 3in-pa3/π’…”π’…†π’Š’ (L741253-F2) β†’ i-n-pad[name][-ΓΈ] FIN.3-SG-H-A.V.3-SG-P (see elaboration of Example 3)
Planned useLinking forms to their compositions (Lexemes) and to attach the grammatical role of these compositions
See alsocombines lexemes (P5238) which allows to decompose a Lexeme into other lexemes which are parts of it

Motivation

[edit]

Sumerian, as an agglutinative language derives its grammatical features from compositions of mainly suffixes which are attached to a Lexeme.

In Wikidata, we can already model Lexemes of the individual suffixes and we can create QIDs for the grammatical features that we need to describe a Lexeme Form.

What we miss is a way to decompose a lexeme form to represent how the suffixes represent the grammatical features which are assigned to the form.

One might argue that this is a trivial matter, as only suffixes are added and they can be described sufficiently to represent a grammatical feature.

However, in Sumerian, the interpretation of a word is usually broken down into a description of the chain of suffixes, or even vowels in suffixes, as exemplified here:

http://oracc.museum.upenn.edu/etcsri/parsing/index.html

This interpretation of a Sumerian form can become quite complex and is worth modeling in Wikidata, in my opinion.

To do that, we would need a property that allows for representing the decomposition of a form, similarly to "combines lexemes". Then, we would be able to list the individual suffixes or parts of suffixes in a list e.g. with "series ordinal" to explain the decomposition of the lexeme form completely in RDF.

Usage for other languages

[edit]

There can be many other potential application cases for this property in other languages such as:

  • Turkish, Japanese as agglutinative languages (even though maybe with a clearer representation of Suffixes), e.g. all forms of ε› γ‚‹/γ‚ˆγ‚‹ (L11476)
  • Arguably Indo-European languages, e.g. German gehst (L1026-F4) "gehst" could be separated into "geh" – STEM and "st" "second person singular present, indicative, active"
  • Akkadian Cuneiform will need similar patterns for verbs, but also includes verbal roots, maybe Arabic is then also applicable

Elaboration on examples

[edit]

This section elaborates the aforementioned three examples for Sumerian.

  • Form: nin9-zu / π’Žπ’ͺ
  • Grammatical interpretation: nin9=HEAD.zu=2-SG-POSS

This noun has a second person singular possessive case which is marked with the suffix zu/π’ͺ (L1116255).

We would like to express that the suffix is marked with zu/π’ͺ (L1116255) and that nin/π’Š©π’Œ† (L643660) is the HEAD and carries the meaning of the noun.

Representation in Wikidata
[edit]

Taken from: https://github.com/cdli-gh/CDLI-CoNLL-to-CoNLLU-Converter/blob/master/resources/P100065.conll

  • Genitive absolutive form of lugal (king)
  • r.1.4 lugal lugal[king][-ak][-ΓΈ] N.GEN.ABS

This example shows, that the three forms (lugal/π’ˆ— (L643713-F7), lugal/π’ˆ— (L643713-F9), lugal/π’ˆ— (L643713-F10)) are written in the same way: "lugal". Therefore, additional elaborations on why these forms are written in this way are needed.

The genitive absolutive case of lugal/π’ˆ— (L643713), lugal/π’ˆ— (L643713-F10) is comprised of three components:

  1. the STEM (lugal)
  2. the particle (-ak)
  3. the non-written marker for the absolutive case (it is always left empty)

In lugal/π’ˆ— (L643713-F10), the (-ak) is also not written, hence it is indistinguishable from the forms lugal/π’ˆ— (L643713-F7), lugal/π’ˆ— (L643713-F9) without additional context.

Hence, we would like to break down the grammatical composition with reference to the written and non-written parts of the form.

Representation in Wikidata
[edit]

Taken from: https://github.com/cdli-gh/CDLI-CoNLL-to-CoNLLU-Converter/blob/master/resources/P100065.conll

  • r.3.3 in-pa3 i-n-pad[name][-ΓΈ] FIN.3-SG-H-A.V.3-SG-P

The example for inpad shows a representation of the in-pa3/π’…”π’…†π’Š’ (L741253-F2) with the sense "to name" in Sumerian.

The verb describes its directly associated subject and its associated direct object with different grammatical parameters.

  • Subject: The subject is described as "third person singular finite human agent", which manifests itself in the prefix "in"
  • Direct Object: The direct object is described as "third person singular" and manifests itself in the non-written suffix -ΓΈ (L1117775-F1) .
Representation in Wikidata
[edit]

 – The preceding unsigned comment was added by Situxx (talkΒ β€’Β contribs)Β at 21:45, April 28, 2023β€Ž (UTC).

Discussion

[edit]
  • @Situxx: OK, having thought about it a bit more I have some more specific comments: I am not sure if it is necessary to use subject stated as to qualify the statements. An issue with that property is that it does not allow specifying a language code for the string, and for languages where this information could be presented in multiple ways it is unclear how to use. One approach I have been going with for "zero morphemes" which are common in a Punjabi is to use non-printing Unicode characters as representations of individual forms. ("Left to Right Mark" and "Arabic Letter Mark" for LTR and RTL representations respectively.) This allows attaching additional data to the zero representation, and indicating that a form is an empty string without using a qualifier. See ਇ/ِؑ (L718607) for example where this verbal suffix is most often unrealized, but has different forms historically and in some dialects. Rather than using subject named as, I think it would make sense to separate forms like this and select the combining form which has the correct representative string(s).
Then, for example, it could be stated that ਉੱਠ/اُٹھّ (L689060-F14) employs the suffix β€Ž/؜ (L718607-F1), while ਉੱਠੀ/Ψ§ΩΩΉΪΎΩ‘ΫŒ (L689060-F15) employs ਈ/ئی (L718607-F3). It would not be clear in the second case how to represent both ਈ and ی using subject named as whereas using the linked form in both cases we can get a representation of the combining form for each language/script code -عُثمان (talk) 20:25, 7 June 2023 (UTC)[reply]
Thank you very much for your remarks. I think using zero morphemes as forms for the suffixes we have in Sumerian that can be omitted is a great idea. I will adapt that and update my proposal accordingly. As for subject has role vs. object has role I think you are right. It should be object has role as the role of the suffix is described and not the grammatical feature of the subject (the form) which is already described in the grammatical feature description. I will adapt that as well and give you a heads up once I am done. Situxx (talk) 13:47, 9 June 2023 (UTC)[reply]
@Situxx: have the adaptations you mentioned been done? Mahir256 (talk) 14:51, 24 January 2024 (UTC)[reply]
Sorry about that. I forgot about this.
@عُثمان I have used "object has role" in the property proposal now and I used the null morpheme instead of the empty string representation in the examples.
I think that improved the proposal a lot. Situxx (talk) 00:56, 29 January 2024 (UTC)[reply]
@Mahir256, عُثمان: pining for attention. Regards, ZI Jony (Talk) 06:21, 29 January 2024 (UTC)[reply]