Wikidata:WikiProject Ontology/Cleaning Task Force

From Wikidata
Jump to navigation Jump to search

This page serves as the main page for the ontology cleaning task force's efforts to clean up the Wikidata ontology.

The task force arose from a presentation (https://commons.wikimedia.org/wiki/File:Wikidata_Challenges_in_Semantic_Web_Community.pdf) by Andrea Westerinen at the 2023 Wikidata Data Quality Days (https://www.wikidata.org/wiki/Wikidata:Events/Data_Modelling_Days_2023)

Anyone interested in working to improve the Wikidata ontology is welcome to join the task force.

Scope

[edit]

The scope of the task force has not yet been determined but just about any effort to improve the Wikidata ontology will probably be considered in scope, ranging from theoretical analyses of the ontology, to techniques for reasoning in Wikidata, to implementation of tools that help improve the ontology, to documentation of problems with the ontology, to direct editing of the ontology. There is a welcome page that gives some background on the Wikidata ontology and the task force.

Participation

[edit]

Anyone interested in working to improve the Wikidata ontology is welcome. Add your user identification below to join the group. Add issues that you are interested in or currently work on to the sections below.

Participants

[edit]
  1. Peter Patel-Schneider
  2. Lectrician1
  3. Andrea Westerinen
  4. PKM (talk)
  5. Chris Mungall
  6. Mahir256
  7. Egezort
  8. Rosario Uceda-Sosa

Meetings

[edit]

The task force uses Google Meet for meetings, as described in the task force's calendar. Currently meetings are Tuesday at 11 am ET (currently GMT-4). Members of the task force are active in the Wikiproject Ontology telegram group.

Next meeting agenda

[edit]

Task force members should add agenda items for topics they wish to discuss. The default agenda has reports on the current efforts. If there is significant work on any effort please add something to its agenda item.

Tuesday, 26 November 2024, 11am ET
[edit]

Peter will miss the meeting due to a vacation.

  • Our meeting length is limited to 60 minutes by Google so meetings may be terminated abruptly.
  • Welcome new members
    • If you can't make meetings please add information to this page about your interests and activities.
  • PropBank discussion if there has been recent progress
  • Interests of new members
  • Reports on status of current efforts.
Tuesday, 3 December 2024, 11am ET
[edit]

Peter will miss the meeting due to a vacation.

Tuesday, 10 December 2024, 11am ET
[edit]

Peter may miss the meeting due to travel.

Previous meeting notes

[edit]

See Wikidata:WikiProject Ontology/Cleaning Task Force/Meetings for notes from prior meetings.

Current efforts and Information

[edit]

Feel free to add efforts that you are participating in here and report on your progress.

See the task force Phabricator board for current and potential efforts being tracked in Phabricator.

Significant Changes

[edit]

Some actual and proposed significant changes to the Wikidata ontology may be catalogued in the changes page.

Disjointness

[edit]

Peter Patel-Schneider

There is a paper on disjointness in Wikidata at https://arxiv.org/abs/2410.13707

Wikidata has disjoint union of (P2738), which states that a class is the disjoint union of a list of other classes. This implies that the classes in the list are pairwise disjoint. I am preparing a fuller report on violations of these disjointnesses, expanding on the information below. See User:Peter F. Patel-Schneider/disjoint_violations for a report on violations of the disjointnesses resulting from disjoint union of (P2738) statements.

Enumeration

[edit]

There are 740 disjoint union of (P2738) statements on a total of 603 classes creating 6995 implied disjoint pairs of classes. After excluding all those related to day day (Q573) only 1455 remain. Excluding several other not-very-intersting groupings results in 870 implied disjoint pairs. Of these pairs, there are 18329 items that are either subclasses of both classes in a pair (with no superclass also being a subclass of both) or instances of both classes (that have no subclass in common) from a total of 123 pairs. Some of them appear to be simple mistakes about other classes or individulas, like liquid helium being a gas and Horse Grenadiers being a person. Others appear to be mistakes about disjointness, like game of skill stated as disjoint from game of chance, some perhaps resulting from confusion about disjointness between classes like female given name stated as disjoint from male given name. Someone should probably go through this list and try to cut it down. Some large parts of the list appear to be caused by a single questionable subclass of link or a single questionable disjoint union of.

High-level disjointness

[edit]

The Wikidata ontology does not make many high-level distinctions. Nonetheless are the disjoint union of statements that make these distinctions, including disjointness of concrete object and abstract object under object, disjointness of artificial object and natural object under object, and disjointness of abstract entity and concrete object under entity. Each of these disjointnesses have many classes thare are subclasses of the two disjoint classes.

What should be done here? My suggestion is to look at each of the high-level disjointnesses and determine whether the disjointness is correct in the Wikidata ontology. If the disjointness is correct then edits should be made to both classes and individuals to remove violations of the disjointness. If the disjointness if not correct it should be removed, and replaced with appropriate disjointness between more-specific classes.

Class order

[edit]

Peter Patel-Schneider

A class is first-order if it has no classes as instances. A class is second-order if it has only first-order classes as instances. Similarly for higher orders. Wikidata has several metaclasses - first-order class (Q104086571), second-order class (Q24017414), third-order class (Q24017465), fourth-order class (Q24027474), and fifth-order class (Q24027515) - that allow stating that a class has a particular order. There are lots of classes that are instances of one (or more) of these metaclasses but that violate the requirements for the class order. Some of these are individual errors but many are caused by a general confusion on orders (e.g., diseases and colors).

See https://www.wikidata.org/wiki/User:Peter_F._Patel-Schneider/order_violations for a longer description of class orders and a long list of violations as of 8 January 2024.

See https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology/Class_Order for some discussion about problems with class order in products.

Inactive Efforts

[edit]

Alignment with schema.org

[edit]

Andrea Westerinen

This would involve finding or creating Wikidata classes for each class in schema.org and checking that the generalizations in one are exactly the generalizations in another.

Another possibility would be to review and update the GitHub Wikidata-schema.org mapping work

Relevant query

[edit]

Notes:

  • This query filters equivalent class claims to schema.org references
  • This assumes that the equivalent class property is the proper place to record the relationship from a Wikidata item to a schema.org "term"
  • Not all schema.org URIs follow the same syntax
  • Assuming that this "source file" contains the majority of the schema.org classes, then Wikidata is missing a lot of these linkages

Output of RDF/OWL

[edit]

Andrea Westerinen

The current RDF output is valuable but insufficient to discover (and possibly) correct errors in Wikidata. There are several reasons for this: 1) the output does not allow reasoning/consistency checking; and 2) it does not provide a mechanism to view constraint violations for items.

The work involves programmatically converting the RDF to RDF/OWL and including information on mandatory constraint violations.

Review subclasses of entity (Q35120)

[edit]

Lectrician1

Wikidata Graph Builder

Possible efforts

[edit]

Add efforts that you are interested in pursuing or that you think should be pursued.

See the task force Phabricator board for current and potential efforts being tracked in Phabricator.

  • Adopt some existing upper ontology as the basis for the Wikidata upper ontology. Care has to be taken here as some upper ontologies are quite prescriptive and Wikidata has to support multiple modeling styles.
    • As opposed to adopting an existing upper ontology, it may be beneficial to collect the concept distinctions that are most valuable across several (BFO, SUMO, DOLCE, etc.) and then map the Wikidata concept model onto these distinctions.
    • It appears that a lot of BFO has already been added, sometimes creating problems for Wikidata.
  • Add facilities to Wikidata that help prevent problems with the ontology.
    • Is the property constraint none-of sufficient to forbid values from a particular class.
    • Disjointness of concepts is also important to define.
  • Suggest enhanced EntitySchema design to aid in use of the upper and middle ontologies

Some Specific Questions to Answer

[edit]
  • When aligning with schema.org, what is the right set of properties to use to indicate the mapping?
    • Some possibilities are P1709 (equivalent class), P3950 (narrower external class), P4900 (broader class, but does this work for external classes), P2888 (exact match), P4070 (identifier shared with) and P1628 (equivalent property)
    • Or is there a single mapping property, perhaps with a qualifier indicating the nature of the mapping?
  • Should meta-class be included in the upper ontology? If so, why?
    • Some users don't like the idea of metaclasses. If one accepts that Q16521 (taxon) is a metaclass, than Q876500 (western low-land gorilla) is a class and Q12038481 (Moja) is P31 Q876500. People who don't like metaclasses however created P10241 (instance of taxon) to get around using P31 in this way.
  • How much should concepts rely on multiple inheritance to capture ambiguities (such as a geopolitical entity being both an agent/actor and a location)?
  • Is disjointness like a constraint and can be easily violated or should it be considered as inviolable.
  • There are temporally qualified ontology links (and they cause disjointness violations). Should it be required that the best one of these be preferred. There is an example of this for Berlin population in some Wikidata tutorial material.
  • Some stuff pulled from other ontologies make assumptions that are not true in Wikidata, e.g., disjointness of object and property pulled from BFO. Should these be deprecated or removed?
  • The Wikidata ontology does not make firm decisions between some very high-level categorizations. For example, many classes are subclasses of both artificial object and natural object or both concrete object and abstract entity. Nevertheless these distinctions are important and both pairs are in disjoint unions under classes very high in the ontology. What should be done about this?

Assorted Topics

[edit]

is metaclass for

[edit]

Peter F. Patel-Schneider

Some thoughts on metaclass relationships. Where can this be publicized?

The Wikidata ontology has a number of situations that are not easily captured with just subclass and instance links.

One example is invasive amphibian invasive amphibian (Q111535327) described in English as "amphibian that is spreading outside its original habitat" but whose instances are species, hence it is a subclass of subclass of (P279) invasive species invasive species (Q183368). But it is also stated to be a subclass of subclass of (P279) Amphibia Amphibia (Q10908), whose instances are individual amphibians, so this subclass link is incorrect.

Intead instances of instances of invasive amphibian are instances of Amphibia, i.e., instances of invasive amphibian are subclasses of Amphibia. There is an existing property in Wikidata for this relationship - is metaclass for is metaclass for (P8225).

But is metaclass for is metaclass for (P8225) is lacking support in Wikidata, as far as I know. Implications of statements using is metaclass of are not added to Wikidata. I don't know of any place that records these implications that are not already in Wikidata.

I would like to use is metaclass for when it is appropriate, and remove the incorrect subclass links. I would also like to have is metaclass for better supported.

One extra complication for using is metaclass for in biology is that the stated relationship used there is parent taxon parent taxon (P171), a subproperty of subclass of subclass of (P279). This makes querying for existing subclass of relationships very difficult.

[edit]

There is a page on what might be done to improve the Wikidata ontology at https://www.wikidata.org/wiki/Wikidata:Ontology_issues_prioritization. This was a result of a survey of Wikidata users on what problems they encountered when using the Wikidata ontology.

https://www.wikidata.org/wiki/Wikidata:Tools/Enhance_user_interface#Classification.js is a useful tool to show parts of the Wikidata ontology.