Help:Dataset sizing
Purpose
[edit]This page aims to list and define a few standard metrics suitable to be determined on a subset of Wikidata items.
For metrics used elsewhere, it attempts to provide queries that can be used on Query Server.
Version
[edit]This is the version as of 20200823075556. Please use the "permanent link" on the left side when quoting this page.
Introduction
[edit]Sample queries to select the items:
- sleds:
SELECT ?item WHERE { ?item wdt:P279* wd:Q181388 }
- tennis:
SELECT ?item WHERE { ?item wdt:P641 wd:Q847 }
Knowledge Graphs on the Web -- an Overview (Q86997852) proposes a few metrics:
- a. # instances
- b. # assertions
- c. average linking degree
- d. median ingoing edges
- e. median outgoing edges
- f. # classes
- g. # relations
- h. average depth of class tree
- i. average branching factor of class tree (average width of class tree)
- j. ontological complexity
They are described at "3. Comparison of Knowledge Graphs" in the paper.
Discussion at Wikidata:Request a query#Dataset sizing.
The queries below are mostly based on truthy main statements (wdt:), not qualifiers (pq:), references (pr:), sitelinks, or labels/descriptions/aliases. Please help expand/add alternate ways to calculate.
A few other metrics are included as well.
Basic metrics
[edit]number of instances
[edit]- definition
- number of distinct items
# a. # instances
SELECT (COUNT(DISTINCT ?item) as ?nb_instance)
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
number of assertions
[edit]# b. # assertions
# Tbd: include sitelinks?
SELECT (SUM(?st) as ?nb_assertions)
WITH
{
SELECT DISTINCT ?item ?st
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
?item wikibase:statements ?st .
}
} as %a
{
INCLUDE %a
}
average linking degree
[edit]# c. average linking degree
# TBD: include incoming links?
SELECT (AVG(?st) as ?avg_linking_degree)
WITH
{
SELECT DISTINCT ?item ?st
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
?item wikibase:statements ?st .
}
} as %a
{
INCLUDE %a
}
median ingoing edges
[edit]# d. median ingoing edges: number of ingoing edges
# after the below, calculate median on ?nb_ingoing_edges
SELECT ?item (COUNT(?wdt) as ?nb_ingoing_edges)
WITH
{
SELECT DISTINCT ?item
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
} as %a
{
INCLUDE %a
?p wikibase:directClaim ?wdt ; wikibase:propertyType wikibase:WikibaseItem .
[] ?wdt ?item
}
GROUP BY ?item
median outgoing edges
[edit]# e. median outgoing edges: number of outgoing edges
# after the below, calculate median on ?nb_outgoing_edges
# alternative method: include external id properties
SELECT ?item (COUNT(?wdt) as ?nb_outgoing_edges)
WITH
{
SELECT DISTINCT ?item
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
} as %a
{
INCLUDE %a
?p wikibase:directClaim ?wdt ; wikibase:propertyType wikibase:WikibaseItem .
?item ?wdt []
}
GROUP BY ?item
number of relations
[edit]# g. # relations
# currently properties. Could be expanded to other
SELECT (COUNT(DISTINCT ?wdt) as ?nb_relations)
WITH
{
SELECT DISTINCT ?item
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
} as %a
{
INCLUDE %a
?p wikibase:directClaim ?wdt .
{ ?item ?wdt [] } UNION { [] ?wdt ?item }
}
number of classes (types)
[edit]- definition
- number of distinct values used with instance of (P31) or subclass of (P279)
- query
# f. # classes
SELECT (COUNT(DISTINCT ?class) as ?nb_classes)
WITH
{
SELECT DISTINCT ?item
WHERE
{
?item wdt:P279* wd:Q181388 .
# ?item wdt:P641 wd:Q847 .
}
} as %a
{
INCLUDE %a
?item (wdt:P31|wdt:P279) ?class
}
Most frequent
[edit]most frequently used properties
[edit]- definition
- properties most frequently used as main values (truthy values)
- query
most frequent sitelinks
[edit]- definition
- most frequently linked WMF sites (Wikipedia, Commons, Wikisource, etc.)
- query
most frequently used classes (types)
[edit]- definition
- most frequent values used with instance of (P31) or subclass of (P279). Sometimes limited to P31.
- query