When testing the new merging updater we found that the number of triples dropped significantly (T231411#5674890).
One reason invoked was that the way we sync blank nodes has changed. We already know that some are duplicated in the triple store (T231515) and due to their nature (unstable unique ID) such blank nodes are hard to keep in sync with the triple store.
This ticket is about tracking how blank nodes are used in the rdf output from wikibase and make sure that we do not duplicate them during the update process.
Blank used to denote "unknown value" in wikidata
and never used as subject.
Blank node only used as an object of statement qualifier:
s:Q233-a9844587-4029-33bc-7b34-13b0d3c10ed3 a wikibase:Statement, wikibase:BestRank ; wikibase:rank wikibase:NormalRank ; ps:P138 wd:Q10987 ; pq:P407 _:genid1 .
from: https://www.wikidata.org/wiki/Special:EntityData/Q233.ttl
(does not seem to be duplicated currently)
As statement values:
s:Q17619314-5cd290f5-4659-e699-74b9-52714a955c62 a wikibase:Statement, wikibase:BestRank ; wikibase:rank wikibase:NormalRank ; ps:P268 _:genid4 ; pq:P813 "2016-03-14T00:00:00Z"^^xsd:dateTime ; pqv:P813 v:bcddb148b45928cdcf857b69eeb88df9 .
from: https://www.wikidata.org/wiki/Special:EntityData/Q17619314.ttl
(does not seem to be duplicated currently).
But they don't seem to link to same anonymous bnode when used as ps and wdt objects (T239397).
Some usecases that use such blank nodes
Blank nodes used to indicate owl constraints on properties
wdno:P3418 a owl:Class ; owl:complementOf _:genid1 . _:genid1 a owl:Restriction ; owl:onProperty wdt:P3418 ; owl:someValuesFrom owl:Thing .
from https://www.wikidata.org/wiki/Special:EntityData/P3418.ttl
The ones are not properly synced and are duplicated (T231515).
But again, are they really useful on the triple stores, these constraints seem to be always the same and since we do not use any inference engine nor constraint checks do we really need to sync them?
Find others
We should investigate other uses of blank nodes by extracting all of them from the triple store using this query:
select ?p (count(*)as ?cnt) { ?s ?p ?o . filter (isBlank(?o)) } group by ?p