WDQS Munger should be multi threaded
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Gehel
	Nov 11 2019, 4:40 PM

Description

During data reload, the Munging step is taking a significant time. A quick look at resource consumption indicates that munging is CPU bound. It is probably possible to introduce some parallelism to increase throughput.

Related Objects

Mentioned In: T241128: EPIC: Reduce the time needed to do the initial WDQS import

Event Timeline

Gehel created this task.Nov 11 2019, 4:40 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptNov 11 2019, 4:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Per-item data are mostly independent, so different items can be easily processable in parallel, however that would require splitting the incoming data per item (note that item data not necessarily have item URI as subject - there are statements, references, values, sitelinks, etc.)

Separation of

parsing
munging
writing

in multiple thread doubled the speed of the munger

old: 
real    1371m34.618s
user    1854m48.672s
sys     24m44.480s

new:
real    731m20.495s
user    1798m42.176s
sys     30m7.888s

I should have linked https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/553758 to this task.
Since the rdf parser is the limiting factor I think we will have to do the entity delimitation without a rdf parser if we want to further improve the speed of this step.
We could also consider switching to the nt format which I'm sure will be a lot faster to parse if the size overhead is acceptable.

dcausse mentioned this in T241128: EPIC: Reduce the time needed to do the initial WDQS import.Dec 19 2019, 10:32 AM

The real solution will come from moving all this processing in hadoop as part of the new streaming updater

I'm going to mark this as resolved as the munge step can now be performed in hadoop as a standalone step, meaning it is still useful to other users and is not only useful as part of the streaming updater.
The main part of this was done in https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/623621 thanks to @dcausse for guiding me through my dream.
I'll currently writing a post about how to make use of this on a context outside of the WMF cluster and will try to link back here when done.

WDQS Munger should be multi threadedClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

WDQS Munger should be multi threaded
Closed, ResolvedPublic
Actions