Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Create RDF dump of structured data on Commons
Closed, ResolvedPublic

Description

We should set up a regular dump of all structured data entities on Commons, akin to dumps of Wikidata entities we have now.

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenFeatureNone
DuplicateNone
ResolvedNone
ResolvedNone
ResolvedNone
DuplicateNone
ResolvedArielGlenn
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedGehel
ResolvedSmalyshev
ResolvedNone
ResolvedCparle
ResolvedCparle
ResolvedArielGlenn
Resolved Zbyszko

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The refactor patchset now checks out with all the wikidata dumps including json. I'd like to deploy it this weekend, giving plenty of time to make sure it's ok, test the structured data patchset, and then be able to deploy that separately.

I sincerely apologize: this weekend the heat baked my brain and I did nothing related to computers at all. And Friday evening I was out. I'll set a notification to remind me this coming Friday earlier in the day, so that this gets done.

Change 517670 merged by ArielGlenn:
[operations/puppet@production] refactor wikidata entity dumps into wikibase + wikidata specific bits

https://gerrit.wikimedia.org/r/517670

I think T222497 should be resolved before this goes live. I can test it in deployment-prep before then, but I don't want to do production tests until there is some sort of batching.

I checked on this week's run (which is with the refactor patch deployed) and didn't see anything amiss so far.

I've added fix for one of the issues in T222497 already but it doesn't fix everything. I think it's still would be interesting to test what happens in production - maybe not full dump but just partial, to estimate what we're dealing with and how bad is it? Maybe due to the fact we don't have yet too many mediainfo records and they're small we could be still fine?

Also, the patch itself doesn't actually turn the cron on, it just puts the files there. We'd need to flip the "files only" switch to actually produce the working cron.

I'm not thinking about the amount of time it takes, but rather the load on the database servers. Reasonable sized batched queries will be better, as I've seen already with stub dumps and slot retrieval.

I tried to manually dump the mediainfo entries over the weekend, it took 376 minutes for 4 shards (a lot, but less than I expected) and produces 1724656 items. Does not seem to produce significant load on DB so far - but it gives about 20 items/second, which seems to be too slow. If we ever get all files having items, that'd take 4 days to process over 8 shards, probably more since DB access will get slower, right now they are not to slow because there's only 2% of files that have items, so not too many DB queries.

@ArielGlenn What is this blocked or stalled on? Looks like several of the subtasks have been closed, but not all.

@Abit: I need to get my last question on T241149 answered; if these errors only go to stderr then I can at least run a test dump, but if they go to logstash that's 50 million log entries as the task description says, which would be pretty unacceptable. @Cparle has said he could have a look at that in particular, but really anyone who knows that code can have a look.

Because I've gotten a nice run in beta with the --ignore-missing flag, I'm trying a test run on snapshot1008 in a screen session:

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 50000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing 2>>/var/lib/dumpsgen/mediainfo-log.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-noshard.gz

If the output looks good, I'll put it somewhere for WQS testing and move forward with making these weekly runs with the appropriate number of parallel processes.

A batchsize of 50k turned out to be too large. Same with 5k. I'm now running with a batchsize of 500, which will surely be too small, but at least I am getting output. I'll check on it tomorrow and see how it's doing.

This morning the job was terminated by the oom killer:

[4288057.417443] Out of memory: Kill process 117265 (php) score 868 or sacrifice child
[4288057.425084] Killed process 117265 (php) total-vm:58241128kB, anon-rss:56901636kB, file-rss:0kB, shmem-rss:0kB

It produced a file of size 380M with 2224612 entitites in it before being shot. One of the last entries in it is the page File:Gerrardina_foliosa_1.jpg with page id 78 846 520 and mediainfo entity (Depicts) added on Jan 10th, 2020. Also the gz output file is not truncated, so perhaps it is complete. but the max page id currently is 85 865 111 so everything created after the above page (after May 2019) is likely missing. @Abit Should I move the output file somewhere for folks to test with, or would it not be helpful?

Note to self that a run of

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 250 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 0 --sharding-factor 1  2>/var/lib/dumpsgen/mediainfo-log-small.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-noshard-small.gz

worked fine. Going to run one with a sharding factor of 4 and a batch size 4 times larger, to see how that is.

Ran

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 1000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 1 --sharding-factor 4  2>/var/lib/dumpsgen/mediainfo-log-small-shard.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-of-4-small.gz

and it also ran fine.

Ran

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 500 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 78846320 --last-page-id 79046320 --shard 0 --sharding-factor 1  2>/var/lib/dumpsgen/mediainfo-log-small-shard-oom.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-small-oom.gz

which should cover the page range where we had the oom; it ran to completion fine. I guess that there is some small memory leak that must accumulate over batches, which is what did us in earlier. As long as we limit runs to some reasonable number of pages each time, we should be fine.

Change 516444 merged by ArielGlenn:
[operations/puppet@production] Set up dumps for mediainfo RDF generation

https://gerrit.wikimedia.org/r/516444

I plan to try running

/usr/local/bin/dumpwikibaserdf.sh commons full nt

on Thursday morning and see how long it takes with the 8 shards that are currently configured. @Abit is the nt format the one needed for WDQS testing?

@Abit is the nt format the one needed for WDQS testing?

I cannot answer this. @EBernhardson, any idea?

@EBernhardson or some other person from #search-team would now for certain.
Quickly looking at https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Data_reload_procedure, commands there mention ttl dumps though. Note that I don't really know what I am talking about, just trying to be useful.

I found a ticket that mentions use of ttl files so I'll run

/usr/local/bin/dumpwikibaserdf.sh commons full ttl

and keep an eye on it. Running on snapshot1008 in a screen session. Here we go!

In https://dumps.wikimedia.org/other/wikibase/commonswiki/ there are two ttl files, gz and bz2 compressed. Please have a look!

The bash script producing them complained that

/usr/local/bin/dumpwikibaserdf.sh: line 224: setDcatConfig: command not found

and I see that in commonsrdf_functions.sh there is a comment

# TODO: add DCAT info

so folks might want to look at that.

@dcausse is going to check over the ttl dump and let me know if it looks ok; if so then I'll flip the switch for generation weekly and make sure there's cleanup too.

Some unexpected (?) triples popping up that @dcausse is looking into, so the dumps will not be turned on in cron until we have the thumbs up on that. See T243292

If it turns out the data is all ok, we can move forward.

@ArielGlenn the structured data team keep coming up as blockers on this ticket in scrum-of-scrums, but we're not blocking you, are we?

If not then maybe I'll remove the tags relevant to us

@Cparle, No blocks on your side, the ball is now in @dcausse 's court. :-)

Hi, just checking in: any progress on invetigating the 'extra' dumps content?

@ArielGlenn no not yet, this is still blocked on T243292 which requires some investigation to determine which component (dump or the wdqs transformation process) is wrong.

I see that we're no longer blocked. Does this mean that we're good to go for weekly runs?

We've loaded the dump on a test server and it looks fine. We can start scheduling the dumps regularly.

Change 598787 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] enable dumps of structured data from commons

https://gerrit.wikimedia.org/r/598787

Change 598787 merged by ArielGlenn:
[operations/puppet@production] enable dumps of structured data from commons

https://gerrit.wikimedia.org/r/598787

Change 599052 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] fix up cron name of commons structured data dumps

https://gerrit.wikimedia.org/r/599052

Change 599052 merged by ArielGlenn:
[operations/puppet@production] fix up cron name of commons structured data dumps

https://gerrit.wikimedia.org/r/599052

@ArielGlenn we plan to make a subtle change to the dump (prefixes), this won't be technically a breaking change but could cause some confusion if users start to assume the presence of some prefixes. Would it be possible to pause the publication of the dumps while we change this? Sorry for the late notice.

Just a note on the current problem:
the prefixes defined in ttl dumps are identical to the ones used by wikidata e.g.:

@prefix wdt: <http://commons.wikimedia.org/prop/direct/> .

This is perfectly valid but might cause some confusions because when using commons query service we will likely change these prefixes so that federation with wdqs is more obvious.
I had a look at the code base but haven't found an easy way to override this.
@Lucas_Werkmeister_WMDE do you know if it'd be possible to change the prefixes for the local repository such that for commons http://commons.wikimedia.org/prop/direct/ would not be wdt?

@dcausse it is possible to customize the prefix. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/569260 once merged will enable this. The chain of patches is a bit stalled. I will remove some dust off it and hopefully we'll get this in soon

Looks like it was decided not to use wikidata specific prefixes for MediaInfo exports but uses a more specific sdc for these (see: T222995).
The code does still look to be hardcoded with wikidata specific prefixes.
It does not look to me like that we could make this happen quickly.
I created T253798 to track this work. Since it seems that some refactoring will have to happen (initially we thought it might just be a config change) I wonder if making the dumps available should be blocked by T253798 or go ahead and make them available with a short notice explaining that prefixes might change in the future with a link to that same ticket.
@CBogen I'll leave that decision to you.

Please disregard the message above, this is actually possible with a config change.

@WMDE-leszek oops, sorry I replied before reading you comment and was reading an old code base... if this is just a config change it can hopefully be merged soon. Thanks!

Change 599856 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Revert "enable dumps of structured data from commons"

https://gerrit.wikimedia.org/r/599856

Change 599856 merged by Gehel:
[operations/puppet@production] Revert "enable dumps of structured data from commons"

https://gerrit.wikimedia.org/r/599856

Change 601162 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] sdc dumps: add placeholder entry in dcat setup to avoid syntax errors

https://gerrit.wikimedia.org/r/601162

Change 601162 merged by ArielGlenn:
[operations/puppet@production] sdc dumps: add placeholder entry in dcat setup to avoid syntax errors

https://gerrit.wikimedia.org/r/601162

Change 609114 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] snapshots: enable dumps of structured data from commons

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609114

Change 609114 merged by Gehel:
[operations/puppet@production] snapshots: enable dumps of structured data from commons

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609114

Change 609823 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add a README about the content of the commons structured data dumps

https://gerrit.wikimedia.org/r/609823

Change 609823 merged by ArielGlenn:
[operations/puppet@production] add a README about the content of the commons structured data dumps

https://gerrit.wikimedia.org/r/609823

Change 610816 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] Add commons structured data dumps to the webpage!

https://gerrit.wikimedia.org/r/610816

Change 610816 merged by ArielGlenn:
[operations/puppet@production] Add commons structured data dumps to the webpage!

https://gerrit.wikimedia.org/r/610816

The dumps are now lived! And it has been communicated on wikidata-l.

It's linked off the 'other datasets' page near the top. But here's the direct link: https://dumps.wikimedia.org/other/wikibase/commonswiki/

should latest-full.* be deleted to avoid confusion?

Links latest-full.ttl.bz2 -> 20200116/commons-20200116-full.ttl.bz2 and latest-full.ttl.gz -> 20200116/commons-20200116-full.ttl.gz have been cleaned up. Thanks for the suggestion!