Maniphest T221917

Create RDF dump of structured data on Commons
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Apr 25 2019, 11:44 PM

Description

We should set up a regular dump of all structured data entities on Commons, akin to dumps of Wikidata entities we have now.

Details

Subject	Repo	Branch	Lines +/-
Add commons structured data dumps to the webpage!	operations/puppet	production	+1 -0
add a README about the content of the commons structured data dumps	operations/puppet	production	+15 -0
snapshots: enable dumps of structured data from commons	operations/puppet	production	+1 -2
sdc dumps: add placeholder entry in dcat setup to avoid syntax errors	operations/puppet	production	+1 -0
Revert "enable dumps of structured data from commons"	operations/puppet	production	+2 -1
enable dumps of structured data from commons	operations/puppet	production	+1 -2
fix up cron name of commons structured data dumps	operations/puppet	production	+1 -1
Set up dumps for mediainfo RDF generation	operations/puppet	production	+93 -3
refactor wikidata entity dumps into wikibase + wikidata specific bits	operations/puppet	production	+374 -278
LoadBalancer::getConnectionIndex: Also fallback to generic group	mediawiki/core	master	+33 -3
Add option --ignore-missing to dumper	mediawiki/extensions/Wikibase	master	+25 -2

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Duplicate		None	T141602 [Objective Fiscal 19-20/Q4] (9) Provide a Proof of Concept SPARQL endpoint in support of SDoC project
Resolved		ArielGlenn	T221917 Create RDF dump of structured data on Commons
Resolved		Smalyshev	T221916 Create RDF export for structured data stored for files
Resolved		Smalyshev	T222299 dumpRdf does not work with MediaInfo entities
Resolved		Smalyshev	T222302 Too many RDF label forms for MediaInfo
Resolved		Smalyshev	T222306 RDF export generates wrong IDs for federated entities
Resolved		Gehel	T222321 Make /entity/ alias work for Commons
Resolved		Smalyshev	T222995 Decide which prefixes to use for MediaInfo RDF
Resolved		None	T230840 Set up proper prefix configuration for RDF export on Commons
Resolved		Cparle	T230856 RDF dump performance for SDC
Resolved		Cparle	T222497 dumpRDF for MediaInfo entities loads each page individually
Resolved		ArielGlenn	T239905 dumpRdf for mediainfo entities loads data from db more often than it needs to
Resolved		• Zbyszko	T243292 Fix the munger to support commons RDF dump

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The refactor patchset now checks out with all the wikidata dumps including json. I'd like to deploy it this weekend, giving plenty of time to make sure it's ok, test the structured data patchset, and then be able to deploy that separately.

@ArielGlenn: sounds great!

Smalyshev claimed this task.Aug 1 2019, 6:53 PM

I sincerely apologize: this weekend the heat baked my brain and I did nothing related to computers at all. And Friday evening I was out. I'll set a notification to remind me this coming Friday earlier in the day, so that this gets done.

Change 517670 merged by ArielGlenn:
[operations/puppet@production] refactor wikidata entity dumps into wikibase + wikidata specific bits

https://gerrit.wikimedia.org/r/517670

I think T222497 should be resolved before this goes live. I can test it in deployment-prep before then, but I don't want to do production tests until there is some sort of batching.

I checked on this week's run (which is with the refactor patch deployed) and didn't see anything amiss so far.

I've added fix for one of the issues in T222497 already but it doesn't fix everything. I think it's still would be interesting to test what happens in production - maybe not full dump but just partial, to estimate what we're dealing with and how bad is it? Maybe due to the fact we don't have yet too many mediainfo records and they're small we could be still fine?

Also, the patch itself doesn't actually turn the cron on, it just puts the files there. We'd need to flip the "files only" switch to actually produce the working cron.

I'm not thinking about the amount of time it takes, but rather the load on the database servers. Reasonable sized batched queries will be better, as I've seen already with stub dumps and slot retrieval.

jleedev subscribed.Aug 14 2019, 3:00 AM

Smalyshev closed subtask T221916: Create RDF export for structured data stored for files as Resolved.Aug 19 2019, 7:01 PM

I tried to manually dump the mediainfo entries over the weekend, it took 376 minutes for 4 shards (a lot, but less than I expected) and produces 1724656 items. Does not seem to produce significant load on DB so far - but it gives about 20 items/second, which seems to be too slow. If we ever get all files having items, that'd take 4 days to process over 8 shards, probably more since DB access will get slower, right now they are not to slow because there's only 2% of files that have items, so not too many DB queries.

Smalyshev removed Smalyshev as the assignee of this task.Sep 4 2019, 5:52 AM

Gehel merged a task: T235644: Regular RDF dumps from structured data on commons.Oct 16 2019, 1:22 PM

Gehel subscribed.

ArielGlenn mentioned this in T220525: MCR: Import all slots from XML dumps.Nov 22 2019, 7:09 AM

ArielGlenn moved this task from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Nov 27 2019, 1:52 PM

Abit mentioned this in T231952: [REQUEST] SDC metrics.Dec 20 2019, 10:28 PM

Cparle closed subtask T230856: RDF dump performance for SDC as Resolved.Jan 6 2020, 11:18 AM

Cparle closed subtask T222497: dumpRDF for MediaInfo entities loads each page individually as Resolved.

@ArielGlenn What is this blocked or stalled on? Looks like several of the subtasks have been closed, but not all.

@Abit: I need to get my last question on T241149 answered; if these errors only go to stderr then I can at least run a test dump, but if they go to logstash that's 50 million log entries as the task description says, which would be pretty unacceptable. @Cparle has said he could have a look at that in particular, but really anyone who knows that code can have a look.

Because I've gotten a nice run in beta with the --ignore-missing flag, I'm trying a test run on snapshot1008 in a screen session:

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 50000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing 2>>/var/lib/dumpsgen/mediainfo-log.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-noshard.gz

If the output looks good, I'll put it somewhere for WQS testing and move forward with making these weekly runs with the appropriate number of parallel processes.

A batchsize of 50k turned out to be too large. Same with 5k. I'm now running with a batchsize of 500, which will surely be too small, but at least I am getting output. I'll check on it tomorrow and see how it's doing.

This morning the job was terminated by the oom killer:

[4288057.417443] Out of memory: Kill process 117265 (php) score 868 or sacrifice child
[4288057.425084] Killed process 117265 (php) total-vm:58241128kB, anon-rss:56901636kB, file-rss:0kB, shmem-rss:0kB

It produced a file of size 380M with 2224612 entitites in it before being shot. One of the last entries in it is the page File:Gerrardina_foliosa_1.jpg with page id 78 846 520 and mediainfo entity (Depicts) added on Jan 10th, 2020. Also the gz output file is not truncated, ~~so perhaps it is complete.~~ but the max page id currently is 85 865 111 so everything created after the above page (after May 2019) is likely missing. @Abit Should I move the output file somewhere for folks to test with, or would it not be helpful?

Note to self that a run of

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 250 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 0 --sharding-factor 1  2>/var/lib/dumpsgen/mediainfo-log-small.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-noshard-small.gz

worked fine. Going to run one with a sharding factor of 4 and a batch size 4 times larger, to see how that is.

Ran

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 1000 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 1 --last-page-id 200001 --shard 1 --sharding-factor 4  2>/var/lib/dumpsgen/mediainfo-log-small-shard.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-of-4-small.gz

and it also ran fine.

Ran

php /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki commonswiki --batch-size 500 --format nt --flavor full-dump --entity-type mediainfo --no-cache --dbgroupdefault dump  --ignore-missing --first-page-id 78846320 --last-page-id 79046320 --shard 0 --sharding-factor 1  2>/var/lib/dumpsgen/mediainfo-log-small-shard-oom.txt | gzip > /mnt/dumpsdata/temp/dumpsgen/mediainfo-dumps-test-nt-one-shard-small-oom.gz

which should cover the page range where we had the oom; it ran to completion fine. I guess that there is some small memory leak that must accumulate over batches, which is what did us in earlier. As long as we limit runs to some reasonable number of pages each time, we should be fine.

Change 516444 merged by ArielGlenn:
[operations/puppet@production] Set up dumps for mediainfo RDF generation

https://gerrit.wikimedia.org/r/516444

I plan to try running

/usr/local/bin/dumpwikibaserdf.sh commons full nt

on Thursday morning and see how long it takes with the 8 shards that are currently configured. @Abit is the nt format the one needed for WDQS testing?

Maintenance_bot removed a project: Patch-For-Review.Jan 13 2020, 12:11 PM

@Abit is the nt format the one needed for WDQS testing?

I cannot answer this. @EBernhardson, any idea?

@EBernhardson or some other person from #search-team would now for certain.
Quickly looking at https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Data_reload_procedure, commands there mention ttl dumps though. Note that I don't really know what I am talking about, just trying to be useful.

I found a ticket that mentions use of ttl files so I'll run

/usr/local/bin/dumpwikibaserdf.sh commons full ttl

and keep an eye on it. Running on snapshot1008 in a screen session. Here we go!

In https://dumps.wikimedia.org/other/wikibase/commonswiki/ there are two ttl files, gz and bz2 compressed. Please have a look!

The bash script producing them complained that

/usr/local/bin/dumpwikibaserdf.sh: line 224: setDcatConfig: command not found

and I see that in commonsrdf_functions.sh there is a comment

# TODO: add DCAT info

so folks might want to look at that.

ArielGlenn moved this task from Blocked/Stalled/Waiting for event to Active on the Dumps-Generation board.Jan 16 2020, 5:09 PM

@dcausse is going to check over the ttl dump and let me know if it looks ok; if so then I'll flip the switch for generation weekly and make sure there's cleanup too.

• dcausse mentioned this in T243270: Test commons RDF dumps on sdcquery.wmflabs.org.Jan 21 2020, 10:11 AM

Mahir256 subscribed.Jan 22 2020, 6:16 AM

ArielGlenn added a subtask: T243292: Fix the munger to support commons RDF dump.Jan 22 2020, 6:17 PM

Some unexpected (?) triples popping up that @dcausse is looking into, so the dumps will not be turned on in cron until we have the thumbs up on that. See T243292

If it turns out the data is all ok, we can move forward.

nettrom_WMF subscribed.Jan 22 2020, 11:40 PM

@ArielGlenn the structured data team keep coming up as blockers on this ticket in scrum-of-scrums, but we're not blocking you, are we?

If not then maybe I'll remove the tags relevant to us

@Cparle, No blocks on your side, the ball is now in @dcausse 's court. :-)

Cparle removed projects: WikibaseMediaInfo, SDC General.Feb 12 2020, 5:57 PM

ArielGlenn moved this task from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Feb 28 2020, 6:25 PM

Hi, just checking in: any progress on invetigating the 'extra' dumps content?

@ArielGlenn no not yet, this is still blocked on T243292 which requires some investigation to determine which component (dump or the wdqs transformation process) is wrong.

Gehel mentioned this in T251496: Validate and fix TTL dumps of SDoC.Apr 30 2020, 9:04 AM

• Zbyszko closed subtask T243292: Fix the munger to support commons RDF dump as Resolved.May 15 2020, 2:26 PM

I see that we're no longer blocked. Does this mean that we're good to go for weekly runs?

We've loaded the dump on a test server and it looks fine. We can start scheduling the dumps regularly.

Change 598787 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] enable dumps of structured data from commons

https://gerrit.wikimedia.org/r/598787

gerritbot added a project: Patch-For-Review.May 26 2020, 4:01 PM

Change 598787 merged by ArielGlenn:
[operations/puppet@production] enable dumps of structured data from commons

https://gerrit.wikimedia.org/r/598787

Mahir256 awarded a token.May 27 2020, 2:53 PM

Change 599052 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] fix up cron name of commons structured data dumps

https://gerrit.wikimedia.org/r/599052

Change 599052 merged by ArielGlenn:
[operations/puppet@production] fix up cron name of commons structured data dumps

https://gerrit.wikimedia.org/r/599052

Maintenance_bot removed a project: Patch-For-Review.May 27 2020, 3:11 PM

@ArielGlenn we plan to make a subtle change to the dump (prefixes), this won't be technically a breaking change but could cause some confusion if users start to assume the presence of some prefixes. Would it be possible to pause the publication of the dumps while we change this? Sorry for the late notice.

@dcausse what's your time frame?

Just a note on the current problem:
the prefixes defined in ttl dumps are identical to the ones used by wikidata e.g.:

@prefix wdt: <http://commons.wikimedia.org/prop/direct/> .

This is perfectly valid but might cause some confusions because when using commons query service we will likely change these prefixes so that federation with wdqs is more obvious.
I had a look at the code base but haven't found an easy way to override this.
@Lucas_Werkmeister_WMDE do you know if it'd be possible to change the prefixes for the local repository such that for commons http://commons.wikimedia.org/prop/direct/ would not be wdt?

@dcausse it is possible to customize the prefix. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/569260 once merged will enable this. The chain of patches is a bit stalled. I will remove some dust off it and hopefully we'll get this in soon

~~Looks like it was decided not to use wikidata specific prefixes for MediaInfo exports but uses a more specific sdc for these (see: T222995).~~
~~The code does still look to be hardcoded with wikidata specific prefixes.~~
~~It does not look to me like that we could make this happen quickly.~~
I created T253798 to track this work. Since it seems that some refactoring will have to happen (initially we thought it might just be a config change) I wonder if making the dumps available should be blocked by T253798 or go ahead and make them available with a short notice explaining that prefixes might change in the future with a link to that same ticket.
@CBogen I'll leave that decision to you.

Please disregard the message above, this is actually possible with a config change.

@WMDE-leszek oops, sorry I replied before reading you comment and was reading an old code base... if this is just a config change it can hopefully be merged soon. Thanks!

DD063520 added a subscriber: D063520.May 28 2020, 6:02 PM

Change 599856 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Revert "enable dumps of structured data from commons"

https://gerrit.wikimedia.org/r/599856

gerritbot added a project: Patch-For-Review.May 29 2020, 1:22 PM

Change 599856 merged by Gehel:
[operations/puppet@production] Revert "enable dumps of structured data from commons"

https://gerrit.wikimedia.org/r/599856

Change 601162 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] sdc dumps: add placeholder entry in dcat setup to avoid syntax errors

https://gerrit.wikimedia.org/r/601162

Change 601162 merged by ArielGlenn:
[operations/puppet@production] sdc dumps: add placeholder entry in dcat setup to avoid syntax errors

https://gerrit.wikimedia.org/r/601162

Addshore mentioned this in T255601: MediaInfo UI breaks when repoDatabase set to "testcommonswiki" instead of false.Jun 16 2020, 7:15 PM

Change 609114 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] snapshots: enable dumps of structured data from commons

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609114

Change 609114 merged by Gehel:
[operations/puppet@production] snapshots: enable dumps of structured data from commons

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609114

Change 609823 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add a README about the content of the commons structured data dumps

https://gerrit.wikimedia.org/r/609823

Change 609823 merged by ArielGlenn:
[operations/puppet@production] add a README about the content of the commons structured data dumps

https://gerrit.wikimedia.org/r/609823

Change 610816 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] Add commons structured data dumps to the webpage!

https://gerrit.wikimedia.org/r/610816

Change 610816 merged by ArielGlenn:
[operations/puppet@production] Add commons structured data dumps to the webpage!

https://gerrit.wikimedia.org/r/610816

The dumps are now lived! And it has been communicated on wikidata-l.

Cool, but what is the link?

It's linked off the 'other datasets' page near the top. But here's the direct link: https://dumps.wikimedia.org/other/wikibase/commonswiki/

should latest-full.* be deleted to avoid confusion?

Links latest-full.ttl.bz2 -> 20200116/commons-20200116-full.ttl.bz2 and latest-full.ttl.gz -> 20200116/commons-20200116-full.ttl.gz have been cleaned up. Thanks for the suggestion!

nettrom_WMF mentioned this in T258834: Create a Commons equivalent of the wikidata_entity table in the Data Lake.Jul 24 2020, 8:37 PM

nettrom_WMF mentioned this in T259067: Set up generation of JSON dumps for Wikimedia Commons.Jul 28 2020, 6:53 PM

CBogen closed subtask T230840: Set up proper prefix configuration for RDF export on Commons as Resolved.Aug 31 2020, 2:00 PM

ArielGlenn moved this task from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.Sep 19 2020, 11:19 AM

Create RDF dump of structured data on CommonsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create RDF dump of structured data on Commons
Closed, ResolvedPublic
Actions

Related Objects
Search...