Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Commons database is growing way too fast
Open, Needs TriagePublic

Description

s4 is now the largest section by a wide margin:

s8 (noch eine Kopie).png (810×1 px, 157 KB)

It has basically doubled since 2019 (while many other sections got smaller with the optimizations in place). Many optimizations we have done improved the storage (e.g. the image metadata storage change in late 2021) but with this rate of growth no optimization can prevent major issues in a year or two.

It's just three tables that are growing really fast: categorylinks, templatelinks, externallinks (in total, they are responsible for 800GB). The rest don't seem to be too problematic:

table (Kopie).png (1×2 px, 452 KB)

By a quick look, I think we can do some easy fixes and it should drastically reduce the database growth:

For externallinks:

  • Use interwiki links/pagelinks instead of raw https links.

For templatelinks (Most used templates):

  • Merge some templates that are both heavily used and only used by the same users
  • Use redirect target in heavily used templates. e.g. Template:Location_dec is used 5 million times, using the redirect target removes 5m rows from templatelinks (I can give the list of heavily used redirect templates)
  • Migrate some functionalities to software or parser functions, etc. to avoid having a dedicated template used in basically every page (e.g. https://commons.wikimedia.org/wiki/Template:Dir)

For categorylinks (Most used categories):

Templates and categories added by https://commons.wikimedia.org/wiki/Module:SDC_tracking might be contributing to the issue.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have never run into any pages in MediaWiki namespace managed by Translate extension, are there any pages using this feature I can look at?

Commons previously have pages in MediaWiki namespace marked for translation (such as https://commons.wikimedia.org/wiki/MediaWiki:Linkshere, which was able to be translated by non-admins), but no longer currently. See also T214741.
Wikidata currently has such page.

On finding templates that could be merged together ...

Landsgroup, Thank you for your table. I was just about to ask about something similar but was still thinking about what is needed. It just happen that I wrote or maintain most templates and modules in the first table and most in the second.

Looking at the first table, I think most overlaps are accidental, since every file uploded with Upload wizard has some common templates and modules (see for example today's upload):

  • Template:Information -> Module:Information + Module:Information/styles.css - default infobox
  • Template:Infobox template tag - do-nothing template used to find files with no infobox (like files missing Template:Information or with broken infobox). Such do-nothing tag templates were introduced in 2011 to replace maintenance categories used previously for similar tracking, as less visible and to reduce size of database table tracking categories. Perhaps some other way of tagging and searching for files missing this or that tag can be devised.
  • Template:Dir - used by most Commons templates to figure out language direction of the user
  • Module:DateI18n - internationalization of the date string
  • Module:ISOdate - parsing of the date string (usually YYYY-MM-DD but also few other supported formats)
  • Template:SDC_statement_exist, Template:SDC_statement_has_value -> Module:SDC_tracking - adds maintenance categories to find files missing SDC statements
  • Module:Core - collection of stripped down versions of most used lua functions used by Module:Information and many other Lua modules
  • Template:License template tag - do-nothing template used by bots to find files missing license. Mostly replicated by Extension:CommonsMetadata but still prefered as it gives more reliable results.
  • Template:En -> Template:Description -> Module:Description - mark language of a text

Some pairs definitely have child-parent or sibling relationship:

  • Template:En -> Template:Description -> Module:Description chain can be reduced to Template:En -> Module:Description, for a few high use templates.
  • Template:License template tag retire and use Extension:CommonsMetadata - started discussion about it. The difference between files found using the missing tag template vs. extension might be due to badly formatted license templates that need to be fixed.
  • Module:Information/styles.css is not a lua code but css file used by Extension:TemplateStyles - I am not sure why is it in the transclusion table
  • Module:DateI18n and Module:ISOdate - those 2 mostly go together as ISOdate calls DateI18n. Originally when I wrote them, they were in a single module but I split them in 2014, to allow better unit testing and reduce number of affected files that need to be updated each time I change DateI18n which is also used by other templates/modules. This code is used in few hundreds other projects.
  • Above 2 and Module:Wikidata_date, Module:Calendar, Module:Complex_date - are all codes related to dates. They call each other but they are quite separate.
  • Template:Infobox_template_tag and Module:Information - the tag could be removed from Module:Information and searches for files with missing/broken infobox can be done by a search for files without the tag or Module:Information
  • Template:Self and Template:Self/is-pd-expired - Template:Self is one of the high use templates not written in Lua yet and it calls many other templates. Template:Self/is-pd-expired for example is called 5 times. It can be optimized by a rewrite which will add Module:Self dependency.
  • use of Module:city by other modules is and anachronism and can be probably eliminated. I will look into it.

There are some inefficiencies with most modules being called from wikitext through template shell that does not do anything other than provide convenient place for documentation. Those are minimized when lua codes call each other directly. Also many Commons modules keep translations in a separate "module:I18n/..." to clearly separate the code from the translations. Some of those translations can be moved to translatewiki if there was some easy way to do it.
Most high-use lua codes try to be small and modular with minimum number of dependencies. Some submodules can be moved from always-load to load-as-needed schema, but most are always needed. For years most proposals to make Module:Information or other high-use modules more complicated were rejected based on efficiency arguments, to avoid loading bunch of rarely used code for each file, in order to add some feature or check affecting small percentage of files. We were not thinking about database efficiency, but how much code is loaded for each file, but those 2 optimization criteria go together. Some modules like Message_box, YesNo, etc. are transplants from other wikis maintained by different groups of people and unlikely to be combined. Many of the above modules are transplanted to great many other wikis and changes to Commons might affect them too.

Some ways to minimize amount of templates/modules loaded:

e.g. One case I see is that merging https://commons.wikimedia.org/wiki/Template:Self/is-pd-expired into https://commons.wikimedia.org/wiki/Template:Self can remove 36M rows from templatelinks.

This would, again, hurt readability a lot. However, merging https://commons.wikimedia.org/wiki/Template:Self/is-pd-expired, https://commons.wikimedia.org/wiki/Template:License_migration_is_redundant and https://commons.wikimedia.org/wiki/Template:License_migration_is_redundant_multiple in one Lua module could save some 8.5M links (assuming all pages using {{License migration is redundant}} also use {{License migration is redundant multiple}} and all pages using {{License migration is redundant multiple}} also use {{Self/is-pd-expired}}) while not hurting readability if not even improving it.

Another way to reduce size of transclude table used in the past was to run a bot job of substituting template redirects. Does anybody have a good SQL query to see if we have any templates with often used redirects? I run many such bot jobs in the past but not in the last decade or so.

Another way to reduce size of transclude table used in the past was to run a bot job of substituting template redirects. Does anybody have a good SQL query to see if we have any templates with often used redirects? I run many such bot jobs in the past but not in the last decade or so.

https://quarry.wmcloud.org/query/12286 First field is the redirect, second field is redirect target, third field is transclusions of redirect.

Another way to reduce size of transclude table used in the past was to run a bot job of substituting template redirects. Does anybody have a good SQL query to see if we have any templates with often used redirects? I run many such bot jobs in the past but not in the last decade or so.

https://quarry.wmcloud.org/query/12286 First field is the redirect, second field is redirect target, third field is transclusions of redirect.

Thanks, so I see that we can eliminate 41M transclusions by replacing all template redirects or 31M by replacing redirects of top 50 templates. Most used redirecting template is Template:Location_dec, which I merged with Template:Location a decade ago. I guess I never finished the cleanup afterwards.

Maybe someone knows how to set up a bot scheduled to run with some frequency finding and replacing names of substituted templates. My approach of AWB task with string replace is not the most efficient way to do it.

I eliminated the use of Template:Description in language templates, like Template:En in favor of calling Module:Description directly. It actually makes much more readable wikicode as Module:Description uses named variables as opposed to numbered variables. That should (eventually) eliminate 66.6M templatelinks entries.

I also started replacing some template redirects, but those savings in DB size come at the cost of millions of file edits. I am using pywikibot on https://hub-paws.wmcloud.org and at current rate of editing, with throttle set to 0, I might be done in 9 months...

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table. Maybe if it could be batched with something else, sure but not on its own.

But generally, thank you so much for cleaning these up and improve the database stability!

I actually missed this:

That’s a good question. As far as I can tell from looking at a random file on wikidata.org and on an external wiki with InstantCommons, both the TemplateStyles and Commons’ MediaWiki:Filepage.css are loaded on other wikis (Filepage code here), whereas its common.css isn’t loaded. So if we want to keep the behavior consistent, I guess MediaWiki:Filepage.css makes more sense. WDYT @Ladsgroup?

Than yes, please move it to Filepage.css

I specced out this proposal in some more detail here: https://commons.wikimedia.org/wiki/Module_talk:Information#Proposal:_Move_template_styles_to_site-wide_CSS

I just made this change to Module:Information, so most of the 87 million templatelinks to Module:Information/styles.css‏‎ should start to disappear now.

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table.

Good point. I was trying to create SQL query to find template redirects transcluded in other templates but failed so far. There a single edit will fix great many pages. Can anybody with better SQL skills help?

I also wrote Module:Self which implements 7 templates called from Template:Self including already discussed Template:Self/is-pd-expired. Template:Self is used on 36M pages.

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table.

Good point. I was trying to create SQL query to find template redirects transcluded in other templates but failed so far. There a single edit will fix great many pages. Can anybody with better SQL skills help?

I'll give it a try.

I also wrote Module:Self which implements 7 templates called from Template:Self including already discussed Template:Self/is-pd-expired. Template:Self is used on 36M pages.

Thank you so much! it looks super nice. Once the updates trickle in, I'll do an optimize table of templatelinks to reclaim some space.

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Let me know how it goes.

Change 997425 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Schema: Drop iwl_title from one of indexes

https://gerrit.wikimedia.org/r/997425

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Let me know how it goes.

I think I replaced most calls from templates to Template:Autotranslate, (also Template:Autotranslate/clone 1 and few more clones needed due to T196464 issue) with calls directly to Module:Autotranslate. In a month or two when the DB catches up with the change we will see if I missed any.

one way you can reduce future usage of redirects is make TemplateWizard and visualeditor insert the target whenever a redirect is selected.

Change 997425 merged by jenkins-bot:

[mediawiki/core@master] Schema: Drop iwl_prefix_from_title from iwlinks

https://gerrit.wikimedia.org/r/997425

Note: once T61245: Review the PageNotice extension for deployment is resolved, we can include some common wikitext (though rendered differently in different pages) in all file pages, whose content will not be treated as template link, so it will not be updated when the common message is changed.

Potentially we can automatically include an license template, or even infobox, but such will be confusing when there are still manual infobox and license template in wikitext. One solution is hide the automatic template (display:none) by default, and show them via TemplateStyles once the manual ones are removed; I am not sure whether it works.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

Unfortunately I had to revert the removal of Template:License_template_tag. It appears that the UploadWizard extension tests for the presence of {{License template tag}} when a custom license is specified. See mw.UploadWizardLicenseInput.js and CommonSettings.php.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

That template is used for many different things that you probably just broke. I see uploadwizard and no license bot already. Please don't do this in such an uncontrolled way. Please check what is using it and switch it before trying again.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

That template is used for many different things that you probably just broke. I see uploadwizard and no license bot already. Please don't do this in such an uncontrolled way. Please check what is using it and switch it before trying again.

We did identify some tools which were using it in the past, but it seems like no all of them. It was unexpected that Commons:UploadWizard was relying on it, especially since there are other ways of tracking presence of license template.

Change 991921 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Change 991921 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently 146759278, or ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. (But we might have to actively mass-purge pages to get all of them to switch to interwiki links.)

I have a question for MediaWiki developers.

Currently Module:Autotranslate, with 72M uses, is used to transclude the best language subtemplate based on user language. It uses Lua's mw.language.getFallbacksFor function to get fallback language codes and mw.title.exists to check which language subpages exist and which do not. Lua function mw.title.exists is considered an expensive function.

User @Tacsipacsi proposed a rewrite which replaces calls to mw.title.exists with calls directly to frame:expandTemplate with pcall function wrapper used to catch errors, like non-existing page. The rewrite seems to work exactly like the current version, but is not using any "expensive" functions and no rows are created in the pagelinks table. So the question is, is the rewrite a good idea? We can write new Lua function mimicking mw.title.exists using frame:expandTemplate with pcall wrapper, which according to current metrics would be less expensive and have smaller BD footprint. However I find it hard to believe that expanding a template is less expensive than checking if it exists, and I am weary that maybe the issue is with the metrics we currently use for tracking expensive calls.

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently 146759278, or ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. (But we might have to actively mass-purge pages to get all of them to switch to interwiki links.)

I’ve started cleaning up Module:Caller title usage (now that T224810 is ready), which causes a large number of reparses. To help cleaning up the externallinks table, I halt this cleanup on Commons until the train arrives, so that those reparses also remove the external links as a side effect.

However I find it hard to believe that expanding a template is less expensive than checking if it exists, and I am weary that maybe the issue is with the metrics we currently use for tracking expensive calls.

Expanding a template is not less expensive than checking if it exists; it’s less expensive than checking if it exists and expanding it. In this particular module, there are two cases:

  • The given language subpage exists. In this case, currently we manually check if it exists, then expand the template, which consists of a check whether the template exists (MediaWiki cannot know that we’ve already done that check) and actually expanding the template. With my proposed code, the manual check is skipped, which is obviously a net win.
  • The given language doesn’t exist. In this case, currently we manually check if it exists, and finding that it doesn’t, we don’t go further. With my proposed code, the manual check is skipped, so it’s the MediaWiki transclusion code that does the existence check and finds that it fails, reporting the error. It’s the same amount of existence checks, so it shouldn’t have worse performance.

Rewrite of Module:Autotranslate discussed went live last week.

Ebrahim subscribed.

About Template:Dir used on 112M pages, likely dozen times on each page. That template is imbedded in almost every non-lua-based template that shows some text. Lua based templates (like Template:Information and other infoboxes) use mw.language:getDir function. We do have 181k templates on Commons and good fraction of them show something, so it would be a big job to replace that template. The easiest solution would be to add it to "software or parser functions" which would use the current format, the way {{!}} template was replaced by a parser function. Whatever way it is implemented it would be good to keep it in synch with Lua function. Shall we create a ticket for that?

Following your fantastic idea, T365189 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1032542

This comment was removed by Bugreporter.

An update: We optimized templatelinks in Commons and now it's 184GB which is roughly 100GB smaller than what it used to be \o/ It's still too large but much better than ~290GB it used to be.

A massive thank you to anyone who has been cleaning this up.

https://commons.wikimedia.org/w/index.php?search=deepcategory:"Files_with_no_machine-readable_author"+dpla

i noticed that out of the 7 million in Category:Files with no machine-readable author, 3.8 are DPLA uploads.

but they had author info in their source repos.

if the uploader could have done their job right...

With implementation of T359761 / T366623 and replacement of {{dir}} and {{bcp47}} templates with brand new parser functions, we should loose some more templatelinks (124M +5M to be precise). I also replaced many calls to {{Autotranslate}} templates in template namespace with the Lua calls directly to c:Module:Autotranslate which might reduce it even further.

This is a good practice I believe, we should see other most used templates in Commons https://commons.wikimedia.org/wiki/Special:MostTranscludedPages and replace them with more efficient solutions. Maybe we can even have a dedicated extension for Commons, if Commons database is that huge it shows how important it is and how more we should make the software more efficient for the sake of it, benefiting all the wikis at the end.

With implementation of T359761 / T366623 and replacement of {{dir}} and {{bcp47}} templates with brand new parser functions, we should loose some more templatelinks (124M +5M to be precise). I also replaced many calls to {{Autotranslate}} templates in template namespace with the Lua calls directly to c:Module:Autotranslate which might reduce it even further.

We currently have multiple ways of translating templates. Maybe have a look if we can switch more to the translation extension to replace autotranslate and int:lang? Not sure if we already have a task for that?

Maybe have a look if we can switch more to the translation extension to replace autotranslate and int:lang? Not sure if we already have a task for that?

We do have Commons:Translators_noticeboard, but not much of a push to move translations from old format to translation extension. I tried to convert a single template: Template:Sisterprojects/i18n since someone had edit request for it. After a over a week of effort I managed to copy 107 out of ~150 languages, before I moved to other stuff (I might come back and finish it). With 10's of thousands of templates translating things it might not be sustainable to switch to translation extension without some better tools for copying.

We discovered another major pain in the neck related to translation extension: when replacing {{dir}} and {{bcp47}} templates with new parser functions we changed non-translated parts of few thousand templates managed by translation extension, like this. None of those changes propagated to the language subtemplates without manual multi-step process of approving each affected template for re-translation. Wider use of translation extension would make changes like this even harder. I wonder if there are bots to help with basic translation extension tasks like mark for translation some list of pages.

Also a funny fact: pages on Commons that use translation extension use {{Autotranslate}} template for displaying the correct subtemplate. See for example Template:Upload_campaign_thanks_ADM20. So translation extension is only used for managing language subtemplates, but not choosing the right one.

We currently have multiple ways of translating templates. Maybe have a look if we can switch more to the translation extension to replace autotranslate and int:lang? Not sure if we already have a task for that?

My understanding is that the Translate extension folks (and/or other mediawiki developers broadly) don't want to support the Commons usecase of automatically translating templates into the interface language (which is what autotranslate does, even for templates where the actual translation is handled by the Translate extension).

This functionality is heavily used in user warning templates, where it could in theory be replaced with a link to a translated page elsewhere, but this would decrease the effectiveness of the warnings for non-English speakers. It is also vital on file description pages to translate license information into the interface language. In theory these templates could be replaced with an interface page that uses structured data to highlight important source, license, authorship, and attribution information for re-users, but that will require significant efforts on improving the structured data data model, upload and editing tool support, and backfilling for existing files.

Change 991921 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently 146759278, or ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. (But we might have to actively mass-purge pages to get all of them to switch to interwiki links.)

Regarding this, there will still be 53,506,154 links to https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses/ coming from MediaWiki:Wm-license-cc-conditions-share_alike-text & friends. These cannot be replaced by the current :creativecommons: interwiki, as its target url is //creativecommons.org/licenses/$1.

I assume that changing that target URL is not feasible for compatibility reasons. Would it be desirable to add, like, a :ccorg: interwiki that targets //creativecommons.org/$1?

Thanks for the replies related to translating. Dead end for this task.

My understanding is that the Translate extension folks (and/or other mediawiki developers broadly) don't want to support the Commons usecase of automatically translating templates into the interface language (which is what autotranslate does, even for templates where the actual translation is handled by the Translate extension).

I would like to support this feature. There are similar cases in translatewiki that prevents from fully using translatable pages. Comments are welcome in T313748: Allow translatable templates to be shown in the user interface language.

I assume that changing that target URL is not feasible for compatibility reasons. Would it be desirable to add, like, a :ccorg: interwiki that targets //creativecommons.org/$1?

I think so, it's probably better than trying hacks like [[creativecommons:../foo]]. Proposed at https://meta.wikimedia.org/wiki/Talk:Interwiki_map#ccorg

Change #1082873 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/WikimediaMessages@master] Use ccorg interwiki for CC license text links

https://gerrit.wikimedia.org/r/1082873

Change #1082874 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/WikimediaMessages@master] Add new wm-license-cc-pd-mark-wikilink message

https://gerrit.wikimedia.org/r/1082874

The new ccorg interwiki has now been added, so I've submitted two patches:

Change #1082873 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use ccorg interwiki for CC license text links

https://gerrit.wikimedia.org/r/1082873

Change #1082874 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Add new wm-license-cc-pd-mark-wikilink message

https://gerrit.wikimedia.org/r/1082874

Change #1084925 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/WikimediaMessages@master] Remove now-unused wm-license-cc-pd-mark-link message

https://gerrit.wikimedia.org/r/1084925

Change #1084925 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Remove now-unused wm-license-cc-pd-mark-link message

https://gerrit.wikimedia.org/r/1084925