Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Enable numerical category sorting on Commons
Open, Stalled, Needs TriagePublic

Description

Event Timeline

Anoop changed the subtype of this task from "Administrative Request" to "Task".
Anoop moved this task from Backlog to Working on on the Wikimedia-Site-requests board.

Change #1037006 had a related patch set uploaded (by Anzx; author: Anzx):

[operations/mediawiki-config@master] commonswiki: Enable numeric wgCategoryCollation

https://gerrit.wikimedia.org/r/1037006

I don't want to touch the patch you made without your approval @Anoop. Is it okay if I deploy your patch?

I don't want to touch the patch you made without your approval @Anoop. Is it okay if I deploy your patch?

@Ladsgroup , Yes sure please deploy

Change #1039657 had a related patch set uploaded (by Ebrahim; author: Anzx):

[operations/mediawiki-config@master] commonswiki: Enable numeric wgCategoryCollation

https://gerrit.wikimedia.org/r/1039657

Change #1039657 abandoned by Ebrahim:

[operations/mediawiki-config@master] commonswiki: Enable numeric wgCategoryCollation

Reason:

original patch is rebased, this is no longer needed

https://gerrit.wikimedia.org/r/1039657

Change #1037006 merged by jenkins-bot:

[operations/mediawiki-config@master] commonswiki: Enable numeric wgCategoryCollation

https://gerrit.wikimedia.org/r/1037006

Mentioned in SAL (#wikimedia-operations) [2024-06-06T14:04:15Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:1037006|commonswiki: Enable numeric wgCategoryCollation (T362494)]], [[gerrit:1037505|Add project namespace alias for Azerbaijani Wikisource (T365966)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-06T14:06:44Z] <samtar@deploy1002> samtar and anzx and nmw03: Backport for [[gerrit:1037006|commonswiki: Enable numeric wgCategoryCollation (T362494)]], [[gerrit:1037505|Add project namespace alias for Azerbaijani Wikisource (T365966)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-06T14:17:13Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:1037006|commonswiki: Enable numeric wgCategoryCollation (T362494)]], [[gerrit:1037505|Add project namespace alias for Azerbaijani Wikisource (T365966)]] (duration: 12m 58s)

nb. still need to run updateCollation as

mwscript updateCollation.php --wiki commonswiki

I'll try to do this later or in a later deployment window.

Do not do this. It will break the wikis.

Change #1039767 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] Add a note that you cannot change wgCategoryCollation easily

https://gerrit.wikimedia.org/r/1039767

Anoop removed Anoop as the assignee of this task.Jun 6 2024, 4:03 PM
Anoop subscribed.

In practice, I think this is effectively Declined as impossible?

What makes it impossible (rather than simply difficult) exactly?

What makes it impossible (rather than simply difficult) exactly?

Re-building the DB for Commons would take days if not weeks, during which the site would be at best read-only (and much of the category content would be "corrupted"), AIUI. But I defer to Amir.

I remember https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_149#Sorting_in_categories_unreliable_for_a_few_days on enwiki, which didn't involve making the database read-only. But I guess Commons is much bigger than enwiki so it would take even longer.

Sorry about the trouble it has made, it's great that you are adding a comment to the section so changes in the future can be done more carefully, just to note in future if you can design someway to change this, please go for numeric instead of uca-default-u-kn as it was on the initial versions of the reverted patch (weird that it has changed last minute, probably even my own fault 😞). Maybe uca-default-u-kn can be desirable sometime but perhaps another community discussion is needed for that I think as some may don't like that T32996#2342009 and the request here was only to enable numeric sorting. (and quick question from @Jdforrester-WMF, is a change from uppercase to numeric also this costly?)

As a note for the need of the change, enabling numeric sorting is something really good to have, I was working with someone who is uploading a museum album and they were even suggesting renaming 3000 files and padding the names with zeros to fix ordering of the files as their order did matter on the case so I suggest instead of rejecting this, defer and engineering a solution for it is more desirable as maybe a solution can be found with different design in multiple years or with simply awaiting possible hardware upgrades.

I'd be curious to hear more about why it's impossible. I realize that Commons is much bigger in some respects than all other wikis, but that should just mean that the update to recalculate categories will take much longer (maybe weeks instead of days), not that it will be impossible. Someone has to monitor the update script during this time and maybe restart it if the server it's running on gets shut down or something, which is inconvenient, but again not impossible. The wiki should otherwise continue functioning normally while this happens.

It's expected that sorting in categories will be messed up while the script is running, though, so if this is unacceptable to the Commons community, I guess that would make it impossible.

@Ebrahim Interesting you mention that, as a change to numeric would be just as slow, but less disruptive – the sort keys for that collation are the same as the current uppercase collation, except for the numeric parts, so the messed up sorting in categories would only affect the numeric parts of file names, not everything. Maybe that is the way to go.

So first of all, categorylinks in commons is a beast that no other wiki can compare:

root@clouddb1021:/srv# ls -Ssh sqldata.s*/*/categorylinks.ibd | head
 171G sqldata.s4/commonswiki/categorylinks.ibd
  45G sqldata.s1/enwiki/categorylinks.ibd
  27G sqldata.s3/ruwikinews/categorylinks.ibd
  24G sqldata.s7/arwiki/categorylinks.ibd
  17G sqldata.s6/frwiki/categorylinks.ibd
  14G sqldata.s3/arzwiki/categorylinks.ibd
 9.3G sqldata.s5/cebwiki/categorylinks.ibd
 9.2G sqldata.s2/enwiktionary/categorylinks.ibd
 9.2G sqldata.s7/fawiki/categorylinks.ibd
 9.1G sqldata.s6/ruwiki/categorylinks.ibd

In other words, categorylinks is commons is four times bigger than the second largest one (enwiki) and in itself basically is responsible for 10% of all commons database and one third of all categorylinks tables of all wikis combined. Of course, I have hoped using SDC we could have improved things but it has actually made things worse because now there are many SDC tracking categories showing up among most used ones (https://commons.wikimedia.org/wiki/Special:MostLinkedCategories e.g. Files with coordinates missing SDC location of creation‏‎ (18,427,878 members), Pages with camera coordinates from SDC‏‎ (7,931,992 members) and so on).

This is not sustainable. Commons needs to move away from this mode of categorization (to a tagging system for example). MediaWiki categories are not built for this. Any change would be extremely impactful (e.g. license categories could be moved to WCQS but we need to fix the OAuth thing, read more about this in T343131).

And the current size is after optimizations we did (T342854: Drop cl_collation_ext index from categorylinks) otherwise it would have been even bigger. Of course any further improvements would be appreciated (e.g. I never was sure we would really "cl_timestamp" column) but those improvements are impactful on other wikis, in Commons it's not going to change much as this is a red queen situation. If we improve and drop 10% of the table, it gets immediately filled by new categorylinks rows. We have to constantly optimize to just keep its size from exploding even further.


Now talking about updateCollation. Currently update collation works in two modes, either

  • you can set a target table in which it creates a new table, copies over everything and then later you can make it point to the new one and drop the old one. This is rather recent and was done to reduce the impact of such changes. But, that won't work for commons, the duplication of tables will be such a major stress on the database that can easily cause outages.
  • or do it the old way which is just updating rows in place. That could work but 1- It needs to be done with a lot of care, small batch size, making sure the usual wait for replication AND extra sleep is in place, make sure DBAs are aware, no other data migration maint script should be run at the same time. etc. etc. 2- It will take a long time (in order of months, plural) and in the mean time, community is going to see broken list of files in categories. This needs to be communicated to them *beforehand* and if there are major objections, that can break this project.
    • In other words, it's not "impossible" I don't think I said that (sorry for the confusion). But it needs a lot of preparation and planning beforehand otherwise it's going to cause a decent outage.

Does this mean we should decline this ticket? I don't think so. We should make improvements on how categories are used in commons to reduce the size of this table to something manageable and then we can revisit this again and plan and prepare for a run.

AntiCompositeNumber changed the task status from Open to Stalled.Jun 9 2024, 11:40 PM

Change #1039767 merged by jenkins-bot:

[operations/mediawiki-config@master] Add a note that you cannot change wgCategoryCollation easily

https://gerrit.wikimedia.org/r/1039767

Mentioned in SAL (#wikimedia-operations) [2024-06-17T20:01:57Z] <jforrester@deploy1002> Started scap: Backport for [[gerrit:1041659|[wikifunctionswiki] Remove right to promote/demote sysops and bureaucrats from staff (T365627)]], [[gerrit:1039767|Add a note that you cannot change wgCategoryCollation easily (T362494 T366809)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-17T20:06:59Z] <jforrester@deploy1002> jforrester: Backport for [[gerrit:1041659|[wikifunctionswiki] Remove right to promote/demote sysops and bureaucrats from staff (T365627)]], [[gerrit:1039767|Add a note that you cannot change wgCategoryCollation easily (T362494 T366809)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

So first of all, categorylinks in commons is a beast that no other wiki can compare:

root@clouddb1021:/srv# ls -Ssh sqldata.s*/*/categorylinks.ibd | head
 171G sqldata.s4/commonswiki/categorylinks.ibd
  45G sqldata.s1/enwiki/categorylinks.ibd
  27G sqldata.s3/ruwikinews/categorylinks.ibd
  ...

In other words, categorylinks is commons is four times bigger than the second largest one (enwiki) and in itself basically is responsible for 10% of all commons database and one third of all categorylinks tables of all wikis combined. (…)

A few months later, I wonder if any of the work done with on-wiki templates at T343131 has affected this at all?

Those changes mostly affected templatelinks, the changes on categories are probably quite small and get offset by the growth anyway. If you want to make large-scale changes:

  • We can normalize the target via linktarget, the code for it is basically ready, we did for templatelinks and pagelinks and the infra is there.
  • Maybe get rid of cl_timestamp altogether, it's barely used and it's taking a lot of space (and it's unreliable, if a rollback happens, the timestamp resets)
  • The sortkey and the prefix could be potentially rethought. I think we can come up with a better way to keep most of functionality while getting rid of a lot of data. I will think about it.
  • cl_collation can be normalized. We repeat one string 2 billion times. I don't understand why.