Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Reduce db queries in load.php calls
Open, In Progress, MediumPublic

Description

load.php is a critical path that needs to respond as quickly as possible to avoid breaking user experience and given that they are being called a lot (23% of all requests are to load.php = ~2B req/day), this redactions will have a major impact.

Looking at the queries logged by verbose logging:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as Medium priority.Jan 7 2022, 6:24 AM
Marostegui moved this task from Triage to Refine on the DBA board.

I added both of these recently for T284097. If there is a better way to have some localisation messages that support full wikitext syntax, and have the ResourceLoader cache invalidated when the
messages change, I'm all ears.

(Maybe we could reduce the number of queries by doing one big query manually instead of using getTouched() for each title, like ResourceLoaderWikiModule does – not sure if that would be much better?)

  • DiscussionTools is triggering several db queries in load.php due to getting called in HookUtils::isAvailableForTitle which sorta doesn't make sense and it seems it's due to the hook handler onParserAfterTidy which gets triggered in every parse including parsing of mediawiki messages (6 of them happen in the load.php I examined) so if this needs to be on parsing of the page, it probably need to hook somewhere else in mediawiki not on ParserAfterTidy

This probably means some other code is parsing wikitext localisation messages without setting the interface flag. Can you track down which ResourceLoader modules are causing that?

(We previously discussed this in T291601, and I fixed this problem for a majority of cases.)

This probably means some other code is parsing wikitext localisation messages without setting the interface flag. Can you track down which ResourceLoader modules are causing that?

Well, I guessed one, maybe that is all: T298822: Copyright messages in ext.visualEditor.data (VisualEditorDataModule) are parsed without the interface flag

For context, I believe use of page_touched here was likely based on my recommendation at: T189229: VisualEditorDataModule spends significant time in hot startup module

Basically, it avoids the cost that's worse than a one cheap db query, which is a parser run and a dozen more queries.

This probably means some other code is parsing wikitext localisation messages without setting the interface flag. Can you track down which ResourceLoader modules are causing that?

I can give you the queries made in load.php requests. Is that useful? https://logstash.wikimedia.org/goto/fad5ac21e85939eebc78761b44dcf582

I added both of these recently for T284097. If there is a better way to have some localisation messages that support full wikitext syntax, and have the ResourceLoader cache invalidated when the
messages change, I'm all ears.

My thinking is that having an APCu cache on it for an hour should be okay, they don't take much space and being distributed in 200 appservers make the actual ttl much smaller and I don't think having outdated information for an hour is that big of a deal (I'm biased though). For example site lookup has APCu with ttl of an hour and when we add a new wiki, it takes an hour to show up in lists.

I can give you the queries made in load.php requests. Is that useful? https://logstash.wikimedia.org/goto/fad5ac21e85939eebc78761b44dcf582

Kind of useful, but I haven't managed to find any other modules doing parsing wrong like ext.visualEditor.data. So maybe it is the only one, which would be nice.

I learned a few other things from it though:

  • WikiLove and PageTriage are doing some queries from load.php, because they use ResourceLoaderWikiModule – I think this is expected
  • GrowthExperiments is doing some queries from load.php for the startup module, because it parses some messages in a packageFiles callback, e.g. when generating GEHelpPanelLinks in a few modules – this is a similar issue to T189229

GuidedTours is making one/two db queries caused by onResourceLoaderGetConfigVars T264817#7601839

At a glance the only nontrivial logic there is Title::exists() which has its internal cache mechanism (via the LinkCache, which stores titles in the WAN cache for one day). Since the check is always done for the same title, the DB traffic should be minimal (and I don't see how that call could ever result in two queries). I guess APC could speed it up marginally, but wouldn't change the number of DB queries. Could something be broken with the tracking? Or with the LinkCache mechanism? (Or maybe there is some TitleExists hook handler doing nontrivial work? But that wouldn't necessarily be related to GuidedTours.)

GrowthExperiments is making several db queries in unloading package files, caused by getSuggestedEditsConfigJson. These need APCu cache: https://performance.wikimedia.org/xhgui/run/symbol?id=61d700876ddcc011535c3a61&symbol=GrowthExperiments%5CHomepageHooks%3A%3AgetSuggestedEditsConfigJson

getSuggestedEditsConfigJson uses GrowthWikiConfig, which is cached (for one day in memcached + in in-process cache). APC would speed it up a little, but 4ms seems way too much for a memcached call. Not sure what's going on there...

Flow is doing similar to GT in getTermsOfUseMessagesVersion (but instead of calling Title::exists(), it calls Title::getTouched() triggering a db query: https://performance.wikimedia.org/xhgui/run/symbol?id=61d700876ddcc011535c3a61&symbol=Title%3A%3AgetTouched)

That should also be handled by LinkCache.

GrowthExperiments is triggering several queries through HelpPanelHooks https://performance.wikimedia.org/xhgui/run/symbol?id=61d700876ddcc011535c3a61&symbol=GrowthExperiments%5CHelpPanelHooks%3A%3AgetModuleData

Also uses GrowthWikiConfig.

  • GrowthExperiments is doing some queries from load.php for the startup module, because it parses some messages in a packageFiles callback, e.g. when generating GEHelpPanelLinks in a few modules – this is a similar issue to T189229

Hm, that's possible, we do parse a configuration variable to allow using {{#time}} in page name patterns. I guess that would need its own caching logic?
In this case we have an easy way out since we don't need that variable anymore - we only use it if $wgGEHelpPanelAskMentor is false and it is always true in production. We just need to make sure the message parsing doesn't happen in that case.

ppelberg subscribed.

Note: the Editing Team does not understand the issues raised with Discussion Tools to be urgent nor do they see a clear way of optimizing the code referenced in this task without sacrificing important functionality (e.g. localizing interface messages). As such, we're not going to prioritize work on this. If there is any information that leads you to a different conclusion from the one I've shared above, please ping us!

Change 756104 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GuidedTour@master] Do not load help/guider URL when the accompanying tour is not used

https://gerrit.wikimedia.org/r/756104

Change 756104 merged by jenkins-bot:

[mediawiki/extensions/GuidedTour@master] Do not load help/guider URL when the accompanying tour is not used

https://gerrit.wikimedia.org/r/756104

This probably means some other code is parsing wikitext localisation messages without setting the interface flag. Can you track down which ResourceLoader modules are causing that?

Well, I guessed one, maybe that is all: T298822: Copyright messages in ext.visualEditor.data (VisualEditorDataModule) are parsed without the interface flag

The patch for this is merged now.

I went looking, and found another one in Wikibase with this search: https://codesearch.wmcloud.org/search/?q=wfMessage&i=nope&files=.*Module.*&excludeFiles=&repos= (other results look like false positives)

Change 757449 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/Wikibase@master] SitesModule: Replace wfMessage() with use of MessageLocalizer

https://gerrit.wikimedia.org/r/757449

Change 757449 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] SitesModule: Replace wfMessage() with use of MessageLocalizer

https://gerrit.wikimedia.org/r/757449

Urbanecm_WMF changed the task status from Open to In Progress.Jan 31 2022, 11:33 AM

A somewhat related issue (but one that cannot be solved with caching) is that these DB calls block T155147: Do not initialise the database in tests when not needed; there should be some way to disable them in tests which are not that extension's hook handler's own tests.

I think the way forward here would be to get statistics on how often there are DB reads in load.php, because most of the remaining cases here are expected to fall back to the DB with an 1/hour/wiki or 1/day/wiki frequency, which shouldn't be problematic (but maybe something is wrong and it happens a lot more often).

It seems that should be doable by setting the queries TransactionProfiler expectation to 0 in load.php?

queries in load.php is way more than you think. Just a random one makes 50 db queries: https://logstash.wikimedia.org/goto/65cb014fa5f87a207bd9872aba396a5e

(some are related to this T322528)

queries in load.php is way more than you think. Just a random one makes 50 db queries: https://logstash.wikimedia.org/goto/65cb014fa5f87a207bd9872aba396a5e

  • 32 of 50 seem like a connection reuse problem, with the SET command and heartbeat check on a new DB connection being repeated 16 times.
  • 2 more connection-related calls (DatabaseMysqlBase::doSelectDomain and Database::beginIfImplied - I wonder what causes an implicit transaction, maybe this was an RL cache miss so the results got written back to objectcache? but there are no non-SELECT queries in the log)
  • an SqlBagOStuff lookup for ResourceLoader's own cache
  • 3 queries for loading the data of the current user
  • another SqlBagOStuff lookup related to AbuseFilter
  • 11 revision lookups for user and gadget JS pages:
    • 7 PageStore::getPageByNameViaLinkCache calls (which in theory should use the WAN cache)
    • a WikiModule::preloadTitleInfo call for the same 7 pages (apparently not really succesful at preloading)
    • 2 RevisionStore::loadSlotRecordsFromDb calls and one RevisionStore::fetchRevisionRowFromConds call (I guess these are pages where the content is fetched for some reason? but why aren't the blob store DB lookups logged then? maybe the WAN cache worked for that layer but not the one above?)

If this is really a random call, something is very broken: on a random call, page and revision lookups should be WAN-cached (at the least for very commonly looked up pages like Group-autoconfirmed.js). In any case, this is a user scripts / gadgets lookup for a logged-in user, which is one of the worst cases (need to check the user identity, need to fetch content from wiki pages, some of which are user-specific and thus more likely to be a cache miss). Most load.php calls return content that's identical for all users, so it should have significantly less queries (apart maybe from the SET/heartbeat ones, not sure what's up with those).

I still think it would be worthwhile to set a 0 queries expectation, at least temporarily, maybe limited to some smaller wikis if we are worried about log volume, and get a better understanding of how frequently these reads happen. I don't think it's reasonable to invest much time in trying optimize load.php DB reads without even trying to measure what we are intending to optimize.

The most common load.php is https://en.wikipedia.org/w/load.php?modules=startup&only=scripts&raw=1. This also happens to generally be the most intense as it has to evaluate all modules in the system. After loading that once on mwdebug1001 to warmup, and then again with WikimediaDebug verbose logs, querying on the mediawiki dashboard channel:DBQuery AND message:("SELECT" AND NOT "fetchSecondsSinceHeartbeat" AND NOT "serverIsReadOnly")

I find essentially 3 queries on this request:

  1. mainstash: One fixed query to batch preload file data for any requested file modules.
  2. wiki db: One fixed query batch preload wiki data for any requested wiki modules.
  3. wiki db: Seven unbatched queries for GrowthExperiments to determine existence of certain wiki pages, via ext.growthExperiments.Help/data.json from HelpPanelHooks::getModuleData (based on this flamegraph).

All of load.php traffic can be grouped into four categories:

  • A: The majority load one or more file modules. These perform one db query (nr 1 above).
  • B: The ones that load one or more wiki modules (gadgets). These perform one db query (nr 2 above).
  • C: The one that loads a GrowthExperiments module. These perform two queries (nr 1 and 3 above).
  • D: The startup module, which performs three (nr 1-3 above).

Beyond that is the "user" module which was described above.

About the GrowthExperiments module, the pages it looks up are specified by https://en.wikipedia.org/wiki/MediaWiki:GrowthExperimentsConfig.json and are user-facing content pages, e.g. not the meta namespaces like MediaWiki/Template/Module etc that qualify for Memcached-based LinkCache. These are explicilty disqualified from LinkCache. I see two broad ways to improve this if it were needed:

  • Make use of LinkBatch in GrowthExperiments so that each LinkRenderer call doesn't produce a separate db query.
  • Or, depending on product/UX requirements, follow in the footsteps of T313462 and don't offer red links. Information about bad configuration could perhaps be given to sysadmins by other means, e.g. via the Special page for that configuration.

I'm not suggesting wrapping these specific calls in WAN-cache for the simple reason that it adds code complexity and need for more purging/checkKeys in various places. That's doable, but not preferred given propagation is already behind a 5-minute delay of the startup module. And thus indeed should be redundant with HTTP caching. We need to quantify first what actual traffic costs are reaches backends, assuming latencies are fine.

ResourceLoader (Grafana dashboard) has a cache hit rate of over 99.9% (afaik higher than any other service or endpoint, including for logged-in users!). Globally for all datacenters/wikis/skins/usergroups we have about ~100 per second load.php, of which about 80% is in category A and B.

Ah, I see. I didn't think of LinkRenderer making queries. Thanks for looking into it!

Yeah, I think we can just use blue links.

Change 890130 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Do not trigger DB lookup in ResourceLoader callbacks

https://gerrit.wikimedia.org/r/890130

Change 890130 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Do not trigger DB lookup in ResourceLoader callbacks

https://gerrit.wikimedia.org/r/890130