Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

At 11:34, 3 June 2024 (CEST) still getting a report that 2 people do not see new CSS for infoboxes. Screen below.

What should have happened instead?:

I'm an experienced dev and I know caching is hard (-:
But there is big problem here that resources are being loaded in different versions. Mediawiki:Mobile.css/Common.css must be loaded with the same methods as a modular CSS provided by default gadgets.

Purge of this cache by interface administrators would be fine too, but I really think you need to find a way to mange CSS/JS. Maybe you could build CSS/JS packages and load them in a specific version (e.g. files with added date-time). This cache should be rebuild after specific set of Mediawiki pages are modified.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia): pl.wikipedia.

Other information (browser name/version, screenshots, etc.):

Desktop version still not loading new gadget for anonymous users:

2024-06-03-11-31_Lotniskowce_typu_Nimitz_pl_wikipedia_PMG.png (1×1 px, 209 KB)

Screenshot's date is 2024-06-03 at 11:31 (4 days after creating the infobox gadget).

Event Timeline

This is because ResourceLoader modules would be inlined into the page HTML, either as a stylesheet link and loaded directly (CSS-only gadgets) or in the RLPAGEMODULES array and loaded by the startup module (general gadgets), while the page HTML can be cached up to one week(?) in wmf-hosted wikis for anon users.

So a new ResourceLoader module or gadget would require at least one week as a grease period to be loaded for all anon users, but code change of existing modules would take less time to propagate.

Yeah, I already heard in T362747 that it's probably 7 days, but might be up to 14 days (maybe).

I think this:

...
<link rel="stylesheet" href="/w/load.php?lang=pl&amp;modules=ext.flaggedRevs.basic%2Cicons%7Cext.relatedArticles.styles%7Cext.wikimediaBadges%7Cext.wikimediamessages.styles%7Cmediawiki.hlist%7Cmobile.init.styles%7Cskins.minerva.base.styles%7Cskins.minerva.codex.styles%7Cskins.minerva.content.styles.images%7Cskins.minerva.icons.wikimedia%7Cskins.minerva.mainMenu.icons%2Cstyles%7Cwikibase.client.init&amp;only=styles&amp;skin=minerva">
<script async="" src="/w/load.php?lang=pl&amp;modules=startup&amp;only=scripts&amp;raw=1&amp;skin=minerva"></script>
...

Should be this:

<script async="" src="/static/wgvars-site-20240603.js"></script>
<script async="" src="/static/wgvars-user-20240603.js"></script>
<script id="wgvars-page">...</script>
<link rel="stylesheet" href="/static/minerva-20240603.css">
<script async="" src="/static/minerva-20240603.js"></script>
<script async="" src="/static/gadgets-default-20240603.js"></script>

So some kind of a split of global vars and date for cache busting.

Gadgets and most other ResourceLoader modules are not inlined. /w/load.php?lang=pl&modules=startup will load the module registry (with Cache-Control: max-age=300), and the module registry includes a content-based hash for each module which is used as a cache-busting parameter when loading those modules (which happens via JS). Mobile.css is one of the few exceptions, it is included (along with Common.css etc) as part of the site.styles module via a link hardcoded in the HTML, but like the startup module it has a very short expiry.

It's hard to figure out from the task description what you are reporting as an error. The removal of Mobile.css broke the desktop site for some people? How is that supposed to happen?

For this particular case:

<title>Wikipedia, wolna encyklopedia</title>
<script>
...
RLSTATE={"skins.vector.user.styles":"ready","ext.gadget.wikiflex":"ready","ext.gadget.infobox":"ready","ext.gadget.hlist":"ready",...};
RLPAGEMODULES=["ext.scribunto.logs","site","mediawiki.page.ready","skins.vector.js","ext.centralNotice.geoIP","ext.centralNotice.startUp","ext.gadget.ll-script-loader","ext.gadget.veKeepParameters","ext.gadget.szablon-galeria","ext.gadget.NavFrame",...];</script>
<script>(RLQ=window.RLQ||[]).push(function(){mw.loader.impl(function(){return["user.options@12s5i",function($,jQuery,require,module){mw.user.tokens.set({"patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});
}];});});</script>
<link rel="stylesheet" href="/w/load.php?lang=pl&amp;modules=ext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cext.wikimediamessages.styles%7Cskins.vector.icons%2Cstyles%7Cskins.vector.search.codex.styles&amp;only=styles&amp;skin=vector-2022">
<script async="" src="/w/load.php?lang=pl&amp;modules=startup&amp;only=scripts&amp;raw=1&amp;skin=vector-2022"></script>
<meta name="ResourceLoaderDynamicStyles" content="">
<link rel="stylesheet" href="/w/load.php?lang=pl&amp;modules=ext.gadget.citation-access-info%2Chlist%2Cinfobox%2Csmall-references%2Csprawdz-problemy-szablony%2Cwikiflex&amp;only=styles&amp;skin=vector-2022">
<link rel="stylesheet" href="/w/load.php?lang=pl&amp;modules=site.styles&amp;only=styles&amp;skin=vector-2022">
<meta name="generator" content="MediaWiki 1.43.0-wmf.7">

CSS-only gadgets as stylesheet come with the HTML:

<link rel="stylesheet" href="/w/load.php?lang=pl&amp;modules=ext.gadget.citation-access-info%2Chlist%2Cinfobox%2Csmall-references%2Csprawdz-problemy-szablony%2Cwikiflex&amp;only=styles&amp;skin=vector-2022">

So the main point here is not the expiry of the startup module URL or the site.styles one being too long, it's the cached HTML that won't direct the new gadget to be loaded, either via stylesheet or RLPAGEMODULES. (Or, it will be fine if the site.styles module response is cached for as long as the HTML, but, no)

By the way, the Cache-Control header seen on the browser may not reflect the actual backend caching behaviour, the HTML even have Cache-Control: private, s-maxage=0, max-age=0, must-revalidate for anon users, probably for some tracking cookies like WMF-Last-Access.

Func renamed this task from Default gadget not loaded yet while Mediawiki:Mobile.css to Newly added default gadget not loaded for anon users for days while migrating away from Common.css and Mobile.css.Jun 3 2024, 11:15 PM

Thanks Func. Yes, I removed infobox styles both from mobile CSS (Mobile.css) and desktop CSS (Common.css). The problem was last reported 4 days after adding the default gadget when the styles were removed from Mobile.css and Common.css.

Additionally, on 10:32, 1 June 2024 (CEST), Msz2001 reported:

I just checked and in incognito, the new gadget with infobox styles is already loading on about 80% of the pages (i.e. 8 out of randomly selected 10).
(You can check in the console: mw.loader.getState('ext.gadget.infobox') === 'ready';).

It seems the new infobox gadget ("ext.gadget.infobox") is still not being loaded when stale HTML is loaded from cache. HTML includes RLSTATE (which either does or does not contain the new default gadget: "ext.gadget.infobox":"ready"). That RLSTATE from HTML seem to be consistent with mw.loader.getState(.).

When I look at even some old articles I see this is still false in some cases

image.png (885×1 px, 201 KB)

(times you see in console are CEST timezone, Polish timezone, so just a few minutes ago)

Source of Sofala viewed as anon:


(note that this HTML doesn't include "ext.gadget.infobox":"ready")

The same also happened in Russian Wikipedia as I introduced https://ru.wikipedia.org/wiki/MediaWiki:Gadget-common-site.css as mentioned in other task.

The bug report here is that it breaks the default expectation present in desktop version that the CSS and JS updates typically get propagated faster than ‘weeks’. Local developers should not be expected to keep the code in two places to allow for the update to propagate, caching system is fundamentally broken if it requires this to be the case (as it required in Russian Wikipedia).

So the issue is that you have newly added the "infobox" gadget (in the sense of making it default) and edited Mobile.css (etc), and changing the list of RL modules to be loaded for a page and changing the contents of an RL module are different actions with different caching behavior (since the module list needs to be included in the page HTML while the contents of a module doesn't)? I doubt much can, or should, be done about that. Invalidating caches is expensive; you can't invalidate every page when you make a CSS/JS change, and you don't want to delay all CSS/JS changes until the cache expires.

Given that duplicate CSS rules are mostly harmless, I think the solution here is just to restore the old CSS rules temporarily.

Can you clarify why the desktop version does not have the similar problem with CSS caching, though?

Are you sure it doesn't? I don't think there is any difference in how Mobile.css and e.g. Common.css are loaded.

Yes, I am pretty sure that is the case, desktop CSS cache was not a problem for me while testing anonymously, but mobile CSS cache had to resort to what you described (duplicating the styles for two weeks in JS-loaded Mobile.css until all pages get updated cache).

Can you clarify why the desktop version does not have the similar problem with CSS caching, though?

Nux has provided screenshots and HTML of desktops being affected, so it's not the case.

Nux has provided screenshots and HTML of desktops being affected, so it's not the case.

Hmm. True, but the same screenshot has a functioning infobox styling though. Whereas AFAIK on mobile it functions differently, the old styling would not show up at all but the new styling is also not propagated (I guess since Mobile.css gets updated faster than cached HTML).

That's simply because they reverted the Common.css removal.

So the issue is that you have newly added the "infobox" gadget (in the sense of making it default) and edited Mobile.css (etc), and changing the list of RL modules to be loaded for a page and changing the contents of an RL module are different actions with different caching behavior (since the module list needs to be included in the page HTML while the contents of a module doesn't)? I doubt much can, or should, be done about that. Invalidating caches is expensive; you can't invalidate every page when you make a CSS/JS change, and you don't want to delay all CSS/JS changes until the cache expires.

Given that duplicate CSS rules are mostly harmless, I think the solution here is just to restore the old CSS rules temporarily.

It's not really harmless though, it is causing stress for us, devs and stress relationships with community ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

The cache should be in sync. It wouldn't be a problem if all CSS would be in the same version. The problem is one part of editable CSS is updated and the other part of editable CSS is not updated...

Though I know this is not trivial, cache is always hard, it's definitely a bug. Ideal cache here would probably have few tiers:

  • Software version based (updated weekly), that would be CSS that came with the app and mostly style the „shell” (as PWA call it). Not sure if realistic now, but maybe with Codex...
  • Page version based cache.
  • User preferences based cache.
  • Editable, site-wide based cache.

Or just one single key that caches everything together. So preserves a moment in time until something invalidates it.

Gadgets depend on user preferences (even if a module is default, you can still disable it). So either you have a URL which is the same for everyone but the CSS/JS it returns depends on user preferences (meaning it cannot be cached), or you have a URL which changes when the CSS/JS changes, but then that URL has to come from somehwere, so you need to put it in the HTML in form, and then it's the HTML you cannot cache (or need to invalidate all the time). Either of those would be significantly bigger problems than having to be careful when deploying new default gadgets.

I guess the gadgets UI could include some sort of warning about this, maybe an edit notice.

Gadgets depend on user preferences (even if a module is default, you can still disable it). So either you have a URL which is the same for everyone but the CSS/JS it returns depends on user preferences (meaning it cannot be cached), or you have a URL which changes when the CSS/JS changes, but then that URL has to come from somehwere, so you need to put it in the HTML in form, and then it's the HTML you cannot cache (or need to invalidate all the time). Either of those would be significantly bigger problems than having to be careful when deploying new default gadgets.

Good point! So either default+hidden could be the default package... or create a package optimized for anons. There seem to be some optimizations for anons already.

I guess we could piggyback with some other gadget as a workaround or hackaround ;-)

P.S.: It works (-:[
https://pl.wikipedia.org/w/index.php?title=MediaWiki:Gadgets-definition&diff=prev&oldid=73919892

We ran into this again this week during the dark mode roll out. We deployed on the 2nd. According to data - 1 in 5 pages are serving cached HTML 3 days later. https://grafana.wikimedia.org/d/000000566/overview?orgId=1&from=1717616356034&to=1720208356035&viewPanel=35
Instrumentation code is here: https://en.wikipedia.org/wiki/MediaWiki:Mobile.js#L-8

Vgutierrez moved this task from Backlog to Traffic team actively servicing on the Traffic board.
Vgutierrez subscribed.

I'm taking a look today and I'll report back, sorry about the delay

As mentioned in Slack, the CDN enforces a max cap on the TTL of 24 hours, something that is no being triggered on /w/load.php cause as @Tgr mentioned above, the TTL is set to 300 seconds by the applayer.

What it could lead to confusion is how a revalidated cache hit is reported to the user via x-cache, x-cache-status and server-timing headers.

After the TTL expires, a conditional request will be made to the applayer (mw-web cluster if we are talking about /w/load.php) if the applayer replies with a 304 the cached content will be served but x-cache, x-cache-status and server-timing will still report a hit-front.

ATS reporting a a 304 from the applayer
Date:2024-07-16 Time:11:51:55 ConnAttempts:0 ConnReuse:23 TTFetchHeaders:146 ClientTTFB:146 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:146 TotalPluginTime:0 ActivePluginTime:0 TotalTime:146 OriginServer:mw-web-ro.discovery.wmnet OriginServerTime:-1 CacheResultCode:TCP_REFRESH_HIT CacheWriteResult:- ReqMethod:GET RespStatus:304 OriginStatus:304 ReqURL:http://en.wikipedia.org/w/load.php?lang=en&modules=ext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cext.wikimediamessages.styles%7Cskins.vector.icons%2Cstyles%7Cskins.vector.search.codex.styles&only=styles&skin=vector-2022&vgutierrez=1 ReqHeader:User-Agent:curl/7.88.1 ReqHeader:Host:en.wikipedia.org ReqHeader:X-Client-IP:<REDACTED> ReqHeader:Cookie: BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:public, max-age=300, s-maxage=300, stale-while-revalidate=60 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp6015 hit RespHeader:Backend-Timing:-
client gets it reported as a hit-front
< x-cache: cp6015 miss, cp6012 hit/4                                                    
< x-cache-status: hit-front                                                             
< server-timing: cache;desc="hit-front", host;desc="cp6012"

regarding the HTML page itself, the applayer is signaling the CDN a s-maxage of 3600 seconds cache-control: s-maxage=3600, must-revalidate, max-age=0

Happy to perform additional checks on the CDN if needed, just let me know impacted URLs

@Vgutierrez Out of curiosity, is a 304 response the only way to produce an x-cache of miss, hit/X ?

@Vgutierrez Out of curiosity, is a 304 response the only way to produce an x-cache of miss, hit/X ?

nope:

vgutierrez@carrot:~$ curl https://en.wikipedia.org/wiki/Pacific_Seabird_Group?vgutierrez=never_used_before_today -v -o /dev/null -s 2>&1 |grep -i x-cache
< x-cache: cp6015 miss, cp6012 miss
< x-cache-status: miss
vgutierrez@carrot:~$ curl https://en.wikipedia.org/wiki/Pacific_Seabird_Group?vgutierrez=never_used_before_today -v -o /dev/null -s 2>&1 |grep -i x-cache
< x-cache: cp6015 miss, cp6012 hit/1
< x-cache-status: hit-front

first request with a cold cache triggers a miss in both layers, second request is a hit on Varnish

Could varnish be behaving differently on the mobile domain?

I'm seeing HTML older than 24hrs as we speak.

When I visit the page https://en.wikipedia.org/wiki/Harmon_S._Cutting in Chrome browser incognito I see the following response heades:

Accept-Ch:
Accept-Ranges:
bytes
Age:
5
Cache-Control:
private, s-maxage=0, max-age=0, must-revalidate
Content-Encoding:
gzip
Content-Language:
en
Content-Length:
16899
Content-Type:
text/html; charset=UTF-8
Date:
Thu, 18 Jul 2024 20:53:02 GMT
Last-Modified:
Fri, 12 Jul 2024 16:51:05 GMT
Nel:
{ "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
Origin-Trial:
(redacted)
Report-To:
{ "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
Server:
mw-web.codfw.main-8564bc6dcd-r7qdp
Server-Timing:
cache;desc="hit-front", host;desc="cp4038"
Set-Cookie:
NetworkProbeLimit=0.001;Path=/;Secure;SameSite=Lax;Max-Age=3600
Strict-Transport-Security:
max-age=106384710; includeSubDomains; preload
Vary:
Accept-Encoding,Cookie,Authorization
X-Cache:
cp4038 hit, cp4038 hit/1
X-Cache:
cp4038 hit, cp4038 hit/1
X-Cache-Status:
hit-front

In the HTML I see:

NewPP limit report
Parsed by mw‐web.codfw.main‐58c7647fd9‐htzj8
Cached time: 20240713045518
Cache expiry: 2592000
Reduced expiry: false
Complications: [vary‐revision‐sha1]
CPU time usage: 0.481 seconds
Real time usage: 0.700 seconds
Preprocessor visited node count: 9453/1000000
Post‐expand include size: 27294/2097152 bytes
Template argument size: 4777/2097152 bytes
Highest expansion depth: 17/100
Expensive parser function count: 2/500
Unstrip recursion depth: 1/20
Unstrip post‐expand size: 25061/5000000 bytes
Lua time usage: 0.285/10.000 seconds
Lua memory usage: 6807874/52428800 bytes
Number of Wikibase entities loaded: 1/400

This was cached on 13th July but still being served on 18th July. Any explanations about what is happening here?

Note: If I append ?vgutierrez=never_used_before_today to the URL it seems to bypass the cache - it is my understanding that any query string parameter will do that.

I just replicated your findings on esams:

my request from my computer looks like this: $ curl -v -s --connect-to en.wikipedia.org:443:$(dig +short text-lb.esams.wikimedia.org) "https://en.wikipedia.org/wiki/Harmon_S._Cutting":

x-cache status confirms it was miss on the CDN on both layers (varnish and ATS):

< x-cache: cp3073 miss, cp3073 miss                                                                                                                                                   
< x-cache-status: miss                                                                                                                                                                
< server-timing: cache;desc="miss", host;desc="cp3073"

in the HTML I get the same NewPP limit report:

<!--                                                                                       
NewPP limit report                                                                         
Parsed by mw‐web.codfw.main‐58c7647fd9‐htzj8                                               
Cached time: 20240713045518                                                                                                                                                           
Cache expiry: 2592000                                                                                                                                                                 
Reduced expiry: false                                                                                                                                                                 
Complications: [vary‐revision‐sha1]                                                                                                                                                                                                                                                                                                                                         
CPU time usage: 0.481 seconds                                                                                                                                                                                                                                                                                                                                               
Real time usage: 0.700 seconds                                                                                                                                                                                                                                                                                                                                              
Preprocessor visited node count: 9453/1000000                                                                                                                                         
Post‐expand include size: 27294/2097152 bytes                                              
Template argument size: 4777/2097152 bytes                                                 
Highest expansion depth: 17/100                                                                                                                                                                                                                                                                                                                                             
Expensive parser function count: 2/500                                                                                                                                                                                                                                                                                                                                      
Unstrip recursion depth: 1/20                                                                                                                                                                                                                                                                                                                                               
Unstrip post‐expand size: 25061/5000000 bytes                                                                                                                                                                                                                                                                                                                               
Lua time usage: 0.285/10.000 seconds                                                                                                                                                                                                                                                                                                                                        
Lua memory usage: 6807874/52428800 bytes                                                                                                                                                                                                                                                                                                                                    
Number of Wikibase entities loaded: 1/400                                                                                                                                                                                                                                                                                                                                   
-->

content was cached on 20240713045518 for 2592000 seconds / 720 hours / 30 days. Now, that cache layer isn't at the CDN, this has been served by the applayer, we can reproduce this from a CDN node asking to the applayer for that URL:

vgutierrez@cp3073:~$ curl -s  --connect-to en.wikipedia.org:443:$(dig +short mw-web.discovery.wmnet):4450 "https://en.wikipedia.org/wiki/Harmon_S._Cutting" |egrep "Cached time|Cache expiry"
Cached time: 20240713045518
Cache expiry: 2592000

so that HTML is being cached by something at the applayer mw-web.discovery.wmnet:4450 but definitely not the CDN

so the NewPP limit report refers to mediawiki parsing cache, given that https://en.wikipedia.org/wiki/Harmon_S._Cutting last modification was on 17:10, 13 February 2024‎ according to its history page https://en.wikipedia.org/w/index.php?title=Harmon_S._Cutting&action=history it doesn't seem to be a big deal? But I'll let the experts on that part of the infrastructure to chime in

We can ignore the NewPP limit report comment for now!

I am getting a different response to you though when I do curl -v -s --connect-to en.wikipedia.org:443:$(dig +short text-lb.esams.wikimedia.org) "https://en.wikipedia.org/wiki/Harmon_S._Cutting and am seeing a cache hit:

jdlrobson@wmf3447 MobileFrontend % curl -v -s --connect-to en.wikipedia.org:443:$(dig +short text-lb.esams.wikimedia.org) "https://en.wikipedia.org/wiki/Harmon_S._Cutting"
* Connecting to hostname: 185.15.59.224
*   Trying 185.15.59.224:443...
* Connected to 185.15.59.224 (185.15.59.224) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org
*  start date: Oct 18 00:00:00 2023 GMT
*  expire date: Oct 16 23:59:59 2024 GMT
*  subjectAltName: host "en.wikipedia.org" matched cert's "*.wikipedia.org"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS Hybrid ECC SHA384 2020 CA1
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://en.wikipedia.org/wiki/Harmon_S._Cutting
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: en.wikipedia.org]
* [HTTP/2] [1] [:path: /wiki/Harmon_S._Cutting]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET /wiki/Harmon_S._Cutting HTTP/2
> Host: en.wikipedia.org
> User-Agent: curl/8.4.0
> Accept: */*
> 
< HTTP/2 200 
< date: Thu, 18 Jul 2024 21:07:14 GMT
< server: mw-web.eqiad.main-5866b6d8d8-ppqf4
< x-content-type-options: nosniff
< content-language: en
< origin-trial: AonOP4SwCrqpb0nhZbg554z9iJimP3DxUDB8V4yu9fyyepauGKD0NXqTknWi4gnuDfMG6hNb7TDUDTsl0mDw9gIAAABmeyJvcmlnaW4iOiJodHRwczovL3dpa2lwZWRpYS5vcmc6NDQzIiwiZmVhdHVyZSI6IlRvcExldmVsVHBjZCIsImV4cGlyeSI6MTczNTM0Mzk5OSwiaXNTdWJkb21haW4iOnRydWV9
< accept-ch: 
< vary: Accept-Encoding,Cookie,Authorization
< last-modified: Fri, 12 Jul 2024 16:51:05 GMT
< content-type: text/html; charset=UTF-8
< age: 1718
< x-cache: cp3073 miss, cp3073 hit/2
< x-cache-status: hit-front
< server-timing: cache;desc="hit-front", host;desc="cp3073"

Your request is hitting the same cp node in esams that I hit a few minutes ago (cp3073). My request triggered a cache miss and resulted in ATS and varnish caching that URL. Hence it's serving it from their cache. That's totally expected

This was cached on 13th July but still being served on 18th July. Any explanations about what is happening here?

I believe what you are pointing to is the Parser cache report in the generated article HTML. We configure all wikis with a 30 day (2592000 seconds) $wgParserCacheExpireTime.

If I use a guaranteed cache busting URL to fetch the [[w:Harmon S. Cutting]] page, I get the exact same parser cache report in the HTML. This seems reasonable for an article that has not had an edit since 2024-02-13T17:10:46.

I believe OutputPage::checkLastModified() sets the Last-Modified header to the date of the last edit (ish). So it probably won't tell you anything about Varnish behavior.

This issue may serve as data point to reconsider T190083. If I understand correctly, the reason plwiki is moving in this direction is because there is no mechanism other than a hidden/default/styles-only gadget to reliably and performantly add a (small, reasonable) CSS on a wiki site-wide. Keeping in mind that it is explicitly less performant to use TemplateStyles for things that a majority of pages need, since that would need to be redownloaded with every article instead of leveraging browser cache (see also Commons and the additonal database scale reason to avoid TemplateStyles in popular templates, per T343131#9503575).

The reason no other mechanism is available, is that:

  • MediaWiki core standardises on Common.css, but MobileFrontend disables this.
  • MobileFrontend offers Mobile.css as clean and mutually exclusive alternative, and if you're okay with maintaining a copy on-wiki, this would work, except that its current implementation causes a FOUC and breaks Grade C due to being loaded async via JavaScript (T190083).
  • The suggested alternative of Minerva.css causes styles to load twice for desktop users using Minerva, and similarly requires maintaining a duplicate copy.

I'm chatting with @Vgutierrez right now who has been able to reproduce the issue but we don't have answers. Sorry for the confusion around parser cache. It's not related.

@Krinkle the issue you are describing is a tangential issue from the one we are trying to understand right now. (FWIW I am currently assisting English Wikipedia / @Izno to remove Mobile.css in favor of Common.css).

The issue here is that the CSS was being served correctly but the newly created gadget was not present in the stylesheet in the page after 24hrs e.g. load.php?module=ext.gadget.infobox for anonymous users and we've been telling editors that this should only be a problem for 24hrs.

We have this issue right now with the dark mode roll out - if you click Special:Random in an incognito window the class "vector-feature-night-mode-disabled" is present on the HTML element on 1 in 5 page views. This results in the dark mode control not appearing in the right side of the screen as there is logic that depends on this class being present:

e.g. it looks like this:

image.png (720×405 px, 44 KB)

instead of this:
image.png (720×385 px, 59 KB)

It has not been possible to generate HTML where the class vector-feature-night-mode-disabled is present on the HTML element since Tuesday which is the subject of concern.

ok, I've reproduced the issue and caught the request on ATS after a few attempts:

vgutierrez@carrot:~$ curl -H 'User-Agent: vgutierrez' -L "https://en.wikipedia.org/wiki/Special:Random" -v 2>&1 |egrep -i "x-cache|vector-feature-night-mode-disabled|rel=\"canonical"
< x-cache: cp6013 miss, cp6012 pass
< x-cache-status: pass
< x-cache: cp6013 hit, cp6012 miss
< x-cache-status: hit-local
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available";var cookie=document.cookie.match(/(?:^|; )enwikimwclientpreferences=([^;]+)/);if(cookie){cookie[1].split('%2C').forEach(function(pref){className=className.replace(new RegExp('(^| )'+pref.replace(/-clientpref-\w+$|[^\w-]+/g,'')+'-clientpref-\\w+( |$)'),'$1'+pref+'$2');});}document.documentElement.className=className;}());RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":[
<link rel="canonical" href="https://en.wikipedia.org/wiki/Muirgheas_Ua_hEidhin">

using curl -L I'm following the redirect to /wiki/Special:Random, the affected URL in cp6013 is https://en.wikipedia.org/wiki/Muirgheas_Ua_hEidhin

As reported in the headers it's a miss on varnish in cp6012 and a hit on ATS in cp6013, hence: x-cache-status is set to hit-local:

< x-cache: cp6013 hit, cp6012 miss
< x-cache-status: hit-local

ATS in cp6013 reported this:

Date:2024-07-18 Time:23:11:52 ConnAttempts:0 ConnReuse:7 TTFetchHeaders:141 ClientTTFB:143 CacheReadTime:0 CacheWriteTime:0 TotalSMTime:143 TotalPluginTime:1 ActivePluginTime:1 TotalTime:143 OriginServer:mw-web-ro.discovery.wmnet OriginServerTime:-1 CacheResultCode:TCP_REFRESH_HIT CacheWriteResult:- ReqMethod:GET RespStatus:200 OriginStatus:304 ReqURL:http://en.wikipedia.org/wiki/Muirgheas_Ua_hEidhin ReqHeader:User-Agent:vgutierrez ReqHeader:Host:en.wikipedia.org ReqHeader:X-Client-IP:<REDACTED> ReqHeader:Cookie: BerespHeader:Set-Cookie:- BerespHeader:Cache-Control:s-maxage=1209600, must-revalidate, max-age=0 BerespHeader:Connection:- RespHeader:X-Cache-Int:cp6013 hit RespHeader:Backend-Timing:D=53171 t=1721344312216478

so content was stale on ATS, but it performed a conditional request to mediawiki using If-Modified-Since and it got a 304 back, so it used the content from its cache.

In ATS documentation words:

TCP_REFRESH_HIT: The object was in the cache, but it was stale. Traffic Server made an if-modified-since request to the origin server and the origin server sent a 304 not-modified response. Traffic Server sent the cached object to the client.

I've found a variation of this issue where ATS was returning a cache hit with status TCP_IMS_HIT, again from ATS documentation:

TCP_IMS_HIT: The client issued an if-modified-since request and the object was in cache and fresher than the IMS date, or an if-modified-since request to the origin server revealed the cached object was fresh. Traffic Server served the cached object to the client.

in both cases, "the client" means varnish in this case, not curl running in my machine :)

It looks like given that the article hasn't been changed since 04:53, 29 August 2023‎ mediawiki replies with a 304 even if the CSS classes returned embedded in the HTML have been updated recently. But this is already out of my area of expertise

[…] we've been telling editors that this should only be a problem for 24hrs.

To my knowledge we haven't capped CDN storage to 24 hours. The lowest it has been, as it is since 2016 (History), is 1 day unconditional freshness, after which storage continues for stale/renewal purposes, with a hard limit on the MediaWiki side to deny renewals on the same object after 14 days (e.g. after every subsequent day has passed, if it's still popular).

https://wikitech.wikimedia.org/wiki/CDN#Retention

"24 hours" is a fine approximation for most page views, e.g. for optimisations or changes with minimal downside. But the long tail of pages that are both infrequently edited and popularly viewed (i.e. never fall out of CDN cache), do get renewed via HTTP 304 for upto 14 days.

This is why (usually) when we change default HTML, we pre-seed the CSS class or module for 2 weeks / 14 days.

We have this issue right now with the dark mode roll out - if you click Special:Random in an incognito window the class "vector-feature-night-mode-disabled" is present on the HTML element on 1 in 5 page views. […]

A lot of stars have to align for the long tail to reach several days beyond 24 hours. If something still happens 1 in 5 times (20%) on a wiki, more than a day after a change was deployed, it seems unlikely to be due to CDN caching. I can't rule it out, but if something is that common, I'd be sceptical and look at other explanations as well.

To summarize the discussion:

  • When adding a new gadget to the site that is default for anonymous users this requires updating the HTML of the page
  • The HTML of the page is cached for up to 24 hrs in Varnish
  • In addition to this Apache Traffic Server (ATC) checks If-Modified headers and will serve an older version of the page if they are still valid
  • A page can be served from ATC typically up to 2 weeks.
  • Existing gadgets / site CSS are already in the page HTML so is not subject to this cache. They instead are subject to a ResourceLoader cache which is a much shorter period of time.
  • This means pages served from ATC / Varnish do not have the new gadget and thus the new styles are loading, but the site styles where the CSS was removed are - this resulted in the infoboxes not being styled in these pages.

I hope this helps explain what happened here.

In terms of next steps we have opened T373495 to explore reducing cache retention.