Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout
Closed, ResolvedPublic

Description

Large PDFs uploaded by https://commons.wikimedia.org/wiki/User:MidleadingBot cannot be downloaded. Thumbnail generation also fails. The upload method is documented on the user page.

curl -o o.pdf https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 15 1953M   15  310M    0     0  1298k      0  0:25:40  0:04:05  0:21:35 1340k
curl: (92) HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)

DNS resolves to ulsfo.

Looks like the file is already on Commons, just without thumbnails, not downloadable. I used my 100Mbps network, and it fails 5:51 after start downloading with 1.19GB recieved.

Event Timeline

Midleading renamed this task from Cannot download certain files uploaded by async=1 to Cannot download large files.Jul 13 2023, 2:57 AM
Midleading updated the task description. (Show Details)
Peachey88 renamed this task from Cannot download large files to Cannot download large files from commons.Jul 13 2023, 9:16 AM
Peachey88 edited projects, added SRE; removed Commons, MediaWiki-File-management.

Eh.. these are 3GB PDFs ??? what I gods name justifies using such large files ? That's hours of video !
The systems are NOT designed to handle that kind of material. So while you might be able to upload it, you should not expect thumbnailing to work.

4000'ish pages, most around 506 × 756 pixels, haven't been able to download one yet to check which compression is being used and if that compression is suitable.

These files are uploaded at the request of another Wikimedia user. I have PDFs up to 8GB, but they will be splitted to parts below 4GB. Not having to use server-side upload is a huge achievement for Wikimedia engineers and myself, before that even 2GB PDFs have to be uploaded with server-side upload.
But at that time these 2GB PDFs can have their thumbnails once on Commons, like https://commons.wikimedia.org/wiki/File:%EF%BC%88%E5%85%89%E7%B7%92%EF%BC%89%E7%BA%8C%E4%BF%AE%E5%BB%AC%E5%B7%9E%E5%BA%9C%E5%BF%97_-_%E5%85%89%E7%B7%92%E5%8D%81%E4%B8%80%E5%B9%B4_(1885).pdf. Something must be broken since then.

Aklapper renamed this task from Cannot download large files from commons to Cannot download large (3GB) PDF files from commons.Jul 17 2023, 11:22 AM
Clement_Goubert subscribed.

Hi,

I can't reproduce the download issue going through drmrs

$ curl -v -o o.pdf https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf                                                      
* Expire in 0 ms for 6 (transfer 0x5556bc5d6110) 
* Expire in 1 ms for 1 (transfer 0x5556bc5d6110)                                
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--    
[...]
*   Trying 2a02:ec80:600:ed1a::2:b...           
* TCP_NODELAY set                               
* Expire in 149872 ms for 3 (transfer 0x5556bc5d6110)
* Expire in 200 ms for 4 (transfer 0x5556bc5d6110)
* Connected to upload.wikimedia.org (2a02:ec80:600:ed1a::2:b) port 443 (#0)
* ALPN, offering h2                             
* ALPN, offering http/1.1                       
* successfully set certificate verify locations:
*   CAfile: none                                           
  CApath: /etc/ssl/certs                        
} [5 bytes data]                                
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]                              
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]                                         
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [19 bytes data]                               
* TLSv1.3 (IN), TLS handshake, Certificate (11): 
{ [3191 bytes data]                              
* TLSv1.3 (IN), TLS handshake, CERT verify (15): 
{ [79 bytes data]                                
* TLSv1.3 (IN), TLS handshake, Finished (20):    
{ [52 bytes data]                                     
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):               
} [1 bytes data]                                                                                                                                                                                                   
* TLSv1.3 (OUT), TLS handshake, Finished (20):   
} [52 bytes data]                                                                                    
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2                
* Server certificate:                            
*  subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org    
*  start date: Oct 27 00:00:00 2022 GMT          
*  expire date: Nov 17 23:59:59 2023 GMT         
*  subjectAltName: host "upload.wikimedia.org" matched cert's "*.wikimedia.org"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS Hybrid ECC SHA384 2020 CA1   
*  SSL certificate verify ok.                           
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
} [5 bytes data]
* Using Stream ID: 1 (easy handle 0x5556bc5d6110)
} [5 bytes data]
> GET /wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf HTT
P/2
> Host: upload.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
>                                               
{ [5 bytes data]                                           
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]                              
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]                              
* old SSL session ID is stale, removing         
{ [5 bytes data]                                           
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
} [5 bytes data]                                
< HTTP/2 200                                     
< date: Wed, 16 Aug 2023 10:21:01 GMT            
< content-type: application/pdf                  
< content-length: 2048480535                     
< x-object-meta-sha1base36: pkzs2wjg2vjddpfn1mvwfloxlbyis3i
< etag: c857f45ee9b94937fd03960b159a9bf9              
< last-modified: Wed, 12 Jul 2023 07:29:26 GMT                            
< accept-ranges: bytes                                                                                                                                                                                             
< server: ATS/9.1.4                              
< age: 0                                                                                             
< x-cache: cp6005 miss, cp6004 pass                    
< x-cache-status: pass                           
< server-timing: cache;desc="pass", host;desc="cp6004"
< strict-transport-security: max-age=106384710; includeSubDomains; preload                            
< report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0"
 }] }                                            
< nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
< x-client-ip: xxxxxxxxx
< x-content-type-options: nosniff                       
< access-control-allow-origin: *                                                                                      
< access-control-expose-headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
< timing-allow-origin: *                                                        
<               
{ [13888 bytes data]                             
100 1953M  100 1953M    0     0  10.9M      0  0:02:59  0:02:59 --:--:-- 11.2M
* Connection #0 to host upload.wikimedia.org left intact                                                                                                                                                           

$ ls -alh o.pdf
-rw-r--r-- 1 claime claime 2.0G Aug 16 10:24 o.pdf

Tagging Traffic for the potential download issue from ulsfo and Thumbor for the potential thumbnailing issue.

I am able to reproduce.

My work laptop with a throughput of ~300Mbps was able to download it just fine:

< last-modified: Wed, 12 Jul 2023 07:30:13 GMT
< accept-ranges: bytes
< server: ATS/9.1.4
< age: 1
< x-cache: cp4050 miss, cp4050 pass
< x-cache-status: pass
< server-timing: cache;desc="pass", host;desc="cp4050"
[...] 
100 1953M  100 1953M    0     0  25.0M      0  0:01:18  0:01:18 --:--:-- 28.4M
* Connection #0 to host upload.wikimedia.org left intact

A server with a 1Gbps link in a similar region was also fine.

A slower server in Washington connecting to eqiad was not:

< last-modified: Wed, 12 Jul 2023 07:29:26 GMT
< accept-ranges: bytes
< server: ATS/9.1.4
< age: 0
< x-cache: cp1078 miss, cp1090 pass
< x-cache-status: pass
< server-timing: cache;desc="pass", host;desc="cp1090"
[...]
 42 1953M   42  834M    0     0  2782k      0  0:11:58  0:05:06  0:06:52 2547k* HTTP/2 stream 1 was reset
 42 1953M   42  835M    0     0  2782k      0  0:11:58  0:05:07  0:06:51 3058k
* Connection #0 to host upload.wikimedia.org left intact
curl: (18) HTTP/2 stream 1 was reset

An older raspberry pi on the same network as my laptop with a 100Mbps link downloading from ulsfo: Same error and around the same timing as the OP (4 minutes):

It looks like a timeout configuration though I'm not sure why the Washington server died after 5 minutes and the OP/raspberry pi after 4.

The file can be curled down from all dcs:

for site in ulsfo eqiad codfw esams eqsin drmrs; do curl -o $site.pdf https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf --connect-to ::upload-lb.$site.wikimedia.org ; done

Note: That file is "only" 2 GB, the 3 GB mention in the task title is misleading. The total filesize is 2048480535, as confirmed querying the database.

But it does fail if we also add --limit-rate 1M

It also fails quite consistently at the 3:45 - 4:00 mark, but not always:

$ time for site in ulsfo eqiad codfw esams eqsin drmrs; do curl --limit-rate 1M -o $site.pdf ht
tps://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A
2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf --
connect-to ::upload-lb.$site.wikimedia.org ; done
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 11 1953M   11  227M    0     0  1030k      0  0:32:21  0:03:45  0:28:36 1273k
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 11 1953M   11  224M    0     0  1024k      0  0:32:32  0:03:43  0:28:49 1048k
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 11 1953M   11  227M    0     0  1025k      0  0:32:30  0:03:46  0:28:44  983k
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 12 1953M   12  242M    0     0  1028k      0  0:32:25  0:04:01  0:28:24 1249k
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 23 1953M   23  456M    0     0  1026k      0  0:32:29  0:07:35  0:24:54 1244k
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 14 1953M   14  287M    0     0  1027k      0  0:32:26  0:04:45  0:27:41 1248k
curl: (16) Error in the HTTP2 framing layer

$ for site in eqsin drmrs; do curl --limit-rate 1M -o $site.pdf https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf --connect-to ::upload-lb.$site.wikimedia.org ; done
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 12 1953M   12  246M    0     0  1029k      0  0:32:22  0:04:04  0:28:18 1274k
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 11 1953M   11  232M    0     0  1024k      0  0:32:33  0:03:51  0:28:42  883k
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)

It's also funny to see it downloaded the same number of bytes from ulsfo and codfw above (238026752).

A quick check on cp3081 shows the following results:

  • HAproxy closes the connection after 245 seconds and 327 seconds in a second test
  • varnish closes the connection after 353 seconds and 393 seconds in a second test
  • ATS closes the connection after 908 seconds
  • swift allows slow fetching the entire object (completed in 32 minutes using -limit-rate 1M)

Considering that Varnish and HAProxy aren't enforcing a total request timeout, it looks like all the connection closes are related to the proxy.config.http.transaction_active_timeout_in 900 seconds timeout enforced by ATS:

vgutierrez@cp3081:~$ sudo -i traffic_ctl config match timeout |egrep "net|http|ssl"
proxy.config.net.default_inactivity_timeout: 360
proxy.config.http2.accept_no_activity_timeout: 180
proxy.config.http2.no_activity_timeout_in: 180
proxy.config.http.connect_attempts_timeout: 9
proxy.config.http.post_connect_attempts_timeout: 11
proxy.config.http.transaction_no_activity_timeout_in: 200
proxy.config.http.transaction_no_activity_timeout_out: 180
proxy.config.http.keep_alive_no_activity_timeout_in: 120
proxy.config.http.keep_alive_no_activity_timeout_out: 120
proxy.config.http.transaction_active_timeout_out: 205
proxy.config.http.accept_no_activity_timeout: 180
proxy.config.http.parent_proxy.connect_attempts_timeout: 30
proxy.config.http.transaction_active_timeout_in: 900
proxy.config.http.background_fill_active_timeout: 0
proxy.config.net.poll_timeout: 10
proxy.config.ssl.session_cache.timeout: 0
proxy.config.ssl.handshake_timeout_in: 30
proxy.config.ssl.ocsp.cache_timeout: 3600
proxy.config.ssl.ocsp.request_timeout: 10
proxy.config.http2.active_timeout_in: 0
proxy.config.http2.zombie_debug_timeout_in: 0

and somehow it looks like ATS is failing to reset the timeout per transaction so the user never experiences the same timeout value (connection reuse is enabled across the 3 layers of the CDN [TLS Termination/frontend cache/backend cache])

Midleading renamed this task from Cannot download large (3GB) PDF files from commons to Cannot download large (2GB) PDF files with 100.Aug 27 2023, 5:03 AM
Midleading renamed this task from Cannot download large (2GB) PDF files with 100 to Cannot download large (2GB) PDF files with 100Mbps network due to ATS.
Midleading renamed this task from Cannot download large (2GB) PDF files with 100Mbps network due to ATS to Cannot download large (2GB) files with 100Mbps or slower network due to ATS timeout.Aug 27 2023, 5:07 AM

current task name is misleading, 100Mbps is definitely enough to download the file without triggering the ATS timeout, your curl output shows an average speed of 1298 kbytes per sec, that's consistent with a 10Mbps network, not a 100Mbps one. I just used --limit-rate 12M (12 megabytes per second or roughly 100Mbps) to test it:

vgutierrez@carrot:~$ curl -o /dev/null https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf --limit-rate 12M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1953M  100 1953M    0     0  12.0M      0  0:02:42  0:02:42 --:--:-- 12.2M

An actual 100Mbps WAN transit link suffering from network congestion could likely trigger this timeout. My fastest internet connection is 100Mbps theoretical, but I never download it successfully, actual speed is near 30Mbps. Anyway, 10Mbps link speed could definitely trigger the timeout. Not everyone in the world can have access to true 100Mbps network.

Midleading renamed this task from Cannot download large (2GB) files with 100Mbps or slower network due to ATS timeout to Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout.Aug 28 2023, 4:26 PM

Change 953587 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Allow configuring transaction_active_timeout_in

https://gerrit.wikimedia.org/r/953587

Change 953638 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Set active timeouts to 1h

https://gerrit.wikimedia.org/r/953638

Change 953678 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Increase send_timeout in upload

https://gerrit.wikimedia.org/r/953678

Change 953587 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Allow configuring transaction_active_timeout_in

https://gerrit.wikimedia.org/r/953587

Change 953638 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Set active timeouts to 1h in upload

https://gerrit.wikimedia.org/r/953638

Change 953678 merged by Vgutierrez:

[operations/puppet@production] varnish: Increase send_timeout in upload

https://gerrit.wikimedia.org/r/953678

Mentioned in SAL (#wikimedia-operations) [2023-08-31T08:21:27Z] <vgutierrez> set send_timeout to 3620s in the upload cluster via cumin to avoid a varnish restart https://gerrit.wikimedia.org/r/c/operations/puppet/+/953678 - T341755

Vgutierrez claimed this task.
Vgutierrez moved this task from Backlog to Traffic team actively servicing on the Traffic board.
vgutierrez@carrot:~$ curl -o /dev/null https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf?vgutierrez=1 --limit-rate 1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1953M  100 1953M    0     0  1024k      0  0:32:33  0:32:33 --:--:-- 1122k

Problem has been fixed after tuning varnish and ATS timeouts in the upload cluster. Thanks for reporting the issue