User Details
- User Since
- Feb 12 2018, 9:51 AM (358 w, 42 m)
- Availability
- Available
- IRC Nick
- vgutierrez
- LDAP User
- Vgutierrez
- MediaWiki User
- VGutiérrez (WMF) [ Global Accounts ]
Wed, Dec 4
As noticed while working on https://gitlab.wikimedia.org/repos/sre/liberica/-/merge_requests/87, liberica needs to populate the lru_mapping eBPF map to avoid letting katran fallback to the global LRU:
# HELP liberica_fp_katran_fallback_global_lru_total Number of times that katran failed to find the per cpu/core lru, it should be 0 in production # TYPE liberica_fp_katran_fallback_global_lru_total counter liberica_fp_katran_fallback_global_lru_total 12006
Tue, Nov 26
right now exim is configured with RSA certs only and not with a dual stack (RSA+ECDSA) setup, from lists1004's exim configuration:
# TLS
Nov 22 2024
This triggered errors on every haproxykafka instance after losing producer access to the configured topics:
Nov 21 17:42:40 cp5031 haproxykafka[3825906]: %5|1732210960.009|PARTCNT|cp5031#producer-1| [thrd:ssl://kafka-jumbo1011.eqiad.wmnet:9093/bootstrap]: Topic webrequest_frontend_upload partition count changed from 1 to 0
Nov 21 2024
liberica is using BGP as expected on lvs1013
Nov 19 2024
Removing Traffic given this kind of request isn't handled by the CDN
Nov 18 2024
as mentioned on the email thread that sounds like viable option for us
Nov 14 2024
Nov 7 2024
Nov 6 2024
lvs1013 running liberica is now the primary load balancer for ncredir@eqiad
Oct 30 2024
nice, do we need to add lvs1013 to any ACLs?
Oct 29 2024
Gven the limitations to run pybal and liberica on the same hosts, we want to run liberica on separate hosts with a higher priority than pybal for the same prefixes so we can test liberica.
We want to deploy liberica on lvs1013 and let it handle traffic for ncredir without altering the current config for ncredir@eqiad on pybal running on lvs1017 and lvs1020, that would mean setting a community string (14907:2?) on liberica in lvs1013 that translates to local-pref >100 so it will start receiving traffic for ncredir.
Oct 28 2024
Oct 14 2024
Oct 10 2024
I like the black box approach, but it would require maintaining yet another map of DCs and ranges per service, but I guess you could infer it from dig +short $cluster-lb.$dc.wikimedia.org and use those as input, something like this:
bash #!/bin/bash DCS=("eqiad" "codfw" "esams" "ulsfo" "eqsin" "drmrs" "magru") for DC in "${DCS[@]}"; do TEXT_IP=$(dig +short text-lb."$DC".wikimedia.org @ns0.wikimedia.org) UPLOAD_IP=$(dig +short upload-lb."$DC".wikimedia.org @ns0.wikimedia.org)
Oct 8 2024
Oct 7 2024
Oct 4 2024
On a second attempt after wiping remaining challenges on the DNS server using acme-chief-designate-tidyup.service, acme-chief issued both certificates as expected:
root@traffic-acmechief01:/etc/acme-chief# openssl x509 -issuer -dates -subject -ext subjectAltName -noout -in /var/lib/acme-chief/certs/non-canonical-redirect-1-pki/new/rsa-2048.crt issuer=C = US, O = Google Trust Services, CN = WR1 notBefore=Oct 4 10:41:09 2024 GMT notAfter=Jan 2 10:41:08 2025 GMT subject=CN = wikipedia.com.traffic.wmflabs.org X509v3 Subject Alternative Name: DNS:wikipedia.com.traffic.wmflabs.org, DNS:*.wikipedia.com.traffic.wmflabs.org, DNS:*.en-wp.com.traffic.wmflabs.org, DNS:en-wp.com.traffic.wmflabs.org, DNS:*.en-wp.org.traffic.wmflabs.org, DNS:en-wp.org.traffic.wmflabs.org root@traffic-acmechief01:/etc/acme-chief# openssl x509 -issuer -dates -subject -ext subjectAltName -noout -in /var/lib/acme-chief/certs/non-canonical-redirect-1-pki/new/ec-prime256v1.crt issuer=C = US, O = Google Trust Services, CN = WR1 notBefore=Oct 4 10:41:05 2024 GMT notAfter=Jan 2 10:41:04 2025 GMT subject=CN = wikipedia.com.traffic.wmflabs.org X509v3 Subject Alternative Name: DNS:wikipedia.com.traffic.wmflabs.org, DNS:*.wikipedia.com.traffic.wmflabs.org, DNS:*.en-wp.com.traffic.wmflabs.org, DNS:en-wp.com.traffic.wmflabs.org, DNS:*.en-wp.org.traffic.wmflabs.org, DNS:en-wp.org.traffic.wmflabs.org
after a quick patch:
vgutierrez@carrot:~/gitlab.wikimedia.org/sre/acme-chief$ git diff acme_chief/acme_requests.py diff --git a/acme_chief/acme_requests.py b/acme_chief/acme_requests.py index 7f677f2..bee4415 100644 --- a/acme_chief/acme_requests.py +++ b/acme_chief/acme_requests.py @@ -331,9 +331,6 @@ class ACMERequests: self.challenges = {} self.orders = {}
Oct 3 2024
Assuming that the automated traffic you're describing follows DNS and doesn't hardcode the IP of another of our DCs, the increase should be reflected in that graph.
Sep 27 2024
after a powercycle puppetserver1001 is responsive again
Sep 26 2024
we should start getting RSA data for TLSv1.3 as soon as puppet runs in the cp nodes
I just -2ed the gerrit change cause we don't currently have information about which certificate is being used.
TLSv1.2 includes the authentication mechanism used during the handshake as part of the ciphersuite (ECDHE-RSA-AES256-GCM-SHA384, ECDHE-ECDSA-AES256-GCM-SHA384) and that's what's currently being sent from haproxy to varnish as part of x-connection-properties header.
In TLSv1.3 he authentication mechanism has been dropped from the ciphersuite name (TLS_AES_128_GCM_SHA256 could be used with either ECDSA or RSA certificates).
Sep 25 2024
- Is there an explanation for why there are users that apparently WMF-Last-Access-Global is set but not WMF-Last-Access and vice versa? It seems that Cookies should be set simultaneously such that any user should have a minimum a single pair of cookies.
Yes, since T174640 varnish doesn't set WMF-Last-Access-Global cookie for wikimedia.org or its subdomains. And T260943 excluded requests to api.wikimedia.org from getting WMF-Last-Access. Relevant code can be seen here
Sep 24 2024
Regarding reverse-path filtering it's enough to disable it on "all" and ipip0/ipip60, per Linux kernel documentation:
The max value from conf/{all,interface}/rp_filter is used when doing source validation on the {interface}.
vgutierrez@krb1001:~$ sudo manage_principals.py create cyndywikime --email_address=csimiyu@wikimedia.org Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to csimiyu@wikimedia.org
Sep 23 2024
purged 0.24 survived to Consumer group session timed out (in join-state steady) after 10458 ms without a successful response from the group coordinator (broker 2001, last error was Broker: Not coordinator): revoking assignment and rejoining group error this morning, I'll close this task as it seems like the bug has been solved.
Sep 20 2024
purged 0.24 shipping the patch mentioned above has been deployed everywhere. Now we should expect the following behavior: purged will start logging and gathering metrics of kafka errors. If the received error is considered fatal (only all brokers down error is flagged as fatal at the moment) purged will exit and systemd should restart it.
If a non-fatal error triggers the same behavior as the described on this task we should add that error code to the list of fatal errors.
Sep 19 2024
@elukey I think I've found the root cause of this mess, this is the patch:
diff --git a/kafka.go b/kafka.go index 4e495ce..b878db8 100644 --- a/kafka.go +++ b/kafka.go @@ -187,7 +187,7 @@ func (k *KafkaReader) manageEvent(event kafka.Event, c chan string) bool { if err != nil { log.Printf("Unable to update promrdkafka metrics: %v\n", err) } - case *kafka.Error: + case kafka.Error: kafkaErrors.With(prometheus.Labels{typeLabel: e.Code().String()}).Inc()
Sep 18 2024
SSH key has been confirmed out of band
per data.yaml we need approval from @odimitrijevic / @Milimetric / @WDoranWMF / @Ahoelzl / @Ottomata (one of them is enough)
Provided URLs are currently handled by mw-web:
vgutierrez@carrot:/tmp$ ./T374997.sh https://donate.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-7z5sc https://quality.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-hmc86 https://office.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-8xnx7 https://wikimania2011.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-m6l62 https://vote.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-t6cgb https://ng.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-2x87n https://collab.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-4qwnw https://mai.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-zckls https://u4c.wikimedia.org/favicon.ico < server: mw-web.eqiad.main-6646476df4-fx86f
Answering here @RobH question:
Hey I made some assumptions on the cp hosts troubleshooting but should check with you: Those hosts are under the same weight conditions as all their related hosts correct? If so, then indeed somehting is off and likely thermal paste
That's right, all hosts have the same weight, during normal conditions they should handle the same load on average
Sep 16 2024
After double checking that I get the very same errors as @Cyndymediawiksim it looks like it's an issue with that specific superset dashboard, not related to @Cyndymediawiksim user
idp configuration states that wmf membership is enough to access superset (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/idp.yaml#206) and as already mentioned @Cyndymediawiksim is already member of wmf:
Wikitech: User:Cyndywikime Shell username: cyndywikime Email: csimiyu at wikimedia dot org User ID: 40557 Account created: 20220802154524Z Groups Groups that begin with project- refer to Cloud VPS projects.
gentle reminder, this is still waiting for @VPuffetMichel approval
Sep 12 2024
hmm is entirely possible that cp instances aren't running that mtail program at all?
vgutierrez@cp7005:~$ ps auxww |grep "mtail -progs" root 2830 11.8 0.0 4408172 59920 ? Sl Jun04 17071:58 /usr/bin/mtail -progs /etc/mtail-default -port 3903 -logs /dev/stdin -disable_fsnotify root 2116858 23.1 0.0 3817292 51472 ? Sl Sep03 2935:08 /usr/bin/mtail -progs /etc/mtail-internal -port 3913 -logs /dev/stdin -disable_fsnotify vgutier+ 2492792 0.0 0.0 6240 652 pts/0 S+ 12:35 0:00 grep mtail -progs root 2991153 0.0 0.0 2480 508 ? Ss Jun24 0:00 /bin/sh -c atslog-backend | mtail -progs "/etc/atsmtail-backend" -logs /dev/stdin -disable_fsnotify -port "3904" root 2991156 1.3 0.0 6771692 80696 ? Sl Jun24 1558:37 mtail -progs /etc/atsmtail-backend -logs /dev/stdin -disable_fsnotify -port 3904
@RobH / @wiki_willy could we get this task prioritized on your side?
Sep 10 2024
I don't think so, I'm failing to see
Sep 10 11:48:50 cp7005 kernel: [8456281.652467] mce: CPU15: Core temperature is above threshold, cpu clock is throttled (total events = 170) Sep 10 11:48:51 cp7005 kernel: [8456282.676501] mce: CPU15: Core temperature/speed normal (total events = 170)
reported on grafana, https://grafana.wikimedia.org/goto/ccf0yA6Sg?orgId=1
Sep 9 2024
Sep 8 2024
Service should be restored now.
mw-wikifunctions seems to be down in eqiad at the moment:
vgutierrez@cp6016:~$ nc -zv mw-wikifunctions.discovery.wmnet 4451 nc: connect to mw-wikifunctions.discovery.wmnet (10.2.2.88) port 4451 (tcp) failed: Connection refused vgutierrez@cp6016:~$ nc -zv mw-wikifunctions.svc.eqiad.wmnet 4451 nc: connect to mw-wikifunctions.svc.eqiad.wmnet (10.2.2.88) port 4451 (tcp) failed: Connection refused vgutierrez@cp6016:~$ nc -zv mw-wikifunctions.svc.codfw.wmnet 4451 Connection to mw-wikifunctions.svc.codfw.wmnet (10.2.1.88) 4451 port [tcp/*] succeeded!
Sep 6 2024
that specific request triggered a timeout while trying to read the POST request body:
Sep 6 06:01:47 cp3069 varnish-frontend-fetcherr[1010101]: @cee: {"time": "2024-09-06T06:01:47.736362", "message": "req.body read error: 11 (Resource temporarily unavailable) [omitted output]
esams seems to be as healthy as usual per https://grafana.wikimedia.org/goto/ix3gNVeSR?orgId=1:
this has been triggered again in cp2038 and cp2041:
vgutierrez@cumin1002:~$ sudo -i cumin 'cp[2038,2041].codfw.wmnet' 'journalctl -u purged.service --since=-18h |grep "timed out"' 2 hosts will be targeted: cp[2038,2041].codfw.wmnet OK to proceed on 2 hosts? Enter the number of affected hosts to confirm or "q" to quit: 2 ===== NODE GROUP ===== (1) cp2041.codfw.wmnet ----- OUTPUT of 'journalctl -u pu...grep "timed out"' ----- Sep 05 17:01:19 cp2041 purged[3565730]: %4|1725555679.879|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10216 ms without a successful response from the group coordinator (broker 2003, last error was Broker: Not coordinator): revoking assignment and rejoining group ===== NODE GROUP ===== (1) cp2038.codfw.wmnet ----- OUTPUT of 'journalctl -u pu...grep "timed out"' ----- Sep 05 17:01:21 cp2038 purged[4019621]: %4|1725555681.014|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10498 ms without a successful response from the group coordinator (broker 2001, last error was Broker: Not coordinator): revoking assignment and rejoining group ================