Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Vgutierrez (Valentín Gutiérrez)
Staff Site Reliability Engineer, Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (358 w, 42 m)
Availability
Available
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
VGutiérrez (WMF) [ Global Accounts ]

Recent Activity

Wed, Dec 4

Vgutierrez added a comment to T380450: bring katran to liberica.

As noticed while working on https://gitlab.wikimedia.org/repos/sre/liberica/-/merge_requests/87, liberica needs to populate the lru_mapping eBPF map to avoid letting katran fallback to the global LRU:

# HELP liberica_fp_katran_fallback_global_lru_total Number of times that katran failed to find the per cpu/core lru, it should be 0 in production
# TYPE liberica_fp_katran_fallback_global_lru_total counter
liberica_fp_katran_fallback_global_lru_total 12006
Wed, Dec 4, 1:56 PM · Patch-For-Review, Traffic

Tue, Nov 26

Vgutierrez added a comment to T375569: Remove RSA certificates from puppet.

right now exim is configured with RSA certs only and not with a dual stack (RSA+ECDSA) setup, from lists1004's exim configuration:

# TLS
Tue, Nov 26, 4:15 AM · Patch-For-Review, Traffic

Nov 22 2024

Vgutierrez updated the task description for T374128: haproxykafka features.
Nov 22 2024, 12:20 PM · Traffic
Vgutierrez triaged T380583: Avoid logging errors per produced message as High priority.
Nov 22 2024, 12:20 PM · Sustainability (Incident Followup), Traffic
Vgutierrez created T380583: Avoid logging errors per produced message.
Nov 22 2024, 12:19 PM · Sustainability (Incident Followup), Traffic
Vgutierrez added a comment to T380373: Allow TLS authenticated client to write on new topics.

This triggered errors on every haproxykafka instance after losing producer access to the configured topics:

Nov 21 17:42:40 cp5031 haproxykafka[3825906]: %5|1732210960.009|PARTCNT|cp5031#producer-1| [thrd:ssl://kafka-jumbo1011.eqiad.wmnet:9093/bootstrap]: Topic webrequest_frontend_upload partition count changed from 1 to 0
Nov 22 2024, 10:53 AM · Data-Platform-SRE, Traffic

Nov 21 2024

Vgutierrez triaged T380450: bring katran to liberica as High priority.
Nov 21 2024, 9:41 AM · Patch-For-Review, Traffic
Vgutierrez created T380450: bring katran to liberica.
Nov 21 2024, 9:40 AM · Patch-For-Review, Traffic
Vgutierrez closed T378341: harden liberica systemd service units, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Nov 21 2024, 9:39 AM · Traffic
Vgutierrez closed T378341: harden liberica systemd service units as Resolved.
Nov 21 2024, 9:39 AM · Traffic
Vgutierrez closed T376600: Provide debian packages for liberica, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Nov 21 2024, 9:39 AM · Traffic
Vgutierrez closed T375464: Test liberica BGP support as Resolved.

liberica is using BGP as expected on lvs1013

Nov 21 2024, 9:39 AM · Traffic
Vgutierrez closed T376600: Provide debian packages for liberica as Resolved.
Nov 21 2024, 9:39 AM · Traffic
Vgutierrez closed T375464: Test liberica BGP support, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Nov 21 2024, 9:38 AM · Traffic

Nov 19 2024

Vgutierrez removed a project from T379990: 403 on http://dumps.wikimedia.org: Traffic.

Removing Traffic given this kind of request isn't handled by the CDN

Nov 19 2024, 9:45 AM · Data-Platform-SRE, Data-Platform

Nov 18 2024

Vgutierrez added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

as mentioned on the email thread that sounds like viable option for us

Nov 18 2024, 5:23 PM · Prod-Kubernetes, Kubernetes, serviceops, Traffic

Nov 14 2024

Vgutierrez triaged T379891: Upgrade haproxy to 2.8.12 on cp hosts as Medium priority.
Nov 14 2024, 8:53 AM · Traffic
Vgutierrez created T379891: Upgrade haproxy to 2.8.12 on cp hosts.
Nov 14 2024, 8:53 AM · Traffic
Vgutierrez added a parent task for T379797: Package and deploy ATS 9.2.6: Unknown Object (Task).
Nov 14 2024, 8:42 AM · Traffic

Nov 7 2024

Vgutierrez created T379238: liberica should offer prometheus endpoints on both IPv4 and IPv6.
Nov 7 2024, 12:27 PM · Traffic
Vgutierrez triaged T379238: liberica should offer prometheus endpoints on both IPv4 and IPv6 as Medium priority.
Nov 7 2024, 12:27 PM · Traffic
Vgutierrez changed the status of T378341: harden liberica systemd service units, a subtask of T332027: Replace current L4LB with with Katran-based alternative, from Open to In Progress.
Nov 7 2024, 9:30 AM · Traffic
Vgutierrez changed the status of T378341: harden liberica systemd service units from Open to In Progress.
Nov 7 2024, 9:30 AM · Traffic

Nov 6 2024

Vgutierrez triaged T379164: BGP settings for liberica as Medium priority.
Nov 6 2024, 2:50 PM · netops, Traffic, Infrastructure-Foundations
Vgutierrez created T379164: BGP settings for liberica.
Nov 6 2024, 2:50 PM · netops, Traffic, Infrastructure-Foundations
Vgutierrez closed T378453: Testing liberica with ncredir@eqiad as Resolved.

lvs1013 running liberica is now the primary load balancer for ncredir@eqiad

Nov 6 2024, 2:42 PM · Infrastructure-Foundations, netops
Vgutierrez closed T377127: liberica puppetization, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Nov 6 2024, 2:41 PM · Traffic
Vgutierrez closed T377127: liberica puppetization as Resolved.
Nov 6 2024, 2:41 PM · Traffic
Vgutierrez closed T378453: Testing liberica with ncredir@eqiad, a subtask of T377127: liberica puppetization, as Resolved.
Nov 6 2024, 2:41 PM · Traffic

Oct 30 2024

Vgutierrez added a comment to T378453: Testing liberica with ncredir@eqiad.

nice, do we need to add lvs1013 to any ACLs?

Oct 30 2024, 1:03 PM · Infrastructure-Foundations, netops

Oct 29 2024

Vgutierrez created T378453: Testing liberica with ncredir@eqiad.
Oct 29 2024, 10:10 AM · Infrastructure-Foundations, netops
Vgutierrez added a comment to T354839: Support PyBal routes announced with lower priority than "backup".

Gven the limitations to run pybal and liberica on the same hosts, we want to run liberica on separate hosts with a higher priority than pybal for the same prefixes so we can test liberica.
We want to deploy liberica on lvs1013 and let it handle traffic for ncredir without altering the current config for ncredir@eqiad on pybal running on lvs1017 and lvs1020, that would mean setting a community string (14907:2?) on liberica in lvs1013 that translates to local-pref >100 so it will start receiving traffic for ncredir.

Oct 29 2024, 9:50 AM · Traffic, netops, Infrastructure-Foundations, SRE

Oct 28 2024

Vgutierrez created T378341: harden liberica systemd service units.
Oct 28 2024, 10:56 AM · Traffic

Oct 14 2024

Vgutierrez renamed T377127: liberica puppetization from liberica puppetization for existing lvs instances to liberica puppetization.
Oct 14 2024, 12:41 PM · Traffic
Vgutierrez triaged T377127: liberica puppetization as Medium priority.
Oct 14 2024, 12:38 PM · Traffic
Vgutierrez created T377127: liberica puppetization.
Oct 14 2024, 12:38 PM · Traffic

Oct 10 2024

Vgutierrez added a comment to T376876: Gather site pooled/depooled information for Grafana.

I like the black box approach, but it would require maintaining yet another map of DCs and ranges per service, but I guess you could infer it from dig +short $cluster-lb.$dc.wikimedia.org and use those as input, something like this:

bash
#!/bin/bash
DCS=("eqiad" "codfw" "esams" "ulsfo" "eqsin" "drmrs" "magru")
for DC in "${DCS[@]}"; do
    TEXT_IP=$(dig +short text-lb."$DC".wikimedia.org @ns0.wikimedia.org)
    UPLOAD_IP=$(dig +short upload-lb."$DC".wikimedia.org @ns0.wikimedia.org)
Oct 10 2024, 9:29 AM · Traffic

Oct 8 2024

Vgutierrez updated the task description for T376600: Provide debian packages for liberica.
Oct 8 2024, 2:17 PM · Traffic
Vgutierrez closed T376696: Sync liberica etcd library requirements with versions provided on debian bookworm, a subtask of T376600: Provide debian packages for liberica, as Invalid.
Oct 8 2024, 2:03 PM · Traffic
Vgutierrez closed T376696: Sync liberica etcd library requirements with versions provided on debian bookworm as Invalid.
Oct 8 2024, 2:03 PM · Traffic
Vgutierrez created T376696: Sync liberica etcd library requirements with versions provided on debian bookworm.
Oct 8 2024, 7:46 AM · Traffic
Vgutierrez updated the task description for T376600: Provide debian packages for liberica.
Oct 8 2024, 7:44 AM · Traffic

Oct 7 2024

Vgutierrez updated the task description for T376600: Provide debian packages for liberica.
Oct 7 2024, 11:16 AM · Traffic
Vgutierrez updated the task description for T376600: Provide debian packages for liberica.
Oct 7 2024, 10:21 AM · Traffic
Vgutierrez triaged T376600: Provide debian packages for liberica as Medium priority.
Oct 7 2024, 9:59 AM · Traffic
Vgutierrez created T376600: Provide debian packages for liberica.
Oct 7 2024, 9:58 AM · Traffic

Oct 4 2024

Vgutierrez created T376477: Get a WMF/SRE/Traffic GCP account.
Oct 4 2024, 2:27 PM · Traffic
Vgutierrez closed T376460: Test acme-chief ability to use pki.goog, a subtask of T376459: Study the viability of using pki.goog (Google Trust Services) as a 2nd ACME CA, as Resolved.
Oct 4 2024, 11:43 AM · Traffic
Vgutierrez closed T376460: Test acme-chief ability to use pki.goog as Resolved.

On a second attempt after wiping remaining challenges on the DNS server using acme-chief-designate-tidyup.service, acme-chief issued both certificates as expected:

root@traffic-acmechief01:/etc/acme-chief# openssl x509 -issuer -dates -subject -ext subjectAltName -noout  -in /var/lib/acme-chief/certs/non-canonical-redirect-1-pki/new/rsa-2048.crt
issuer=C = US, O = Google Trust Services, CN = WR1
notBefore=Oct  4 10:41:09 2024 GMT
notAfter=Jan  2 10:41:08 2025 GMT
subject=CN = wikipedia.com.traffic.wmflabs.org
X509v3 Subject Alternative Name: 
    DNS:wikipedia.com.traffic.wmflabs.org, DNS:*.wikipedia.com.traffic.wmflabs.org, DNS:*.en-wp.com.traffic.wmflabs.org, DNS:en-wp.com.traffic.wmflabs.org, DNS:*.en-wp.org.traffic.wmflabs.org, DNS:en-wp.org.traffic.wmflabs.org
root@traffic-acmechief01:/etc/acme-chief# openssl x509 -issuer -dates -subject -ext subjectAltName -noout  -in /var/lib/acme-chief/certs/non-canonical-redirect-1-pki/new/ec-prime256v1.crt
issuer=C = US, O = Google Trust Services, CN = WR1
notBefore=Oct  4 10:41:05 2024 GMT
notAfter=Jan  2 10:41:04 2025 GMT
subject=CN = wikipedia.com.traffic.wmflabs.org
X509v3 Subject Alternative Name: 
    DNS:wikipedia.com.traffic.wmflabs.org, DNS:*.wikipedia.com.traffic.wmflabs.org, DNS:*.en-wp.com.traffic.wmflabs.org, DNS:en-wp.com.traffic.wmflabs.org, DNS:*.en-wp.org.traffic.wmflabs.org, DNS:en-wp.org.traffic.wmflabs.org
Oct 4 2024, 11:43 AM · Traffic
Vgutierrez added a comment to T376460: Test acme-chief ability to use pki.goog.

after a quick patch:

vgutierrez@carrot:~/gitlab.wikimedia.org/sre/acme-chief$ git diff acme_chief/acme_requests.py 
diff --git a/acme_chief/acme_requests.py b/acme_chief/acme_requests.py
index 7f677f2..bee4415 100644
--- a/acme_chief/acme_requests.py
+++ b/acme_chief/acme_requests.py
@@ -331,9 +331,6 @@ class ACMERequests:
         self.challenges = {}
         self.orders = {}
Oct 4 2024, 11:34 AM · Traffic
Vgutierrez created T376460: Test acme-chief ability to use pki.goog.
Oct 4 2024, 10:04 AM · Traffic
Vgutierrez created T376459: Study the viability of using pki.goog (Google Trust Services) as a 2nd ACME CA.
Oct 4 2024, 10:00 AM · Traffic

Oct 3 2024

Vgutierrez added a comment to T375562: Investigating unique devices traffic data.

Assuming that the automated traffic you're describing follows DNS and doesn't hardcode the IP of another of our DCs, the increase should be reflected in that graph.

Oct 3 2024, 8:03 AM · Movement-Insights, Traffic

Sep 27 2024

Vgutierrez triaged T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00 as High priority.

after a powercycle puppetserver1001 is responsive again

Sep 27 2024, 3:35 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
Vgutierrez created T375839: puppetserver[1001-1002,2001] crashed on 2024-09-27 00:00.
Sep 27 2024, 3:22 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure

Sep 26 2024

Vgutierrez closed T375711: HAproxy and varnish misreport the authentication mechanism used in TLSv1.3 traffic as Resolved.

we should start getting RSA data for TLSv1.3 as soon as puppet runs in the cp nodes

Sep 26 2024, 11:18 AM · Patch-For-Review, Traffic
Vgutierrez closed T375711: HAproxy and varnish misreport the authentication mechanism used in TLSv1.3 traffic, a subtask of T370837: Remove RSA certificates and use only ECDSA certificates, as Resolved.
Sep 26 2024, 11:17 AM · User-notice-archive, Patch-For-Review, Traffic
Vgutierrez renamed T375711: HAproxy and varnish misreport the authentication mechanism used in TLSv1.3 traffic from HAproxy misreports the authentication mecanism in TLSv1.3 traffic to HAproxy and varnish misreport the authentication mechanism used in TLSv1.3 traffic.
Sep 26 2024, 7:40 AM · Patch-For-Review, Traffic
Vgutierrez triaged T375711: HAproxy and varnish misreport the authentication mechanism used in TLSv1.3 traffic as High priority.
Sep 26 2024, 7:21 AM · Patch-For-Review, Traffic
Vgutierrez created T375711: HAproxy and varnish misreport the authentication mechanism used in TLSv1.3 traffic.
Sep 26 2024, 7:20 AM · Patch-For-Review, Traffic
Vgutierrez added a comment to T370837: Remove RSA certificates and use only ECDSA certificates.

I just -2ed the gerrit change cause we don't currently have information about which certificate is being used.
TLSv1.2 includes the authentication mechanism used during the handshake as part of the ciphersuite (ECDHE-RSA-AES256-GCM-SHA384, ECDHE-ECDSA-AES256-GCM-SHA384) and that's what's currently being sent from haproxy to varnish as part of x-connection-properties header.
In TLSv1.3 he authentication mechanism has been dropped from the ciphersuite name (TLS_AES_128_GCM_SHA256 could be used with either ECDSA or RSA certificates).

Sep 26 2024, 7:02 AM · User-notice-archive, Patch-For-Review, Traffic

Sep 25 2024

Vgutierrez added a comment to T375562: Investigating unique devices traffic data.

What we saw in the data is that for *.wikipedia.org we have webrequests with WMF-Last-Access-Global set but no WMF-Last-Access. I see the conditions that match what you say in the VCL code, but I didn't see anything that explained this happening on a request to wikipedia.org. Happy to hang out and find data we can look at.

Sep 25 2024, 3:15 PM · Movement-Insights, Traffic
Vgutierrez added a comment to T375562: Investigating unique devices traffic data.
  1. Is there an explanation for why there are users that apparently WMF-Last-Access-Global is set but not WMF-Last-Access and vice versa? It seems that Cookies should be set simultaneously such that any user should have a minimum a single pair of cookies.

Yes, since T174640 varnish doesn't set WMF-Last-Access-Global cookie for wikimedia.org or its subdomains. And T260943 excluded requests to api.wikimedia.org from getting WMF-Last-Access. Relevant code can be seen here

Sep 25 2024, 1:00 PM · Movement-Insights, Traffic

Sep 24 2024

Vgutierrez added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Regarding reverse-path filtering it's enough to disable it on "all" and ipip0/ipip60, per Linux kernel documentation:

The max value from conf/{all,interface}/rp_filter is used when doing source validation on the {interface}.

Sep 24 2024, 2:46 PM · Prod-Kubernetes, Kubernetes, serviceops, Traffic
Vgutierrez closed T375060: Requesting access to stat1007 for cyndywikime as Resolved.
vgutierrez@krb1001:~$ sudo manage_principals.py create cyndywikime --email_address=csimiyu@wikimedia.org
Principal successfully created. Make sure to update data.yaml in Puppet.
Successfully sent email to csimiyu@wikimedia.org
Sep 24 2024, 7:50 AM · Data-Engineering, SRE, SRE-Access-Requests
Vgutierrez changed the status of T375060: Requesting access to stat1007 for cyndywikime from Stalled to In Progress.
Sep 24 2024, 7:38 AM · Data-Engineering, SRE, SRE-Access-Requests
Vgutierrez triaged T375464: Test liberica BGP support as Medium priority.
Sep 24 2024, 7:33 AM · Traffic
Vgutierrez created T375464: Test liberica BGP support.
Sep 24 2024, 7:33 AM · Traffic

Sep 23 2024

Vgutierrez closed T334078: purged issues while kafka brokers are restarted as Resolved.

purged 0.24 survived to Consumer group session timed out (in join-state steady) after 10458 ms without a successful response from the group coordinator (broker 2001, last error was Broker: Not coordinator): revoking assignment and rejoining group error this morning, I'll close this task as it seems like the bug has been solved.

Sep 23 2024, 2:33 PM · Traffic

Sep 20 2024

Vgutierrez added a comment to T334078: purged issues while kafka brokers are restarted.

purged 0.24 shipping the patch mentioned above has been deployed everywhere. Now we should expect the following behavior: purged will start logging and gathering metrics of kafka errors. If the received error is considered fatal (only all brokers down error is flagged as fatal at the moment) purged will exit and systemd should restart it.
If a non-fatal error triggers the same behavior as the described on this task we should add that error code to the list of fatal errors.

Sep 20 2024, 7:24 AM · Traffic

Sep 19 2024

Vgutierrez added a comment to T334078: purged issues while kafka brokers are restarted.

@elukey I think I've found the root cause of this mess, this is the patch:

diff --git a/kafka.go b/kafka.go
index 4e495ce..b878db8 100644
--- a/kafka.go
+++ b/kafka.go
@@ -187,7 +187,7 @@ func (k *KafkaReader) manageEvent(event kafka.Event, c chan string) bool {
                if err != nil {
                        log.Printf("Unable to update promrdkafka metrics: %v\n", err)
                }
-       case *kafka.Error:
+       case kafka.Error:
                kafkaErrors.With(prometheus.Labels{typeLabel: e.Code().String()}).Inc()
Sep 19 2024, 9:33 AM · Traffic

Sep 18 2024

Vgutierrez added a comment to T375060: Requesting access to stat1007 for cyndywikime.

SSH key has been confirmed out of band

Sep 18 2024, 3:54 PM · Data-Engineering, SRE, SRE-Access-Requests
Vgutierrez updated the task description for T375060: Requesting access to stat1007 for cyndywikime.
Sep 18 2024, 3:53 PM · Data-Engineering, SRE, SRE-Access-Requests
Vgutierrez updated the task description for T375060: Requesting access to stat1007 for cyndywikime.
Sep 18 2024, 3:31 PM · Data-Engineering, SRE, SRE-Access-Requests
Vgutierrez changed the status of T375060: Requesting access to stat1007 for cyndywikime from Open to Stalled.

per data.yaml we need approval from @odimitrijevic / @Milimetric / @WDoranWMF / @Ahoelzl / @Ottomata (one of them is enough)

Sep 18 2024, 3:28 PM · Data-Engineering, SRE, SRE-Access-Requests
Vgutierrez triaged T375060: Requesting access to stat1007 for cyndywikime as Medium priority.
Sep 18 2024, 3:26 PM · Data-Engineering, SRE, SRE-Access-Requests
Vgutierrez placed T374997: Some sites try and fail to serve favicon.ico up for grabs.

Provided URLs are currently handled by mw-web:

vgutierrez@carrot:/tmp$ ./T374997.sh 
https://donate.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-7z5sc
https://quality.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-hmc86
https://office.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-8xnx7
https://wikimania2011.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-m6l62
https://vote.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-t6cgb
https://ng.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-2x87n
https://collab.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-4qwnw
https://mai.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-zckls
https://u4c.wikimedia.org/favicon.ico
< server: mw-web.eqiad.main-6646476df4-fx86f
Sep 18 2024, 10:36 AM · Patch-For-Review, serviceops, MW-on-K8s, Traffic
Vgutierrez triaged T374997: Some sites try and fail to serve favicon.ico as Medium priority.
Sep 18 2024, 9:52 AM · Patch-For-Review, serviceops, MW-on-K8s, Traffic
Vgutierrez claimed T374997: Some sites try and fail to serve favicon.ico.
Sep 18 2024, 9:52 AM · Patch-For-Review, serviceops, MW-on-K8s, Traffic
Vgutierrez added a comment to T374986: cp307[12] thermal issues.

Answering here @RobH question:

Hey I made some assumptions on the cp hosts troubleshooting but should check with you: Those hosts are under the same weight conditions as all their related hosts correct? If so, then indeed somehting is off and likely thermal paste

That's right, all hosts have the same weight, during normal conditions they should handle the same load on average

Sep 18 2024, 7:12 AM · SRE, ops-esams, DC-Ops, Traffic

Sep 16 2024

Vgutierrez closed T374595: LDAP access to the wmf group for Cyndywikime as Declined.

After double checking that I get the very same errors as @Cyndymediawiksim it looks like it's an issue with that specific superset dashboard, not related to @Cyndymediawiksim user

Sep 16 2024, 1:40 PM · SRE, LDAP-Access-Requests
Vgutierrez reassigned T374595: LDAP access to the wmf group for Cyndywikime from Vgutierrez to Cyndymediawiksim.
Sep 16 2024, 10:09 AM · SRE, LDAP-Access-Requests
Vgutierrez claimed T374595: LDAP access to the wmf group for Cyndywikime.
Sep 16 2024, 10:09 AM · SRE, LDAP-Access-Requests
Vgutierrez changed the status of T374595: LDAP access to the wmf group for Cyndywikime from Open to Stalled.

idp configuration states that wmf membership is enough to access superset (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/idp.yaml#206) and as already mentioned @Cyndymediawiksim is already member of wmf:

Wikitech: User:Cyndywikime
Shell username: cyndywikime
Email: csimiyu at wikimedia dot org
User ID: 40557
Account created: 20220802154524Z
Groups
Groups that begin with project- refer to Cloud VPS projects.
Sep 16 2024, 10:07 AM · SRE, LDAP-Access-Requests
Vgutierrez added a comment to T373666: Requesting access to deployment for zoe.

gentle reminder, this is still waiting for @VPuffetMichel approval

Sep 16 2024, 7:00 AM · SRE, SRE-Access-Requests

Sep 12 2024

Vgutierrez added a comment to T373995: CPU thermal throttling: saturation panel isn't working as expected.

hmm is entirely possible that cp instances aren't running that mtail program at all?

vgutierrez@cp7005:~$ ps auxww |grep "mtail -progs"
root        2830 11.8  0.0 4408172 59920 ?       Sl   Jun04 17071:58 /usr/bin/mtail -progs /etc/mtail-default -port 3903 -logs /dev/stdin -disable_fsnotify
root     2116858 23.1  0.0 3817292 51472 ?       Sl   Sep03 2935:08 /usr/bin/mtail -progs /etc/mtail-internal -port 3913 -logs /dev/stdin -disable_fsnotify
vgutier+ 2492792  0.0  0.0   6240   652 pts/0    S+   12:35   0:00 grep mtail -progs
root     2991153  0.0  0.0   2480   508 ?        Ss   Jun24   0:00 /bin/sh -c atslog-backend | mtail -progs "/etc/atsmtail-backend" -logs /dev/stdin -disable_fsnotify -port "3904" 
root     2991156  1.3  0.0 6771692 80696 ?       Sl   Jun24 1558:37 mtail -progs /etc/atsmtail-backend -logs /dev/stdin -disable_fsnotify -port 3904
Sep 12 2024, 12:36 PM · SRE Observability (FY2024/2025-Q2)
Vgutierrez updated subscribers of T373993: CPU temperature issues in cp hosts.

@RobH / @wiki_willy could we get this task prioritized on your side?

Sep 12 2024, 11:05 AM · SRE, ops-esams, ops-magru, DC-Ops, Traffic

Sep 10 2024

Vgutierrez added a comment to T373995: CPU thermal throttling: saturation panel isn't working as expected.

I don't think so, I'm failing to see

Sep 10 11:48:50 cp7005 kernel: [8456281.652467] mce: CPU15: Core temperature is above threshold, cpu clock is throttled (total events = 170)
Sep 10 11:48:51 cp7005 kernel: [8456282.676501] mce: CPU15: Core temperature/speed normal (total events = 170)

reported on grafana, https://grafana.wikimedia.org/goto/ccf0yA6Sg?orgId=1

Sep 10 2024, 12:41 PM · SRE Observability (FY2024/2025-Q2)

Sep 9 2024

Vgutierrez moved T374340: No ntp query ACL for new alert hosts from Backlog to Radar/Not for service by Traffic on the Traffic board.
Sep 9 2024, 9:24 AM · Traffic, SRE Observability (FY2024/2025-Q1), Observability-Alerting

Sep 8 2024

Vgutierrez closed T374318: Wikifunctions is down as Resolved.

Service should be restored now.

Sep 8 2024, 7:17 PM · Traffic, Abstract Wikipedia team
Vgutierrez added a comment to T374318: Wikifunctions is down.

mw-wikifunctions seems to be down in eqiad at the moment:

vgutierrez@cp6016:~$ nc -zv mw-wikifunctions.discovery.wmnet 4451
nc: connect to mw-wikifunctions.discovery.wmnet (10.2.2.88) port 4451 (tcp) failed: Connection refused
vgutierrez@cp6016:~$ nc -zv mw-wikifunctions.svc.eqiad.wmnet 4451
nc: connect to mw-wikifunctions.svc.eqiad.wmnet (10.2.2.88) port 4451 (tcp) failed: Connection refused
vgutierrez@cp6016:~$ nc -zv mw-wikifunctions.svc.codfw.wmnet 4451
Connection to mw-wikifunctions.svc.codfw.wmnet (10.2.1.88) 4451 port [tcp/*] succeeded!
Sep 8 2024, 5:56 PM · Traffic, Abstract Wikipedia team

Sep 6 2024

Vgutierrez created T374232: Provide a golang-github-confluentinc-confluent-kafka-go-dev version that matches librdkafka capabilities for bullseye.
Sep 6 2024, 12:33 PM · Traffic
Vgutierrez added a comment to T364691: Elevated 503 backend fetch failed reported by users.

that specific request triggered a timeout while trying to read the POST request body:

Sep  6 06:01:47 cp3069 varnish-frontend-fetcherr[1010101]: @cee: {"time": "2024-09-06T06:01:47.736362", "message": "req.body read error: 11 (Resource temporarily unavailable) [omitted output]
Sep 6 2024, 11:28 AM · Traffic
Vgutierrez added a comment to T364691: Elevated 503 backend fetch failed reported by users.

esams seems to be as healthy as usual per https://grafana.wikimedia.org/goto/ix3gNVeSR?orgId=1:

image.png (1×1 px, 275 KB)

Sep 6 2024, 11:15 AM · Traffic
Vgutierrez changed the status of T334078: purged issues while kafka brokers are restarted from Stalled to In Progress.
Sep 6 2024, 10:59 AM · Traffic
Vgutierrez added a comment to T334078: purged issues while kafka brokers are restarted.

this has been triggered again in cp2038 and cp2041:

vgutierrez@cumin1002:~$ sudo -i cumin 'cp[2038,2041].codfw.wmnet' 'journalctl -u purged.service --since=-18h |grep "timed out"' 
2 hosts will be targeted:
cp[2038,2041].codfw.wmnet
OK to proceed on 2 hosts? Enter the number of affected hosts to confirm or "q" to quit: 2
===== NODE GROUP =====                                                                                                                                                           
(1) cp2041.codfw.wmnet                                                                                                                                                           
----- OUTPUT of 'journalctl -u pu...grep "timed out"' -----                                                                                                                      
Sep 05 17:01:19 cp2041 purged[3565730]: %4|1725555679.879|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10216 ms without a successful response from the group coordinator (broker 2003, last error was Broker: Not coordinator): revoking assignment and rejoining group                                
===== NODE GROUP =====                                                                                                                                                           
(1) cp2038.codfw.wmnet                                                                                                                                                           
----- OUTPUT of 'journalctl -u pu...grep "timed out"' -----                                                                                                                      
Sep 05 17:01:21 cp2038 purged[4019621]: %4|1725555681.014|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10498 ms without a successful response from the group coordinator (broker 2001, last error was Broker: Not coordinator): revoking assignment and rejoining group                                
================
Sep 6 2024, 4:24 AM · Traffic

Sep 5 2024

Vgutierrez closed Restricted Task, a subtask of T339134: Package and deploy ATS 9.2.5, as Resolved.
Sep 5 2024, 1:06 PM · Traffic
Vgutierrez closed T339134: Package and deploy ATS 9.2.5, a subtask of T342154: Upgrade Traffic hosts to bookworm, as Resolved.
Sep 5 2024, 1:05 PM · Patch-For-Review, Traffic
Vgutierrez closed T339134: Package and deploy ATS 9.2.5 as Resolved.
Sep 5 2024, 1:05 PM · Traffic

Sep 4 2024

Vgutierrez updated the task description for T339134: Package and deploy ATS 9.2.5.
Sep 4 2024, 1:18 PM · Traffic