Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

BBlack (Brandon Black)
Principal Site Reliability Engineer, SRE Traffic Team

Projects (9)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (527 w, 22 h)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Yesterday

BBlack edited P71676 WME varnish head pass.
Tue, Dec 10, 12:42 PM · Traffic
BBlack created P71676 WME varnish head pass.
Tue, Dec 10, 12:33 PM · Traffic

Mon, Dec 9

BBlack created P71644 varnishlog full miss, HEAD on >8MB original.
Mon, Dec 9, 6:16 PM · Traffic, SRE

Wed, Dec 4

BBlack added a comment to T362985: Improve how we generate DNS entries from Netbox.

Also, probably the way to standardize this for sanity (avoiding ORIGIN mistakes on both ends) is to follow some simple rules that:

Wed, Dec 4, 1:22 PM · Infrastructure-Foundations, SRE
BBlack added a comment to T362985: Improve how we generate DNS entries from Netbox.

Seems like a net win to me. Reduces some error-prone process stuff and makes life simpler!

Wed, Dec 4, 1:19 PM · Infrastructure-Foundations, SRE

Oct 25 2024

BBlack added a comment to T375256: Cookie % has been rejected because it is foreign and does not have the "Partitioned" attribute.

Ah interesting! We should confirm that and perhaps avoid the set-cookie entirely on cookies that are (or at least are intended to be) ~ SameSite=Lax|Strict then, I guess?

Oct 25 2024, 2:37 PM · Data Products (Data Products Sprint 23), Traffic
BBlack added a comment to T375256: Cookie % has been rejected because it is foreign and does not have the "Partitioned" attribute.

I don't think it's necessarily always up to us to be able to know it's cross-origin, though, right? It would depend on the $random_other_site's CORS whether they tell us about a referrer at all?

Oct 25 2024, 2:21 PM · Data Products (Data Products Sprint 23), Traffic

Oct 24 2024

BBlack added a comment to T375256: Cookie % has been rejected because it is foreign and does not have the "Partitioned" attribute.

Do we have a specific example of a URL and which cookies triggered the rejections? In my own quick repro attempt, I only saw them failing on actually cross-domain traffic (in my case, an enwiki page was loading Math SVG content from https://wikimedia.org/api/..., and it was the cookies coming with that response that were rejected).

Oct 24 2024, 4:48 PM · Data Products (Data Products Sprint 23), Traffic
BBlack added a comment to T375256: Cookie % has been rejected because it is foreign and does not have the "Partitioned" attribute.

Seems like all of these Varnish-level cookies mentioned at the top should at least gain appropriate, explicit SameSite= attributes, in addition to perhaps Partitioned as appropriate (only NetworkProbeLimit currently carries a SameSite attribute at all).

Oct 24 2024, 2:36 PM · Data Products (Data Products Sprint 23), Traffic

Oct 11 2024

BBlack added a comment to T327286: Integrate In-App Internet censorship circumvention by domain fronting.

Recent days, WMF somehow changed their GeoDNS so that requests from China would get a not yet blocked IPv4 address.

I have to mention that on Oct 1 WMF projects were resolved back to the blocked servers in San Francisco (ulsfo) with the IPv4 address 198.35.26.96.

Oct 11 2024, 11:16 AM · Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, Traffic, Security, Privacy Engineering, Infrastructure Security

Sep 20 2024

BBlack created P69382 wmfuniq prototype with key mgmt.
Sep 20 2024, 5:50 PM

Jul 23 2024

BBlack added a comment to T370821: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs.

Note also Digicert's annual renewal is coming soon in T368560 . We should maybe look at whether the OCSP URI is optional in the form for making the cert, and turn it off (assuming they also have CRLs working fine). Or if they're not ready for this, I guess Digicert waits another year.

Jul 23 2024, 8:14 PM · Traffic, SRE
BBlack added a comment to T370821: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs.

Firefox has historically been the reason we've been stapling OCSP for the past many years. If our certificate has an OCSP URI in its metadata, then Firefox will check OCSP in realtime (which is a privacy risk) unless our servers staple the OCSP to the TLS negotiation (which we do!). This applies to both our Digicert and LE unified certs (and I'm sure some other lesser cases as well!).

Jul 23 2024, 8:12 PM · Traffic, SRE

Jul 2 2024

BBlack added a comment to T368645: Google search results pointing to nonexistent https://donate.m.wikimedia.org/.

^ While we can maintain the VCL-level hacks for now, it would be best to both dig into how this actually happened (most likely, we ourselves emitted donate.m links from a wiki, probably donatewiki itself?), and to come up with a permanent solution at the application layer (fix the wiki to support these links properly and directly). We don't want to keep accumulating hacks like these in our already-overly-complex VCL code if unwarranted.

Jul 2 2024, 3:51 PM · Mobile, Fundraising-Backlog, SEO

Jun 27 2024

BBlack added a comment to T368544: IPIP encapsulation considerations for low-traffic services.

Note there was some phab/brain lag here, I wrote this before I saw joe's last response above, they overlap a bunch

Jun 27 2024, 2:52 PM · Infrastructure-Foundations, serviceops, netops, Traffic

Jun 7 2024

BBlack added a comment to T365690: Make it possible to access the Realtime API and On-demand API without authentication .

One problem that arises from that approach is the inability of the service’s owner to be able to know if there are individual users who are using a disproportionate amount of resources, or even if there is an abuser in the mix. This is the classic “tragedy of the commons'' situation. Also use of authentication here is not to keep an eye on who is using the API and for what reason. Anyone can use anonymous email accounts to create an account and use that without disclosing any personal information. It is mostly meant to keep users of the APIs updated on changes as well and to give warnings for deprecations as well. This all assumes that the user of the APIs has not abandoned the Open source project and someone is still maintaining it.

This could be said for all Wikimedia APIs.

Jun 7 2024, 11:00 PM · Wikimedia Enterprise

Jun 4 2024

BBlack created P64029 sway env.
Jun 4 2024, 4:22 PM
BBlack created P64027 sway start.
Jun 4 2024, 4:20 PM

Jun 3 2024

BBlack added a comment to T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd.

Re: "same logic" - they're different protocols, different hierarchies, and much different on the client behavior front as well. It doesn't make sense to share a strategy between the two.

Jun 3 2024, 4:26 PM · SRE, Traffic
BBlack added a comment to T366193: Anycast ns1.wikimedia.org.

I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP, but it might not be worth using a different AuthDNS software on half the servers, or a different Puppet infra.

Jun 3 2024, 4:19 PM · SRE, Traffic

May 31 2024

BBlack added a comment to T366193: Anycast ns1.wikimedia.org.

Yes, from a resiliency POV, in some senses keeping unicasts in the mix is an answer (and it's the answer we currently rely on). In a world with only very smart and capable resolvers, the simplest answer probably is the current setup. And indeed, not-advertising ns2 from the core DCs would be a very slight resiliency win over that.

May 31 2024, 7:16 PM · SRE, Traffic
BBlack added a comment to T366193: Anycast ns1.wikimedia.org.

Yeah my general thinking was get ns1-anycast going first, and then figure out any of the above about better resiliency before we consider withdrawing ns0-unicast completely.

May 31 2024, 5:02 PM · SRE, Traffic
BBlack added a comment to T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd.

Yeah, I've looked at this from the deep-ntp-details POV and it's all pretty sane. We're in alignment with the recommendations in https://www.rfc-editor.org/rfc/rfc8633.html#page-17 and it should result in good time sync stability.

May 31 2024, 4:01 PM · SRE, Traffic
BBlack updated the task description for T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd.
May 31 2024, 3:57 PM · SRE, Traffic

May 30 2024

BBlack added a comment to T366193: Anycast ns1.wikimedia.org.

On that future discussion topic (sorry I'm getting nerdsniped!) - Yeah, I had thought about prepending (vs the hard A/B cutoff) as well, but I tend to think it doesn't offer as much resiliency as the clean split.

May 30 2024, 4:50 PM · SRE, Traffic
BBlack added a comment to T366193: Anycast ns1.wikimedia.org.

Re: anycast-ns1 and future plans, etc (I won't quote all the relevant bits from both msgs above):

May 30 2024, 1:11 PM · SRE, Traffic

May 25 2024

BBlack closed T365911: Review "xx.wiki" domains, redirect "xx.wiki/FOO" to "xx.wikipedia.org/wiki/FOO" as Declined.

There are brand/identity dilution and confusion issues with using any of *.wiki in an official capacity, especially as canonical redirectors for Wikipedia itself, which is why we didn't start using these many years ago when they were first offered for free.

May 25 2024, 2:05 PM · Traffic, Domains

May 23 2024

BBlack added a comment to T365630: Remove long term caching and active purging for Parsoid endpoints in RESTBase.

Re-routing sounds better? Or perhaps even-better would be a full-on redirect to the new parsoid URL paths?

The idea is that we don't implement long term caching with active purging for the new endpoints at all. If we don't need it for the old endpoints, we don't need it for the new endpoints. This would make the architecture and the migration a whole lot simpler.

It seems to me like a couple of minutes would be good enough at least for organic spikes. Do you think the edge cache is effective protection against DDoS?

May 23 2024, 5:57 PM · MW-1.43-notes (1.43.0-wmf.10; 2024-06-18), Patch-For-Review, Traffic, Content-Transform-Team, MW-Interfaces-Team, RESTBase Sunsetting

May 22 2024

BBlack added a comment to T365630: Remove long term caching and active purging for Parsoid endpoints in RESTBase.

I'm a little leery of dropping the TTL really-short. I get the argument for the normal case, but we also have to consider the possibility that something out there on the Internet could cause traffic surges to some of these URLs and we'd lose some amount of caching defenses against it with a short TTL (esp if we're also no longer pregenerating them, making such traffic more-expensive on the inside). Re-routing sounds better? Or perhaps even-better would be a full-on redirect to the new parsoid URL paths?

May 22 2024, 4:48 PM · MW-1.43-notes (1.43.0-wmf.10; 2024-06-18), Patch-For-Review, Traffic, Content-Transform-Team, MW-Interfaces-Team, RESTBase Sunsetting

May 17 2024

BBlack added a comment to T364126: Disable Chrome Private Prefetch Proxy.

What a fun deep-dive! :)

May 17 2024, 3:01 PM · Movement-Insights, Traffic

May 16 2024

BBlack added a comment to T364126: Disable Chrome Private Prefetch Proxy.
  • They do have a lot of presence all over the world. Presence we don't have currently and would take us decades to obtain. And via CP3, they effectively end up being a CDN. If we can take advantage of that, we might be able to reap some of the benefits we reap by setting up our own PoPs, all without having to pay the cost of setting up our own PoPs. It is conceivable that we end up giving people in underrepresented areas of the world with a better performance for our sites, effectively furthering the mission.
May 16 2024, 8:19 PM · Movement-Insights, Traffic
BBlack added a comment to T364126: Disable Chrome Private Prefetch Proxy.

Media (images, video, etc) are served from upload.wikimedia.org and are requested without Cookies.

May 16 2024, 1:26 AM · Movement-Insights, Traffic

May 15 2024

BBlack created P62422 MSS from getsockopt.
May 15 2024, 6:58 PM
BBlack created P62421 tcpdump mss issue.
May 15 2024, 6:01 PM · Traffic

May 14 2024

BBlack added a comment to T363695: Create a Wikimedia login domain that can be served by any wiki.

Also similarly T214998

May 14 2024, 11:54 AM · Security, SUL3, MediaWiki-extensions-CentralAuth, MediaWiki-Platform-Team
BBlack added a comment to T363695: Create a Wikimedia login domain that can be served by any wiki.

T215071 <- throwing this in here for semi-related context. Maybe we can align on a potential common future URI scheme anyways, while not actually yet tackling that one.

May 14 2024, 11:54 AM · Security, SUL3, MediaWiki-extensions-CentralAuth, MediaWiki-Platform-Team

May 10 2024

BBlack closed T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) as Resolved.

Should be all set, may take up to ~30 minutes for changes to propagate.

May 10 2024, 5:06 PM · SRE, SRE-Access-Requests
BBlack updated the task description for T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access).
May 10 2024, 4:57 PM · SRE, SRE-Access-Requests

May 8 2024

BBlack added a comment to T364359: Restore nshahquinn-wmf and hghani to analytics-product-users.

The patch should fix things up, let me know if there's still problems after ~half an hour to let the change propagate through the systems.

May 8 2024, 7:34 PM · SRE, SRE-Access-Requests, Movement-Insights
BBlack added a comment to T364400: map the /api/ prefix to /w/rest.php.

Could we implement this remapping at the ATS layer rather than the Apache one, in a manner that would mean that when we need to cache we only need to store each effective URL once?

May 8 2024, 12:48 PM · serviceops, Traffic, MW-Interfaces-Team

May 3 2024

BBlack added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

We could choose to use subdivision-level mapping in cases where it makes sense.

May 3 2024, 4:43 PM · Traffic

Mar 27 2024

BBlack closed T361046: Requesting access to analytics-privatedata-users for bblack as Resolved.
Mar 27 2024, 3:05 PM · Patch-For-Review, SRE, SRE-Access-Requests
BBlack updated the task description for T361046: Requesting access to analytics-privatedata-users for bblack.
Mar 27 2024, 2:34 PM · Patch-For-Review, SRE, SRE-Access-Requests
BBlack updated the task description for T361046: Requesting access to analytics-privatedata-users for bblack.
Mar 27 2024, 1:49 AM · Patch-For-Review, SRE, SRE-Access-Requests

Mar 26 2024

BBlack updated the task description for T361046: Requesting access to analytics-privatedata-users for bblack.
Mar 26 2024, 7:17 PM · Patch-For-Review, SRE, SRE-Access-Requests
BBlack created T361046: Requesting access to analytics-privatedata-users for bblack.
Mar 26 2024, 7:13 PM · Patch-For-Review, SRE, SRE-Access-Requests

Feb 12 2024

ayounsi awarded T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) a Like token.
Feb 12 2024, 7:22 AM · Traffic, SRE

Feb 9 2024

CDanis awarded T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) a Love token.
Feb 9 2024, 7:24 PM · Traffic, SRE

Jan 19 2024

BBlack added a comment to T355446: Synchronize and rotate TCP Fastopen keys for various use-cases.

We discussed this in Traffic earlier this week, and I ended up implementing what I think is a reasonable solution already, so now I've made this ticket for the paper trail and to cover the followup work to debianize and usefully-deploy it. The core code for it is published at https://github.com/blblack/tofurkey .

Jan 19 2024, 7:17 PM · Traffic
BBlack triaged T355446: Synchronize and rotate TCP Fastopen keys for various use-cases as Medium priority.
Jan 19 2024, 7:14 PM · Traffic

Dec 5 2023

BBlack added a comment to T352744: OpenSSL 3.x performance issues.

The perf issues are definitely relevant for traffic's use of haproxy (in a couple of different roles). Your option (making a libssl1.1-dev for bookworm that tracks the sec fixes that are still done for the bullseye case, and packaging our haproxy to build against it) would be the easiest path from our POV, for these cases.

Dec 5 2023, 2:21 PM · SRE-swift-storage, Traffic

Nov 29 2023

BBlack added a comment to T345939: Create metrics/monitoring of fifo-log-demux.

Followup: did a 3-minute test of the same pair of parameter changes on cp3066 for a higher-traffic case. No write failures detected via strace in this case (we don't have the error log outputs to go by in 9.1 builds). mtail CPU usage at 10ms polling interval was significantly higher than it was in ulsfo, but still seems within reason overall and not saturating anything.

Nov 29 2023, 3:33 PM · Traffic
BBlack added a comment to T345939: Create metrics/monitoring of fifo-log-demux.

I went on a different tangent with this problem, and tried to figure out why we're having ATS fail writes to the notpurge log pipe in the first place. After some hours of digging around this problem (I'll spare you endless details of temporary test configs and strace outputs of various daemons' behavior, etc), these are the basic issues I see:

Nov 29 2023, 3:19 PM · Traffic

Nov 22 2023

BBlack created P53731 screenlocker script.
Nov 22 2023, 6:59 PM
BBlack created P53730 Triggering go template errors (tpl).
Nov 22 2023, 5:46 PM · Traffic
BBlack created P53729 Triggering go template errors (script).
Nov 22 2023, 5:45 PM · Traffic

Nov 9 2023

BBlack created T350869: cr2-eqiad xe-3/2/2 has errors for the past ~week.
Nov 9 2023, 1:54 PM · Infrastructure-Foundations, netops

Nov 7 2023

BBlack added a comment to T350354: Do we need to generate aggregates for LVS service IP ranges?.

I don't suspect it serves any real purpose at present, unless it was to avoid some filtering that exists elsewhere to avoid cross-site sharing of /32 routes or something.

Nov 7 2023, 2:11 PM · netops, Infrastructure-Foundations, SRE

Oct 19 2023

BBlack edited P53018 Example grafana text panel to pick specific absolute time ranges.
Oct 19 2023, 3:45 PM · Traffic
BBlack updated the task description for T349314: cp3079 bios settings.
Oct 19 2023, 3:37 PM · DC-Ops, ops-esams, SRE, Traffic
BBlack created T349314: cp3079 bios settings.
Oct 19 2023, 3:37 PM · DC-Ops, ops-esams, SRE, Traffic
BBlack created P53018 Example grafana text panel to pick specific absolute time ranges.
Oct 19 2023, 3:25 PM · Traffic

Oct 16 2023

BBlack added a comment to T348837: Investigate IPVS IPIP encapsulation support.

One potential issue with relying solely on MSS reduction is that, obviously, it only affects TCP. For now this is fine, as long as we're only using LVS (or future liberica) for TCP traffic (I think that's currently the case for LVS anyways!), but we could add UDP-based things in the future (e.g. DNS and QUIC/HTTP3), at which point we'll have to solve these problems differently.

Oct 16 2023, 2:46 PM · Patch-For-Review, SRE, Traffic
BBlack added a comment to T348837: Investigate IPVS IPIP encapsulation support.

The one thing you may not be able to control with mtu/advmss on a route is traffic to the local subnet, as that route is added by the kernal when the IP is added to the interface. Not sure if that can be modified to differ from interface MTU.

Oct 16 2023, 2:21 PM · Patch-For-Review, SRE, Traffic
BBlack added a comment to T348837: Investigate IPVS IPIP encapsulation support.

Could we take the opposite approach with the MTU fixup for the tunneling, and arrange the host/interface settings on both sides (the LBs and the target hosts) such that they only use a >1500 MTU on the specific unicast routes for the tunnels, but default to their current 1500 for all other traffic? If per-route MTU can usefully be set higher than base interface MTU, this seems trivial, but even if not, surely with some set of ip commands we could set the iface MTU to the higher value, while clamping it back down to 1500 for all cases except the tunnel.

Oct 16 2023, 1:09 PM · Patch-For-Review, SRE, Traffic

Oct 11 2023

BBlack created P52909 HA example.
Oct 11 2023, 4:25 PM

Oct 3 2023

BBlack added a comment to T348041: Remove static routes for ns[01] and replace their announcements with bird.

Looks about right to me!

Oct 3 2023, 7:52 PM · netops, SRE, Traffic, Infrastructure-Foundations
BBlack added a comment to T346165: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run".

We could add some normalization function at the ferm or puppet-dns-lookup layer perhaps (lowercase and do the zeros in a consistent way)?

Oct 3 2023, 3:12 PM · Patch-For-Review, cloud-services-team, Dumps-Generation, Data-Platform-SRE

Sep 27 2023

BBlack renamed T342159: Q1:rack/setup/install cp11[00-15] from Q1:rack/setup/install cp1[098-113] to Q1:rack/setup/install cp11[00-15].
Sep 27 2023, 6:23 PM · SRE, ops-eqiad, Traffic, DC-Ops

Sep 25 2023

BBlack added a comment to T323723: Alert on Varnish high thread count.

To clarify and expand on my position about this thread count parameter (which is really just a side-issue related to this ticket, which is fundamentally complete):

Sep 25 2023, 3:56 PM · Patch-For-Review, SRE, Traffic

Sep 22 2023

BBlack added a comment to T342159: Q1:rack/setup/install cp11[00-15].

Adding to the confusion: historically, we once used the hostname cp1099 back in 2015 for a one-off host: T96873 - therefore that name already exists in both phab and git history, confusingly.

Sep 22 2023, 1:03 PM · SRE, ops-eqiad, Traffic, DC-Ops
BBlack added a comment to T342159: Q1:rack/setup/install cp11[00-15].

Reading a little deeper on this, I think we still have a hostnames issue. If those other 8 hosts are indeed being brought from ulsfo+eqsin. Those 8 hosts, I presume, would be 1091-8, and so these hosts should start at 1099, not 1098?

Sep 22 2023, 1:00 PM · SRE, ops-eqiad, Traffic, DC-Ops
BBlack added a comment to T342159: Q1:rack/setup/install cp11[00-15].

@VRiley-WMF - Sukhbir's out right now, but I've updated the racking plan on his behalf!

Sep 22 2023, 12:48 PM · SRE, ops-eqiad, Traffic, DC-Ops
BBlack updated the task description for T342159: Q1:rack/setup/install cp11[00-15].
Sep 22 2023, 12:47 PM · SRE, ops-eqiad, Traffic, DC-Ops

Sep 18 2023

BBlack updated the task description for T346640: Traffic cache daemon restart scripts need some rework.
Sep 18 2023, 2:27 PM · SRE, Traffic
BBlack triaged T346640: Traffic cache daemon restart scripts need some rework as Medium priority.
Sep 18 2023, 2:26 PM · SRE, Traffic

Sep 15 2023

BBlack added a comment to T337446: Rebuild sanitarium hosts.

There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924508/1/hieradata/common/service.yaml

Sep 15 2023, 7:38 PM · User-notice-archive, TaxonBot, cloud-services-team, Data-Engineering, Data-Services, DBA

Sep 14 2023

BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

https://grafana.wikimedia.org/d/000000513/ping-offload might be a good starting point (might need some updates/tweaking to get the exact data you want, though)

Sep 14 2023, 7:31 PM · Infrastructure-Foundations, Traffic, netops, SRE
BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

some sort of rate-limiting configured on the switch-side for ICMP echo, which was IP-aware and didn't count packets from our own internal systems

Sep 14 2023, 7:29 PM · Infrastructure-Foundations, Traffic, netops, SRE

Sep 8 2023

BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

Reading into the code above and the history more and self-correcting: the ratelimiter doesn't apply to PTB packets, just some other informational packets. Apparently we bumped the ratelimiter first as a short-term mitigation (for all the sites), I guess primarily to avoid what looks like ping loss to our monitoring and/or users, then deployed the ping offloader in some places as well as a better way to deal with it (and I guess at thousands per second, the pps reduction probably is useful, although I don't know to what degree).

Sep 8 2023, 12:57 PM · Infrastructure-Foundations, Traffic, netops, SRE
BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

The current puppetized tuneables are at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/8ed59718c7a7603b61d7d42e05726fd11dae5eaa/modules/lvs/manifests/kernel_config.pp#49

Sep 8 2023, 12:52 PM · Infrastructure-Foundations, Traffic, netops, SRE
BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

to reduce load on LVS hosts

Sep 8 2023, 12:48 PM · Infrastructure-Foundations, Traffic, netops, SRE

Sep 5 2023

BBlack added a comment to T345334: Cache thumbs in our caching infrastructure (e.g. ATS).

This topic probably deserves a ~hour meeting w/ Traffic to hash out some of the potential solutions and tradeoffs, but I'm gonna try to bullet-point my way through a few points for now anyways to seed further discussion:

Sep 5 2023, 5:57 PM · SRE, Thumbor, SRE-swift-storage, Traffic

Jun 12 2023

BBlack added a comment to T337535: Figure out what changes are needed in the traffic layer for having codfw be the r/w DC for half a year.

The more I've thought about this issue, I think we should probably stick with the (very approximate) latency mapping we have, and not try to have a second setup to optimize for the codfw-primary case. I do think we should swap the core DCs at the front of the global default entry on switchover, though, and the patch above makes that spot a little more visible. There shouldn't be any hard dependencies between this and other steps, but it could be done around the start of the switchover process asynchronously.

Jun 12 2023, 4:59 PM · SRE, Traffic, serviceops, Datacenter-Switchover

May 31 2023

BBlack added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

We've got a pair of patches to review now which configure this on the pybal and safe-service-restart sides. We could especially use serviceops input on the latter. None of it's particularly pretty, but at least it's fairly succinct and seems to do the job!

May 31 2023, 2:29 PM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool

May 30 2023

BBlack added a comment to T337446: Rebuild sanitarium hosts.

Note: I restored+amended https://gerrit.wikimedia.org/r/c/operations/puppet/+/924342 and merged+deployed it on lvs1018+lvs1020. This seems to work and disable the problematic monitoring that impacts LVS itself.

May 30 2023, 1:25 PM · User-notice-archive, TaxonBot, cloud-services-team, Data-Engineering, Data-Services, DBA

May 27 2023

BBlack added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

As you seem to be working on this I'm bluntly assigning to you as part of the incident followup.

May 27 2023, 12:47 AM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool

Apr 27 2023

BBlack added a comment to T334048: Cookbook to depool a site in AuthDNS.

I like this direction (etcd). It's not super-trivial, but we've complained a lot even internally about the lack of etcd support for depooling whole sites at the public edge.

Apr 27 2023, 7:41 PM · Traffic

Apr 26 2023

BBlack added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

Probably needs subtasks for two things:

  1. Fix "safe-service-restart.py" being unsafe (either it or its caller is failing to propogate an error upstream to stop the carnage, and is also leaving a node depooled when the error happens between the depool and repool operations. At least one of those needs fixing, if not both).
  2. The whole 'template the local appservers.svc IP into the "instrumentation_ips"' thing at the pybal level, plus whatever changes are needed to use it from the scap side of things (so that it only checks one local pybal, and it's the correct one by current pooling).
Apr 26 2023, 9:08 PM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool
BBlack created P47284 Something like this.
Apr 26 2023, 6:48 PM
BBlack added a comment to T334467: Can't retrieve HTML from REST API .

The patch has been rolled out everywhere for a little while at this point, should be able to confirm success

Apr 26 2023, 4:56 PM · RESTBase-API, API Platform, Wikimedia Enterprise
BBlack added a comment to T334467: Can't retrieve HTML from REST API .

We had a brief meeting on this, and I think the actual problem and immediate workaround is actually much simpler than we imagined. We're going to apply the same workaround we did for MediaWiki traffic in T238285 ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/882663/ ) to the Restbase traffic for now. Patch incoming shortly!

Apr 26 2023, 3:00 PM · RESTBase-API, API Platform, Wikimedia Enterprise

Apr 25 2023

BBlack added a comment to T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh').

I think we need to rewind a step here. We do want mh, but we want it for the current public sh cases (basically: text and upload ports 80+443), and maybe the other three sh cases (kibana + thanos), although we can start with text+upload first and then talk about those others with the respective teams. The current ticket description and patches seem to be going after the opposite: switching the current wrr services to mh via hieradata and spicerack changes. I think this would be actively harmful. sh and mh choose the destination based on hashes of the source address, which is great for public-facing, but would be hasing on our very limited set of internal cache exit IPs (or other internal service clusters for internal LVS'd traffic), and so it wouldn't balance very well at all. One could potentially address that by including the source port in the hash, but it still seems like it would be more-complicated and less-optimal than just sticking with wrr for these cases.

Apr 25 2023, 1:48 PM · Traffic

Apr 17 2023

BBlack added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

So, the solution quoted from my IRC chat above: that's about making the depool verification code actually track the currently-live "low-traffic" (applayer/internal) LVS routing, as opposed to what it's doing now (which I think checks the primary+secondary for the role as-configured in puppet, which doesn't account for any failure/depool/etc at the LVS layer).

Apr 17 2023, 2:47 PM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool

Apr 14 2023

BBlack added a comment to T332024: GeoIP mapping experiments.

It's awesome to see this moving along! One minor point:

Apr 14 2023, 7:08 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic

Apr 13 2023

BBlack added a comment to T331356: Wikidata seems to still be utilizing insecure HTTP URIs.

Some remarks:

  • We should consider these canonical HTTP URIs to be names in the first place, which are unique worldwide and issued by the Wikidata project as the "owner" [1] of the wikidata.org domain. The purpose of these names is to identify things.
Apr 13 2023, 6:32 PM · wmde-wikidata-tech, SRE, [DEPRECATED] wdwb-tech, Traffic, Wikidata

Mar 30 2023

BBlack created P46001 Bard on FF/CRLite.
Mar 30 2023, 6:50 PM

Mar 13 2023

BBlack added a comment to T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]00[123]..

Resilient hashing indeed sounds much better (it seems like that's their codeword for some internal "consistent hashing" implementation), but it doesn't look like our current router OS have it, at least not when I looked at cr1-eqiad.

Mar 13 2023, 5:02 PM · SRE, Traffic

Mar 6 2023

BBlack closed T330906: HTTP URIs do not resolve from NL and DE? as Resolved.

The redirects are neither good nor bad, they're instead both necessary (although that necessity is waning) and insecure. We thought we had standardized on all canonical URIs being of the secure variant ~8 years ago, and this oversight has flown under the radar since then, only to be exposed recently when we intentionally (for unrelated operational reasons) partially degraded our port 80 services.

Mar 6 2023, 9:26 PM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic
BBlack triaged T331356: Wikidata seems to still be utilizing insecure HTTP URIs as High priority.
Mar 6 2023, 9:25 PM · wmde-wikidata-tech, SRE, [DEPRECATED] wdwb-tech, Traffic, Wikidata