Maniphest T190960

1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	• mmodell
	Mar 28 2018, 7:59 PM

Description

During the time that 1.31.0-wmf.27 was deployed to group1, implicit temporary tables shot up to ~4x the baseline, as can be seen in this grafana view

The error is coming from line 258 of RefreshLinksJob.php where the code calls commitAndWaitForReplication:

$lbFactory->commitAndWaitForReplication( __METHOD__, $ticket );

stack trace

#0 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/loadbalancer/LoadBalancer.php(639): Wikimedia\Rdbms\DatabaseMysqlBase->masterPosWait(Wikimedia\Rdbms\MySQLMasterPos, double)
#1 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/loadbalancer/LoadBalancer.php(534): Wikimedia\Rdbms\LoadBalancer->doWait(integer, boolean, double)
#2 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/lbfactory/LBFactory.php(367): Wikimedia\Rdbms\LoadBalancer->waitForAll(Wikimedia\Rdbms\MySQLMasterPos, double)
#3 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/lbfactory/LBFactory.php(419): Wikimedia\Rdbms\LBFactory->waitForReplication(array)
#4 /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php(290): Wikimedia\Rdbms\LBFactory->commitAndWaitForReplication(string, integer)
#5 /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title)
#6 /srv/mediawiki/php-1.31.0-wmf.27/extensions/EventBus/includes/JobExecutor.php(59): RefreshLinksJob->run()
#7 /srv/mediawiki/rpc/RunSingleJob.php(79): JobExecutor->execute(array)
#8 {main}

Timeline

19:24	twentyafterfour@tin:	Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 17s)
19:22	twentyafterfour@tin:	rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26
19:20	twentyafterfour:	Rolling back to wmf.26 due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"
19:19	twentyafterfour:	rolling back to wmf.26
19:18	icinga-wm	PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
		https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=1522263260081&to=1522265839537
19:17	twentyafterfour:	I'm seeing quite a few "[{exception_id}] {exception_url} Wikimedia\Rdbms\DBExpectedError: Replication wait failed: Lost connection to MySQL server during query
19:12	milimetric@tin:	Finished deploy [analytics/refinery@c22fd1e]: Fixing python import bug (duration: 02m 48s)
19:09	milimetric@tin:	Started deploy [analytics/refinery@c22fd1e]: Fixing python import bug
19:09	milimetric@tin:	Started deploy [analytics/refinery@c22fd1e]: (no justification provided)
19:06	twentyafterfour@tin:	Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 17s)
19:05	twentyafterfour@tin:	rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27

Incident report

https://wikitech.wikimedia.org/wiki/Incident_documentation/20180229-Train-1.31.0-wmf.27

Details

Subject	Repo	Branch	Lines +/-
rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots	mediawiki/core	wmf/1.31.0-wmf.27	+15 -2
rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots	mediawiki/core	master	+15 -2
Normalize and lower the default DB lag wait timeout	mediawiki/core	wmf/1.31.0-wmf.27	+2 -2
Normalize and lower the default DB lag wait timeout	mediawiki/core	master	+2 -2

Customize query in gerrit

Revisions and Commits

rMW MediaWiki
	rMWceb7d61ee7ef rdbms: make getMasterPos() ignore GTIDs outside of gtid_domain_id

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T119626 Eliminate SPOF at the main database infrastructure
Resolved		aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
			Restricted Task
Open		None	T193258 Look into ways to identify risky commits that are part of the weekly branch cut, either manually or through automated process
Resolved	PRODUCTION ERROR	• mmodell	T190960 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"

Event Timeline

• mmodell created this task.Mar 28 2018, 7:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 28 2018, 7:59 PM

• mmodell updated the task description. (Show Details)Mar 28 2018, 8:03 PM

• mmodell updated the task description. (Show Details)

• mmodell triaged this task as Unbreak Now! priority.Mar 28 2018, 8:06 PM

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptMar 28 2018, 8:06 PM

@aaron as our DBAs are out, can you help here or not?

@Niharika found suspect queries in tendril: https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=2d03f574e8b789aec61dc623f2b45ad2&host=family%3Adb1109&user=&schema=&hours=1%2F32

• mmodell added a parent task: T183966: 1.31.0-wmf.27 deployment blockers.Mar 28 2018, 8:16 PM

Change 422547 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Normalize and lower the default DB lag wait timeout

https://gerrit.wikimedia.org/r/422547

gerritbot added a project: Patch-For-Review.Mar 28 2018, 8:50 PM

Change 422565 had a related patch set uploaded (by 20after4; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.27] Normalize and lower the default DB lag wait timeout

https://gerrit.wikimedia.org/r/422565

Change 422547 merged by jenkins-bot:
[mediawiki/core@master] Normalize and lower the default DB lag wait timeout

https://gerrit.wikimedia.org/r/422547

@aaron: Can you explain how your patch addresses the issue? I'm not confident that it'll fix the root cause, even if it does work around the connection issues.

Change 422565 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.27] Normalize and lower the default DB lag wait timeout

https://gerrit.wikimedia.org/r/422565

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2018-04-03 (1.31.0-wmf.28)).Mar 28 2018, 10:00 PM

Mentioned in SAL (#wikimedia-operations) [2018-03-28T22:03:43Z] <twentyafterfour> syncing https://gerrit.wikimedia.org/r/#/c/422565/ refs T190960 T183966

Stashbot mentioned this in T183966: 1.31.0-wmf.27 deployment blockers.Mar 28 2018, 10:03 PM

Mentioned in SAL (#wikimedia-operations) [2018-03-28T22:08:49Z] <twentyafterfour> rolling forward group1 to 1.31.0-wmf.27 refs T183966 T190960

Mentioned in SAL (#wikimedia-operations) [2018-03-28T22:09:21Z] <twentyafterfour@tin> rebuilt and synchronized wikiversions files: sync https://gerrit.wikimedia.org/r/#/c/422563/ group1 wikis to 1.31.0-wmf.27 refs T183966 T190960

IKhitron subscribed.Mar 28 2018, 11:57 PM

• Mholloway subscribed.Mar 29 2018, 12:10 AM

The temp table stat increases just seem like noise due to some queries going from "SELECT @@" to "SHOW GLOBAL VARIABLES LIKE 'gtid_%'. E.g.:

mysql> show status like 'Created_%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 0     |
| Created_tmp_files       | 6     |
| Created_tmp_tables      | 2     |
+-------------------------+-------+
3 rows in set (0.00 sec)

mysql> show variables like 'gtid%';
+----------------------------------+------------------------------------------+
| Variable_name                    | Value                                    |
+----------------------------------+------------------------------------------+
| gtid_executed_compression_period | 1000                                     |
| gtid_mode                        | OFF                                      |
| gtid_next                        | AUTOMATIC                                |
| gtid_owned                       |                                          |
| gtid_purged                      | fae78223-0aae-11e8-ba79-340286975770:1-2 |
+----------------------------------+------------------------------------------+
5 rows in set (0.01 sec)

mysql> show status like 'Created_%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 0     |
| Created_tmp_files       | 6     |
| Created_tmp_tables      | 3     |
+-------------------------+-------+
3 rows in set (0.00 sec)

Volans subscribed.Mar 29 2018, 8:20 AM

@aaron: can we revert whatever patch made this change (from SELECT to SHOW) so we can roll forward? The DBAs are not around this week and it's a short week.

• mmodell added a commit: rMWceb7d61ee7ef: rdbms: make getMasterPos() ignore GTIDs outside of gtid_domain_id.Mar 29 2018, 6:25 PM

@aaron I think the temp tables are caused by holding open transactions for longer which is what would happen if we were waiting a long time for a slave to catch up before committing the transaction. Setting the timeout lower would make the situation better but not entirely resolved it.

Can we revert rMWceb7d61ee7ef: rdbms: make getMasterPos() ignore GTIDs outside of gtid_domain_id for now and merge it again next week? I'd be more comfortable with doing it when we have a DBA around to help verify that we aren't killing the database.

The reason I don't think the temp tables are noise is because they correlate with the timing of a large jump in failed connections to mysql. So correlation isn't causation but nonetheless something is causing the connection errors. If it's not this change then we have to figure out what is causing it. If we still have the errors without the gtid change then at least we've proved what isn't causing it.

Just to be explicit: we're now passed the point of no return (for this week). We can't safely do group1 and group 2 before the end of the day; especially as tomorrow is a WMF Holiday.

The train is now halted until next week.

I don't mean "noise" as "unrelated to deploy", rather "expected, but doesn't matter".

I'm curious why 60 second timeouts were still showing last go. That should only be possible in CLI mode, which JobExecuter should not be in. Something is wrong there.

@aaron: What I saw with the second deployment was a ton of errors like the following:

SlowTimer [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT(...)

I apologize for not updating the ticket with that last detail. I logged that in the SAL but I forgot to mention T190960 so it didn't get logged here.

SlowTimer [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT(...)

That is HHVM complaining a query took more than 10 seconds when MASTER_GTID_WAIT() is being passed a 10 seconds timeout. That is/has been the cause of spam log previously but I could not find a reference to an existing task :(

The temp table stat increases just seem like noise due to some queries going from "SELECT @@" to "SHOW GLOBAL VARIABLES LIKE 'gtid_%'.

Then I guess those queries should be moved back to use a SELECT. Adding a large baseline of noise to the mysql internal monitoring metrics is probably going to be a pain to the DBA. If later on there is an increase of temporary tables creations which hidden by that noise, it is going to hurt surely?

Not sure whether all SHOW can be converted back to SELECT but surely that should be doable.

For reference this is the max replication lag between all eqiad DBs in that time frame (Mar. 28th, ~18:30-20:30), from where seems pretty clear that there was no sensible lag at all in that time frame.

Change 423069 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots

https://gerrit.wikimedia.org/r/423069

Liuxinyu970226 added a project: User-notice.Mar 31 2018, 3:01 AM

Liuxinyu970226 moved this task from To Triage to In current Tech/News draft on the User-notice board.

Change 423069 merged by jenkins-bot:
[mediawiki/core@master] rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots

https://gerrit.wikimedia.org/r/423069

Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.Apr 2 2018, 12:12 PM

Change 423488 had a related patch set uploaded (by 20after4; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.27] rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots

https://gerrit.wikimedia.org/r/423488

Change 423488 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.27] rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots

https://gerrit.wikimedia.org/r/423488

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)); removed MW-1.31-release-notes (WMF-deploy-2018-04-03 (1.31.0-wmf.28)).Apr 2 2018, 6:00 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-02T19:06:49Z] <twentyafterfour> sync rdbms: avoid lag estimates in getLagFromPtHeartbeat ruined by snapshots Bug: T190960 Change-Id: I57dd8d3d0ca96d6fb2f9e83f062f29b1d53224dd

Mentioned in SAL (#wikimedia-operations) [2018-04-02T19:09:49Z] <twentyafterfour@tin> Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/: sync I57dd8d3d0ca96d6fb2f9e83f062f29b1d53224dd refs T183966 T190960 (duration: 01m 19s)

Mentioned in SAL (#wikimedia-operations) [2018-04-02T19:41:54Z] <twentyafterfour@tin> Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/DatabaseMysqlBase.php: sync rMW779e7fd3de08 refs T190960 (duration: 01m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-04-02T20:38:11Z] <twentyafterfour@tin> Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/DatabaseMysqlBase.php: sync rMW36c52351487a refs T190960 (duration: 01m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-04-02T20:59:16Z] <twentyafterfour> MediaWiki Train: rolling back to 1.31.0-wmf.26 refs T183966, T190960

Mentioned in SAL (#wikimedia-operations) [2018-04-02T21:22:17Z] <twentyafterfour@tin> Synchronized php-1.31.0-wmf.27/includes/libs/rdbms/database/: Revert ceb7d61ee7ef3edc6705abd41ec86b3afcd9c491 refs T183966 T190960 (duration: 00m 59s)

The original patch that caused this has been reverted, unblocking the train.

I am not sure if this is related but in the last 12h, starting around 7pm, the following message has caused almost 1.2M errors:

message	       	Using cached lag value for 10.64.16.84:3311 due to active transaction
t  method	       	Wikimedia\Rdbms\DatabaseMysqlBase::getLagFromPtHeartbeat
t  mwversion	       	1.31.0-wmf.27
t  normalized_message	       	Using cached lag value for {db_server} due to active transaction

https://logstash.wikimedia.org/goto/e2a6f0cc6b65ccb88081e09216d1c7d7

@Marostegui: AFAIK that's a debugging log message put in place by @aaron

• mmodell updated the task description. (Show Details)Apr 3 2018, 5:43 AM

In T190960#4099499, @mmodell wrote:

@Marostegui: AFAIK that's a debugging log message put in place by @aaron

ok - thanks for the clarification, I thought it was better to report it :-)

I've started working on an incident report for this: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180229-Train-1.31.0-wmf.27 feel free to add to it, it's just a basic outline so far.

• Pchelolo mentioned this in T191282: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending.Apr 3 2018, 1:55 PM

I think there's another problem related to wmf.27 rollout, see T191282 - the timings for those log entries correlate with the train deployments precisely

temp tables as part of SHOW are a non-problem, those are complusory tiny tables on memory- they only would be problematic if they created significant latency issues or they were created on disk (io increase, with means performance impact).

@aaron - if you don't like the current model, should we think an alternative, simpler one based on the heartbeat table- or is this still ok for you?

• mmodell added a project: Wikimedia-Incident.Apr 4 2018, 5:32 PM

• mmodell updated the task description. (Show Details)

In T190960#4100608, @jcrespo wrote:

@aaron - if you don't like the current model, should we think an alternative, simpler one based on the heartbeat table- or is this still ok for you?

It's OK for now. I made https://gerrit.wikimedia.org/r/#/c/423651/ to have another go at that patch.

Johan moved this task from In current Tech/News draft to Already announced/Archive on the User-notice board.Apr 5 2018, 6:06 PM

RandomDSdevel awarded a token.Apr 6 2018, 7:43 PM

https://wikitech.wikimedia.org/wiki/Incident_documentation/20180229-Train-1.31.0-wmf.27

• mmodell added a parent task: Restricted Task.Apr 10 2018, 3:02 AM

Liuxinyu970226 awarded a token.Apr 10 2018, 11:43 AM

By reading the IR I guess this task can be closed as the problem was found @mmodell ?

@Marostegui We still have a retrospective meeting scheduled for this one, not sure if we should keep it open for that purpose though.

Liuxinyu970226 unsubscribed.Apr 13 2018, 1:28 PM

• mmodell added a parent task: T193258: Look into ways to identify risky commits that are part of the weekly branch cut, either manually or through automated process.Apr 27 2018, 5:02 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

Ladsgroup edited projects, added User-notice-archive; removed User-notice.Aug 13 2022, 1:54 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 13 2022, 2:37 PM

1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"Closed, ResolvedPublicPRODUCTION ERRORActions