During the time that 1.31.0-wmf.27 was deployed to group1, implicit temporary tables shot up to ~4x the baseline, as can be seen in this grafana view
The error is coming from line 258 of RefreshLinksJob.php where the code calls commitAndWaitForReplication:
$lbFactory->commitAndWaitForReplication( __METHOD__, $ticket );
stack trace
#0 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/loadbalancer/LoadBalancer.php(639): Wikimedia\Rdbms\DatabaseMysqlBase->masterPosWait(Wikimedia\Rdbms\MySQLMasterPos, double) #1 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/loadbalancer/LoadBalancer.php(534): Wikimedia\Rdbms\LoadBalancer->doWait(integer, boolean, double) #2 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/lbfactory/LBFactory.php(367): Wikimedia\Rdbms\LoadBalancer->waitForAll(Wikimedia\Rdbms\MySQLMasterPos, double) #3 /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/lbfactory/LBFactory.php(419): Wikimedia\Rdbms\LBFactory->waitForReplication(array) #4 /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php(290): Wikimedia\Rdbms\LBFactory->commitAndWaitForReplication(string, integer) #5 /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php(122): RefreshLinksJob->runForTitle(Title) #6 /srv/mediawiki/php-1.31.0-wmf.27/extensions/EventBus/includes/JobExecutor.php(59): RefreshLinksJob->run() #7 /srv/mediawiki/rpc/RunSingleJob.php(79): JobExecutor->execute(array) #8 {main}
Timeline
19:24 | twentyafterfour@tin: | Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 17s) |
19:22 | twentyafterfour@tin: | rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26 |
19:20 | twentyafterfour: | Rolling back to wmf.26 due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" |
19:19 | twentyafterfour: | rolling back to wmf.26 |
19:18 | icinga-wm | PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] |
https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=1522263260081&to=1522265839537 | ||
19:17 | twentyafterfour: | I'm seeing quite a few "[{exception_id}] {exception_url} Wikimedia\Rdbms\DBExpectedError: Replication wait failed: Lost connection to MySQL server during query |
19:12 | milimetric@tin: | Finished deploy [analytics/refinery@c22fd1e]: Fixing python import bug (duration: 02m 48s) |
19:09 | milimetric@tin: | Started deploy [analytics/refinery@c22fd1e]: Fixing python import bug |
19:09 | milimetric@tin: | Started deploy [analytics/refinery@c22fd1e]: (no justification provided) |
19:06 | twentyafterfour@tin: | Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 17s) |
19:05 | twentyafterfour@tin: | rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27 |
Incident report
https://wikitech.wikimedia.org/wiki/Incident_documentation/20180229-Train-1.31.0-wmf.27