Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

decommission rcs100[12]
Closed, ResolvedPublic3 Estimated Story Points

Description

This task will track the decommission of rcs100[12].eqiad.wmnet.

rcs1001

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port - asw-c-eqiad:ge-4/0/5
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - mgmt dns entries removed.

rcs1002

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port - rcs1002 = asw-a-eqiad:ge-4/0/18
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - mgmt dns entries removed.

Event Timeline

Change 364219 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Disable RCStream

https://gerrit.wikimedia.org/r/364219

Change 364252 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove rcstream routes from varnish

https://gerrit.wikimedia.org/r/364252

Change 364254 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Redirect / to /?doc just to have something useful at /

https://gerrit.wikimedia.org/r/364254

Change 364254 merged by Ottomata:
[mediawiki/services/eventstreams@master] Redirect / to /?doc just to have something useful at /

https://gerrit.wikimedia.org/r/364254

Change 364258 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Redirect /rc to /?doc in case an old RCStream client comes looking

https://gerrit.wikimedia.org/r/364258

Change 364258 merged by Ottomata:
[mediawiki/services/eventstreams@master] Redirect /rc to /?doc in case an old RCStream client comes looking

https://gerrit.wikimedia.org/r/364258

Mentioned in SAL (#wikimedia-operations) [2017-07-10T17:55:36Z] <ottomata> disabling RCStream varnish routing: T170157

Change 364252 merged by Ottomata:
[operations/puppet@production] Remove rcstream routes from varnish

https://gerrit.wikimedia.org/r/364252

Ottomata renamed this task from Disable RCStream to Decommission RCStream.Jul 10 2017, 6:08 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-17T14:11:00Z] <ottomata> decommissioning rcs100[12] to spare::system: T170157

Change 364219 merged by Ottomata:
[operations/puppet@production] Decom RCStream

https://gerrit.wikimedia.org/r/364219

@RobH, I started the decommission process for rcs1001 and rcs1002, but may have taken it farther than I should. I followed https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission and got all the way to 'Power down system'. I did not do anything under 'disable host's port on switch.'.

Anyway, these hosts are off and ready for decom. Thanks!

I already synced up with @Ottomata about this via IRC, and I'll snag from here.

There is a checklist for decoms, which I'll append into the main task description now.

RobH renamed this task from Decommission RCStream to Decommission RCStream (rcs100[12]).Jul 18 2017, 5:15 PM
RobH claimed this task.
RobH updated the task description. (Show Details)

Anyway, these hosts are off and ready for decom. Thanks!

Umm....

$ ping rcs1001
PING rcs1001.eqiad.wmnet (10.64.32.148) 56(84) bytes of data.
64 bytes from rcs1001.eqiad.wmnet (10.64.32.148): icmp_seq=1 ttl=64 time=0.327 ms
64 bytes from rcs1001.eqiad.wmnet (10.64.32.148): icmp_seq=2 ttl=64 time=0.184 ms

$ ping rcs1002
PING rcs1002.eqiad.wmnet (10.64.0.17) 56(84) bytes of data.
64 bytes from rcs1002.eqiad.wmnet (10.64.0.17): icmp_seq=1 ttl=63 time=0.231 ms
64 bytes from rcs1002.eqiad.wmnet (10.64.0.17): icmp_seq=2 ttl=63 time=0.229 ms

And we see a lovely spam of this in the mw logs (there are other errors that appear too)

119 timed out after 1 seconds when connecting to rcs1002.eqiad.wmnet [110]: Connection timed out
119 Failed connecting to redis server at rcs1002.eqiad.wmnet: Connection timed out in /srv/mediawiki/php-1.30.0-wmf.9/includes/libs/redis/RedisConnectionPool.php on line 238
118 timed out after 1 seconds when connecting to rcs1001.eqiad.wmnet [110]: Connection timed out
118 Failed connecting to redis server at rcs1001.eqiad.wmnet: Connection timed out in /srv/mediawiki/php-1.30.0-wmf.9/includes/libs/redis/RedisConnectionPool.php on line 238

Because of in CommonSettings..

	// RCStream / stream.wikimedia.org
	if ( $wmfRealm === 'production' ) {
		$wgRCFeeds['rcs1001'] = [
			'uri'	   => "redis://rcs1001.eqiad.wmnet:6379/rc.$wgDBname",
			'formatter' => 'JSONRCFeedFormatter',
		];

		$wgRCFeeds['rcs1002'] = [
			'uri'	   => "redis://rcs1002.eqiad.wmnet:6379/rc.$wgDBname",
			'formatter' => 'JSONRCFeedFormatter',
		];
	}

Change 366175 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Remove rcs100[12] from $wgRCFeeds

https://gerrit.wikimedia.org/r/366175

Change 366175 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove rcs100[12] from $wgRCFeeds

https://gerrit.wikimedia.org/r/366175

Mentioned in SAL (#wikimedia-operations) [2017-07-19T00:35:01Z] <reedy@tin> Synchronized wmf-config/CommonSettings.php: Remove rcs1001 and rcs1002 from CommonSettings wgRCFeeds. Stops a load of logspam T170157 (duration: 00m 48s)

Krinkle moved this task from Next up to Assigned on the EventStreams board.
RobH renamed this task from Decommission RCStream (rcs100[12]) to decommmission RCStream (rcs100[12]).Jul 19 2017, 12:47 AM
RobH renamed this task from decommmission RCStream (rcs100[12]) to decommmission rcs100[12].

Change 366176 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom rcs100[12]

https://gerrit.wikimedia.org/r/366176

Change 366176 merged by RobH:
[operations/puppet@production] decom rcs100[12]

https://gerrit.wikimedia.org/r/366176

Reedy renamed this task from decommmission rcs100[12] to decommission rcs100[12].Jul 19 2017, 12:54 AM
Reedy updated the task description. (Show Details)

Change 366177 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom of rcs100[12] production dns

https://gerrit.wikimedia.org/r/366177

Change 366177 merged by RobH:
[operations/dns@master] decom of rcs100[12] production dns

https://gerrit.wikimedia.org/r/366177

RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Reclaim (Spares/Decommission) on the hardware-requests board.

Because of in CommonSettings..

Aye yai yai. @Reedy very sorry about that, thanks for cleaning up my mess.

elukey moved this task from Incoming to Radar on the Analytics board.

Change 421564 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns rcs1001-1002

https://gerrit.wikimedia.org/r/421564

Change 421564 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns rcs1001-1002

https://gerrit.wikimedia.org/r/421564

Cmjohnson updated the task description. (Show Details)