Sensu and Sensibility - Puppetconf 2014

Sensu and Sensibility
Tomas
Doran
@bobtfish
2014-‐09-‐23

Cycle of failure and
disappointment
• Manually edited and deployed monitoring
• Changes require two teams
• Low developer visibility about production
3

disappointment
• Escalation of issues is hard
• Ops ignore alerts from services
• Postmortems
5

disappointment
• Escalation of issues is hard
• Ops ignore alerts from services
• Postmortems
• High friction, low trust, low visibility.
7

“Normality”
8
-‐
http://gunshowcomic.com/648

“Normality”
dysfunctional
9
This is
-‐
http://gunshowcomic.com/648

“51 % viewed their ERP implementation as
unsuccessful”
12
The Robbins-Gioia Survey (2001)

The Conference Board Survey (2001)
“40 % of the projects failed to achieve their
business case within one year of going live”
13

McKinsey & Company in conjunction
with the University of Oxford (2012)
• “17 percent of large IT projects go so
badly that they can threaten the very
existence of the company”
• “On average, large IT projects run 45
percent over budget and 7 percent over
time, while delivering 56 percent less
value than predicted”
14

Failure is an option
-‐
blog.parasoft.com/single-‐greatest-‐barrier-‐with-‐sw-‐delivery
15

Why Sensu?
• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
• Ruby
• Not afraid to extend (or fork!)
18

‘industry standard’
‘enterprise class’
19

How we use Sensu
• Don’t use all of this!
• ‘Standalone’ checks only
• Default in the puppet module
26

Sensu data flow
• Sensu client runs checks on each machine
• Pushes results to RabbitMQ
• Clustered, clients/messages will fail over.
• Sensu server (multiple, ha)
• Processes check results, invokes handlers
• Writes state to redis
• Redis + sentinel
• Read by API (2 instances)
• All layers behind haproxy
27

Quis custodiet ipsos custodes?
28
“Sensu
has
so
many
moving
parts
that
I
wouldn’t
be
able
to
sleep
at
night
unless
I
set
up
a
Nagios
instance
to
make
sure
they
were
all
running.”

Mutually assured monitoring
• Multiple independent Sensu installs (per-datacenter)
• Monitor each other!
29

Machine readable config
• /etc/sensu/conf.d/checks/check_name.json
• Extensible with arbitrary metadata
• Hash merge
• Never edit by hand!
30

monitoring_check
monitoring_check { 'systems-apache-external':
page => true,
command => "/usr/lib/nagios/plugins/
check_tcp -H ${external_ip_address} -p 443",
check_every => ‘5m',
alert_after => '30m',
realert_every => 10,
runbook => 'y/apache',
}
31

monitoring_check
page => true,
}
32

monitoring_check
page => true,
}
33

monitoring_check
page => true,
}
34

sensu::check
• monitoring_check wraps this
• Writes a JSON file for each check
• Comment safe
35

"disk_ro_mounts": {
"standalone": true, "handlers": [“default"], "subscribers": [],
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts",
"interval": 60,
"alert_after": 0, "realert_every": “-1",
"dependencies": [],
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk",
"annotation": "https://gitweb.yelpcorp.com/?
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": true,
"project": “OPS”,
"page": false,
"tip": false
}
36

"disk_ro_mounts": {
"interval": 60,
"dependencies": [],
"ticket": true,
"page": false,
"tip": false
}
37

"disk_ro_mounts": {
"interval": 60,
"dependencies": [],
"ticket": true,
"page": false,
"tip": false
}
38

"disk_ro_mounts": {
"interval": 60,
"dependencies": [],
"ticket": true,
"page": false,
"tip": false
}
39

"disk_ro_mounts": {
"interval": 60,
"dependencies": [],
"ticket": true,
"page": false,
"tip": false
}
40

"disk_ro_mounts": {
"interval": 60,
"dependencies": [],
"ticket": true,
"page": false,
"tip": false
}
41

Check scripts
• Same as nagios checks
• Simple (text) output
• Exit code
• Result sent to server, along with check definition
• Including all the custom metadata
• Our handlers use the extra data.
42

Handlers
• base
• JIRA
• email
• irc
• pagerduty
• awsprune
43

How do checks get run?
• Every machine runs the client.
• Client managed by puppet
• Client has a TCP socket you can send JSON to
• Custom checks + pysensu-yelp
44

Single source of truth
• DNS is canonical for sensu servers
• Configure things in one place!
47

Single source of truth
• DNS is canonical for sensu servers
• Configure things in one place!
48

Automatic monitoring
• E.g. cron jobs - check successful recently!
• cron::d
49

Automatic monitoring
• E.g. cron jobs - check successful recently!
• cron::d
50

User specified monitoring
53
• Data lives in the service config
• Next to the code to emit metrics!

• Simple checks for free!
54

• Data lives in the service config
• Next to the code to emit metrics
• Next to metadata about SLAs and LB timeouts
• Developers can push without OPS
55

Cluster checks
• We’re working on this currently
• Assert some % of machines are healthy.
• Use to reduce alert noise.
• If a service becomes fully unavailable to clients,
you want to page someone.
• If one machine goes belly up, you don’t (make
a JIRA ticket for handling later!)
56

WIP
• This is all still a work in progress.
• We’ve not 100% migrated off of Nagios
• Open sourcing the pieces
57

Thanks!
• Slides will be online shortly:
• slideshare.net/bobtfish
• @bobtfish
• Some (most?) of our code is open source:
• https://github.com/Yelp/sensu/commit/
aa5c43c2fdfde5e8739952c0b8082000934f3ad2
• https://github.com/Yelp/puppet-monitoring_check
• https://github.com/Yelp/puppet-netstdlib
• https://github.com/Yelp/sensu_handlers
• https://github.com/Yelp/pysensu-yelp
58

Sensu and Sensibility - Puppetconf 2014

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Sensu and Sensibility - Puppetconf 2014

Similar to Sensu and Sensibility - Puppetconf 2014 (20)

More from Tomas Doran

More from Tomas Doran (20)

Recently uploaded

Recently uploaded (20)

Sensu and Sensibility - Puppetconf 2014