Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
MySQL Performance
monitoring using
Statsd and Graphite
Art van Scheppingen
Head of Database Engineering
Overview
1.
2.
3.
4.
5.
6.
7.

Who are we?
What monitoring tools do we use?
What are StatsD, Collectd and Graphite?
How MySQL logs to StatsD
Graphing examples
Challenges
Questions?

2
Who are we?
Who is Spil Games?
Facts
•
•
•
•
•

Company founded in 2001
350+ employees world wide
180M+ unique visitors per month
Over 60M registered users
45 portals in 19 languages
• Casual games
• Social games
• Real time multiplayer games
• Mobile games
• 35+ MySQL clusters
• 60k queries per second (3.5 billion qpd)
4
Geographic Reach
180 Million Monthly Active Users(*)

Source: (*) Google Analytics, August 2012

5
Brands
Girls, Teens and Family

spielen.com
juegos.com
gamesgames.com
games.co.uk
6
Monitoring
We use(d) many many many
monitoring tools so far!
Existing monitoring systems we use(d)
•
•
•
•

Opsview/Nagios (mainly availability)
Cacti (using Baron Schwartz/Percona templates)
MONYog
Good ol’ RRD

8
Challenges
• Problems with existing systems
• Stats gathering through polling
• Data gets averaged out
• (Host) checks are run serial
• Slowdowns in a run means no/less data
• Setting up an SSH connection is slow
• Low granularity (1 to 5 minutes)
• Hardly scalable
• Difficult to correlate metrics

9
Difficult to add a new metric
host065
bash-3.2# netstat -s | grep "listen queue"
26 times the listen queue of a socket overflowed
host066
bash-3.2# netstat -s | grep "listen queue"
33 times the listen queue of a socket overflowed

10
Statsd + Collectd
+ Graphite
What are they?
What is Collectd?
•
•
•
•

Unix daemon that gathers system statistics
Over 90 (input/output) plugins
Plugin to send metrics to Graphite/Carbon
Very useful for system metrics

12
Collectd
Collectd

Carbon

TCP

30 second interval

Gather data plugins

CPU

DISK

LOAD

13

….
What is StatsD?
•
•
•
•
•
•
•

Front-end proxy for Graphite/Carbon (by Etsy)
NodeJS daemon (also other languages)
Receives UDP (on localhost)
Buffers metrics locally
Flushes periodically data to Graphite/Carbon (TCP)
Client libraries available in about any language
Send any metric you like!

14
StatsD functions
• StatsD functions
• update_stats
• increment/decrement
• set
• gauge
• timers

15
StatsD Bash examples
echo ”some.metric:1|c" | nc -w 1 -u graphite.host 8125
echo ”some.metric:1|c" > /dev/udp/localhost/8125
bash-3.2# netstat -s | grep "listen"
26 times the listen queue of a socket overflowed
netstat -s | grep "listen" | awk '{print "hostname.listen.queue.overflowed:"$1"|c"}’ >
/dev/udp/localhost/8125
hostname.listen.queue.overflowed:26|c
echo "show global status" | mysql -u root | awk '{print
"hostname.mysql.status."$1":"$2"|c"}'

16
StatsD
StatsD

Carbon

TCP

2 second interval
localhost:8125
UDP
Application Level

# OF LOGINS

MySQL_Statsd

CACHE HIT/MISS

STATUS

17

INNODB STATUS
What is Graphite?
• Highly scalable real-time graphing system
• Collects numeric time-series
• Backend daemon Carbon
• Carbon-cache: receives data
• Carbon-aggregator: aggregates data
• Carbon-relay: replication and sharding
• RRD or Whisper database

18
Graphite’s capabilities
• Each metric is in its own bucket
• Periods make folders
• prod.syseng.mmm.<hostname>.admin_offline
• Metric types
• Counters
• Gauge
• Retention can be set using a regex
• [mysql]
• pattern = ^prod.syseng.mysql..*$
• retentions = 2s:1d,1m:3d,5m:7d,1h:5y
19
Our Graphite environment
Client requesting graphs

Server-1

Loadbalancer (port 443)

Server-2

Server-n

Loadbalancer (port 2003)

Graphite Rendering Cluster

Carbon relay

3 nodes

2 nodes
24h retention

Skyline

1 node

8 nodes
DEV

SYSENG

SERVICES1

20

SERVICES2
Our Graphite cluster(s)
Client requesting graphs

Server-1

12 graphs/s

Loadbalancer (port 2003)

Graphite Rendering Cluster

Carbon relay

700 get/s

DEV

Server-n

a

Loadbalancer (port 443)

250K m/s

Server-2

3M m(etrics)/s(econd)

1M m/s
SYSENG

1.5M m/s
SERVICES1

21

500K m/s
SERVICES2
Graphite Storage Clusters

22
MySQL + StatsD
How do we use them?
Why use StatsD over Collectd?
• MySQL plugin for Collectd
• Sends SHOW STATUS
• No INNODB STATUS
• Plugin not flexible
• DBI plugin for Collectd
• Metrics based on columns
• Different granularity needed
• Separate daemon (with persistent connection)
• StatsD is easy as ABC

24
MySQL StatsD daemon
•
•
•
•
•
•
•
•

Written in Python
Rewritten and open sourced during a hackday
Gathers data every 0.5 seconds
Sends to StatsD (localhost) after every run
Easy configuration
Persistent connection
Baron Schwartz’ InnoDB status parser (cacti poller)
Other interesting metrics and counters
• Information Schema
• Performance Schema
• MariaDB specific
• Galera specific
• If you can query it, you can use it as a metric!
25
MySQL StatsD overview
StatsD
MySQL

SHOW GLOBAL VARIABLES
SHOW GLOBAL STATUS
SHOW ENGINE INNODB STATUS

StatsD thread

MySQL Thread

MySQL StatsD daemon

26
Example configuration
[daemon]
logfile = /var/log/mysql_statsd/daemon.log
pidfile = /var/run/mysql_statsd.pid
[statsd]
host = localhost
port = 8125
prefix = prd.mysql
include_hostname = true
[mysql]
host = localhost
username = mysqlstatsd
password =ub3rs3cr3tp@ss!
stats_types = status,variables,innodb,commit
query_variables = SHOW GLOBAL VARIABLES
interval_variables = 10000
query_status = SHOW GLOBAL STATUS
interval_status = 500
query_innodb = SHOW ENGINE INNODB STATUS
interval_innodb = 10000
query_commit = COMMIT
interval_commit = 5000
sleep_interval = 500
[metrics]
variables.max_connections = g
status.max_used_connections = g
status.connections = c
innodb.spin_waits = c

27
MySQL Multi Master patch
•
•
•
•

Perl (Net::Statsd)
Sends any status change to StatsD (localhost)
Non-blocking (thanks to UDP)
Draw as infinite in Graphite

28
Other metrics
• Deployments
• User initiated actions
• Logins
• High scores
• Comments / ratings
• Images uploaded
• Payments
• Application metrics
• Error counts
• Cache statistics (cache hit/miss)
• Request timers
• Image sizes
29
Start graphing!
Now it starts to get
interesting!
What is important for you?
• Identify your KPIs
• Don’t graph everything
• More graphs == less overview
• Combine metrics
• Stack clusters

31
Correlate!
• Include other metrics into your graphs
• Deployments
• Failover(s)
• Combine application metrics with your database
• Other influences
• Launch of a new game
• Apple keynotes

32
Graphing
• Graphite Graphing Engine
• DIY
• Giraffe
• Readily available dashboards/tools
• Graph Explorer (vimeo)
• Team Dashboard
• Skyline (Etsy)
• Dashing (Shopify)

33
DIY

34
Giraffe

35
Graph Explorer

36
Team Dashboard

37
Skyline

38
Dashing

39
Graphite Graphing Engine
• URI based rendering API
• Support for wildcards
• stats.prod.syseng.mysql.*.status.com_select
• sumSeries (stats.prod.syseng.mysql.*.status.com_select)
• aliasByNode(stats.prod.syseng.mysql.*.status.com_select, 4)

• Many functions
• Nth percentile
• Holt-Winters Forecast
• Timeshift

40
Graphite web interface

41
Graphite Example URL
https://graphitehost/render/?width=722&height=357&_salt=1366550446.553&righ
tDashed=1&target=alias%28sumSeries%28stats.prod.services.profilar.request.t
otal.count.*%29%2C%22Number%20of%20profile%20requests%22%29&target=alias%28
secondYAxis%28sumSeries%28stats_counts.prod.syseng.mysql.<node1>.status.que
stions%2C%20stats_counts.prod.syseng.mysql.<node2).status.questions%29%29%2
C%22Number%20of%20queries%20profiles%20cluster%22%29&from=00%3A00_20130415&
until=23%3A59_20130421

42
Graphite Example URL
https://graphitehost/render/?width=722&height=357&_salt=1366550446.553&righ
tDashed=1&target=alias%28sumSeries%28stats.prod.services.profilar.request.t
otal.count.*%29%2C%22Number%20of%20profile%20requests%22%29&target=alias%28
secondYAxis%28sumSeries%28stats_counts.prod.syseng.mysql.<node1>.status.que
stions%2C%20stats_counts.prod.syseng.mysql.<node2).status.questions%29%29%2
C%22Number%20of%20queries%20profiles%20cluster%22%29&from=00%3A00_20130415&
until=23%3A59_20130421

43
Examples: timeshift

44
Examples: multiple weeks

45
Challenges
The road ahead
What challenges do we have?
•
•
•
•
•
•
•

Improve MySQL-statsd (extensive issue list)
No zoom in on graphs
Get Skyline to work and not cry wolf
Machine learning
Eternal hunger for more metrics
Abuse of the system
Hitting limits of SSD write performance
• Virident? Fusion-IO?
• Carbon  OpenTSDB  Graphite-web?

47
What lessons have we learned?
• Persistent connections + repeatable read
• History list skyrocketed
• More hackdays are needed!
• Too many metrics slows down graphing
• Too many metrics can kill a host
• EstatsD for Erlang

48
Questions…
Practical links
• Graphite:
http://graphite.readthedocs.org/en/latest/
• Collectd:
https://collectd.org/
• StatsD on Github by Etsy:
https://github.com/etsy/statsd/wiki
• Etsy on StatsD:
http://codeascraft.etsy.com/2011/02/15/measureanything-measure-everything/

50
Thank you!
• Presentation can be found at:
http://spil.com/pluk2013
• MySQL Statsd can be found at:
http://spil.com/mysqlstatsd
http://github.com/spilgames/mysql-statsd
• If you wish to contact me:
art@spilgames.com

51

More Related Content

MySQL performance monitoring using Statsd and Graphite (PLUK2013)

Editor's Notes

  1. so that may be the reason our name is not widely known.
  2. The three main brands:Girls, aimed at girls ages from 8 to 12Teens aimed at boys and girls 10 to 15and Family basically mothers playing with their childrenStrong domains localized over 19 different languagesspielen.com, juegos.com, gamesgames.com, games.co.uk, oyunonya.comAll content is localized
  3. ----- Meeting Notes (30-11-12 12:00) -----Abbreviations (try to pronounce)Theory too long, second part too brief.High Availability -&gt; HA What do we do? Games!180M+Query numbers on DBsSome examples of portal namesSSP is abstraction layerSSP query exampleExplain why horizontal instead of verticalFunctional sharding slide!Explain why sattelite DCIntroduction to sattelite data centers (moving data to caching) but explain they do not own the dataInstead of example of migrating users, example of adding a new DCSlide 23: leave out slideWhy we chose erlang: remove pattern matching. Adds productivity: simplerAdd another example for buckets with a different backendSlide 22: partition on users, bucket and GIDs.It is not a mess in LAMP stack: the backend is just not scalables