MySQL performance monitoring using Statsd and Graphite (PLUK2013)
Note: this is a placeholder for the presentation next Tuesday at the Percona Live London
1 of 51
More Related Content
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
4. Facts
•
•
•
•
•
Company founded in 2001
350+ employees world wide
180M+ unique visitors per month
Over 60M registered users
45 portals in 19 languages
• Casual games
• Social games
• Real time multiplayer games
• Mobile games
• 35+ MySQL clusters
• 60k queries per second (3.5 billion qpd)
4
8. Existing monitoring systems we use(d)
•
•
•
•
Opsview/Nagios (mainly availability)
Cacti (using Baron Schwartz/Percona templates)
MONYog
Good ol’ RRD
8
9. Challenges
• Problems with existing systems
• Stats gathering through polling
• Data gets averaged out
• (Host) checks are run serial
• Slowdowns in a run means no/less data
• Setting up an SSH connection is slow
• Low granularity (1 to 5 minutes)
• Hardly scalable
• Difficult to correlate metrics
9
10. Difficult to add a new metric
host065
bash-3.2# netstat -s | grep "listen queue"
26 times the listen queue of a socket overflowed
host066
bash-3.2# netstat -s | grep "listen queue"
33 times the listen queue of a socket overflowed
10
12. What is Collectd?
•
•
•
•
Unix daemon that gathers system statistics
Over 90 (input/output) plugins
Plugin to send metrics to Graphite/Carbon
Very useful for system metrics
12
14. What is StatsD?
•
•
•
•
•
•
•
Front-end proxy for Graphite/Carbon (by Etsy)
NodeJS daemon (also other languages)
Receives UDP (on localhost)
Buffers metrics locally
Flushes periodically data to Graphite/Carbon (TCP)
Client libraries available in about any language
Send any metric you like!
14
18. What is Graphite?
• Highly scalable real-time graphing system
• Collects numeric time-series
• Backend daemon Carbon
• Carbon-cache: receives data
• Carbon-aggregator: aggregates data
• Carbon-relay: replication and sharding
• RRD or Whisper database
18
19. Graphite’s capabilities
• Each metric is in its own bucket
• Periods make folders
• prod.syseng.mmm.<hostname>.admin_offline
• Metric types
• Counters
• Gauge
• Retention can be set using a regex
• [mysql]
• pattern = ^prod.syseng.mysql..*$
• retentions = 2s:1d,1m:3d,5m:7d,1h:5y
19
24. Why use StatsD over Collectd?
• MySQL plugin for Collectd
• Sends SHOW STATUS
• No INNODB STATUS
• Plugin not flexible
• DBI plugin for Collectd
• Metrics based on columns
• Different granularity needed
• Separate daemon (with persistent connection)
• StatsD is easy as ABC
24
25. MySQL StatsD daemon
•
•
•
•
•
•
•
•
Written in Python
Rewritten and open sourced during a hackday
Gathers data every 0.5 seconds
Sends to StatsD (localhost) after every run
Easy configuration
Persistent connection
Baron Schwartz’ InnoDB status parser (cacti poller)
Other interesting metrics and counters
• Information Schema
• Performance Schema
• MariaDB specific
• Galera specific
• If you can query it, you can use it as a metric!
25
27. Example configuration
[daemon]
logfile = /var/log/mysql_statsd/daemon.log
pidfile = /var/run/mysql_statsd.pid
[statsd]
host = localhost
port = 8125
prefix = prd.mysql
include_hostname = true
[mysql]
host = localhost
username = mysqlstatsd
password =ub3rs3cr3tp@ss!
stats_types = status,variables,innodb,commit
query_variables = SHOW GLOBAL VARIABLES
interval_variables = 10000
query_status = SHOW GLOBAL STATUS
interval_status = 500
query_innodb = SHOW ENGINE INNODB STATUS
interval_innodb = 10000
query_commit = COMMIT
interval_commit = 5000
sleep_interval = 500
[metrics]
variables.max_connections = g
status.max_used_connections = g
status.connections = c
innodb.spin_waits = c
27
28. MySQL Multi Master patch
•
•
•
•
Perl (Net::Statsd)
Sends any status change to StatsD (localhost)
Non-blocking (thanks to UDP)
Draw as infinite in Graphite
28
31. What is important for you?
• Identify your KPIs
• Don’t graph everything
• More graphs == less overview
• Combine metrics
• Stack clusters
31
32. Correlate!
• Include other metrics into your graphs
• Deployments
• Failover(s)
• Combine application metrics with your database
• Other influences
• Launch of a new game
• Apple keynotes
32
33. Graphing
• Graphite Graphing Engine
• DIY
• Giraffe
• Readily available dashboards/tools
• Graph Explorer (vimeo)
• Team Dashboard
• Skyline (Etsy)
• Dashing (Shopify)
33
47. What challenges do we have?
•
•
•
•
•
•
•
Improve MySQL-statsd (extensive issue list)
No zoom in on graphs
Get Skyline to work and not cry wolf
Machine learning
Eternal hunger for more metrics
Abuse of the system
Hitting limits of SSD write performance
• Virident? Fusion-IO?
• Carbon OpenTSDB Graphite-web?
47
48. What lessons have we learned?
• Persistent connections + repeatable read
• History list skyrocketed
• More hackdays are needed!
• Too many metrics slows down graphing
• Too many metrics can kill a host
• EstatsD for Erlang
48
51. Thank you!
• Presentation can be found at:
http://spil.com/pluk2013
• MySQL Statsd can be found at:
http://spil.com/mysqlstatsd
http://github.com/spilgames/mysql-statsd
• If you wish to contact me:
art@spilgames.com
51
Editor's Notes
so that may be the reason our name is not widely known.
The three main brands:Girls, aimed at girls ages from 8 to 12Teens aimed at boys and girls 10 to 15and Family basically mothers playing with their childrenStrong domains localized over 19 different languagesspielen.com, juegos.com, gamesgames.com, games.co.uk, oyunonya.comAll content is localized
----- Meeting Notes (30-11-12 12:00) -----Abbreviations (try to pronounce)Theory too long, second part too brief.High Availability -> HA What do we do? Games!180M+Query numbers on DBsSome examples of portal namesSSP is abstraction layerSSP query exampleExplain why horizontal instead of verticalFunctional sharding slide!Explain why sattelite DCIntroduction to sattelite data centers (moving data to caching) but explain they do not own the dataInstead of example of migrating users, example of adding a new DCSlide 23: leave out slideWhy we chose erlang: remove pattern matching. Adds productivity: simplerAdd another example for buckets with a different backendSlide 22: partition on users, bucket and GIDs.It is not a mess in LAMP stack: the backend is just not scalables