Beyond 1000 bosh Deployments

Beyond 1000 BOSH Deployments
2

~/
→ Sven
● At anynines since 2015
● Ordained Minister
● Working with Cloud Foundry, BOSH and K8s
● Twitter @ShalahAllier
3

Also Known For
Mentor in the EVE: Online Alliance Test Alliance Please Ignore
Also where my slack avatar comes from
All the office plants
4

The Road to 1200 Deployments
5

The a9s Data Service Framework
6

BOSH Setup
Overbosh: Deploys runtime and Underbosh
Underbosh: Deploys service brokers and services (where we have 1200
deployments on) uses credhub colocated on overbosh
Utilsbosh: Utilities and Prometheus monitoring
7

Lessons Learned
● Deploy less with create-env
● Only create-env utilsbosh
● Deploy other directors with Utilsbosh
● Using Overbosh as Credhub provider for Underbosh can be
suboptimal
○ Recreate of the Overbosh means people cannot create services
○ Dedicated Credhub/UAA deployment better
● Using an external RDS does not solve all problems
○ More about that later
8

IO Credits are Fun
● AWS has limited IOPS for ssd disks at a rate of 3 IOPS/GB on gp2
● None on magnetic volumes (st1, sc1, standard)
● AWS-Stage BOSH DB runs on gp2
● AWS-Prod BOSH DB runs on standard
● You can see disk IOPS budget on the disk in cloud watch
● Unless its an RDS instance
● You have to create an alert for each single volume
10

Effects
● Database in AWS-P is consistently slower, but no variation in answer
times
● Database in AWS-S went unresponsive at some points
○ BOSH sometimes sends a few thousand requests in which do large joins
● EU-P BOSH has 50GB of standard disk (3$/mo)
● EU-S BOSH has 1TB of GP2 disk (119$/mo)
11

Things That Drain Your IOPS
● The daily snapshot task, even if snapshots are disabled
○ Made less severe
● Bosh vms/deployments
○ More later
● If your IOPS on the director disk get depleted repeatedly, take a
magnetic storage like sc1 or st1
○ Slower than gp2 at max speed
○ Costs half
○ Consistent and fast enough for BOSH
12

September 2018
● 670 Deployments
● BOSH director is very slow
● Some queries take 2-3 minutes to complete
● Scaling BOSH and DB does only bring minor reductions
● M4.2xlarge RDS is a bit faster, but does not solve it
● More disk IOPS does not help
14

Solution
● Updating the director
● Reason was that every bosh vms made the bosh also select
deployment configs for each deployment separately
○ Even though it was not part of the output
● SAP stumbled over the issue first and fixed it
15

November 2018
● BOSH unresponsive or very slow
● No uploads/deploys possible
● Persistent disk 50% free
● “df -i” showed all inodes exhausted
● BOSH stores task logs on disk
○ And deletes regularly
○ If you have 900 deployments and prometheus bosh exporter does a bosh vms every 5
minutes you create tasks faster than bosh cleans them up
○ 1.8m task log folders on disk
○ Every one contained 0-3 log files
16

Solution
● Removing some older log files (1.79m)
● Scaling the disk
● Notifying BOSH core
● Set up alert for Inode usage on all persistent disks
● Switch from bosh exporter to graphite hm plugin
● BOSH core made the director more aggressive at purging old task
logs
○ Went from 1.6m task logs on disk to just 18.000
17

December 2018
● BOSH very slow
● Sometimes locks up for minutes
● Database works on some queries longer than BOSH waits
● Whenever a service is deployed or updated
18

Investigation
● Turns out when you use `bosh tasks -r` it queries the last 30 tasks
● We had 3.5m tasks in the DB
● Query: `SELECT * FROM “tasks” WHERE (“deployment_name” =
‘d27eda6’) ORDER BY “timestamp” DESC LIMIT 30`
○ No index on deployment_name
○ So if only 29 tasks are there it crawls through all 3.5m lines to find task 30.
○ Most deployments have less than 30 tasks in the DB
19

Solution
● Change to -r=1
● Make a deploy task for each deployment to make sure there is one
task
● Issue with BOSH core (No. 2105)
● BOSH core fix:
○ BOSH deletes old tasks faster so you have less (10 instead of 2 in each run)
○ Put an index on task types
○ 3.5m tasks > 1100 tasks in the DB
20

Things You Should Monitor
● Network IP exhaustion
○ IaaS dependent, but running out of IPs during deploys is suboptimal
○ Especially when customer notices first
● Disk IOPS (depending on IaaS)
● Quota limitations
○ Record holder is azure where a limit increase took 9 days
● CPU credits on important instances
● Disk inode usage, not just how full it is in terms of data
● Certificate expiration
● Check if metrics are missing
22

What 1200 deployments taught us
● BOSH team is usually rather fast at fixing issues that block the
director
● BOSH itself is pretty stable
● Change from the Prometheus bosh exporter to the graphite hm plugin
● For most smaller to medium environments t2.large (2cpu, 8GB ram
with burst CPU) or equivalent is plenty
● For large environments a m5.xlarge or m5.2xlarge is enough
○ Disk IO/Network speed will most likely be the bottleneck
23

Advice
● Don’t overdo it on the worker count
○ Our biggest director still has only 9 workers for tasks
○ The others have usually 3-4 workers
● Otherwise you run the risk of CPU starving yourself when you use all
workers simultaneously
24

Keep in Touch
anynines.com
@anynines
26
@ShalahAllier

Beyond 1000 bosh Deployments

More Related Content

Beyond 1000 bosh Deployments