Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Beyond 1000 bosh Deployments
Beyond 1000 BOSH Deployments
2
~/
→ Sven
● At anynines since 2015
● Ordained Minister
● Working with Cloud Foundry, BOSH and K8s
● Twitter @ShalahAllier
3
Also Known For
Mentor in the EVE: Online Alliance Test Alliance Please Ignore
Also where my slack avatar comes from
All the office plants
4
The Road to 1200 Deployments
5
The a9s Data Service Framework
6
BOSH Setup
Overbosh: Deploys runtime and Underbosh
Underbosh: Deploys service brokers and services (where we have 1200
deployments on) uses credhub colocated on overbosh
Utilsbosh: Utilities and Prometheus monitoring
7
Lessons Learned
● Deploy less with create-env
● Only create-env utilsbosh
● Deploy other directors with Utilsbosh
● Using Overbosh as Credhub provider for Underbosh can be
suboptimal
○ Recreate of the Overbosh means people cannot create services
○ Dedicated Credhub/UAA deployment better
● Using an external RDS does not solve all problems
○ More about that later
8
Running it
9
IO Credits are Fun
● AWS has limited IOPS for ssd disks at a rate of 3 IOPS/GB on gp2
● None on magnetic volumes (st1, sc1, standard)
● AWS-Stage BOSH DB runs on gp2
● AWS-Prod BOSH DB runs on standard
● You can see disk IOPS budget on the disk in cloud watch
● Unless its an RDS instance
● You have to create an alert for each single volume
10
Effects
● Database in AWS-P is consistently slower, but no variation in answer
times
● Database in AWS-S went unresponsive at some points
○ BOSH sometimes sends a few thousand requests in which do large joins
● EU-P BOSH has 50GB of standard disk (3$/mo)
● EU-S BOSH has 1TB of GP2 disk (119$/mo)
11
Things That Drain Your IOPS
● The daily snapshot task, even if snapshots are disabled
○ Made less severe
● Bosh vms/deployments
○ More later
● If your IOPS on the director disk get depleted repeatedly, take a
magnetic storage like sc1 or st1
○ Slower than gp2 at max speed
○ Costs half
○ Consistent and fast enough for BOSH
12
Some issues we had
13
September 2018
● 670 Deployments
● BOSH director is very slow
● Some queries take 2-3 minutes to complete
● Scaling BOSH and DB does only bring minor reductions
● M4.2xlarge RDS is a bit faster, but does not solve it
● More disk IOPS does not help
14
Solution
● Updating the director
● Reason was that every bosh vms made the bosh also select
deployment configs for each deployment separately
○ Even though it was not part of the output
● SAP stumbled over the issue first and fixed it
15
November 2018
● BOSH unresponsive or very slow
● No uploads/deploys possible
● Persistent disk 50% free
● “df -i” showed all inodes exhausted
● BOSH stores task logs on disk
○ And deletes regularly
○ If you have 900 deployments and prometheus bosh exporter does a bosh vms every 5
minutes you create tasks faster than bosh cleans them up
○ 1.8m task log folders on disk
○ Every one contained 0-3 log files
16
Solution
● Removing some older log files (1.79m)
● Scaling the disk
● Notifying BOSH core
● Set up alert for Inode usage on all persistent disks
● Switch from bosh exporter to graphite hm plugin
● BOSH core made the director more aggressive at purging old task
logs
○ Went from 1.6m task logs on disk to just 18.000
17
December 2018
● BOSH very slow
● Sometimes locks up for minutes
● Database works on some queries longer than BOSH waits
● Whenever a service is deployed or updated
18
Investigation
● Turns out when you use `bosh tasks -r` it queries the last 30 tasks
● We had 3.5m tasks in the DB
● Query: `SELECT * FROM “tasks” WHERE (“deployment_name” =
‘d27eda6’) ORDER BY “timestamp” DESC LIMIT 30`
○ No index on deployment_name
○ So if only 29 tasks are there it crawls through all 3.5m lines to find task 30.
○ Most deployments have less than 30 tasks in the DB
19
Solution
● Change to -r=1
● Make a deploy task for each deployment to make sure there is one
task
● Issue with BOSH core (No. 2105)
● BOSH core fix:
○ BOSH deletes old tasks faster so you have less (10 instead of 2 in each run)
○ Put an index on task types
○ 3.5m tasks > 1100 tasks in the DB
20
Monitoring
21
Things You Should Monitor
● Network IP exhaustion
○ IaaS dependent, but running out of IPs during deploys is suboptimal
○ Especially when customer notices first
● Disk IOPS (depending on IaaS)
● Quota limitations
○ Record holder is azure where a limit increase took 9 days
● CPU credits on important instances
● Disk inode usage, not just how full it is in terms of data
● Certificate expiration
● Check if metrics are missing
22
What 1200 deployments taught us
● BOSH team is usually rather fast at fixing issues that block the
director
● BOSH itself is pretty stable
● Change from the Prometheus bosh exporter to the graphite hm plugin
● For most smaller to medium environments t2.large (2cpu, 8GB ram
with burst CPU) or equivalent is plenty
● For large environments a m5.xlarge or m5.2xlarge is enough
○ Disk IO/Network speed will most likely be the bottleneck
23
Advice
● Don’t overdo it on the worker count
○ Our biggest director still has only 9 workers for tasks
○ The others have usually 3-4 workers
● Otherwise you run the risk of CPU starving yourself when you use all
workers simultaneously
24
The End
25
Keep in Touch
anynines.com
@anynines
26
@ShalahAllier
Questions?
27

More Related Content

Beyond 1000 bosh Deployments

  • 2. Beyond 1000 BOSH Deployments 2
  • 3. ~/ → Sven ● At anynines since 2015 ● Ordained Minister ● Working with Cloud Foundry, BOSH and K8s ● Twitter @ShalahAllier 3
  • 4. Also Known For Mentor in the EVE: Online Alliance Test Alliance Please Ignore Also where my slack avatar comes from All the office plants 4
  • 5. The Road to 1200 Deployments 5
  • 6. The a9s Data Service Framework 6
  • 7. BOSH Setup Overbosh: Deploys runtime and Underbosh Underbosh: Deploys service brokers and services (where we have 1200 deployments on) uses credhub colocated on overbosh Utilsbosh: Utilities and Prometheus monitoring 7
  • 8. Lessons Learned ● Deploy less with create-env ● Only create-env utilsbosh ● Deploy other directors with Utilsbosh ● Using Overbosh as Credhub provider for Underbosh can be suboptimal ○ Recreate of the Overbosh means people cannot create services ○ Dedicated Credhub/UAA deployment better ● Using an external RDS does not solve all problems ○ More about that later 8
  • 10. IO Credits are Fun ● AWS has limited IOPS for ssd disks at a rate of 3 IOPS/GB on gp2 ● None on magnetic volumes (st1, sc1, standard) ● AWS-Stage BOSH DB runs on gp2 ● AWS-Prod BOSH DB runs on standard ● You can see disk IOPS budget on the disk in cloud watch ● Unless its an RDS instance ● You have to create an alert for each single volume 10
  • 11. Effects ● Database in AWS-P is consistently slower, but no variation in answer times ● Database in AWS-S went unresponsive at some points ○ BOSH sometimes sends a few thousand requests in which do large joins ● EU-P BOSH has 50GB of standard disk (3$/mo) ● EU-S BOSH has 1TB of GP2 disk (119$/mo) 11
  • 12. Things That Drain Your IOPS ● The daily snapshot task, even if snapshots are disabled ○ Made less severe ● Bosh vms/deployments ○ More later ● If your IOPS on the director disk get depleted repeatedly, take a magnetic storage like sc1 or st1 ○ Slower than gp2 at max speed ○ Costs half ○ Consistent and fast enough for BOSH 12
  • 13. Some issues we had 13
  • 14. September 2018 ● 670 Deployments ● BOSH director is very slow ● Some queries take 2-3 minutes to complete ● Scaling BOSH and DB does only bring minor reductions ● M4.2xlarge RDS is a bit faster, but does not solve it ● More disk IOPS does not help 14
  • 15. Solution ● Updating the director ● Reason was that every bosh vms made the bosh also select deployment configs for each deployment separately ○ Even though it was not part of the output ● SAP stumbled over the issue first and fixed it 15
  • 16. November 2018 ● BOSH unresponsive or very slow ● No uploads/deploys possible ● Persistent disk 50% free ● “df -i” showed all inodes exhausted ● BOSH stores task logs on disk ○ And deletes regularly ○ If you have 900 deployments and prometheus bosh exporter does a bosh vms every 5 minutes you create tasks faster than bosh cleans them up ○ 1.8m task log folders on disk ○ Every one contained 0-3 log files 16
  • 17. Solution ● Removing some older log files (1.79m) ● Scaling the disk ● Notifying BOSH core ● Set up alert for Inode usage on all persistent disks ● Switch from bosh exporter to graphite hm plugin ● BOSH core made the director more aggressive at purging old task logs ○ Went from 1.6m task logs on disk to just 18.000 17
  • 18. December 2018 ● BOSH very slow ● Sometimes locks up for minutes ● Database works on some queries longer than BOSH waits ● Whenever a service is deployed or updated 18
  • 19. Investigation ● Turns out when you use `bosh tasks -r` it queries the last 30 tasks ● We had 3.5m tasks in the DB ● Query: `SELECT * FROM “tasks” WHERE (“deployment_name” = ‘d27eda6’) ORDER BY “timestamp” DESC LIMIT 30` ○ No index on deployment_name ○ So if only 29 tasks are there it crawls through all 3.5m lines to find task 30. ○ Most deployments have less than 30 tasks in the DB 19
  • 20. Solution ● Change to -r=1 ● Make a deploy task for each deployment to make sure there is one task ● Issue with BOSH core (No. 2105) ● BOSH core fix: ○ BOSH deletes old tasks faster so you have less (10 instead of 2 in each run) ○ Put an index on task types ○ 3.5m tasks > 1100 tasks in the DB 20
  • 22. Things You Should Monitor ● Network IP exhaustion ○ IaaS dependent, but running out of IPs during deploys is suboptimal ○ Especially when customer notices first ● Disk IOPS (depending on IaaS) ● Quota limitations ○ Record holder is azure where a limit increase took 9 days ● CPU credits on important instances ● Disk inode usage, not just how full it is in terms of data ● Certificate expiration ● Check if metrics are missing 22
  • 23. What 1200 deployments taught us ● BOSH team is usually rather fast at fixing issues that block the director ● BOSH itself is pretty stable ● Change from the Prometheus bosh exporter to the graphite hm plugin ● For most smaller to medium environments t2.large (2cpu, 8GB ram with burst CPU) or equivalent is plenty ● For large environments a m5.xlarge or m5.2xlarge is enough ○ Disk IO/Network speed will most likely be the bottleneck 23
  • 24. Advice ● Don’t overdo it on the worker count ○ Our biggest director still has only 9 workers for tasks ○ The others have usually 3-4 workers ● Otherwise you run the risk of CPU starving yourself when you use all workers simultaneously 24