Creating a Culture of Ownership and Trust with Visibility and Transparency by Shai Peretz @Agile Israel 2019
Report
Share
Report
Share
1 of 27
More Related Content
Creating a Culture of Ownership and Trust with Visibility and Transparency by Shai Peretz
1. Turn the lights on
Creating a culture of ownership and trust
With Visibility & transparency
Shai Peretz
2. About myself:
- Technologist, devops culture advocate
- Held various technology management positions (Outbrain, Cyota,
Shopping.com among others)
- Piano player, mostly Jazz
- Amateur photographer
- Co-founded a waldorf education (Anthroposophy) school in Tel Aviv
3. So, what is a culture of Ownership?
You’ve build it, you run it!
- You understand best how it is built, and what can go
wrong, hence:
- Responsible to take your code to production
- Create the tests and monitor them
- Set monitoring and alert thresholds
- Decide what is critical and what isn’t
- Say what actions to take when something goes wrong
- Receive the critical alerts and act upon them
- Automate any possible action, so that it will not wake you
up next time:)
4. What is a culture of Ownership?
- Nothing is ‘not my problem’:
- First verify that all my systems are working as they should, then look
elsewhere
- Be transparent and don’t hide your mistakes
- Culture of cooperation and helping others (production party:)
5. What is a culture of Ownership?
Learn from your mistakes:
- Document your actions (automatically)
- Blameless take-ins (post mortem):
- Lead by the event manager, as close as you can to the event
- Include all stakeholders
- 5 whys methodology
- Create tasks with due dates and priority
- Go back and check that tasks are done
Celebrate failure -
it is the best opportunity to learn!
7. So you have built a new service…
It is a really great app, smart and useful, people love it.
It is responsive, using all latest technology and buzzwords.
It is communicating with dozen other services via efficient APIs
It is collecting tons of data, process and move it via latest message queues, store
it in several data stores
It grabs the data back using a smart search engine and oops…
The user is getting an error. What happened?
8. Something went wrong...
Is it the search engine?
Or maybe the data store?
Or maybe a problem with the network?
Or maybe a broken API call?
Or the reverse proxy is down?
Or...
10. Can someone please turn the lights on?
Well, you can start looking for the problem with a flashlight...
11. But you only have 5 seconds...
That’s because you have committed to a 99.95% SLA, and you have used most of
your allowed downtime already:(
And your system is complicated...
13. We now have:
3 millions application metrics per minute +
1 million system metrics per minute +
750,000 log lines per minute
75 different dashboards rotating on six 65” monitors.
Is that enough light?
14. No. that’s too much light.
Which still leaves us in the dark.
What we need is some filters:
16. Yet, we can’t find where the problem is...
That’s because we have too much information. Well, at least for a human being.
17. Why don’t we let machines deal with it?
That’s exactly what they are built for*:
- Process tons of information in fractions of a second
- Correlate data from many different sources
- Analyze the data, search for anomalies
- Act upon it automatically
(or at least notify someone)
* Assuming the humans who programmed those machines did a good job:)
18. Great visibility:
Encourage Prevention - helps preventing problems before
they occur, by forcing you to consider most possible
problems in advance
Enables automatic, self healing when possible
And if not - provides us with a laser focus pointer into where
the problem is, in a timely manner (near real time) and allows
us to fix quickly (and automate for next time:)
19. So what tools should we use?
It doesn't really matter.
Well, it actually does:)
Choose the right set of tools for your organization, that you are comfortable with
as long as you get good coverage of:
- Automatic testing (visibility of the build & deploy pipeline)
- Infrastructure/system metrics and logs
- Application level metrics and logs
- External (user experience) monitoring
- Prediction and anomaly detection
Select tools that you trust and make their availability first priority!
20. Benefits of good visibility
Enabler for Agile and DevOps culture -
easier to take responsibility, better communication
Drives quality up (both code and infrastructure)
Improves MTTD, MTTR and MTTS (better SLA)
Reduce frustration and improve productivity
Helps to achieve business goals
22. Monitoring the monitoring system
No alerts. All dashboards are green. Does it really mean all is good?
Not necessarily…
You have to verify:
- Set another layer of independent monitoring, outside your network
- Create ‘positive’ checks, that confirms the system is up
If you don’t trust your monitoring system, it is useless!
23. Ownership + Transparency => Trust
Bring facts to your discussions
Take ownership on your stuff
Share your mistakes
Don’t blame others
When trust exist, people are more cooperative and open to learn =>
problems are fixed faster and rarely repeat themselves
24. Transparency
Status pages (if done properly):
- Can save a lot of time while troubleshooting a problem
- Increase transparency, build trust
- Should be automated wherever possible
- Use multi level pages - different level of details for engineering, business and
customers
Share your plans and progres
- Especially when you have delays...
25. How transparent should it be?
My rule of thumb - open up everything that will not hurt your organization
In order to be able to do so:
- People need to respect confidentiality
- People should have effective filters as to what is relevant for them
T r u s t This is a fragile circle, very easy to break!
Transparency
26. Impact of good visibility and transparency
Visibility
Transparency
Responsibility
Ownership
Communication
Quality
Frustration
Fatigue
MTTD
MTTR
MTTS
Uptime
SLA
Revenew
Customer
satisfaction
Employee
satisfaction