Ebook Reactive Microservices The Evolution of Microservices at Scale 2 PDF
Ebook Reactive Microservices The Evolution of Microservices at Scale 2 PDF
Ebook Reactive Microservices The Evolution of Microservices at Scale 2 PDF
m
pl
im
en
ts
Reactive
of
Microsystems
The Evolution of Microservices at Scale
Jonas Bonr
Reactive Microsystems
The Evolution of Microservices at Scale
Jonas Bonr
The OReilly logo is a registered trademark of OReilly Media, Inc. Reactive Microsys
tems, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi
bility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-99433-7
[LSI]
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
iii
6. Toward Scalable Persistence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Moving Beyond CRUD 49
Event LoggingThe Scalable Seamstress 50
TransactionsThe Anti-Availability Protocol 59
8. Next Steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Further Reading 71
Start Hacking 72
iv | Table of Contents
Introduction
v
We Cant Make the Horse Faster
If I had asked people what they wanted, they would have said faster
horses.
Henry Ford1
1 Its been debated whether Henry Ford actually said this. He probably didnt. Regardless,
its a great quote.
vi | Introduction
They have had many names over the years (DCOM, CORBA, EJBs,
WebServices, etc.). Today, we call them microservices. We, as an
industry, have gone full circle again. Fortunately, it is more of an
upward spiral as we are getting a little bit better at it every time
around.
Introduction | vii
take full advantage of the cloud.2 That said, it also can introduce
unnecessary complexity and simply slow you down. In other words,
do not apply microservices blindly. Think for yourself.
2 If approached from the perspective of distributed systems, which is the topic of this
report.
viii | Introduction
CHAPTER 1
Essential Traits of an Individual
Microservice
Isolation is the most important trait and the foundation for many of
the high-level benefits in microservices.
Isolation also has the biggest impact on your design and architec
ture. It will, and should, slice up the entire architecture, and there
fore it needs to be considered from day one.
It will even affect the way you break up and organize the teams and
their responsibilities, as Melvyn Conway discovered in 1967 (later
named Conways Law):
Any organization that designs a system (defined broadly) will pro
duce a design whose structure is a copy of the organizations com
munication structure.
Isolation between services makes it natural to adopt Continuous
Delivery (CD). This makes it possible for you to safely deploy appli
1
cations and roll out and revert changes incrementally, service by ser
vice.
Isolation makes it easier to scale each service, as well as allowing
them to be monitored, debugged, and tested independentlysome
thing that is very difficult if the services are all tangled up in the big
bulky mess of a monolith.
Act Autonomously
In a network of autonomous systems, an agent is only concerned with
assertions about its own policy; no external agent can tell it what to do,
without its consent. This is the crucial difference between autonomy and
centralized management.
Mark Burgess, Promise Theory
Single Responsibility
This is the Unix philosophy: Write programs that do one thing and do it
well. Write programs to work together.
Doug McIlroy
1 The Unix philosophy is described really well in the classic book The Art of Unix Pro
gramming by Eric Steven Raymond (Pearson Education).
2 For an in-depth discussion on the Single Responsibility Principle, see Robert C. Mar
tins website The Principles of Object Oriented Design.
Before we take on the task of slaying the monolith, lets try to under
stand why its architecture is problematic, why we need to slay the
monolith and move to a decoupled architecture using microservi
ces.
Suppose that we have a monolithic Java Platform, Enterprise Edition
(Java EE) application with a classic three-tier architecture that uses
Servlets, Enterprise Java Beans (EJBs) or Spring, and Java Persis
tence API (JPA), and an Oracle SQL database. Figure 2-1 depicts
this application.
7
Figure 2-1. A monolithic application with a classic three-tier architec
ture
Suppose that we want to move away from the application server and
the strongly coupled design and refactor this monolith into a
microservices-based system. By just drinking the Kool-Aid, relying
on a scaffolding tool, and following the path of least resistance,
many people end up with an architecture similar to that shown in
Figure 2-2.
Embrace Uncertainty
What is not surrounded by uncertainty cannot be the truth.
Richard Feynman
11
It sounds like a scary world.2 But it is also the world that gives us
solutions for resilience, elasticity, and isolation, among others. What
we need is better tools to not just survive, but to thrive in the barren
land of distributed systems.
2 It isif you have not experienced this first-hand, I suggest that you spend some time
thinking through the implications of L Peter Deutschs Fallacies of Distributed Com
puting.
3 That fact that information has latency and that the speed of light represents a hard (and
sometimes very frustrating) nonnegotiable limit on its maximum velocity is an obvious
fact for anyone that is building internet systems, or who has been on a VOIP call across
the Atlantic ocean.
4 Peter Bailis has a good explanation of the different flavors of strong consistency.
5 A good discussion on different client-side semantics of eventual consistencyinclud
ing read-your-writes consistency and causal consistencycan be found in Eventually
ConsistentRevisited by Werner Vogels.
6 Justin Sheehys There Is No Now is a great read on the topic.
There has been a lot of buzz about eventual consistency, and for
good reason. It allows us to raise the ceiling on what can be done in
terms of scalability, availability, and reduced coupling.
However, relying on eventual consistency is sometimes not permis
sible, because it can force us to give up too much of the high-level
business semantics. If this is the case, using causal consistency can be
a good trade-off. Semantics based on causality is what humans
expect and find intuitive. The good news is that causal consistency
can be made both scalable and available (and is even proven8 to be
the best we can do in an always available system).
7 Another excellent paper by Pat Helland, in which he introduced the idea of ACID 2.0,
in Building on Quicksand.
8 That causal consistency is the strongest consistency that we can achieve in an always
available system was proved by Mahajan et al. in their influential paper Consistency,
Availability, and Convergence.
9 For good discussions on vector clocks, see the articles Why Vector Clocks Are Easy
and Why Vector Clocks Are Hard.
10 For more information, see Mark Shapiros paper A comprehensive study of Convergent
and Commutative Replicated Data Types.
11 For a great production-grade library for CRDTs, see Akka Distributed Data.
17
change propagates in the systemthings like communication pat
terns, workflow, figuring out who is talking to whom, who is
responsible for what data, and so on. We need to model the business
domain from a data dependency and communication perspective.
As Greg Young, who coined Command Query Responsibility Segre
gation (CQRS), says:
When you start modeling events, it forces you to think about the behav
ior of the system, as opposed to thinking about structure inside the sys
tem.
Modeling events forces you to have a temporal focus on whats going on
in the system. Time becomes a crucial factor of the system.
Events represent facts about the domain and should be part of the
Ubiquitous Language of the domain. They should be modelled as
Domain Events and help us define the Bounded Contexts,1 forming
the boundaries for our service.
As Figure 4-1 illustrates, a bounded context is like a bulkhead: it
prevents unnecessary complexity from leaking outside the contex
tual boundary, while allowing you to use a single and coherent
domain model and domain language within.
1 For an in-depth discussion on how to design and use bounded contexts, read Vaughn
Vernons book Implementing Domain-Driven Design (Addison-Wesley).
2 An in-depth discussion on event storming is beyond the scope for this book, but a good
starting point is Alberto Brandolinis upcoming book Event Storming.
3 Pat Hellands paper, Data on the Outside versus Data on the Inside, talks about guide
lines for designing consistency boundaries. It is essential reading for anyone building
microservices-based systems.
4 For a good discussion on how to design with aggregates, see Vaughn Vernons Effective
Aggregate Design.
5 You can find a good summary of the design principles for almost-infinite scalability
here.
Figure 4-2 presents the flow of commands between a client and the
services/aggregates (an open arrow indicates that the command or
event was sent asynchronously).
If we add the events to the picture, it looks something like the flow
of commands shown in Figure 4-3.
Please note that this is only the conceptual flow of the events, how
they flow between the services. An actual implementation will use
subscriptions on the aggregates event stream to coordinate work
flow between multiple services (something we will discuss in depth
later on in this report).
After this lengthy discussion about events and immutable facts you
might be wondering if mutable state deserves a place at the table at
all.
Its a fact that mutable state, often in the form of variables, can be
problematic. One problem is that the assignment statementas dis
cussed by John Backus in his Turing Award lectureis a destructive
6 For example, session state, credentials for authentication, cached data, and so on.
27
both techniques to see how we can use them on different levels
throughout our design.
Go Asynchronous
Asynchronous and nonblocking I/O is about not blocking threads
of executiona process should not hold a thread hostage, hogging
resources that it does not use. It can help eliminate the biggest threat
to scalability: contention.1
Asynchronous and nonblocking execution and I/O is often more
cost-efficient through more efficient use of resources. It helps mini
mize contention (congestion) on shared resources in the system,
which is one of the biggest hurdles to scalability, low latency, and
high throughput.
As an example, lets take a service that needs to make 10 requests to
10 other services and compose their responses. Suppose that each
request takes 100 milliseconds. If it needs to execute these in a syn
chronous sequential fashion, the total processing time will be
roughly 1 second, as demonstrated in Figure 5-1.
1 Like Gene Amdahl, who coined Amdahls Law, has shown us.
2 With more threads comes more context switching, which is very costly. For more infor
mation on this, go to the How long does it take to make a context switch? blog post
on Tsunas blog.
3 Started by Lightbend in 2014, which has, together with Netflix, Pivotal, Oracle, and
Twitter, created a standard for backpressure on the JVM, now staged for inclusion in
Java 9 as the Flow API.
Figure 5-5. Circuit breakers can help improve the resilience of the ser
vice
4 For an example of an asynchronous JDBC wrapper, see Peter Lawreys post, A JDBC
Gateway Microservice.
5 Most NoSQL databases have asynchronous drivers. In the SQL world, you can turn to
Postgres or MySQL.
7 As brilliantly explained by Joel Spolsky in his classic piece The Law of Leaky Abstrac
tions.
8 The fallacies of RPC have not been better explained than in Steve Vinoskis Conve
nience over Correctness.
9 As explained by Jim Waldo et al., in their classic paper A Note on Distributed Comput
ing.
12 Inspired by David Wheeler, who once said, All problems in computer science can be
solved by another level of indirection.
Bulkheading is most well known for being used in the ship con
struction industry as a way to divide the ship into isolated, water
tight compartments. If a few compartments fill with water, the leak
is contained and the ship can continue to function.
The same thinking and technique can be applied successfully to
software. It can help us to arrive at a design that prevents failures
from propagating outside the failed component, avoiding cascading
failures taking down an entire system.
Some people might come to think of the Titanic as a counterexam
ple. It is actually an interesting study13 in what happens when you
dont have proper isolation between the compartments, and how that
can lead to cascading failures, eventually taking down the entire sys
tem. The Titanic did use bulkheads, but the walls that were sup
posed to isolate the compartments did not reach all the way up to
the ceiling. So, when 6 of its 16 compartments were ripped open by
the iceberg, the ship began to tilt and water spilled over the bulk
heads from one compartment to the next, eventually filling up all
compartments and sinking the Titanic, killing 1,500 people.
13 For an in-depth analysis of what made Titanic sink, see the article Causes and Effects
of the Rapid Sinking of the Titanic.
14 For a detailed discussion on this pattern, see the Akka documentation on Supervision
and Monitoring.
15 Joe Armstrongs thesis Making reliable distributed systems in the presence of software
errors is essential reading on the subject. According to Armstrong, Mike Williams at
Ericsson Labs came up with the idea of links between processes, as a way of monitor
ing process health and lifecycle, forming the foundation for process supervision.
16 A couple of great, and highly influential, papers on this topic are Crash Only Software
and Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel, both
by George Candea and Armando Fox.
Figure 5-11. Separating the stateless processing from the stateful aggre
gate
17 As always, it depends. A stateless architecture could work fine for sharded workloads,
using consistent hashing algorithms, and a highly available database like Cassandra.
18 There are ways to support multiple concurrently active aggregates. As an example, we
are currently building out a solution for Akka Persistence that is based on operation-
based CRDTs and vector clocks.
19 Akka Cluster is a decentralized, node ringbased, cluster management library for Akka
that is using epidemic gossip protocols, and failure detection, in a similar fashion as
Amazons Dynamo.
49
Disk space used to be very expensive. This is one of the reasons why
most SQL databases are using update-in-placeoverwriting existing
records with new data as it arrives.
As Jim Gray, Turing Award winner and legend in database and
transaction processing research, once said, Update-in-place strikes
many systems designers as a cardinal sin: it violates traditional
accounting practices that have been observed for hundreds of years.
Still, money talked, and CRUD was born.
The good news is that today disk space is incredibly cheap so there is
little-to-no reason to use update-in-place for System of Record. We
can afford to store all data that has ever been created in a system,
giving us the entire history of everything that has ever happened in
it.
We dont need Update and Delete anymore.1 We just Create new facts
either by adding more knowledge, or drawing new conclusions
from existing knowledgeand Read facts, from any point in the his
tory of the system. CRUD is no longer necessary.
The question is this: how can we do this efficiently? Lets turn our
attention to Event Logging.
1 This is general advice that is meant to guide the design and thought process; by no
means is it a rule. There might be legal (data-retention laws), or moral (users request
ing their account to be deleted) requirements to physically delete data after a particular
period of time. Still, using traditional CRUD is most often the wrong way to think
about the design of such a system.
2 The quote is taken from Pat Hellands insightful paper Immutability Changes Every
thing.
Youll notice that all those data stores have very different sweet
spots and are optimized for different access patterns.
As an example, if you want to build a graph of friends, and run
queries along the lines of whos my friends best friend?, this query
will be most efficiently answered by a graph-oriented database (such
as Neo4j). A graph database is easily populated dynamically, as
friends are adding each other on your social site, by subscribing to
the Friend aggregates Event Stream. In the same way, you can build
other views, using read-specialized databases from your event
streams, in real time.
One trade-off is that CQRS with event sourcing forces you to tackle
the essential complexity4 of the problem head on. This is often a good
thing, but if you are building a minimum viable product (MVP) or
prototype, a throwaway that you need to get to market quickly in
order to test an idea, you might be better off starting with CRUD
(and a monolith), moving to a more sustainable design after it has
proved its value in the market.
Another trade-off in moving to an architecture based on CQRS with
event sourcing is that the write side and read side will be eventually
consistent. It takes time for events to propagate between the two
storage models, which often reside on separate nodes or clusters.
The delay is often only a matter of milliseconds to a few seconds, but
it can have a big impact on the design of your system.
Using techniques such as a reactive design and event-driven design,
denormalization, minimized units of consistency are essential, and
makes these trade-offs less of an issue.
In general, it is important to take a step back from years of precon
ceived knowledge and biases, and look at how the world actually
works. The world is seldom strongly consistent, and embracing real
ity, and the actual semantics in the domain, often opens up opportu
nities for relaxing the consistency requirements.
One problem that all persistence approaches havebut for which
event sourcing together with CQRS offers a straightforward solution
is the fact that saving the event, and executing the side-effect that
the event represents, often cant be performed as a single atomic
operation. The best strategy is to rely on acknowledgements (the
4 You can find a good definition of the difference between essential complexity and acci
dental complexity here.
Perform the side-effect before persisting the event, with the risk
that the side-effect is performed but the event is never stored.
This pattern works well when you can depend on the upstream
to retry an operationby resubmitting the commanduntil it
is successful. It can be done by subscribing to the aggregates
event stream (as previously discussed), waiting for the acknowl
edgement (the event itself) that the event was persisted success
fully, and if not received within a specific time window,
resubmitting the command.
Store an event representing the intent to perform the side-effect,
perform the side-effect itself, and finally persist an eventthe
acknowledgmentthat the side-effect was successfully per
formed. In this style, you take on the responsibility of executing
the action, so upon replay you then can choose to reexecute the
side-effect if the acknowledgement is missing.
Finally, one thing to take into account is that using CQRS with event
sourcing makes it more difficult to delete datawhich is something
that is becoming increasingly important, for both legal and moral
reasons. First, we need to manage deletion of the events in the event
log. There are many different strategies for this, but discussing them
is beyond the scope of this report. Second, the events also have been
made available through the aggregates event stream, where they
could have been picked up by several other databases, systems, and
services. Keeping track of where all events end up is a nontrivial
exercise that needs to be thought through carefully when designing
the system (and be maintained throughout long-term evolution of
the system).
So, what should we do? Lets take a step back and think about how
we deal with partial and inconsistent information in real life.
For example, suppose that we are chatting with a friend in a noisy
bar. If we cant catch everything that our friend is saying, what do we
do? We usually (hopefully) have a little bit of patience and allow
5 The infamous, and far too common, anti-pattern Integrating over Database comes to
mind.
6 This quote is from Pat Hellands excellent paper Life Beyond Distributed Transac
tions.
7 Its worth reading Pat Hellands insightful article Memories, Guesses, and Apologies.
8 See Clemens Vasters post Sagas for a short but good introduction to the idea. For a
more in-depth discussion, putting it in context, see Roland Kuhns excellent book Reac
tive Design Patterns (Manning).
9 Originally defined in the paper Sagas by Hector Garcia-Molina and Kenneth Salem.
10 For an in-depth discussion, see Catie McAfferys great talk on Distributed Sagas.
11 For an understanding about how Spanner works, see the original paper, Spanner:
Googles Globally-Distributed Database.
12 If you are interested in this, be sure to read Eric Brewers Spanner, TrueTime, and the
CAP Theorem.
13 For more information, see Highly Available Transactions: Virtues and Limitations, by
Peter Bailis et al.
14 One fascinating paper on this topic is Coordination Avoidance in Database Systems
by Peter Bailis et al.
15 A must-see talk, explaining the essence of problem, and painting a vision for where we
need to go as an industry, is Peter Alvaros excellent RICON 2014 keynote Outwards
from the Middle of the Maze.
You could not step twice into the same river. Everything flows and noth
ing stays.
Heraclitus
1 We are using Tyler Akidaus definition of streaming: A type of data processing engine
that is designed with infinite data sets in mind, from his article The world beyond
batch: Streaming 101.
67
timewhen it happensto perform continuous queries or aggrega
tions of inbound data and feed itin real timeback into the appli
cation to affect the way it is operating.
We have covered a lot of ground in this report, yet for some of the
topics we have just scratched the surface. I hope it has inspired you
to learn more and to roll up your sleeves and try these ideas out in
practice.
Further Reading
Learning from past failures1 and successes3 in distributed systems
and collaborative services-based architectures is paramount. Thanks
to books and papers, we dont need to live through it all ourselves
but have a chance to learn from other peoples successes, failures,
mistakes, and experiences.
There are a lot of references throughout this report, I very much
encourage you to read them.
When it comes to books, there are so many to recommend. If I had
to pick two that take this story further and provide practical real-
world advice, they would be Roland Kuhns excellent Reactive Design
1 The failures of SOA, CORBA, EJB,2 and synchronous RPC are well worth studying and
understanding.
2 Check out Bruce Tate, Mike Clark, Bob Lee, Patrick Linskeys book, Bitter EJB (Man
ning).
3 Successful platforms with tons of great design ideas and architectural patterns have so
much to teach usfor example, Tandem Computers NonStop platform, the Erlang
platform, and the BitTorrent protocol.
71
Patterns (Manning) and Vaughn Vernons thorough and practical
Implementing Domain-Driven Design (Addison-Wesley).
Start Hacking
The good news is that you do not need to build all of the necessary
infrastructure and implement all the patterns from scratch yourself.
The important thing is understanding the design principles and phi
losophy. When it comes to implementations and tools, there are
many off-the-shelf products that can help you with the implementa
tion of most of the things we have discussed.
One of them is the Lagom4 microservices framework, an open
source, Apache 2licensed framework with Java and Scala APIs.
Lagom pulls together most of the practices and design patterns dis
cussed in this report into a single, unified framework. It is a formali
zation of all the knowledge and design principles learned over the
past eight years of building microservices and general distributed
systems in Akka and Play Framework.
Lagom is a thin layer on top of Akka and Play, which ensures that it
works for massively scalable and always available distributed sys
tems, hardened by thousands of companies for close to a decade. It
also is highly opinionated, making it easy to do the right thing in
terms of design and implementation strategies, giving the developer
more time to focus on building business value.
Here are just some of the things that Lagom provides out of the box:
Asynchronous by default:
Async IO
Async Streamingover WebSockets and Reactive Streams
Async Pub/Sub messagingover Kafka
Intuitive DSL for REST over HTTP, when you need it
Event-based persistence:
CQRS and Event Sourcingover Akka Persistence and Cas
sandra
4 Lagom means just right, or just the right size, in Swedish and is a humorous answer
to the common but nonsensical question, What is the right size for a microservice?
Start Hacking | 73
About the Author
Jonas Bonr is Founder and CTO of Lightbend, inventor of the
Akka project, coauthor of the Reactive Manifesto, a Java Champion,
and author of Reactive Microservices Architecture (OReilly). Learn
more about Jonas at his website.