Patterns and Practices: Microservices
Patterns and Practices: Microservices
Patterns and Practices: Microservices
microservices
Patterns and Practices eMag Issue 59 - Mar 2018
FOLLOW US CONTACT US
GENERAL FEEDBACK feedback@infoq.com
ADVERTISING sales@infoq.com
EDITORIAL editors@infoq.com
Thomas Betts
While the underlying technology and patterns a development lead should ask is “How do I break
are certainly interesting, microservices have al- up my monolithic process?” as the development
ways been about helping development teams be process is critical to establishing and maintaining
more productive. Whether used as a technique velocity.
for architects to manage complexity or to make
small teams more independent and responsible With microservices distributed across containers,
for supporting the software they create, the hu- how is a developer able to step into the code and
man aspect of microservices cannot be ignored. debug what is happening? Idit Levine discussed
the problem and introduced Squash, an open-
Many of the experts who spoke about microser- source platform for debugging microservices ap-
vices patterns and practices at QCon San Francis- plications.
co 2017 did not simply talk about the technical
details of microservices. They included a focus on Randy Shoup provided practical examples of how
the business side and more human-oriented as- to manage data in microservices, with an empha-
pects of developing distributed software systems. sis on migrating from a monolithic database. He
also strongly advocated for building a monolith
At Netflix, the cloud database engineering team first, and only migrating to microservices after
is responsible for providing several flavors of data you actually require the scaling and other bene-
persistence as a service to microservice develop- fits they provide.
ment teams. Roopa Tangirala explained how her
team has created self-service tools that help de- The microservices track also included a panel
velopers easily implement the appropriate data discussion where several experts shared their
store for each project’s needs. experiences and advice for being successful with
microservices. Questions from the audience high-
Drawing on his experience with developing a mi- lighted common themes, such as dealing with
croservices application at Datawire in 2013, Ra- deployments, communication between micros-
fael Schloming argued that one of the most im- ervices, and looking at what future trends might
portant — although often ignored — questions follow microservices.
CONTRIBUTORS
Thomas Betts
is a principal software engineer at IHS Markit, with two
decades of professional software development experience.
His focus has always been on providing software solutions
that delight his customers. He has worked in a variety of
industries, including retail, finance, health care, defense and
travel. Thomas lives in Denver with his wife and son, and they
love hiking and otherwise exploring beautiful Colorado.
Daniel Bryant
is leading change within organisations and technology.
His current work includes enabling agility within
organisations by introducing better requirement gathering
and planning techniques, focusing on the relevance of
architecture within agile development, and facilitating
continuous integration/delivery.
Watch presentation online on InfoQ
with self-provisioning ed across regions or data centers. For maintenance and upgrades
across all clusters, software or
Netflix maintains three sets of hardware, we need to know
capabilities that allow Redis structures for each queue.
One is a sorted set that contains
whether we can perform main-
tenance without impacting pro-
application users to queue elements by score. The
second is a hash set that contains
duction services. Can we build
our own solution or should we
create clusters on
the payload, and the key is the buy something that’s out there?
message ID. The third is a sorted
set that contains messages con- Another challenge is monitoring.
their own. sumed by the client, but which
have yet to be acknowledged. So
We have tens and thousands of
instances, and all of these instanc-
the third is the unacknowledged es are sending metrics. When
set. there’s a problem, we should
know which metrics make the
most sense and which we should
Identifying the be looking at. We must maintain a
challenges high signal-to-noise ratio.
I love this quote, but I don’t think
my on-call team feels like this: “I
expected times like this — but I Overcoming challenges
never felt that they’d be so bad, The very first step in meeting
so long, and so frequent.” these challenges is to have ex-
perts. We have two or three core
The first challenge my team faces people in our Cassandra cloud da-
is the wide variety and the scale. tabase engineering team that we
We have so many different fla- call subject-matter experts. These
vors of data store, and we have people provide best practices
to manage and monitor all these and work closely with the mi-
different technologies. We need croservice teams to understand
to build a team that is capable of their requirements and suggest a
doing all this while making sure back-end data store. They are the
the team has the skills to cater ones who drive the features and
to all of these different technolo- best practices, as well as the prod-
gies. Handling that variety, espe- uct future and vision.
cially with a small team, becomes
a challenge to manage. Everybody in the team goes on
call for all of these technologies,
The next challenge is predicting so it’s useful to have a core set of
the future. With a combination of people that understand what’s
all of these technologies, we have happening and how we can re-
thousands of clusters, tens and ally fix the back end. Instead of
thousands of nodes, petabytes building automation that applies
of data. We need to predict when patches on top of what is broken,
our cluster risks running out of ca- we can contribute to the open
pacity. My central-platform team
source or to the back-end data software and kernel version each central place to capture this
tier — and produce a feature. runs, its size, and the cost of man- metadata is crucial.
aging it. The metadata helps the
Next, we build intelligent systems application team understand the Lastly, we track maintenance win-
to work for us. These systems take cost associated with a particular dows. Some clusters can have
on all automation and remedia- back end and the data they are maintenance windows at night,
tion. They accept the alerts, look trying to store, and whether or while others receive high traffic
at the config, and use the latency not their approach makes sense. at the same time. We decide on
thresholds we have for each ap- an appropriate maintenance win-
plication to make decisions, sav- The self-service capability of CDE dow for a cluster’s use case and
ing people from getting paged Service allows application users traffic pattern.
for each and every alert. to create clusters on their own,
without the CDE team getting in
the way. The users don’t need to Architecture
CDE Service understand all the nitty-gritty de- Figure 1 shows the architecture,
CDE Service helps the CDE team tails of the back-end YAML; they with the datastore in the center.
provide data stores as a service. only need to provide minimal in- For the scheduler on the left, we
Its first component captures the formation. We create the cluster use Jenkins, which is based on
thresholds and SLAs. We have and make sure that it is using the cron and which allows us to click
thousands of microservices; right settings, it has the right ver- a button to do upgrades or node
how do we know which service sion, and it has the best practices replacements. Under that is CDE
requires what 99th-percentile built in. Service, which captures the clus-
latency? We need a way to look ter metadata and is the source
at the clusters and see both the Before CDE Service, contact infor- of all information like SLAs, Pag-
requirements and what have mation only sat outside the sys- erDuty information, and much
we promised so that we can tell tem. For each application, we’d more. On the top is the monitor-
if a cluster is sized effectively or need to know who to contact and ing system. At Netflix, we use At-
needs to scale up. which team to page. It becomes las, an open-source telemetry sys-
tricky when you’re managing so tem, to capture all of the metrics.
Cluster metadata helps provide a many clusters, and having some Whenever there’s a problem and
global view of all the clusters: the we cannot meet the 99th-percen-
tile latency, the alert will go off. version, the software version, the When an upgrade is running, it
On the very right is the remedia- hardware version, the average can be tricky to figure out what
tion system, an execution frame- node count, and various costs. I percentage of the test clusters
work that runs on containers and can also look at my oldest node, and prod clusters have been up-
that can execute automation. so I can see if the cluster has a graded across a fleet that num-
very old node we need to replace, bers in the thousands. We have a
Anytime an alert fires, the moni- then we will just run remedia- self-service UI to which applica-
toring system will send the alert tions. There’s a job that scans for tion teams can log in to see how
to the remediation system. That old nodes and run terminations. far along we are in the upgrade
system will perform automated In the interest of space, I have not process.
remediation on the data store shown many columns, but you
and won’t even let the alert go can pick what information you
to the CDE team. Only in situa- want to see. Machine learning
tions for which we have not yet Earlier, I mentioned having to
built automation will alerts come We have another UI for creating predict the future. Our telemetry
directly to us. It is in our team’s new clusters, specific to each system stores two weeks of met-
best interest to build as much au- data store. An application user rics, and previous historical data
tomation as possible, to limit the needs to provide only a cluster is pushed to S3. We analyze this
number of on-call pages we need name, email address, the amount data using Kibana dashboards to
to respond to. of data they are planning to store, predict when the cluster will run
and the regions in which to create out of capacity.
the cluster — then the automa-
SLA tion kicks off the cluster creation We have a system called predic-
Figure 2 shows the cluster view in the background. This process tive analysis, which runs models
where I can look at all of my clus- makes it easy for a user to create to predict when a cluster will run
ters. I can see what version they clusters whenever they want, and out of capacity. The system runs
are running, which environment since we own the infrastructure, in the background and pages us
they are, which region they are we make sure that the cluster cre- or notifies us on a Slack channel
in, and what are the number of ation is using the right version of when it expects a cluster to ex-
nodes. This view also shows the the data store with all of the best ceed capacity in 90 days. With
customer email, the Cassandra practices built in. Cassandra, we only want to use a
24
Q: We know that we’re a mono- times with my team, where this is for most of us, is maybe not. As I
lithic application, and we know a legit thing to say. tried to say in my talk, it is the 0.1
that we want to get to busi- percent, or 0.01 percent that get
ness-context-type services. That’s the answer. You know more really large, where you absolute-
But where does that line get than you think about how to de- ly need them — there is no way
drawn? Is it a product level, an sign services. If you know how to Google, Amazon, Netflix, Stitch
API level, a microservice level? design classes, for the most part Fix work without microservices.
Is it just what feels right? you know how to design services. But if you don’t have a huge load,
The only part is recognizing that it is fine to stick with a monolith.
Rafael Schloming: That is a hard
you cannot be as chatty with ser- When should you go with micro-
question, but I think one of the
vices as you can be with some- services? Well, when are you un-
ways to, sort of, think about it is
thing that is in process. able to scale things independent-
actually something Randy said
ly, when does it slow down, when
earlier, which is don’t think about
Richardson: Decomposition ap- do things evolve at different
the size of a microservice in terms
plies at many levels. In a sense, rates? That’s the wall that you
of its lines of code, think about
you decompose methods, class- have to scale with microservices.
the scope. And how do you de-
es, packages, and modules, and
fine scope? Well, you need to un-
so the microservices is just yet Richardson: And I want to add
derstand what it is you are trying
another level in that kind of hi- to that. If your development ve-
to achieve at a high level, in one
erarchy. One comment I would locity is not where it needs to be,
or two sentences.
add is that I think microservices I would actually start to review
kind of have this important rela- your development practices be-
It is really a negotiation between
tionship with team structure as fore switching to microservices.
the user of a service and the team
well. I think there are two mod- So, for instance, if you are not do-
that delivers that service. You
els for microservices. There is this ing automated testing thorough-
need to track the usage; if your
super fine-grain model, which is ly, and I think probably 70-plus
users are happy, then you’re done.
one service per developer, that percent of organizations, accord-
It really helps to think in terms of
seems to be happening at some ing to a SourceLab report, have
that framing, understanding who
companies. Or when you have not completely embraced auto-
the user of the service is, and go-
thousands of services — that or- mated testing. So if you are one
ing from there. And, from that
der of magnitude. Another way of those, work on that first. And
perspective, you can just try a lot
of thinking about services is as a then, you know, once you have
of different kinds of APIs that will
small enough “application” that the hang of that and you really
sort of serve the same mission
its team can remain nimble and are able to automate as much as
and figure out what you need.
agile. That is a much coarser-grain possible, then think about the mi-
And, again, you can track how
model of microservices. And so croservice architecture. It is kind
successful you’re making your
that impacts decomposition. of like try walking before you run.
users in order to measure your
progress as you iterate through
Ryan: I think it is probably a com- Schloming: That’s a great point,
the difficult design space.
mon problem for a lot of people in and a great thing to do is just — it
the room, that they have a mono- doesn’t need to be super heavy-
Shoup: So this is a little bit of a flip
lith yak that they want to shave, weight — to track where you
response, but I don’t mean it in
and that is totally fine. Start shav- spend your time. If you are doing
any aggressive way. Do you guys
ing where you think shaving adds lots of manual testing and that
build one class, like one language
value, and stop shaving where is slowing you down, you don’t
class in Java or whatever? How do
you are not getting any more val- necessarily think about that on a
you know what the scope of the
ue. It is okay to have a monolith if day-to-day basis. And, you know,
classes are? That’s a design thing.
it is doing what it is supposed to if you are spending a lot of time
The class is a single responsibili-
do. I know that might be heresy wrestling with particular areas of
ty; we try to make the interface
here, but if it is doing what it is your monolith, maybe that’s the
minimal and try not to be chatty.
supposed to do, why touch it? If it time you should start shaving
The reason we ask it that way is
is not, shave it, and iterate. that particular patch of yak.
not to put you on the spot, and
the people that are working for
Shoup: A related, excellent ques- Ryan: So I think Randy gave a
me are laughing right now: this
tion is, more or less, are microser- couple examples of why you
is a thing that I have done many
vices worth it? And the answer, might want to do that, scale be-
29
swer. But, in terms of writing code examples of that marketplace of don’t see that you are having a
from scratch, I feel like it is an indi- other things to assemble. degraded experience. If you are
vidual developer muddling along not able to get your personalized
somehow, scratching their head. list of movies to watch from that
And, if that hasn’t changed, we service, if you cannot go to that
have not had a Moore’s law for Q: When I log into an applica- service, then they may fall you
software development in that re- tion like Netflix, it is a pretty back to a fallback page. So you
gard. frictionless user experience. might not experience degraded
I log in once and I don’t get a service; you do not think you are
Ryan: So if we are in the realm of sense that I’m logging into the not seeing your active list of mov-
predictions, I think some of the microservice for my user pro- ies as the service is giving you the
answers are sitting outside in the file, customer history, etc. How fallback experience.
vendor booths. More and more of do you maintain this friction-
your code is running on the same less UI in microservices archi-
network, and I’m not meaning tecture? Most of us are writing
only yours, I mean all of you at applications that span multiple
the same time. You are all putting services but it is really just one
your code into big cloud vendors; application users are trying to
it is much more local with every- go to. How do I maintain the
one else in this room than it was advantages of a share-noth-
before. So we have this interest- ing architecture where I can
ing networking effect. Microser- deploy independently with-
vices are not just a way for you to out dependencies between
build services. It is also a way for services yet maintain a user
you to consume services that oth- experience that is frictionless,
er people have built for you. unified, and with a consistent
look and feel?
When I look out there and I see Tangirala: So, there are different
vendors selling certain types of tiers in the microservice layer.
services, the thing that strikes me There is a front-end tier, which
is that they’re smaller versions of takes all the user traffic, and then
things that bigger vendors used we have a middle tier and back-
to sell. I look at the APM space end tier, which are your mem-
when I see that. And you will see bership and all the core services
the trend continue when there that give that data set to you. And
are more micro-vendors; there so, in terms of the UI integration,
will be more marketplaces that there is a lot of interaction be-
help you acquire services that tween these services, but at any
can do interesting things. Some- given time the source of truth is
body asked about geolocation. just one service.
You can buy that as a service. It is
a tiny little service; it does very lit- I don’t have a lot of insights into
tle in terms of an API and a huge the UI layer. But our UI team does
amount in terms of the back end. a great job in making sure all the
So that is one thing that we might interactions between these mi-
expect to see going forward. croservices and the results that
they are getting in the UI are
Schloming: I think that those seamless. There’s a lot of work
two answers spark a lot. I don’t that goes on behind the scenes,
think a developer writes more but each microservice is not re-
lines of code, but they are way lated to the other. That way, you
more productive because they know which service to call.
figure out how to assemble a lot
of things — and the other things Though there’s a lot of interac-
or what Louis just mentioned are tion, you have fallbacks as well.
From the UI point of view, you
Twitter
Twitter has gone through a sim-
ilar evolution, and is on roughly
its third generation. It started as
a Rails application, nicknamed
the Monorail. The second genera-
tion pulled the front end out into
Every search engine that we use of the wonderful things about transaction. Why? Because it is a
almost certainly is doing some having THE database in our sys- scalability killer.
form of joining one particular tem. It is easy to have a transac-
entity with another particular tion cross multiple entities. In So, we can’t have a transaction —
entity. Every analytical system our SQL statement, we begin the but here is what we can do. We
on the planet is joining lots of transaction, do our inserts and turn a transaction where we want
different pieces of data, because updates, then commit and that to update A, B, and C, all together
that is what analytical systems are either all happens or it doesn’t as a unit or not at all, into a saga.
about. happen at all. To create a saga, we model the
transaction as a state machine of
I hope this technique now sounds Splitting data across services individual atomic events. Figure
a little bit more familiar. makes transactions hard. I will 3 may help clarify this. We re-im-
even replace “hard” with “impos- plement that idea of updating A,
sible”. How do I know it’s impossi- updating B, and updating C as a
Transactions ble? There are techniques known workflow. Updating the A side
The wonderful thing about rela- in the database community for produces an event that is con-
tional databases is this concept doing distributed transactions, sumed by the B service. The B ser-
of a transaction. In a relational like two-phased commit, but vice does its thing and produces
database, a single transaction nobody does them in practice. an event that is consumed by the
embodies the ACID properties: As evidence of that fact, consid- C service. At the end of all of this,
it is atomic, consistent, isolated, er that no cloud service on the at the end of the state machine,
and durable. We can do that in a planet implements a distributed we are in a terminal state where A
monolithic database. That’s one and B and C are all updated.
57
Streaming Architecture
56
Faster,
Smarter DevOps
58 Observability
55
based around modern architectural styles like microser-
vices and serverless. Cloud Native