Architecting For The Cloud

1
u Final Projects Due Friday, 12/2 by midnight

u Includes whitepaper, AMI (optional) and any startup instructions
u If you are submitting an AMI, please include your team # (e.g. TermProject-Team9)
u Term Project Peer Evaluations - Due Friday, 12/2 by midnight
u These are confidential and will not be shared with anyone
u These should take less than 10 minutes to fill out and will be used to make any
adjustments in grading
u Not responding will impact your participation grade
u Student Ratings of Teaching Effectiveness (SRATE) due Tuesday, 12/6

u +1 points on final exam if 90% response rate
u Final Exam – Thursday, 12/8 from 7-9:30pm in GOL 1550

Architecting for the Cloud:
Avoiding Common Pitfalls
SWEN 514/614: Engineering Cloud Software Systems
Department of Software Engineering

Rochester Institute of Technology
You are Presented with this Architecture to 3
Build on the Cloud
u This is a brand-new web application your
company is building for the first time
u Your company has no previous
experience building in the cloud but
understands the basics of AWS
u As a “cloud architect”, what are some of
the mistakes you would want to avoid
when you build this in the cloud?
u Furthermore, what are some of the things
you would recommend to look out for
after day-1?
Top 10 Mistakes to Avoid 4
1. Not knowing your AWS infrastructure limits (way) ahead of time

2. Choosing the wrong AWS region
3. Not accurately predicting and controlling AWS cost
4. Not designing for failure
5. Not using Auto-Scaling
6. Avoiding Automation
7. Not using Infrastructure as Code (IAC)
8. Don’t properly manage Identity, Access and Security
9. Don’t do Continuous Integration Continuous Delivery/Deployment
10. Avoiding monitoring and metrics
5
Mistake #1: Not Knowing your AWS Infrastructure Limits (way) Ahead of Time
u The AWS infrastructure you provisioned on day-1 may eventually reach a limit
and could crash
u Not knowing what that limit is and when will it be reached is a sure way to hit
a wall
u How can you avoid this?
u Execute load tests
u You should know exactly what type of traffic your applications
can handle with your current AWS infrastructure
u You should know what the customer experience will be, and you
should know at which point things will start to break
u Capacity Planning
u Based on load test results and your growth goals, you should
know well in advance when it’s time to beef up your AWS
infrastructure
u Don’t wait until your customers are having a bad day using your
product. It will be too late!
Mistake #2: Choosing the wrong AWS region 6
u A very common mistake is to choose an AWS region simply based on the
location of your customers
u Not all AWS regions are created equal, there are substantial differences in
price and feature availability
u If you think you’ll use a certain AWS service in the near future, make sure it’s
available in your chosen region
u Some configurations can cost almost double,
depending on the region you choose
u Data transfer in Sao Paolo is 177% more
expensive compared to N. Virginia
u Even within the US, some EC2 instances (like a
t2.large) in N. California will cost you 30%
more compared to N. Virginia
Source: https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/
Mistake #3: Not Accurately Predicting and Controlling AWS Cost 7
u An application that costs more than it’s worth will eat your profit
u For example, if you are paying $2,000/month for cloud resources for an application
that brings in $1000/month, something is wrong
u Choosing the right instances for your AWS system is a fundamental decision one
needs to confront
u How many instances to choose?
u What’s the right size for an instance?
u How to keep track of all the instances?
u All these are basic yet vital decisions to be made
u AWS pricing is complicated, but not impossible to calculate.
You need to know the type of resources your application
consumes, their quantity and their corresponding AWS price
dimension.
u Example: data transfer (out to the internet, inter-AZ, inter-region), ELB data processed, instance hours,
EBS/S3 storage, billable API calls, Lambda executions and memory, etc. All this must be accounted for!
Mistake #4: Not Designing for Failure 8
u Imagine for the online web application

that during the night, the hardware
running your virtual machine fails
u Until the next morning when you go into
work, your customers can no longer
access your web application
u During the 8-hour downtime, your
customers (who are now angry) search
for an alternative and stop buying from
you
u That’s a disaster for any business
u Virtual machines are not highly available by default
u The following scenarios could cause an outage of your virtual machine:
u A software issue occurs on the host machine, causing the VM to crash
(either the OS of the host machine crashes, or the virtualization layer does)
u The computing, storage, or networking hardware of
the physical host fails
u Parts of the data center that the virtual machine
depends on fail: network connectivity, the power
supply, or the cooling system
u Remember: If the computing hardware of a physical
host fails, all EC2 instances running on this host will fail
u You don’t want to over-engineer things at the beginning. But you also need to be
prepared for failure. Here are two critical areas to consider:
u Minimize likelihood of failure
u Achieved by proper testing of your applications (e.g. failovers) as
well as deployment steps, which can help to reduce human error
u Using managed services such as RDS, Auto-Scaling or Elastic Load
Balancer (ELB) can help to minimize the impact of these issues
u Reduce time to recovery
u You sure need to think about which tools and processes to put in
place to make your applications recover as quickly as possible
u These include: escalation, up-to-date documentation, automated
recovery, updated runbooks, good logging, appropriate monitoring
and alarming, ticketing systems
u “The best way to avoid failure is to fail constantly”
u If you aren’t constantly testing your ability to succeed despite failure,
then it isn’t likely to work when it matters most — in the event of an
unexpected outage (i.e. Chaos Engineering)
Mistake #5: Not using Auto-Scaling 11
u Every EC2 instance should be launched inside an Auto-Scaling Group even if

it’s a single EC2 instance
u The Auto-Scaling Group takes care of monitoring the EC2 instance, it acts as a
logical group of virtual machines, and it’s free*
u Auto-Scaling is achieved by setting alarms on
metrics like CPU usage (of the logical group) or
number of requests the load balancer received
u If the alarm threshold is reached you can define an
action like increase the number of machines in the
Auto-Scaling Group
u The most common purpose for an Auto-Scaling groups is resiliency

u Instances are put into a fixed-size Auto Scaling group so that if an instance fails, it is
automatically replaced
u Remember: The simplest use case is an Auto-Scaling group has a min size of 1 and a max of 1
Mistake #6: Avoiding Automation 12
u Not having automation in place is the best way to guarantee your team
won’t scale
u You don’t need to have automation mechanisms for every single task, but
you need to have some automation and gradually add tasks
u Don’t wait until your team is overwhelmed with manual, tedious, time sucking
activities
u Some suggestions:
u Ask your team about the most painful manual tasks they
execute. Measure by time spent and level of frustration.
u Based on that, sort those tasks from most urgent to less
urgent
u Identify an automation task for each each manual task and
estimate how much time it would take to implement
u Prioritize automation for those tasks your team hates and
where you’ll get the biggest bang for your buck
Mistake #7: Not using Infrastructure as Code (IaC) 13
u IaC has a bit of a learning curve, and some templates can take time to build,
but it’s worth it
u Not using an IAC tool like CloudFormation means you’ll have to manually
create environments and it also means you won’t have a nice way to keep
track of configuration changes
u Imagine you can create a big
production-like environment in less
than 10 minutes, run some load tests,
shut it all down and launch it again the
next day for more testing
u Or rollback some configuration change
that didn’t work well in production, in
minutes
Mistake #8: Don’t Properly Manage Identity, Access and Security 14
u Your team will grow, you’ll have more integrations with other systems,
you’ll use more API keys, you’ll create more IAM Policies, Roles and
Users, you’ll set up more EC2 Security Groups, Network ACLs, you’ll
need to manage more encryption keys and certificates
u If not managed properly, all these things add up
and turn very quickly into a dangerous mess
u Given security is the factor with potentially the
most devastating consequences, you don’t
should not loose sight on security best practices
from day-1 and as you grow
Mistake #8: Don’t Properly Manage Identity, Access and Security 15
u Question: What is one of the most commonly seen and dangerous mistakes
companies make when they use AWS services?
u Answer: Allowing sensitive data stored in S3 buckets to be publicly accessible
u These buckets are secure by default but customers often
loosen these access measures to enable development
u In many cases, they are not re-assigned once
development is complete and the service goes live
u As a result, multiple, very public breaches of S3 security
continually make the news
u AWS administrators need to be sure to grant access to S3
(or any other AWS service) following the principle of “least
privilege” making certain that only those properly trained
and authorized to make changes within S3 have the
access necessary to do so
Source: https://www.threatstack.com/blog/21-infosec-and-aws-experts-reveal-the-1-mistake-companies-make-when-it-comes-to-aws-security-and-how-to-avoid-it
Mistake #9: Don’t do Continuous Integration Continuous Delivery/Deployment 16
u This is another way to guarantee your team won’t scale
u You don’t need to have the most sophisticated workflows in place from
day-1, but you need to at least have a pipeline, which you’ll gradually
upgrade
u Once you start getting more feature requests, bugs, and your codebase
grows, it will be more difficult to implement a good foundation
u With AWS products such as CodeCommit,
CodeBuild, CodeDeploy and CodePipeline,
there’s really no excuse for postponing at
least some basic code automation
Mistake #10: Avoiding Monitoring and Metrics 17
u “You can’t manage what you can’t measure” is extremely important here
u Not having visibility on important metrics is a recipe for disaster
u Consider measuring for following categories:
u Business Metrics - They tell you if you’re still making money (or not). They measure
in real time your performance against business goals and objectives.
u Customer Experience Metrics - They tell you what
your customers see, if they’re having a good day
using your application (or not). They reflect
“symptoms” when things are not going well.
u System Metrics - They tell you the root causes
behind Customer Experience metrics. They also tell
you if your systems are healthy, at risk, or already
creating a bad experience for your customers.
Source: https://iaasacademy.com/aws-certified-sysops-administrator-associate-exam/monitoring-metrics-exam-tips-amazon-s3/
Mistake #10: Avoiding Monitoring and Metrics 18
How fast is the
application
responding?
u Customer Experience Metrics

Is there a u Examples: response times, page load times,
server down? error 500s, etc.
How many u System Metrics

connections do I Am I able to
have to the DB? connect to the u Examples: CPU Utilization, memory utilization,
DB?
disk I/O, queue length, etc.
u Business Metrics
How high is u Examples: Revenue from application (e.g.
What’s the
the CPU %?
memory
orders submitted) vs. how much this all costs
utilization? to run in the cloud
Am I low How much is this

on disk whole thing
space? costing me to
run? Am I making
or losing money?
System Monitoring & Alerts 19
u Previously we would only alert on a few system statistics:

u 90% CPU utilization for over 10 minutes
u Free disk less than 10%
u Many times, when an alert was triggered, either

the development team wasn’t called, or when
they were, it was already too late
u We decided to take proactive measures
Kibana 20
u Free and open-source front-end application that sits on top of the Elastic
Stack, providing search and data visualization capabilities for data
indexed in Elasticsearch
u All log files are automatically pulled into a Kibana dashboard, which
makes troubleshooting much quicker
u Alerts can be created based on
predefined conditions
u Certain error messages are
appearing in log files
DynaTrace – Game Changer 21
u Application monitoring that provides all performance metrics in real time and detects
and diagnoses problems automatically
u 3 main components:
u OneAgent – a single agent that automatically discovers, instruments, and collects high-
fidelity monitoring data from everything in your IT environment
u Smartscape – an interactive environment topology map that visualizes the dynamic
relationships among all your application components across every tier
u Davis AI engine – analyzes everything and tells you when there is a problem, the business
impact of the problem, and the root cause of the problem so that you can fix it quickly
DynaTrace – Synthetic Monitoring 22
u Runs every 5 minutes
u Person A sends $1 to
Person B
u 5 minutes later Person B
sends $1 to Person A
u If this process ever fails,
automatically contact
support
u This process has allowed us to quickly react to issues much quicker
before having significant customer impact
Monitoring and Metrics in the Cloud – CloudWatch 23
u CloudWatch is a service for monitoring AWS resources and the
applications you run on AWS
u It allows developers, system architects, and administrators to monitor
their AWS applications in the cloud, in near-real-time
u It is automatically configured to provide metrics on request counts,
latency, and CPU usage
u You can use CloudWatch to
collect and track metrics, collect
and monitor log files, set alarms
and automatically react to
changes in AWS resources
CloudWatch – How it works 24
u CloudWatch is basically a metrics

repository
u An AWS service (e.g. EC2) puts
metrics into the repository, and you
retrieve statistics based on those
metric
u You can use metrics to calculate
statistics and then present the data
graphically in the CloudWatch
console (e.g. Dashboard)
u You can configure alarm actions to stop, start, or terminate an Amazon EC2
instance when certain criteria are met
u In addition, you can create alarms that initiate Amazon EC2 Auto-Scaling
and Amazon Simple Notification Service (Amazon SNS) actions on your behalf
CloudWatch – Dashboard Example 25
CloudWatch – Alarm Actions Example 26
u Assume we have set up an alarm for an EC2 Instance…
Tell me the CPU If the EC2 is non-

Utilization, Network responsive, then
Traffic and Disk I/O terminate it
If the CPU Utilization reaches

70%, automatically add a new
EC2 instance
u Actions can be combined to provide automated resiliency and scalability

capabilities
u Example: “If the EC2 instance reports a status of StatusCheckFailed_System,
reboot the EC2 and send a notification to the support team”
Advantages of CloudWatch 27
u One dashboard, access all data
u Applications produce a lot data as they are highly distributed
u Access all the data which have been collected can be displayed on a single CloudWatch
dashboard
u Improve total cost of ownership

u CloudWatch can be used to set alarms and can take automated actions while there is a
breach in the limits provided, which can help to minimize the costs spent on AWS services
u Insights from logs
u You receive detailed insights on separate AWS services and the
applications you run on the infrastructure.
u Data like memory, CPU utilization, and capacity utilization can be
monitored and receive insights from it
u Optimize Applications and resources
u Using the log and metric data, you can optimize your AWS services to
provide maximum throughput and performance
CloudWatch Limitations 28
u Limited data retention, which is restricted to 2 weeks of metrics data

u CloudWatch can't monitor non-AWS cloud services or user-managed
infrastructure
u Graphs and text-based widgets are limited in functionality
u Basic monitoring collects data at a frequency of 5 minutes, which
might not be frequent enough for managing scale in a critical
environment
u Detailed monitor comes at an additional
cost, which can add up quickly
u For larger organizations, it might make more
sense to use it in conjunction industry
standard monitoring tools
Cloud Monitoring Solutions 29
u To recap, CloudWatch focuses on information for health and

performance monitoring
u There are other 3rd-party solutions that can combine CloudWatch data
to provide more wholistic reporting
Datadog + CloudWatch 30
CloudWatch – Last Activity (Due next class) 31
u You will create a simple alarm and dashboard to monitor an
application we’ve ran before (WordPress)
u The dashboard will monitor a
basic metric like CPU usage
u An alarm will be created for
when the server is under high
CPU utilization
u Located at: Assignments >
Activity #23 – Create a
CloudWatch Dashboard and
Alarm
Last Mistake to Avoid: Do You Really Need to Build it? 32
u It can be hard for development teams to accept the fact that maybe
buying might be a better idea vs. building
u For commodity-based solutions, look at SaaS and objectively compare the
costs
u It might be more costly upfront (e.g. licensing) for a “buy” solution, but long
term it may fair outweigh the support headaches with a “build”
u Avoid “Not invented here syndrome” (NIHS)
u This is the name for the tendency of both individual developers
and entire organizations to reject suitable external solutions to
software development problems in favor of internally
developed solutions
u NIHS can be defined as a situation where an external solution is
rejected only because it was not internally developed - in
other words, there are no other factors that dictate an
internally developed solution would be superior
33
u For your final discussion topic, you’ve been
presented with the following architecture that
needs to run on AWS
u Some requirements:
u Needs to scale with web traffic (e.g. Black Friday)
u Downtime == lost revenue
u Jobs management must be low maintenance
u Business barely has a development staff
u All applications layers need to monitored
u Needs to be performant for global/regional users
u Costs need to be minimized
u Each team has 10 minutes to come up with their
recommendations for a better solution
u Template and diagram can be found here:
u Assignments > Activity #22 - AWS Architecture
Recommendations
Example Solution 34
u Use of AWS Cloudfront, S3, ElastiCache,
and Lambda keep operating costs low
(Serverless, on-demand)
u Deployed worldwide for low user latency
u Monitored by APM -
AppDynamics/NewRelic
u Amazon RDS is a managed DB, minimal
overhead. Is the only server ‘running’ at
any given moment
u AWS Lambda gives us on-demand code
execution with no server maintenance
u Monitored by - Sumologic, Splunk,
DataDog
35
u Final Projects Due Friday, 12/2 by midnight

u Includes whitepaper, AMI (optional) and any startup instructions
u Remember: If we cannot run it, we cannot grade it
u Term Project Peer Evaluations - Due Friday, 12/2 by midnight
u These are confidential and will not be shared with anyone
u These should take less than 10 minutes to fill out and will be used to make any
adjustments in grading
u Not responding will impact your participation grade
u Student Ratings of Teaching Effectiveness (SRATE) due Tuesday, 12/6

u Final Exam – Thursday, 12/8 from 7-9:30pm in GOL 1550

Architecting For The Cloud

Uploaded by

Copyright:

Available Formats

Architecting For The Cloud

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Architecting For The Cloud

Uploaded by

Copyright:

Available Formats

1

u Final Projects Due Friday, 12/2 by midnight

u Student Ratings of Teaching Effectiveness (SRATE) due Tuesday, 12/6

u Final Exam – Thursday, 12/8 from 7-9:30pm in GOL 1550

SWEN 514/614: Engineering Cloud Software Systems

Department of Software Engineering

1. Not knowing your AWS infrastructure limits (way) ahead of time

u Imagine for the online web application

u Every EC2 instance should be launched inside an Auto-Scaling Group even if

u The most common purpose for an Auto-Scaling groups is resiliency

u Customer Experience Metrics

How many u System Metrics

Am I low How much is this

u Previously we would only alert on a few system statistics:

u Many times, when an alert was triggered, either

u CloudWatch is basically a metrics

Tell me the CPU If the EC2 is non-

If the CPU Utilization reaches

u Actions can be combined to provide automated resiliency and scalability

u Improve total cost of ownership

u Limited data retention, which is restricted to 2 weeks of metrics data

u To recap, CloudWatch focuses on information for health and

u Final Projects Due Friday, 12/2 by midnight

u Student Ratings of Teaching Effectiveness (SRATE) due Tuesday, 12/6

u Final Exam – Thursday, 12/8 from 7-9:30pm in GOL 1550

You might also like