Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack Summit 2014 Paris Presentation

Copyright©2014 NTT DOCOMO, INC. All rights reserved.
Design and Operation of OpenStack
Cloud on 100 Physical Servers
NTT DOCOMO Inc.
Ken Igarashi
Virtualtech Japan Inc.
Hiromichi Ito
NEC
Akihiro Motoki

DOCOMO, INC All Rights Reserved
Ken Igarashi
○  Leading OpenStack Project at NTT DOCOMO
○  One of the first members of proposing
OpenStack Bare Metal Provisioning (currently
called "Ironic") - bit.ly/1stuN2E
Hiromichi Ito
○  CTO of Virtualtech Japan Inc.
Akihiro Motoki
○  Senior Research Engineer, NEC
○  Core developer of Neutron and Horizon.
About Us
2

○  Information required
Ø  Hardware resources/performance
–  Management resources
–  User resources
ü Nova, Cinder – depends of individual
Ø  Hardware/Software configuration
–  High Availability
–  Network configuration (e.g. Neutron)
Ø  Deployment tool
–  JuJu/MaaS, Fuel, Helion, RDO etc
○  How we get it
Ø  Did simulation using 100 physical hosts
–  Total 3200vCPU, 12.8TB Memory
–  Collaboration with: National Institute of Information and Communications
Technology, VirtualTech Japan Inc., NTT Advanced Technology Corporation,
Japan Advanced Institute of Science and Technology, Tokyo University and Dell
Japan Inc.
Design OpenStack Cloud
3

Test Environment
4
National Institute of Information and Communications Technology
Ishikawa prefecture
About 1400 servers in the single site

Research and Development
New locater protocol development
Home network protocol development
Virtual node migration algorithms
HEMS management protocol
New tunnel protocols
Inter-AS traceback

TCP behavior comparison
Proxy server performance evaluation
Evaluation of X-ray sharing
Video conference protocol switching
FW benchmarking
Protocol / Product
Evaluation
Education
Security operation competition
Cyberrange training
Remote hands-on for Asian students
Competition of cloud computing ideas
Testbed federation algorithms
Supporting software for control testbeds
Wireless link simulation on wired link
IPv6 support on network testbeds
Simulation
Realistic and
Flexible
experiments
based on
bare-wire
environment
StarBED – http://bit.ly/10gYttm
5
○  Open to any companies and organizations

DOCOMO, INC All Rights Reserved 6
100 Physical Servers on StarBED
Compute Node
x 36
Leaf Switch
(S4810)
Leaf Switch
(S4810)
Leaf Switch
(S4810)
Spine Switch
(S6000)
Spine Switch
(S6000)
Compute Node
x 37
40Gb x 2
10Gb x 4
LB (BIG-IP
5200V) x 2
Leaf Switch
(S4810)
10Gb x 4
Management
Servers x 21
10Gb x 1
10Gb x 1
10Gb x 1
10Gb x 1
Compute Node
x 6
40Gb x 2
40Gb x 2
40Gb x 2
10Gb x 1
10Gb x 1
10Gb x 1
10Gb x 1
10Gb x 1
10Gb x 1
○  OpenStack Icehouse

Network Configuration

○  Multi-Chassis Link
Aggregation (MLAG)
Network Redundancy
8
○  Endhost Equal Cost Multi
Path(ECMP)
Switch Switch
eth1
bond0
z.z.z.z
eth2
MLAG
with VRRP
ECMP
Bonding
Switch Switch
eth1
x.x.x.x
lo
z.z.z.z
eth2
y.y.y.y
Routing
Protocol
ECMP
Maturity
Need expensive switch
Remove network complexity
Maturity

○  Virtual network creation is essential to increase network
security
Ø  ML2 with tunnel network configuration
–  Type Driver
ü VXLAN
ü GRE
–  We chose VXLAN
ü VXLAN uses MAC Address-in-User Datagram Protocol (MAC-in-UDP)
encapsulation
ü The load balancing algorithm works effectively by using UDP port
number hash
ü Many network hardware support VXLAN
Ø  Mechanism Drivers
–  Open vSwitch (OVS)
–  Linux Bridge
Neutron Configuration
9

○  Throughput between 1 VM and1 VM on different physical hosts
(1 TCP connection)
Ø  No much difference between OVS and Linux Bridge
Ø  MLAG gets better performance than ECMP
Throughput for Different Network Configuration
10
3.4

3.6

3.8

4.0

4.2

4.4

4.6

ovs_mlag
ovs_ecmp
bridge_mlag

Throughput
[Gbps]

○  MLAG with OVS seems the best configuration today
Ø  Performance, Potential, Stability
3.4

3.6

3.8

4.0

4.2

4.4

4.6

ovs_mlag
ovs_ecmp
bridge_mlag

Throughput
[Gbps]
Throughput for Different Network Configuration
11
We increased VM’s MTU to 8950 to get the performance but
the physical network bandwidth is 20Gbps

Throughput for Different Number of VMs
12
○  VM communicates to random VM on a different physical hosts
(1 connection per VM)
○  It consumes only 50% of the total bandwidth though allocating
all physical resource to VM
0.0

2.0

4.0

6.0

8.0

10.0

12.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

100
200
300
400
477

Throughput
(PHY)
[Gbps]

Throughput
(VM)
[Gbps]
Number
of
Servers
VM
(MTU
1500)
VM
(MTU
8950)
PHY
(MTU
1500)
PHY
(MTU
8950)

* PHY :VM’s total throughput measured at a physical host

○  We could get 19Gbps (MTU 1500) between physical hosts
○  Enabling VXLAN
Ø  We could get only 10Gps (MTU 8950)
Ø  VM’s CPU load during the communication
Ø  The throughput is highly reduced by turning on VXLAN
–  CPU is overloaded by VTEP software processing
ü packet encapsulation and de-capsulation
Slow Throughput
13
Server
Receiver
89.3 0.0 391:31.82 vhost-yyyyy
49.3 0.8 257:06.66 qemu-system-x86
98.4 0.0 462:41.90 vhost-xxxxx
42.9 0.9 294:34.67 qemu-system+

○  NIC with VXLAN offload would be able to reduce CPU load
○  Available Device Lists
Ø  Mellanox ConnectX-3 Pro
–  World 1st VXLAN offload NIC
Ø  Intel X710,XL710
–  Release at 2014 Sep.
Ø  Emulex XE102
Ø  Qlogic 8300 series
–  Support at October 21, 2013 software release
Ø  Qlogic NetXtreme II 57800 series
–  Broadcom is selling its NetXtreme II line of 10GbE controllers and adapters
to QLogic.
NIC with VXLAN Offload Support
14

0.0

5.0

10.0

15.0

20.0

25.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

10
20
30
38

Throughput
(PHY)
[Gbps]
Thoughput
(VM)

[Gbps]
Number
of
Servers
VM
OFF
(MTU
1500)
VM
ON
(MTU
1500)
VM
OFF
(MTU
8950)
VM
ON
(MTU
8950)

PHY
OFF
(MTU
1500)
PHY
ON
(MTU
1500)
PHY
OFF
(MTU
8950)
PHY
ON
(MTU
8950)

Throughput using VXLAN Offload NIC
15
○  Throughput between VMs on 4 different physical hosts (2
server, 2 receiver)
○  It can consume 98% of the total physical bandwidth
Ø  VXLAN Offload with MTU 8950
3.5 ～ 5.6 x
1.3 ～ 1.4 x
* PHY :VM’s total throughput measured at a physical host

CPU Load
16
0.0
50.0
100.0
150.0
200.0
250.0
1 2 4 6 8 10 12 14 16
CPU[%]
Number of Servers
On Tx CPU/Gbps OFF Tx CPU/Gbps
27.1%
0.0
50.0
100.0
150.0
200.0
250.0
300.0
1 2 4 6 8 10 12 14 16
CPU[%]
Number of Servers
On Rx CPU/Gbps OFF Rx CPU/Gbps
28.5%
Server
Receiver

○  We could get 1.3～5.5 times throughput compared to NIC
without offload capability
○  CPU load on a physical host was reduced 27～28%
○  MTU 8950 showed 1.5～1.6 times better throughput than MTU
1500
Ø  We decided to set MTU 9000 on a physical host but we deliver MTU
1500 by DHCP server
Ø  Let user extend MTU
VXLAN Offload NIC
17

High Availability

○  You need 10-12 people
Ø  4 group + α people are required
○  If we can delay fixing a problem later, we can only work on
weekday
Ø  High Availability is the key to achieve this
○  Our design
Ø  Double redundancies for hardware
Ø  Triple redundancies for software ⇒ Against double failure
24/7 Support
19

○  Others○  Load Balancer based
VM
VM
VM
VM
VM
VM
MySQL (Galera)
High Availability
20
Arbitrator
DB1
DB2
DB3
DB4
VM
VM
Nova
OpenStack
APIs
Zabbix
LBLB
Load Balancing
SSL Termination
Health Check
Neutron Agents
PXE, DNS, DHCP
MAAS
RabbitMQ

○  4 Nodes + 1 Arbitrator
MySQL HA
21
Arbitrator
DB1
LBLB
DB2
DB3
DB4
Read/Write to
a single node
Quorum-based Voting
Health
Check
•  Check TCP Port 3306
•  Cluster Status
ü show status like 'wsrep_ready=‘ON’

Priority 1
Priority 2
Priority 3
Priority 4

Galera-cluster State Transition
22
Open
Primary
Joiner
Joined
[3]
Synced
[4]
Donor
[2]
IST and SST
wsrep_ready=‘ON’
○  WSREP_STATUS = 2 and 4 can’t cover all the states

○  Node Recovery
Ø  Health check detects DB1’s failure
MySQL HA
23
DB1
LBLB
DB2
DB3
DB4
Priority 1
Priority 2
Priority 3
Priority 4
Health
Check
•  Check TCP Port 3306
ü show status like wsrep_ready=‘ON’

Arbitrator

Ø  Designated DB is changed from DB1 to DB2
MySQL HA
24
DB1
LBLB
DB2
DB3
DB4
Priority 1
Priority 2
Priority 3
Priority 4
ü show status like
wsrep_ready=‘YES’ -> ‘NO’
Arbitrator

Arbitrator
Ø  DB1 is fixed from DB4 (lowest priority) using IST or SST
MySQL HA
25
DB1
LBLB
DB2
DB3
DB4
Priority 1
Priority 2
Priority 3
Priority 4
Synchronization
•  IST: Incremental State Transfer
•  SST: State Snapshot Transfer

MySQL HA
26
DB1
LBLB
DB2
DB3
DB4
Priority 1
Priority 2
Priority 3
Priority 4
Ø  DB1’s priority is changed before joining the cluster
Priority 1
Priority 2
Priority 3
Priority 4
Arbitrator

Ø  The cluster is backed to normal state
MySQL HA
27
DB1
LBLB
DB2
DB3
DB4
Priority 4
Priority 1
Priority 2
Priority 3
Arbitrator

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

120TPS
240TPS
120TPS
240TPS

Time
for
recovery

[s]
Backgroud
Traﬃc
JOINED-‐>SYNCED

JOINER-‐>JOINED

Recovery Time
28
○  Time for IST
Performance
Max 340 TPS
Performance
Max 1356 TPS

Recovery Time
29
0.0

200.0

400.0

600.0

800.0

1,000.0

1,200.0

1,400.0

1,600.0

1,800.0

2,000.0

120TPS
240TPS
120TPS
240TPS

Time
for
recovery

[s]
Background
Traﬃc
JOINED-‐>SYNCED

JOINER-‐>JOINED

○  Time for SST
Performance
Max 340 TPS
Performance
Max 1356 TPS

○  Loosing all database
Disaster Recovery
Restore from
backup
Fix
NW
DB1
DB2
DB3
DB4
Stand-by
30
SST
Run MySQL
Run MySQLStand-by
Stand-by
DONOR
Run MySQL
SST
SST
Healthy
State
DB
3GB
11 seconds
70 seconds
70 seconds
98.2 minutes
97.5minutes for 12hours
bin log recovery
Run MySQL DONOR

○  MAAS includes
Ø  DNS, DHCP, tftp
○  DNS
Ø  Master – Slave
○  DHCP (ISC DHCP)
Ø  Replication – (Delivering fixed IP address through DHCP)
○  MAAS and tftp
Ø  Back up by VM
MAAS-HA
31
MAAS
Storage
VM
Image
Activate

○  We add multiple RabbitMQ address to configuration files
Ø  Easy configuration and application
level health monitoring
At lease 3 RabbitMQ (5 ideally)
hosts are required against split-brain
○  Read/Write to single node using load balancer
Ø  Don’t need to care about split-brain – 3 RabbitMQ hosts
Ø  Network level health monitoring
RabbitMQ-HA
32
VM
VM
VM
VM
VM
VM
VM
VM
Nova
LB
LB
VM
VM
VM
VM
VM
VM
VM
VM
Nova
fcluster_partition_handling
=‘autoheal’

Neutron HA

Network Setup
34
o  DHCP agent
Ø  Support Active-Active. Assign a virtual network into multiple agents
ü dhcp_agents_per_network = 3 (should be <= 3)
o  L3 agent
Ø  Support only Active-Standby
Ø  If it fails, we need to migrate a router to another agent
o  Metadata agent
Ø  Has no state ⇒ Just need to keep metadata-agent running in all nodes
NW
node
Data Plane (VXLAN)
External Net
Neutron
Server
Message
Queue
NW
node
NW
node
L3-agt
dhcp-agt
Control Plane
dhcp-agt
dhcp-agt
L3-agt
L3-agt
meta-agt
meta-agt
meta-agt
Compute
Node
Compute
Node
Compute
Node

Monitoring Points
35
NW
node
Data Plane (VXLAN)
External Net
GW
router
Neutron
Server
Message
Queue
NW
node
NW
node
L3-agt
dhcp-agt
Compute
Node
Compute
Node
Compute
Node
Control Plane
dhcp-agt
dhcp-agt
L3-agt
L3-agt
[2] PING from
external net
[1] PING from
Internal net
[4] PING from
C-plane
[3] Agent state
check via REST API

○  Data plane connectivity
–  If it fails, users cannot communicate through routers.
Ø  [1] Internal network for VXLAN (ping)
Ø  [2] External network (ping)
○  Network agent health check
–  L3 agent, DHCP agent
Ø  [3] Agent alive state from neutron server (REST API agent-list)
–  Each neutron agent reports its state via message queue.
Ø  [4] Control network connectivity (ping)
–  If it fails, we are no longer able to control the node.
Health Checks against Failures
36

Recovery from Failures
37
NW
node
Data Plane (VXLAN)
External Net
GW
router
Neutron
Server
Message
Queue
NW
node
NW
node
Compute
Node
Compute
Node
Compute
Node
Control Plane
L3-agt
dhcp-agt
dhcp-agt
dhcp-agt
L3-agt
L3-agt
(1) Disable agents
on the host

38
NW
node
Data Plane (VXLAN)
External Net
GW
router
Neutron
Server
Message
Queue
NW
node
NW
node
Compute
Node
Compute
Node
Compute
Node
Control Plane
L3-agt
dhcp-agt
dhcp-agt
dhcp-agt
L3-agt
L3-agt
(2) Migrate
network/router

39
NW
node
Data Plane (VXLAN)
External Net
GW
router
Neutron
Server
Message
Queue
NW
node
NW
node
Compute
Node
Compute
Node
Compute
Node
Control Plane
L3-agt
dhcp-agt
dhcp-agt
dhcp-agt
L3-agt
L3-agt

40
NW
node
Data Plane (VXLAN)
External Net
GW
router
Neutron
Server
Message
Queue
NW
node
NW
node
Compute
Node
Compute
Node
Compute
Node
Control Plane
L3-agt
dhcp-agt
dhcp-agt
dhcp-agt
L3-agt
L3-agt
(3) Shutdown NICs
(or the node)

○  Dedicated network namespace on network node for external
connectivity checking
Ø  Network node has reachability from external network node.
Ø  Use IP address on isolated namespace to avoid access to the node host
from public network.
Tips: Checking External network connectivity
41
Network Node
Bridge (external)
ethN
Router
netns
Router
netns
Netns for checking
IPAddr
GW
router
PING check
No access
to the host

○  Throughput from external node to a VM
○  Injected a control plane failure and migrated a router from
another L3-agent
Traffic During Router Migration
42
0
100
200
300
400
500
600
700
800
900
1000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88
Elapsed Time [second]
Throughput[Mbps]
10 seconds

○  Migrated 88 routers from one L3-agent to other two L3-agents
Router Migration Progress
43
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
0:00:00 0:00:17 0:00:35 0:00:52 0:01:09 0:01:26 0:01:44 0:02:01 0:02:18
NumberofRoutersProcessed
Elapsed Time [sec]
L3-agent processed
REST API processed
REST API requested
L3-agent processed
(aggregated)

Possible Improvements
44
NW
node
Data Plane (VXLAN)
External Net
GW
router
Neutron
Server
Message
Queue
NW
node
NW
node
Compute
Node
Compute
Node
Compute
Node
Control Plane
L3-agt
L3-agt
L3-agt
o  Integration with L3-Agent HA feature
Ø  It improves data-plane availability much
Ø  Monitoring external network connectivity needs to be improved in L3-HA
Ø  Still requires router migration based on C-Plane monitoring
No monitoring for
external network now
HA supported for
internal network failure
C-Plane monitoring
is still required

○  Integration with Juno Neutron features
Ø  Using L3-Agent HA feature (prev. page)
Ø  Leveraging L3-agent auto rescheduling
–  Helps us reduce the number of REST API calls
–  Juno Neutron support L3-agent rescheduling for routers on inactive agents
–  “admin_state” is not considered for rescheduling ß Need to be improved
○  Possible contributions to Neutron upstream
Ø  DHCP agent auto rescheduling
Ø  LBaaS agent scheduling
–  There is no way to reassigning LBaaS agent for HAProxy driver
Possible Improvements
45

Management Resources

○  Controller
Ø  API
○  Message Queue
Ø  RabbitMQ
○  Database
Ø  MySQL – OpenStack
○  Neutron Servers
○  Monitoring
Ø  Zabbix Servers(+MySQL)
○  Storage
Ø  Log, Backup
○  Deployment + etc
Ø  MaaS, MongoDB
47

48
3
○  Controller
Ø  API
Ø  RabbitMQ
○  Database
○  Monitoring
○  Storage

49
3
3 + 2
○  Controller
Ø  API
Ø  RabbitMQ
○  Database
○  Monitoring
○  Storage

50
3
3 + 2
4 + 0.5
○  Controller
Ø  API
Ø  RabbitMQ
○  Database
○  Monitoring
○  Storage

51
3
3 + 2
4 + 0.5
3
○  Controller
Ø  API
Ø  RabbitMQ
○  Database
○  Monitoring
○  Storage

52
3
3 + 2
4 + 0.5
3
3
○  Controller
Ø  API
Ø  RabbitMQ
○  Database
○  Monitoring
○  Storage
Ø  MAAS, MongoDB

53
3
3 + 2
4 + 0.5
3
3
xxTB
○  Controller
Ø  API
Ø  RabbitMQ
○  Database
○  Monitoring
○  Storage

54
3
3 + 2
4 + 0.5
3
3
xxTB
2
○  Controller
Ø  API
Ø  RabbitMQ
○  Database
○  Monitoring
○  Storage

55
Controller
RabbitMQ
MySQL
Neutron
Zabbix
Log, backup
storage
etc

56
Controller
RabbitMQ
MySQL
Neutron
Zabbix
Log, backup
storage
etc
Nova Compute

Scalability Test
57
0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

0-‐1000
1000-‐2000
2000-‐3000
3000-‐4000
4000-‐5000

Error
Rate
[%]
Time　[S]

Number
of
VMs
Elapsed
Time
to
Boot
a
Instance
(Avg.)

Error
%

○  We measured VM boot time for 0-5000 instances


Database Size - Zabbix
58
size
duration
History
50 bytes
30 days
Trend
128 bytes
90 days
Event
130 bytes
90 days
Servers Switch (per port)
Tempest
Health Check
(30 seconds)
Usage
(180 seconds)
Health
Check
Usage
System
Check
Item
Number
69
557 1 24 500
Size
(history)
15GB
40GB
687MB
5GB
108MB
Size
(trend)
2GB
15GB
88MB
2GB
138MB
Size
(event) *
1GB
1GB
1GB
1GB
1GB
Total Size
18GB
57GB
2GB
8GB
1GB
* Assume 1 event/second
86 GB


Database Size - OpenStack
59
Sep 14 2014 Sep 25 2014
OpenStack Related
Keystone* 1.4GB
1.4GB
Nova (28k -> 55k)
451MB
856MB
Neutron (7k -> 9k) 78MB
235MB
Glance
64MB
89MB
Heat 45MB
55MB
Cinder
39MB
43MB
Sub Total
2.1GB
2.7GB
MySQL Related
Transaction log
4.1GB
4.1GB
Ibdata1
268MB
268MB
Total Size
6.4GB
7.0 GB
* Did “keystone-manage token_flush” every 1 hour

○  We can change configuration easily (e.g. HA and Neutron)
○  We can use Ansible for deployment and operation
Deployment Tools
60
DOCOMO
(Ansible based)
Mirantis
Fuel
HP Helion
Canonical
Juju/MAAS
MySQL HA
LB +
Percona
haproxy
+corosync
+pacemaker
+Galera
haproxy
+keepalived
+Galera
haproxy
+corosync
+pacemaker
+Percona
RabbitMQ
HA
Configfile based
(pause_minority)
RabbitMQ
Cluster(autoheal)
+ LB
Configfile based
(pause_minority)
Configfile based
(ignore)
LB HA
Commercial
Products
haproxy
(nameserver)
+corosync
+pacemaker
haproxy
+keepalived
haproxy
+corosync
+pacemaker
Network
Neutron
+ Own HA
Neutron
Neutron DVR
Neutron

○  Default security group
Ø  IP table entry is added/deleted to all
VMs whenever you create/delete a VM
⇒ ovs-agent became busy when we created mv VMs
○  Number of Neutron workers
Ø  neutron.conf
–  api_workers = ‘number of cores’
–  rpc_workers = ‘number of cores’
Ø  metadata_agent.ini
–  metadata_workers = ‘number of cores’
○  Number of File Descriptors
Ø  Default : 1024
Ø  RabbitMQ: more than 5,000 connections
Ø  metadata-ns-proxy (L3-agent,dhcp-agent): request x 2
○  Retry VM Creation Time
Ø  nova.conf
–  scheduler_max_attempts = 1 ⇒ No difference between 1 and 3
Tips Learned from Scalability Tests
61

END

Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack Summit 2014 Paris Presentation

More Related Content

Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack Summit 2014 Paris Presentation