Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
157 views25 pages

Virtual MX Platform Architecture: Specification Title VMX Architecture

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 25

Specification Title VMX Architecture

Specification Owner Specification Revision 1.1

Virtual MX Platform Architecture

Copyright © 2013, Juniper Networks, Inc.

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


Table of Contents
1. Introduction.................................................................................................................1
1.1 Document Revision History...........................................................................................................1
2. Functionality................................................................................................................2
2.1 Phase 1 : Single RE + Single PFE support...................................................................................2
2.1.1 Working Feature List.............................................................................................................3
2.1.2 Non-Goals................................................................................................................................4
2.1.3 Limitations...............................................................................................................................4
2.1.4 Packaging Considerations......................................................................................................4
2.1.5 Control Plane...........................................................................................................................5
2.1.6 Data Plane................................................................................................................................5
2.1.7 Interface Mapping..................................................................................................................8
2.2 Phase 2: Dual-RE and Multiple PFE support..............................................................................8
2.2.1 Choice of guest OS on FPC VM:...........................................................................................8
2.2.2 Packaging Considerations....................................................................................................10
2.2.3 Control Plane.........................................................................................................................11
2.2.4 Data Plane..............................................................................................................................12
2.2.5 Interface Mapping................................................................................................................14
2.2.6 GRES......................................................................................................................................15
2.3 Chassis fru management on the VMX platform........................................................................17
2.3.1 Online/Offline/Restart of an fpc..........................................................................................17
2.3.2 Offline/Online of a mic.........................................................................................................17
2.3.3 Environment monitoring of the fpcs...................................................................................17
2.4 Phase 3: ISSU Support [more details TBD]...............................................................................17
2.5 Usability.........................................................................................................................................18
2.6 Core/Log Management.................................................................................................................18
2.7 Occam Implications......................................................................................................................20
2.8 64-bit considerations.....................................................................................................................20
2.9 Assumptions..................................................................................................................................20
2.10 Constraints and Limitations........................................................................................................20
2.11 Dependencies and Interactions with other Components in the System...................................21
2.12 Free Software Considerations.....................................................................................................21
3. Implementation Details.............................................................................................22
3.1.1 Design considerations for future.........................................................................................22
4. Performance...............................................................................................................23
4.1 Target Performance.....................................................................................................................23

Page
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


1. Introduction
VMX is a new product in the line of virtual-junos (VJX series) which will closely emulate an MX series router. The
driving principles of this product are
 Must closely resemble an MX Series router including the PFE capabilities.
 Should be a viable alternative to an MX router for developers and test community.
 Should be feature compatible with Junos MX router. This requires that the same software modules that
comprise an MX router should be used in VMX. The approach taken is to ensure that all software features
are retained, PFE features are expected to work by using the RIOT simulation, however other hardware
specific features such as fabric and its management, environment monitoring modules, CBFPGA etc will
not be supported. Queuing, timing related features cannot be supported as the corresponding hw simulation
is not present.

1.1 Document Revision History


Revision Author Date Description of Changes
1.1

Page 1

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


2. Functionality
The above set of broad goals can be further refined into multiple Phases based on the complexity of the solution.
Phase 1: Single RE, Single PFE, no fabric, no queuing features.
Phase 2: Extend to Dual RE architecture with Multiple PFEs – Dual RE, Multiple PFEs, Supports GRES,
supports inter-pfe traffic. (no ISSU, no queuing features)
Phase 3: Extend to support ISSU. Include Multichip functionality

2.1 Phase 1 : Single RE + Single PFE support


VMX is a separate platform from MX. In Phase1 it will support a single RE router with single PFE, where the PFE
is running simulation of the LU chip. The data plane forwarding for vMX is achieved by running the PFE
microkernel as a process within the RE (called vmxt). vmxt connects to 2 entities:
- To the RE kernel & chassisd. This ensures it gets the forwarding state updates in the form of IPCs from the
RE. The chassis manager connection to chassisd ensures the FPC/PIC/interface state is created
appropriately in PFE.
- The trinity chip simulator, running on a separate junos VM. This simulates the trinity LU chip. (Note, we
do not use the multi-chip mode of the trinity cosim.)
In the rest of the document ukern and vmxt are used interchangeably in context of VMX.
VMX comprises of the JUNOS image or VM, code which runs as a guest in the hypervisor.
It depends on having an environment to run on, which consists of a hypervisor, processor & device emulator, and a
virtual network model to interconnect the VMs.
VMX phase1 was developed using Olive as the baseline. Some of the key design choices that this entailed are as
follows.
 PFE microkernel will be run as a process within junos Vs real mx where ukern is run on a separate
processor.
 PFE to RE communication will be done over local sockets (local ttp socket for TTP packets and local rdp
sockets for various control plane sockets like cm, pfeman etc). In a real MX these would be running IP
control plane using TCP/UDP sockets for IPC and TTPoIP for data packets.
 A new package jinstall-vmx has been chosen, but the underlying modules included are mostly the same as
MX.

Page 2

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


2.1.1 Working Feature List
Phase 1 has Feature parity with the MX on following features.
a) Routing/Forwarding
Following features supported with support for v4 and v6 families:
[1] Static routes
[2] ARP.
[3] OSPF, ISIS
[4] BGP
[5] RSVP, LDP
[6] LSPs
[7] IGMP, PIM
[8] Ping to and from RE

b) Applications
[1] Layer3 VPNs.
[2] Layer2 VPNs
[3] Multicast VPNs
[4] BGP Multipaths
[5] RSVP-signaled LSPs
[6] LDP-signaled LSPs
[7] MPLS Fast Reroute, Node Protection and Link Protection
[8] Point-to-multipoint (RSVP) LSPs
[9] BGP MPLS Multicast VPNs
[10] L2-circuits

Page 3

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


[11] Layer 2 Switching Cross-connects
[12] Ethernet TCC and VLAN-TCC

2.1.2 Non-Goals
 We do not support the multi-chip model of cosim.
 For the first release we will not support running VMX outside a Juniper cloud environment that uses
VMM.
 Multi-chassis and backup RE support not present in first phase. (No dual-RE support)
 Multiple PFEs are not supported in the first phase.
2.1.3 Limitations
 TTRACE does not work with riot simulation. It works with lu-cosim, but lu-cosim is very slow. Need to
find an alternative to debug LU ucode with riot simulation, riot does support some trace functionality,
which will be evaluated and leveraged as an interim until full ttrace functionality is made available.
 Since MQ simulation is not included, apart from queuing functionality, features such as fragmentation and
reassembly will not work.
 Since only LU chip of the PFE hardware is simulated, functionality of fabric, IX, QX, MQ chips is not
supported.
2.1.4 Packaging Considerations
VMX image is packaged into jinstall-vmx, which differs from the regular mx jinstall slightly. The main differences
are
 Includes cosim and vmxt binaries instead of the other PFE binaries
 vmx.properties is used to set the RE type in junos to JNX_RE_TYPE_VMX and ch_model to
MODEL_VMX
 A new fake mic with single pic is included to support 10x1GE interfaces.
 A file /etc/vplat.conf is included to specify the VMX platform type which is used by rpio and ukern
modules to do VMX specific initialization.
 PFE ukern is built from a new makefile with target name as ‘vmxt’. It has been based on the olivet
makefile. There following modules are currently not included in the vmxt makefile which is present in the
npc Makefile.
o Diagnostic modules like ae11010_diag, phy_diags, bcm87xx_diags, bcm54xxx_diag etc.
o If_ipv4.co (uses if_tnp.co instead)
o Host_loopback.co
o Zarlink library
o ISSU blobs
o Drivers for TDM interfaces.
o Sntp_pfe_ipv4
o Mlpfe _npc (uses mlpfe_absent)
o Bulkget (uses bulkget absent)
o Ppman (uses ppman absent)
 Some of the above such as ppman, bulkget and mlpfe needs to be added to get corresponding functionality
enabled. Note we might not be able to test reassembly and fragmentation without MQ proxy support, but
single fragment packets and LFI can be handled without MQ proxy. These will be fixed in phase1.

Page 4

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


2.1.5 Control Plane
One of the key differences between the MX series routers and VMX phase 1 is that the PFE ukern (VMXT) is
running as a process within junos and there is no etsec connectivity between the ukern and the rest of the junos
world. Local rdp sockets are used for communication between the Daemons running in RE and the ukern threads.
This has meant that in phase 1 we would use TNP control plane.

2.1.6 Data Plane


Localttp socket is used for packet communication between the RE and PFE in VMX. The tnp control plane is used
in VMX vs the IP control plane used in MX. [TBD associate the text with the diagram with numbers).
o In case of host injected packets, RE sends TTPoTNP packet to the PFE using localttp socket. PFE ukern
after processing the packet will inject the packet to the LU simulation via RPIO tunnel. RPIO tunnel will
send the packet (or header) to cosim via unix socket.
o COSIM will now process the packet and return the result back to rpio tunnel. Based on the queue number
in the L2M cookie, the rpio tunnel will determine if this packet is destined to host or if it is destined to be
sent out via a Gig interface.
o If destined to host, the packet will be sent to ukern, where ukern will punt the packet as TTPoTNP through
localttp socket to RE.
o If destined to an Gig interface, the packet will be sent out via the BPF (Berkley Packet Filter) to the
corresponding EM interface.

Page 5

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


o When the peer receives the packet on an EM interface, this is handled by the RPIO tunnel using BPF filter.
It then injects this packet to LU via the wan stream corresponding to the Gig interface, where the cosim
will process the packet and return the result back to RPIO tunnel which will examine the L2M cookie and
punt the packet or forward it an another Gig interface.

The following figures illustrate host injected, host bound and transit traffic paths.

Page 6

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


Page 7

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


2.1.7 Interface Mapping

EM Interface Junos Interface

EM0 fxp0

EM1 unused

EM2 Ge-0/0/0

EM3 Ge-0/0/1

EM11 Ge-0/0/9

2.2 Phase 2: Dual-RE and Multiple PFE support


In this Phase VMX will support Dual RE, each RE running in a separate VM and the PFE ukern will be co-hosted
along with RIOT(COSIM). Moving of ukern and cosim to a separate VM is required to ensure that when the active
RE VM is restarted, the ukern is not lost. For performance reasons we would like to run only one ukern and cosim
per FPC VM.
2.2.1 Choice of guest OS on FPC VM:
 FreeBSD:
Pros:
1. vmxt image compiled and included in RE should be downloadable and run as is in FreeBSD
environment.
2. Ensures that no other unwanted daemons are running on FPC VMs.
Cons:
1. vmxt image and riot image needs to be downloaded to the FreeBSD VM, which requires
additional development (similar to a rom monitor code).

Page 8

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


2. vmxt depends on additional junos libraries which also needs to be downloaded.
 Junos:
Pros:
1. Vmxt image does not need to be downloaded; the same RE image can be used as FPC image, in
which case vmxt and riot binaries would already be packaged. However download utility is still
required to handle change in RE image.
2. Library dependencies need not be worried about as packaging is already complete.
Cons/work involved:
1. At runtime junos needs to see whether it is Standalone RE+FPC (phase1) or RE or FPC and based
on that the right set of daemons will need to be spawned.

Based on the above choices, for phase 2 we will go with junos as the guest OS for FPC VM.

Key changes from phase 1 that would enable us to achieve this are
 Same junos image will be used as RE and FPC with roles passed to the runtime instances via a config
file( perhaps loader.conf?). Using the same junos image as FPC helps in avoiding creation of different
image bundles for different roles, downloading of FPC image from RE. However downloading support

Page 9

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


will be added to allow user to upgrade/downgrade the image on RE VM without having to perform same
steps on each FPC VM.
 RE will enable em1 for inter RE communication and for RE to PFE communication.
 FPC VMs will enable em1 for communication with RE, since there is no real etsec driver in the ukern, a
fake etsec driver will be used which will use BPF to transmit and receive via em driver for packet path.
This fake etsec interface will use internal GE ifd and ifl used for communication with the RE. This will also
mean that the ukern to RE communication will no longer be based on local rdp socket as in phase1, instead
it will start supporting IP control plane over etsec interface on ukern side similar to current MPCs.
 FPC VMs will enable em2 to use as fabric data path between PFEs. RPIO tunnel module will be enhanced
to check the destination PFE in fabric header and send packet to the right PFE VM (mac address will
include the PFE ID).
 FPC VMs will enable em3 to em13 to represent GigabitEthernet interfaces.
 Similar to MX960 phase 2 will support a maximum of 12 FPCs. User may choose how many FPCs and
which FPC would be present via vmm configuration.
 HA arbitration will be done by chasisd passing messages between the two REs without using the CB
FPGA. CBFPGA will be absent. It might however be possible to simulate CBFPGA logic using VIRTIO
shared file system, which can be explored if we need this functionality.
 In this phase we will also allow supporting XE interfaces in junos for VMX. The underlying interface will
still remain em interfaces, this will allow existing scripts that work on XE interfaces to be run in VMX. A
static model will be followed of support 20x1GE MIC on MIC0 and 2x10GE MIC on MIC1 per FPC. In
future this can be made dynamic by passing additional configuration from vmm config.
 In this phase an MQ proxy will be added to the RPIO tunnel module to support LU based fragmentation
and reassembly. This will leverage the XM proxy that is being used by XL only cosim.
2.2.2 Packaging Considerations
In Phase 2 the packaging changes are minimal. VMX image is still packaged into jinstall-vmx. The main differences
from Phase1 are.
 From vmm config file, VMM/QEMU will setup the vm role as either RE+FPC (phase1), or RE0, or RE1
or FPCn in a config file (loader.conf?). (Exact details are TBD).
 Based on the VM role the following will be done.
o In case of RE+FPC, the image is supposed to work as in Phase1. So there will be no change in
behavior. We will not try to support the phase1 like model where RE and FPC run in a single VM.
o In case of RE0 or RE1, the COSIM and VMXT daemons will not be launched. The RE will be
initialized to use IP control plane and will specify, EM1 as the interface to be used for RE-ORE
and RE-PFE communication. VMM configuration will put EM1 of the REs and FPC VMs in the
same VDE bridge.
o In case of FPC VMs, FPC image download utility will be run (new functionality) which will
download COSIM and VMXT binaries from Master RE via the EM connectivity and spawn the
COSIM and VMXT daemons. Other junos daemons will not be launched. (exact way to do this is
TBD, having other daemons running may not impact functionality in a negative way, but is not
required).
o In case of FPC VMs, the control plane will be initialized in this case using IP control plane as it
now resembles more like an MPC by having an etsec interface (aliased to EM1) which connects it
to the RE.
o EM2 interfaces of the FPC VMs (which are used as Fabric connectivity) will not be programmed
in promiscuous mode. The mac address will be set based on FPC slot id.

Page 10

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


o RPIO tunnel module will be made aware of PFE mapping. When COSIM sends an L2M parcel to
RPIO tunnel, it will look at the PFE ID in the fabric header and add a Dest Mac Address to the
packet which would be specific to destination FPC. RPIO module of the destination FPC will strip
the source and destination MAC addresses added to these packets and inject these packets to the
COSIM on the fabric stream.
 The file /etc/vplat.conf in junos will be updated with the VM role information so that modules such as rpio
and ukern can do the right initialization.
 One requirement that we have when we separate RE and FPCs to different VMs is the ability to
stop/start/restart VMs from within junos. This is required to ensure that FPC control from chassid and
ability to restart other RE is available. Currently this is under investigation.
2.2.3 Control Plane
One of the key difference in Phase 2 is when we have separate VMs for RE and FPCs, IP control plane can be
enabled. The control and data packets will no longer use local sockets, instead control will use TCP/UDP sockets
while data plane will operate over the etsec interface as TTPoIP.
The following picture depicts the control plane for phase 2.

2.2.4 Data Plane


Data plane will use TTPoIP. In this case RE and FPCs will learn the reach ability through TNP discovery through
EM1 interfaces of the respective VMs and host injected/host bound packets will be routed via this interface.

Page 11

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


o In case of host injected packets, RE sends TTPoIP packet to the PFE using EM1 interface. PFE ukern after
processing the packet will inject the packet to the LU simulation via RPIO tunnel. RPIO tunnel will send
the packet (or header) to cosim via unix socket.
o COSIM will now process the packet and return the result back to rpio tunnel. Based on the queue number
in the L2M cookie, the rpio tunnel will determine if this packet is destined to host or if it is destined to be
sent out via an EM interface or if it is destined to a different PFE via Fabric.
o If destined to host, the packet will be sent to ukern, where ukern will punt the packet as TTPoIP through
EM1 interface to RE.
o If destined to an EM interface, the packet will be sent out via the BPF (Berkley Packet Filter) to the
corresponding EM interface.
o If it is destined to a different PFE, the packet will be sent via BPF to EM2 interface after adding the right
SRC and Dest Mac address on the packet.
o Note that in the FPC VM, the role of Guest OS (Junos) is limited to providing the underlying EM interface
driver and providing the CPU to run VMXT, RIOT and RPIO tunnel components.

Page 12

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


Page 13

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


2.2.5 Interface Mapping

RE EM Interface Junos Interface

Page 14

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


EM0 fxp0

EM1 RE to ORE and RE-FPC

EM2 Unused

EM3 Unused

EM12 Unused

FPCx EM Interface Junos Interface

EM0 Management port for VM

EM1 FPC to RE etsec interface

EM2 Fabric connectivity

EM3 Ge-x/0/0 [X is the FPC slot]

EM13 Ge-x/0/9 [ X is the fpc slot]

2.2.6 GRES
The platform independent infrastructure of GRES is expected to work without any change on vmx. The platform
specific part of GRES will be slightly different for vmx owing to the lack of the CB hardware on the vmx. The
following is a brief summary of the platform specific components of MX that are employed for GRES.

- Hardware arbitration in the CB FPGA that uses the slot number information, presence status of the
individual RE slots and preparedness of individual RE sw to take up the master role as inputs to decide on
the master. The arbitration algorithm gives priority to RE0 over RE1 when both are present and prepared to
become master. The hardware arbitration state machine guarantees that only one of the REs become the
master at any given point in time. This is important in a real MX as many of the hardware resources such as
the i2c path etc. are to be accessed only from the master RE.

- The hardware maintains its version of which RE is the master. This can be read back from either RE
through its respective cb.

- Once mastership is acquired, the cb hardware also aids in early detection of a mastership failure such as a
kernel hang condition. This is done through a mastership watchdog timer in the cb fpga. The master RE has
to keep stroking this watchdog periodically to prevent it from expiring and releasing the mastership in the
hardware. The RE kernel on a mx240/mx480/mx960 router uses a "refresh" timer to stroke this watchdog
periodically.

Page 15

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


- Chassisd has a mastership state machine that keeps track and integrate various events that impact
mastership. Although the CB FPGA maintains who is the current master, it is chassisd that is responsible
to determine which RE should be the master. The chassisd on the master and backup REs co-ordinate to
determine this. The ethernet interface em1 interconnects an RE with the other RE. Chassisd uses this link to
send periodic "keepalive" and "re info" messages to the other RE. A loss of keepalive event can trigger a
mastership switch in chassisd's mastership state machine

- Since chassisd is responsible for determining which RE should be the master, a failure in the master
chassisd should trigger a mastership switch. The kernel maintains a "relinquish" timer for this purpose
which expires after a pre-determined time and triggers a mastership relinquish operation in the fpga.
Mastership is preserved by chassisd by periodically invoking the "hw.re.mastership" sysctl. The periodic
keepalives and the mastership strobing is done from a real-time thread within the chassis-control daemon.

The following section discusses the key functionality offered by the cb fpga hardware on the MX and broadly
outlines how similar capability can be achieved on the vmx platform that does not have the required fpga hardware.
2.2.6.1 Arbitration:
Arbitrate in resolving contentions when both the REs stake claim to mastership. Normally, the state machine
and the packet exchanges between the chassis control daemons on both the REs should prevent this from
happening However, if this communication were to break down for some reason, the chassisd on one RE could
potentially be unaware of the presence of the chassisd on the other RE and consequently could stake claim to
mastership simultaneously. The hardware arbitration helps prevent this "split-brain" behavior. In the VMx
platform, there are several ways of acheiving the same thing:
a) Implement a virtual cb fpga device through qemu emulation. This can use standard qemu device definition
mechanisms such as 'qdev'. This virtual cb fpga would then be passed on the JUNOS RE guest as a pci
device. The device need only support the most important functionality of the fpga and not all of it. The
guest JUNOS would access it like it were are a real fpga through memory mapped io. These mmio
dereferences would trap into qemu which will do the necessary emulation using standard qemu constructs.
The signals between the two CBs that transcends the midplane boundary can be emulated using shared
memory and interrupt constructs that are offered by qemu functionality such as Nahanni/ivshmem.
Although this method offers "full virtualization" of the cb hardware, and as such is most ideal. Although
standard interfaces such as 'qdev' could be employed to ensure that the interaction between the "emulated
fpga" and the rest of the qemu code is through well defined interfaces, this approach still has the
disadvantage that the "device" emulation code have to be added to and maintained as part of the qemu
source. In addition, this also requires that this custom qemu be then deployed on the target vhost servers or
standalone systems. The effort required to add this is another aspect that needs to be kept in mind.
b) Not implement the hardware arbitration and consequently rely only on the ethernet based communication
mechanism currently present in mx. The arbitration scheme that favors RE0 over RE1 can be implemented
in sw if it is not already present(tbd). To help solve the split brain situation, the two REs could potentially
be interconnected through a second em interface to provide additional redundancy. The current JUNOS
kernel supports platforms that do not implement hardware aided mastership.
Please refer to macro PRODUCT_SUPPORTS_HARDWARE_MASTERSHIP in the code.
Approach [b] outlined abover appears the simplest at the time of writing this document.

Page 16

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


2.2.6.2 Quick failure detection
The second functionality offered by the cb fpga is "quick" detection of RE failures through the watchdog
hardware. Since VMX does not have the real hardware, this functionality can be offered through the following
means
a) The virtual cb fpga implementation in qemu explained in section 1(a) above can be extended to include this
watchdog timer implementation.
b) By heart beating more frequently than the real mx hardware. However, care should be taken in this case to
prevent thrashing due to false failovers from 'glitches' introduced from latencies in the vm/vde server loads.
c) The Nahanni/ivshmem shared memory infrastructure in qemu exports "posix" memory on the vm host as a
pci device to user space applications on participating vms. This allows multiple vms to heartbeat to each
other over a second "faster" intercommunication mechanism(shared memory). An arbiter could be
implemented that "ages/clears" the heartbeat signature written by the two REs. This will help the REs to
determine if the other is still present and kicking.
Approach [b] outlined abover appears the simplest at the time of writing this document, albeit at the expense of
some functionality loss.

2.2.6.3 Mastership Change


The hardware offers the facility to relinquish and acquire mastership. This is used to support user initiated
mastership switches. The same functionality is however already as part of the inter-chassisd messaging.

2.2.7 Chassis fru management on the VMX platform

2.2.7.1 Online/Offline/Restart of an fpc


On a normal mx240/mx480/mx960 router, the chassis control daemon uses the i2cs(i2c slave) cpld on the mpc to
control power to it. The i2cs slave cpld is accessible from the master RE. Since the said hardware is not present on
the vmx the online/offline/restart functionality can be achieved using the following alternate methods
a) The "vmm" infrastructure has a "vmm manager", a socket client and a "vmm server" a socket server that work
together with each other and components on the vhost hardware to start, stop, bind and unbind vms. The chassis
control daemon could behave similar to the "vmm manager" client to start/stop/restart vms.
b) Since vmxt, riot and rpio tunnel are essentially processes on the fpc vm guest powered by JUNOS, A second
approach is to start/stop/restart just these processes when the offline/online/restart operation is triggered by chassisd
while the JUNOS vm implementing the fpc vm still continues to run. This would require message exchanges
between chassis on the master RE and say a "rom" equivalent process on the fpc vm. This "rom" process could also
be extended to automatically download the vmxt, riot, rpio_tunnel images from the master RE over tftp during a
reboot/restart operation. The RE and fpc vms themselves shall be manually launched through the "vmm manager"
tool as is done in the case of today with the phase-1 vmx implementation.
The exact approach which we will take needs to be decided after relevant investigation.
2.2.7.2 Offline/Online of a mic
The MIC is entirely a software abstraction in VMX. The behavior of a MIC offline/online will be inherited as is
from vmx phase 1.

2.2.7.3 Environment monitoring of the fpcs


On a real MX, the chassis control daemon is also tasked with monitoring the vitals of a FRU such as the
temperature of various ASICs, intake and exhaust air flows, the voltage outputs of the various dc-dc converters on

Page 17

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


the mpc and mics. However, these parameters do not have significance for for the vmx platform. Consequently,
these functionality shall not be implemented for the vmx phase 2. Also, frus such as a fpm, pem and fan trays shall
not be implemented. The existing chassisd design already supports these kinds of per-platform customization.

2.3 Phase 3: ISSU Support [more details TBD]


In this phase ISSU will be supported on the VMX platform. To achieve this, the following needs to be done.
 Ability to patch ucode on the fly with RIOT. During ISSU prepare phase counter related instructions are
modified to suppress updating of counters. This allows the counter EDMEM to be re-used during ISSU SW
sync phase. These counter related instructions are during ISSU hy sync phase2. Current riot image
converts the ucode to c, compiles it and runs it once on boot. The ability to patch is thus essential to support
ISSU.
 Ability to download ukern image from junos and warm boot it. This is essential as the new ukern image
from new junos will need to be downloaded from the standby RE to do the upgrade.
 Ability to store part of the ukern memory as ISSU blob which will be retained across the warm boot.
 In this phase RIOT LU cosim model will support full ttrace functionality.

2.4 Usability
Following Matrix documents highlight the usability and debugging techniques that can be used with VMX
- VMX quick start guide: https://matrix.juniper.net/docs/DOC-137243
- Serviceability : https://matrix.juniper.net/docs/DOC-149431
- Boot Image option :
http://pkg.dcbg.juniper.net/twiki/bin/view/Virtualization/VMMInfoFAQ#Basedisk_V_s_Bootdisk

2.5 Core/Log Management


Relevant concept in this context are image format, image type specification while launching a VM and
base image vs. derived images. Qemu supports various file formats for boot image for example raw, cow,
qcow, vdi, vmdk, cloop, dmg, bochs, vpc, vvfat, qcow2, parallels, nbd retc.
[Reference qemu-img help page ie. “qemu-img –h”].
The VMM system supports the concept of a basedisk and a bootdisk. The difference between them is as
follows:
[Reference:
http://pkg.dcbg.juniper.net/twiki/bin/view/Virtualization/VMMInfoFAQ#Basedisk_V_s_Bootdisk]

 basedisk:
o A basedisk when specified as part of a VM directive is used as the base from which a COW
will be created for each VM. The basedisk is never modified. Each VM will then have a new
COW created from this basedisk and will then boot from the COW image.
 bootdisk:
o A bootdisk when specified as part of a VM directive is used as the actual image that the VM
will boot from. By default, each bootdisk will be passed to QEMU in read-only mode so the
changes made to the bootdisk while the VM is running are not written back to the disk so a

Page 18

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


bootdisk will remain unchanged from when it is created, regardless of what the VM who is
using it writes to it.
 bootdisk_rw:
o This is the same as a 'bootdisk' with the difference that when specified, it is passed to QEMU
in read-write mode. So all changes made to the disk by the VM when booted are committed
back to the bootdisk when the VM is terminated. It is up to the user to ensure that they do not
corrupt their disks in this way. If you shutdown your VM in a non-graceful manner or cause
any disk corruption while the VM is running, this corruption will be written back to the
bootdisk_rw image and may cause it to be unusable.

Basedisk or base image can be in raw format. Basedisk is also called backing file. Bootdisk or derived
images are derived from Basedisk Several derived images may depend on one base image at the same time.
Therefore, changing the base image can damage the dependencies. While using derived image, QEMU
writes changes to it and uses the base image only for reading. Derived images because the raw format does
not support the backing_file option, qcow2 is one popular choice for derived images.

[Reference: http://doc.opensuse.org/products/draft/SLES/SLES-kvm_sd_draft/cha.qemu.guest_inst.html]

Creating qcow2 derived bootdisk_rw image from base raw image:


svpod1-vmm.englab.juniper.net:data/user_disks/sanjeevm> /vmm/bin/qemu-img create -b
/volume/vmcores/sanjeevm/jinstall-vmx-13.2I20130509_1211_sanjeevm-domestic.img -f qcow2
/vmm/data/user_disks/sanjeevm/jinstall-vmx-13.2I20130509_1211_sanjeevm-domestic-rw-vmx1.img

Specifying basedisk in vmm.conf [Many VM can use same basedisk]:


#define VMX_DISK basedisk "/volume/vmcores/sanjeevm/jinstall-vmx-13.2I20130509_1211_sanjeevm-
domestic.img";

Specifying bootdisk_rw in vmm.conf [each VM must have a exclusive bootdisk_rw]:


#define VMX_RW_DISK1 bootdisk_rw "/vmm/data/user_disks/sanjeevm/jinstall-vmx-
13.2I20130509_1211_sanjeevm-domestic-rw-vmx1.img";

- Moving data between host and guest


There are many ways to move information between guest and host, some of them are descried here.
1. Data can be shared between the host and guest OS using any network protocol that can transfer files,
such as NFS, SMB, NBD, HTTP, FTP, or SSH, provided that you the network is appropriately setup
and appropriate services are enabled.
2. Qemu’s SMB server [https://wiki.archlinux.org/index.php/QEMU#QEMU.27s_built-in_SMB_server]
3. Mounting the images e.g. qcow2 image can be mounted on host using qemu-nbd command
4. Sharing folder between host and guest os: VirtFS: http://doc.opensuse.org/products/draft/SLES/SLES-
kvm_sd_draft/cha.qemu.running.html#kvm.qemu.virtfs

- Accessing the information generated on the device


Now with the relevant context established, we can consider cases for vMX.
1. If the vMX VMs are launched with read only base or boot disk then data (i.e. core , log etc.) persist
only until “unbind” is done. Therefore, information must be moved out of VM before “unbind”.
2. If the vMX VMs are is launched with bootdisk_rw then information (i.e. core, log etc.) persist across
unbind. In this case we have many more options to move data out.

Page 19

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


Note:
1. After unbind a bootdisk_rw image can be preserved (as needed) to access information at a later time.
2. Qemu provides a “snapshot” facility: Virtual machine snapshots are snapshots of the complete
environment in which a VM Guest is running. The snapshot includes the state of the processor
(CPU), memory (RAM), devices, and all writable disks. This could be a very powerful feature from
debugging point of view.
“qemu-img snapshot” command can be used to create snapshots. [Reference: “qemu-img –h”]

2.6 Occam Implications


Occam is enabling Junos to move to FreeBSD10 from FreeBSD6.1. At the same time it is also bringing in
required modularity (separate sandbox for FreeBSD), componentization and ownership, new features (64-
bit support, SMP) etc. One of the implication from vMX perspective is the aspect of build and packaging.
As vMX derives it code mainly from MX and vJX and both these are moving onto occam base prior to
vMX. There may not be any significant effort needed for vMX to move over to occam base. However,
moving to occam will enable VMX to leverage new features example 64-bit and SMP support. Additional
efforts will be needed to leverage these new features for better scale, performance tec.

2.7 64-bit considerations


VMX in phase 1 and 2 supports 32bit kernel. We can in future phases move to 64-bit kernel perhaps along
with occam release. However, applications can continue to be on 32-bit. We need to look at if applications
like vmxt, RIOT, rpio_tunnel etc. need any change in this context. Required change needed to be evaluated
from ABI and data-model perspective (with 64-bit we would be moving from ILP32 data-model to LP64
data-model).

2.8 Assumptions
The QEMU image and parameters must be invoked in a controlled manner, hence there is an assumption of the
presence of the VMM system, which will create, interconnect and destroy VMs.
QEMU runs in 64 bit mode (again, this is enforced by VMM system).

2.9 Constraints and Limitations


- LU only simulation is available, hence no support for queuing/scheduling
- No fabric simulation.
- Limited pps throughput.
- Time critical activities cannot be supported
- Policing will not be accurate
- Protocols with sub 100ms timeout (BFD, CFM with aggressive timeout, Performance/loss measurement)
may not be supported.
- Not suitable for scaled scenario testing.
- Fake interface driver and hence very restricted support for interface features.
- No support for non-Ethernet interfaces, or Ethernet interfaces other than GE (Phase 2 will extent it to XE).
- No support for timing features.

Page 20

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


- Not suitable for Mx Platform testing or tests involving hardware.
- We will be limited by the maximum number of interfaces that virtual environment is capable of providing
on a single VM invocation. Currently effort is on to support 20x1GE ports on a single VM. In case of
Phase2, this will translate to 20x1GE ports per FPC.

2.10 Dependencies and Interactions with other Components in the


System
VMX code depends on the hypervisor and it’s corresponding technologies like virtual network model, as described
in the Functionality section. Note, the actual environment the VMX image runs on can change, it could potentially
run over any other hypervisor like vmware. The other dependency is with the trinity chip simulator features.
Some of the parameters that define the requirements are
1. QEMU: Current version of QEMU used is version of 0.13.0. There is a juniper repository of the same and may
have additional changes compared to what is available in open source repository. We depend on QEMU to
provide e1000 virtualized Ethernet ports that is used for underlying transport. In theory it may not matter which
interface we use for underlying transport, so other VM architectures may also be feasible.
2. Bridging Software: To interconnect VMs, we need the EM interfaces to be part of a simulated switch, VDE is
the software we currently use to provide this capability, any other technology such as TAP interface can also be
used to achieve the same.
3. CPU : CPU core for the VM need not be dedicated, it should be possible to run many VMs per CPU. [TBD: the
exact performance and the optimal number of VMs per CPU core running at a specific speed needs to be
characterized and mentioned].
4. RAM: Preliminary characterizations for Phase1 indicate that each VM may take upto 2GB of memory from the
host.
5. Disk Space: Preliminary characterizations for Phase 1 indicate that each VM may require around 30GB of disk
space.
6. HyBridge: It may be required in some cases to connect virtual topologies to real network devices (routers/test
equipment), in VMM architecture HyBridge appliance provides this capability. In a standalone environment it
should be possible to interconnect physical ports available to VDE and hence to the virtual topology without
using HyBridge appliance.

2.11 Free Software Considerations


No free software is being added to Junos as part of VMX. However the environment it will be qualified in will be
open source environment. The entire product is based on CentOS 5.4 image, QEMU 0.13.0.

Page 21

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


3. Implementation Details

3.1.1 Design considerations for future


Some of the vMX design was adopted for simplicity. In future, we may change the architecture in some aspects.

- Use MQ along with LU.


- Support other types of PFEs (XL, XL+XM, EA, TQ etc)
- Support Multi-PFE FPCs.

Page 22

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6


4. Performance
Performance will depend on the cosim performance numbers. Currently, the benchmark shows 100-200K pps, in a
non-VM setup. We expect to have numbers close to this range in vMX.

4.1 Target Performance


<TBD>

Page 23

NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.

Process Template: J3.02.P05.T01S - Template Revision 1.6

You might also like