Virtual MX Platform Architecture: Specification Title VMX Architecture
Virtual MX Platform Architecture: Specification Title VMX Architecture
Virtual MX Platform Architecture: Specification Title VMX Architecture
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 1
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 2
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
b) Applications
[1] Layer3 VPNs.
[2] Layer2 VPNs
[3] Multicast VPNs
[4] BGP Multipaths
[5] RSVP-signaled LSPs
[6] LDP-signaled LSPs
[7] MPLS Fast Reroute, Node Protection and Link Protection
[8] Point-to-multipoint (RSVP) LSPs
[9] BGP MPLS Multicast VPNs
[10] L2-circuits
Page 3
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
2.1.2 Non-Goals
We do not support the multi-chip model of cosim.
For the first release we will not support running VMX outside a Juniper cloud environment that uses
VMM.
Multi-chassis and backup RE support not present in first phase. (No dual-RE support)
Multiple PFEs are not supported in the first phase.
2.1.3 Limitations
TTRACE does not work with riot simulation. It works with lu-cosim, but lu-cosim is very slow. Need to
find an alternative to debug LU ucode with riot simulation, riot does support some trace functionality,
which will be evaluated and leveraged as an interim until full ttrace functionality is made available.
Since MQ simulation is not included, apart from queuing functionality, features such as fragmentation and
reassembly will not work.
Since only LU chip of the PFE hardware is simulated, functionality of fabric, IX, QX, MQ chips is not
supported.
2.1.4 Packaging Considerations
VMX image is packaged into jinstall-vmx, which differs from the regular mx jinstall slightly. The main differences
are
Includes cosim and vmxt binaries instead of the other PFE binaries
vmx.properties is used to set the RE type in junos to JNX_RE_TYPE_VMX and ch_model to
MODEL_VMX
A new fake mic with single pic is included to support 10x1GE interfaces.
A file /etc/vplat.conf is included to specify the VMX platform type which is used by rpio and ukern
modules to do VMX specific initialization.
PFE ukern is built from a new makefile with target name as ‘vmxt’. It has been based on the olivet
makefile. There following modules are currently not included in the vmxt makefile which is present in the
npc Makefile.
o Diagnostic modules like ae11010_diag, phy_diags, bcm87xx_diags, bcm54xxx_diag etc.
o If_ipv4.co (uses if_tnp.co instead)
o Host_loopback.co
o Zarlink library
o ISSU blobs
o Drivers for TDM interfaces.
o Sntp_pfe_ipv4
o Mlpfe _npc (uses mlpfe_absent)
o Bulkget (uses bulkget absent)
o Ppman (uses ppman absent)
Some of the above such as ppman, bulkget and mlpfe needs to be added to get corresponding functionality
enabled. Note we might not be able to test reassembly and fragmentation without MQ proxy support, but
single fragment packets and LFI can be handled without MQ proxy. These will be fixed in phase1.
Page 4
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 5
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
The following figures illustrate host injected, host bound and transit traffic paths.
Page 6
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
EM0 fxp0
EM1 unused
EM2 Ge-0/0/0
EM3 Ge-0/0/1
EM11 Ge-0/0/9
Page 8
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Based on the above choices, for phase 2 we will go with junos as the guest OS for FPC VM.
Key changes from phase 1 that would enable us to achieve this are
Same junos image will be used as RE and FPC with roles passed to the runtime instances via a config
file( perhaps loader.conf?). Using the same junos image as FPC helps in avoiding creation of different
image bundles for different roles, downloading of FPC image from RE. However downloading support
Page 9
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 10
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 11
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 12
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 14
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
EM2 Unused
EM3 Unused
EM12 Unused
2.2.6 GRES
The platform independent infrastructure of GRES is expected to work without any change on vmx. The platform
specific part of GRES will be slightly different for vmx owing to the lack of the CB hardware on the vmx. The
following is a brief summary of the platform specific components of MX that are employed for GRES.
- Hardware arbitration in the CB FPGA that uses the slot number information, presence status of the
individual RE slots and preparedness of individual RE sw to take up the master role as inputs to decide on
the master. The arbitration algorithm gives priority to RE0 over RE1 when both are present and prepared to
become master. The hardware arbitration state machine guarantees that only one of the REs become the
master at any given point in time. This is important in a real MX as many of the hardware resources such as
the i2c path etc. are to be accessed only from the master RE.
- The hardware maintains its version of which RE is the master. This can be read back from either RE
through its respective cb.
- Once mastership is acquired, the cb hardware also aids in early detection of a mastership failure such as a
kernel hang condition. This is done through a mastership watchdog timer in the cb fpga. The master RE has
to keep stroking this watchdog periodically to prevent it from expiring and releasing the mastership in the
hardware. The RE kernel on a mx240/mx480/mx960 router uses a "refresh" timer to stroke this watchdog
periodically.
Page 15
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
- Since chassisd is responsible for determining which RE should be the master, a failure in the master
chassisd should trigger a mastership switch. The kernel maintains a "relinquish" timer for this purpose
which expires after a pre-determined time and triggers a mastership relinquish operation in the fpga.
Mastership is preserved by chassisd by periodically invoking the "hw.re.mastership" sysctl. The periodic
keepalives and the mastership strobing is done from a real-time thread within the chassis-control daemon.
The following section discusses the key functionality offered by the cb fpga hardware on the MX and broadly
outlines how similar capability can be achieved on the vmx platform that does not have the required fpga hardware.
2.2.6.1 Arbitration:
Arbitrate in resolving contentions when both the REs stake claim to mastership. Normally, the state machine
and the packet exchanges between the chassis control daemons on both the REs should prevent this from
happening However, if this communication were to break down for some reason, the chassisd on one RE could
potentially be unaware of the presence of the chassisd on the other RE and consequently could stake claim to
mastership simultaneously. The hardware arbitration helps prevent this "split-brain" behavior. In the VMx
platform, there are several ways of acheiving the same thing:
a) Implement a virtual cb fpga device through qemu emulation. This can use standard qemu device definition
mechanisms such as 'qdev'. This virtual cb fpga would then be passed on the JUNOS RE guest as a pci
device. The device need only support the most important functionality of the fpga and not all of it. The
guest JUNOS would access it like it were are a real fpga through memory mapped io. These mmio
dereferences would trap into qemu which will do the necessary emulation using standard qemu constructs.
The signals between the two CBs that transcends the midplane boundary can be emulated using shared
memory and interrupt constructs that are offered by qemu functionality such as Nahanni/ivshmem.
Although this method offers "full virtualization" of the cb hardware, and as such is most ideal. Although
standard interfaces such as 'qdev' could be employed to ensure that the interaction between the "emulated
fpga" and the rest of the qemu code is through well defined interfaces, this approach still has the
disadvantage that the "device" emulation code have to be added to and maintained as part of the qemu
source. In addition, this also requires that this custom qemu be then deployed on the target vhost servers or
standalone systems. The effort required to add this is another aspect that needs to be kept in mind.
b) Not implement the hardware arbitration and consequently rely only on the ethernet based communication
mechanism currently present in mx. The arbitration scheme that favors RE0 over RE1 can be implemented
in sw if it is not already present(tbd). To help solve the split brain situation, the two REs could potentially
be interconnected through a second em interface to provide additional redundancy. The current JUNOS
kernel supports platforms that do not implement hardware aided mastership.
Please refer to macro PRODUCT_SUPPORTS_HARDWARE_MASTERSHIP in the code.
Approach [b] outlined abover appears the simplest at the time of writing this document.
Page 16
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 17
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
2.4 Usability
Following Matrix documents highlight the usability and debugging techniques that can be used with VMX
- VMX quick start guide: https://matrix.juniper.net/docs/DOC-137243
- Serviceability : https://matrix.juniper.net/docs/DOC-149431
- Boot Image option :
http://pkg.dcbg.juniper.net/twiki/bin/view/Virtualization/VMMInfoFAQ#Basedisk_V_s_Bootdisk
basedisk:
o A basedisk when specified as part of a VM directive is used as the base from which a COW
will be created for each VM. The basedisk is never modified. Each VM will then have a new
COW created from this basedisk and will then boot from the COW image.
bootdisk:
o A bootdisk when specified as part of a VM directive is used as the actual image that the VM
will boot from. By default, each bootdisk will be passed to QEMU in read-only mode so the
changes made to the bootdisk while the VM is running are not written back to the disk so a
Page 18
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Basedisk or base image can be in raw format. Basedisk is also called backing file. Bootdisk or derived
images are derived from Basedisk Several derived images may depend on one base image at the same time.
Therefore, changing the base image can damage the dependencies. While using derived image, QEMU
writes changes to it and uses the base image only for reading. Derived images because the raw format does
not support the backing_file option, qcow2 is one popular choice for derived images.
[Reference: http://doc.opensuse.org/products/draft/SLES/SLES-kvm_sd_draft/cha.qemu.guest_inst.html]
Page 19
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
2.8 Assumptions
The QEMU image and parameters must be invoked in a controlled manner, hence there is an assumption of the
presence of the VMM system, which will create, interconnect and destroy VMs.
QEMU runs in 64 bit mode (again, this is enforced by VMM system).
Page 20
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 21
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 22
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.
Page 23
NOTICE: This document contains proprietary and confidential information of Juniper Networks, Inc. and must not be
distributed outside of the company without the permission of Juniper Networks engineering.