Cinnober
Cinnober
Cinnober
Copyright 2011 Cinnober Financial Technology AB. All rights reserved. Cinnober Financial Technology AB reserves the right to make changes to the information contained herein without prior notice. No part of this document may be reproduced, copied, published, transmitted, or sold in any form or by any means without the expressed written permission of Cinnober Financial Technology AB. Cinnober and TRADExpress are trademarks or registered trademarks of Cinnober Financial Technology AB in Sweden and other countries. Other product or company names mentioned herein may be the trademarks of their respective owners.
In the last few years, latency has been the most important characteristic of financial transaction systems. Cinnober has focused its research on several aspects of low-latency technology in order to develop a new, ultra-fast trading system: TRADExpress Ultra. In this paper we present key findings relating to how TRADExpress Ultra achieves single-digit microsecond latency with commercially available equipment.
Low latency comes at a cost. You need first-class data centers and network infrastructure, but you also need a fuzz-free, streamlined transaction model; you must simplify the transaction processing. ous processing instructions in the system, no temporary data structures created on demand, no unnecessary wait states.
Measuring latency
As we have shown in our previous white papers, to be able to demonstrate low latency you need not only the fastest software, processors and network available; you also need to clearly define where and how you measure it. After all, a single, one-way network hop using the fastest commercially available technology will incur delays of 12 microseconds. In practice, this makes single-digit microsecond latency beyond the first firewall, gateway or directly connected client very hard to achieve. In this paper, our transaction figures are measured both at the server (door-to-door) and at a co-located client after the firewall.
Minimalistic design
Having efficiency as the chief guiding principle in this research has led to a set of design decisions: Use a binary transaction protocol to eliminate data conversions and complex parsing. Use Remote Direct Memory Access (RDMA) verbs and zero copy mechanisms to eliminate network protocol stack processing. Use preallocated data structures to completely eliminate all memory turnover and associated garbage collections. Use a highly streamlined matching engine. Tune the hardware and OS for low latency.
Processing efficiency
If latency is the key quality requirement from the users point of view, efficiency is the chief aspect from the designers. There should be no wasted bytes or extrane-
The rest is just applied Java engineering. Turn the page for details.
High-speed Java
Java is high-speed. For over a decade, Cinnober has built high-performance, large-scale systems in pure Java. Building on this expertise, we have implemented TRADExpress Ultra with exceptional performance characteristics. While all implementation languages have different characteristics making them more or less fit for any particular purpose, Java is eminently suitable to build efficient and fast large-scale systems, and to build them efficiently meeting the most aggressive time-to-market requirements. Execution efficiency To create the fastest possible matching engine we have implemented a limit order book which matches orders in near constant time. Complex order strategies may be layered upon the basic model outside the latencycritical path of the matching engine or be left to be implemented by the clients. Memory efficiency Modern computer languages provide a plethora of mechanisms and constructs that ease the expression of complex algorithms and hide implicit complexities, such as data management, conversion and lookup. While these features are well-intended, they also hide the amount of processing required, which might result in degraded performance. At Cinnober, we have always taken great care to tune and control the Java memory management, but to achieve single-digit microsecond latency we have had to take this to new levels: We have managed to eliminate memory turnover, which means there is no temporary memory created anywhere in the transaction execution path. This also accelerates the processing, since there is no unnecessary data copying.
Network efficiency We also needed a method for transporting data efficiently across the network. The conventional verbose protocols, such as FIX, are inherently less efficient to process than a binary protocol, since data must be converted by both sender and receiver. A binary protocol makes transporting native data between compatible hosts trivial. We have designed such a minimalistic and highly efficient binary network protocol. Since there are no conversions involved it is easy to use the fastest technologies available; RDMA transport mechanisms on InfiniBand and 10 Gbit Ethernet. As there is currently no native support for RDMA in Java, this required creating some native bridge code.
Test set-up
Our test system for measuring system latency in TRADExpress Ultra consists of a number of hosts interconnected with 10 Gbit Ethernet and InfiniBand. The 10 Gbit Ethernet is used as the transaction medium, while InfiniBand is used for point-to-point connections between primary and standby hosts. We also use 1 Gbit Ethernet for control and monitoring.
Matching Engine
Standby
Inniband
10 Gbit Eth
Matching Engine
Primary
Co-located latency
Door-to-door latency
Hardware The tests were done on commercially available hardware: a mix of HP DL380 G7 and SL390s servers, all with Intel X5690 CPUs, running at 3.46 GHz, and Mellanox ConnectX2 NICs. OS and hardware tuning HP assisted us in the tuning of servers for best lowlatency performance. In addition, we tuned the Linux IP stack.
Test To test the system we used an order generator simulating 100 users giving a sustained 100,000 TPS aggregated load distributed uniformly over 10 order books totaling 15,000 active orders on a single primary/ standby server pair. This is an order of magnitude greater than the maximum load of a world-class venue, and the system can be further scaled up as needed.
Results
We measured the latency in TRADExpress Ultra: Door-to-doorwith and without synchronous replication to the standby matching engine At the co-located client after the firewallwith and without synchronous replication to the standby matching engine (see figure on previous page)
The graphs below compare the response time distributions for the test runs. The graph figures are in nanoseconds (thousandths of a microsecond).
As can be seen from the data, the median door-to-door latency without replication is 1.4 microseconds (s), and 6.6 s with replication added. At the client we measure 16 and 21 s for the same tests, respectively.
Run Door-to-door, no replication At client, no replication Door-to-door, synch. repl. At client, synch. repl.
Median (s) 90% (s) 1.4 16.0 6.6 21.0 2.0 18.0 7.9 24.0
Conclusion
We have demonstrated a full matching engine with single-digit microsecond latency, implemented with Java on commercially available hardware. Our conclusions are: Java remains the language nonpareil for implementing large, efficient and fast financial systems under aggressive time-to-market constraints. The utmost performance is possible to achieve using standard equipment; you just need to utilize it efficiently. This is a major benefit since specialized hardware solutions are often costly and inflexible. Using commercially available equipment is often more cost-efficient and enables more frequent upgrades of the infrastructure to take advantage of new technologies and advances in hardware. We still see opportunities for specialized solutions, such as hardwareassisted matchers, but software can also deliver outstanding performance. The software is now so fast that the network is the new bottleneck. Sending a single transaction through the network takes orders of magnitude longer time than processing the transaction itself, and the delays incurred are further compounded by each network hop. To alleviate these delays we have used RDMA, which has proved successful. Trading venues wishing to offer extremely low latency must begin by streamlining their business functionality. Complex order types or transaction schemes are better implemented in the trading client or as super constructs, layered outside the latency critical path.
Passion for change | Cinnober provides mission-critical solutions to the worlds most demanding financial marketplaces. We are passionate about one thing: applying advanced financial technology to help trading and clearing venues seize new opportunities in times of change. We build partnerships with our customers based on trust and transparency. We serve investment banks, exchanges, clearinghouses and other actors that have extreme demands on business functionality, high throughput and low latency. We currently have product-based offerings for a number of areas, such as marketplaces, post-trade management and binary markets. All solutions use our TRADExpress technology, designed for scalability and flexibility. Since our start over ten years ago, we have established ourselves as a leading provider of innovative solutions to premier financial institutions around the world. These include mission-critical systems for leading exchanges such as the Chicago Board Options Exchange, Deutsche Brse, London Metal Exchange and NYSE Liffe. We also power new initiatives and alternative trading systems such as Alpha Trading Systems, Burgundy and Markit BOAT. We are an independent provider of marketplace solutions, and do not operate a market of our own, avoiding any conflicts of interest. Our track record says it all. We help our customers turn change into a competitive advantage.