Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Enabling Lightweight Transactions with Precision Time

Published: 04 April 2017 Publication History

Abstract

Distributed transactional storage is an important service in today's data centers. Achieving high performance without high complexity is often a challenge for these systems due to sophisticated consistency protocols and multiple layers of abstraction. In this paper we show how to combine two emerging technologies---Software-Defined Flash (SDF) and precise synchronized clocks---to improve performance and reduce complexity for transactional storage within the data center.
We present a distributed transactional system (called MILANA) as a layer above a durable multi-version key-value store (called SEMEL) for read-heavy workloads within a data center. SEMEL exploits write behavior of SSDs to maintain a time-ordered sequence of versions for each key efficiently and durably. MILANA adds a variant of optimistic concurrency control above SEMEL's API to service read requests from a consistent snapshot and to enable clients to make fast local commit or abort decisions for read-only transactions.
Experiments with the prototype reveal up to 43% lower transaction abort rates using IEEE Precision Time Protocol (PTP) vs. the standard Network Time Protocol (NTP). Under the Retwis benchmark, client-local validation of read-only transactions yields a 35% reduction in latency and 55% increase in transaction throughput.

References

[1]
Atul Adya, Robert Gruber, Barbara Liskov, and Umesh Maheshwari. Efficient optimistic concurrency control using loosely synchronized clocks. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD '95, pages 23--34. ACM, 1995.
[2]
Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. Design tradeoffs for ssd performance. In USENIX 2008 Annual Technical Conference, ATC'08, pages 57--70. USENIX Association, 2008.
[3]
Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity data center network architecture. In Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication, SIGCOMM '08, pages 63--74, New York, NY, USA, 2008. ACM.
[4]
David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. Fawn: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 1--14. ACM, 2009.
[5]
Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '12, pages 53--64, New York, NY, USA, 2012. ACM.
[6]
Jason Baker, Chris Bond, James C Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, and Vadim Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, volume 11, pages 223--234, 2011.
[7]
Ilya Baldin, Jeff Chase, Yufeng Xin, Anirban Mandal, Paul Ruth, Claris Castillo, Victor Orlikowski, Chris Heermann, and Jonathan Mills. ExoGENI: A Multi-Domain Infrastructure-as-a-Service Testbed, pages 279--315. Springer International Publishing, Cham, 2016.
[8]
Philip A. Bernstein and Nathan Goodman. Multiversion concurrency control - theory and algorithms. ACM Transactions on Database Systems, 8(4):465--483, December 1983.
[9]
Philip A. Bernstein, Vassco Hadzilacos, and Nathan Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1987.
[10]
Philip A. Bernstein, Colin W. Reid, and Sudipto Das. Hyder - A transactional record manager for shared flash. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9--12, 2011, Online Proceedings, pages 9--20, 2011.
[11]
Matias Bjørling, Javier González, and Philippe Bonnet. Lightnvm: The linux open-channel ssd subsystem. In 15th USENIX Conference on File and Storage Technologies (FAST). USENIX, 2017.
[12]
Matias Bjørling. Operating System Support for High-Performance Solid State Drives. PhD thesis, Denmark, 2016.
[13]
Adrian M. Caulfield, Todor I. Mollov, Louis Alex Eisner, Arup De, Joel Coburn, and Steven Swanson. Providing safe, user space access to fast, solid state disks. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 387--400, New York, NY, USA, 2012. ACM.
[14]
Joel Coburn, Trevor Bunker, Meir Schwarz, Rajesh Gupta, and Steven Swanson. From aries to mars: Transaction support for next-generation, solid-state drives. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 197--212, New York, NY, USA, 2013. ACM.
[15]
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277--1288, August 2008.
[16]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 251--264. USENIX Association, 2012.
[17]
Biplob Debnath, Sudipta Sengupta, and Jin Li. Flashstore: High throughput persistent key-value store. Proc. VLDB Endow., 3(1--2):1414--1425, September 2010.
[18]
Biplob Debnath, Sudipta Sengupta, and Jin Li. Skimpystash: Ram space skimpy key-value store on flash-based storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, pages 25--36. ACM, 2011.
[19]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's highly available key-value store. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 205--220. ACM, 2007.
[20]
Bailu Ding, Lucja Kot, Alan Demers, and Johannes Gehrke. Centiman: Elastic, high performance optimistic concurrency control by watermarking. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC '15, pages 262--275. ACM, 2015.
[21]
Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. Farm: Fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, NSDI'14, pages 401--414, Berkeley, CA, USA, 2014. USENIX Association.
[22]
Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 54--70, New York, NY, USA, 2015. ACM.
[23]
C. J. Fidge. Timestamps in message-passing systems that preserve the partial ordering. Proceedings of the 11th Australian Computer Science Conference, 10(1):56--66, 1988.
[24]
C. Gray and D. Cheriton. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP '89, pages 202--210, New York, NY, USA, 1989. ACM.
[25]
Aayush Gupta, Youngjae Kim, and Bhuvan Urgaonkar. Dftl: A flash translation layer employing demand-based selective caching of page-level address mappings. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV, pages 229--240, New York, NY, USA, 2009. ACM.
[26]
Jian Huang, Anirudh Badam, Moinuddin K. Qureshi, and Karsten Schwan. Unified address translation for memory-mapped ssds with flashmap. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 580--591. ACM, 2015.
[27]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, pages 11--11. USENIX Association, 2010.
[28]
IEEE. Ieee standard for a precision clock synchronization protocol for networked measurement and control systems. IEEE Std 1588--2008 (Revision of IEEE Std 1588--2002), pages 1--269, 2008.
[29]
William K. Josephson, Lars A. Bongo, Kai Li, and David Flynn. Dfs: A file system for virtualized flash storage. Trans. Storage, 6(3):14:1--14:25, September 2010.
[30]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using rdma efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, pages 295--306, New York, NY, USA, 2014. ACM.
[31]
David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, STOC '97, pages 654--663, New York, NY, USA, 1997. ACM.
[32]
H. T. Kung and John T. Robinson. On optimistic methods for concurrency control. ACM Transactions on Database Systems, 6(2):213--226, June 1981.
[33]
Avinash Lakshman and Prashant Malik. Cassandra: A decentralized structured storage system. SIGOPS Operating Systems Review, 44(2):35--40, April 2010.
[34]
Leslie Lamport. Paxos made simple. ACM Sigact News, 32(4):18--25, 2001.
[35]
Costin Leau. Spring data redis -- retwis-j, 2013. http://docs.spring.io/spring-data/data-keyvalue/examples/retwisj/current/.
[36]
Collin Lee, Seo Jin Park, Ankita Kejriwal, Satoshi Matsushita, and John Ousterhout. Implementing linearizability at large scale and low latency. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 71--86, New York, NY, USA, 2015. ACM.
[37]
Ki Suh Lee, Han Wang, Vishal Shrivastav, and Hakim Weatherspoon. Globally synchronized time via datacenter networks. In To appear in Proceedings of ACM SIGCOMM, August 2016.
[38]
Charles E. Leiserson. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Trans. Comput., 34(10):892--901, October 1985.
[39]
Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. Silt: A memory-efficient, high-performance key-value store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 1--13. ACM, 2011.
[40]
Barbara Liskov and James Cowling. Viewstamped replication revisited. Technical Report MIT-CSAIL-TR-2012-021, MIT, July 2012.
[41]
Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, Raju Rangaswami, Sushma Devendrappa, Bharath Ramsundar, and Sriram Ganesan. Nvmkv: A scalable and lightweight flash aware key-value store. In 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 14), Philadelphia, PA, June 2014. USENIX Association.
[42]
Friedemann Mattern. Virtual time and global states of distributed systems. In Parallel and Distributed Algorithms, pages 215--226. North-Holland, 1989.
[43]
Christopher Mitchell, Yifeng Geng, and Jinyang Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference, USENIX ATC'13, pages 103--114, Berkeley, CA, USA, 2013. USENIX Association.
[44]
Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. Scaling memcache at facebook. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 385--398, Lombard, IL, 2013. USENIX.
[45]
Brian M. Oki and Barbara H. Liskov. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, PODC '88, pages 8--17. ACM, 1988.
[46]
Jian Ouyang, Shiding Lin, Song Jiang, Zhenyu Hou, Yong Wang, and Yuanzheng Wang. SDF: Software-defined Flash for Web-scale Internet Storage Systems. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 471--484, New York, NY, USA, 2014. ACM.
[47]
Xiangyong Ouyang, David Nellans, Robert Wipfel, David Flynn, and Dhabaleswar K. Panda. Beyond block i/o: Rethinking traditional storage primitives. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA '11, pages 301--311, Washington, DC, USA, 2011. IEEE Computer Society.
[48]
Vijayan Prabhakaran, Thomas L. Rodeheffer, and Lidong Zhou. Transactional flash. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, pages 147--160, Berkeley, CA, USA, 2008. USENIX Association.
[49]
Stephen M. Rumble, Ankita Kejriwal, and John Ousterhout. Log-structured memory for dram-based storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies, FAST'14, pages 1--16, Berkeley, CA, USA, 2014. USENIX Association.
[50]
Mohit Saxena, Michael M. Swift, and Yiying Zhang. Flashtier: A lightweight, consistent and durable storage cache. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys '12, pages 267--280. ACM, 2012.
[51]
Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker, Arup De, Yanqin Jin, Yang Liu, and Steven Swanson. Willow: A user-programmable ssd. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 67--80, Berkeley, CA, USA, 2014. USENIX Association.
[52]
Sriram Subramanian, Swaminathan Sundararaman, Nisha Talagala, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Snapshots in a flash with iosnap. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys '14, pages 23:1--23:14. ACM, 2014.
[53]
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 1--12. ACM, 2012.
[54]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. Fast in-memory transaction processing using rdma and htm. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 87--104, New York, NY, USA, 2015. ACM.
[55]
Zev Weiss, Sriram Subramanian, Swaminathan Sundararaman, Vinay Sridhar, Nisha Talagala, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. MjÖlnir: Collecting trash in a demanding new world. In Proceedings of the 3rd Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads, INFLOW '15, pages 4:1--4:10. ACM, 2015.
[56]
Zev Weiss, Sriram Subramanian, Swaminathan Sundararaman, Nisha Talagala, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Anvil: Advanced virtualization for modern non-volatile memory devices. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST'15, pages 111--118. USENIX Association, 2015.
[57]
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Building consistent transactions with inconsistent replication. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 263--278. ACM, 2015.
[58]
Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. De-indirection for flash-based ssds with nameless writes. In Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST'12, pages 1--1. USENIX Association, 2012.
[59]
Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 3--18, New York, NY, USA, 2015. ACM.

Index Terms

  1. Enabling Lightweight Transactions with Precision Time

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 52, Issue 4
    ASPLOS '17
    April 2017
    811 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3093336
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
      April 2017
      856 pages
      ISBN:9781450344654
      DOI:10.1145/3037697
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 April 2017
    Published in SIGPLAN Volume 52, Issue 4

    Check for updates

    Author Tags

    1. clock-synchronization
    2. distributed storage systems
    3. non-volatile memory
    4. strong consistency
    5. transactions

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 699
      Total Downloads
    • Downloads (Last 12 months)155
    • Downloads (Last 6 weeks)28
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media