Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Query fresh: log shipping on steroids

Published: 01 December 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Hot standby systems often have to trade safety (i.e., not losing committed work) and freshness (i.e., having access to recent updates) for performance. Guaranteeing safety requires synchronous log shipping that blocks the primary until the log records are durably replicated in one or multiple backups; maintaining freshness necessitates fast log replay on backups, but is often defeated by the dual-copy architecture and serial replay: a backup must generate the "real" data from the log to make recent updates accessible to read-only queries.
    This paper proposes Query Fresh, a hot standby system that provides both safety and freshness while maintaining high performance on the primary. The crux is an append-only storage architecture used in conjunction with fast networks (e.g., InfiniBand) and byte-addressable, non-volatile memory (NVRAM). Query Fresh avoids the dual-copy design and treats the log as the database, enabling lightweight, parallel log replay that does not block the primary.
    Experimental results using the TPC-C benchmark show that under Query Fresh, backup servers can replay log records faster than they are generated by the primary server, using one quarter of the available compute resources. With a 56Gbps network, Query Fresh can support up to 4--5 synchronous replicas, each of which receives and replays ∼1.4GB of log records per second, with up to 4--6% overhead on the primary compared to a standalone server that achieves 620kTPS without replication.

    References

    [1]
    AgigaTech. AgigaTech Non-Volatile RAM. 2017. http://www.agigatech.com/nvram.php.
    [2]
    J. Arulraj, M. Perron, and A. Pavlo. Write-behind logging. PVLDB, 10(4):337--348, 2016.
    [3]
    M. Balakrishnan, D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J. D. Davis. CORFU: A shared log design for flash clusters. NSDI, 2012.
    [4]
    M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran, M. Wei, J. D. Davis, S. Rao, T. Zou, and A. Zuck. Tango: Distributed data structures over a shared log. SOSP, pages 325--340, 2013.
    [5]
    C. Barthels, S. Loesing, G. Alonso, and D. Kossmann. Rack-scale in-memory join processing using RDMA. SIGMOD, pages 1463--1475, 2015.
    [6]
    P. A. Bernstein, S. Das, B. Ding, and M. Pilman. Optimizing optimistic concurrency control for tree-structured, log-structured databases. SIGMOD, pages 1295--1309, 2015.
    [7]
    P. A. Bernstein, C. W. Reid, and S. Das. Hyder - a transactional record manager for shared flash. CIDR, 2011.
    [8]
    P. A. Bernstein, C. W. Reid, M. Wu, and X. Yuan. Optimistic concurrency control by melding trees. PVLDB, 4(11):944--955, 2011.
    [9]
    C. Binnig, A. Crotty, A. Galakatos, T. Kraska, and E. Zamanian. The end of slow networks: It's time for a redesign. PVLDB, 9(7):528--539, 2016.
    [10]
    Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and general distributed transactions using RDMA and HTM. EuroSys, pages 26:1--26:17, 2016.
    [11]
    D. Cohen, T. Talpey, A. Kanevsky, U. Cummings, M. Krause, R. Recio, D. Crupnicoff, L. Dickman, and P. Grun. Remote direct memory access over the converged enhanced Ethernet fabric: Evaluating the options. Hot Inteconnects, pages 123--130, 2009.
    [12]
    J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee. Better I/O through byte-addressable, persistent memory. SOSP, pages 133--146, 2009.
    [13]
    J. C. Corbett et al. Spanner: Google's globally-distributed database. OSDI, 2012.
    [14]
    R. Crooke and M. Durcan. A revolutionary breakthrough in memory technology. Intel 3D XPoint launch keynote, 2015.
    [15]
    J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. Zdonik. Anti-caching: A new approach to database management system architecture. PVLDB, 6(14):1942--1953, 2013.
    [16]
    C. Diaconu et al. Hekaton: SQL server's memory-optimized OLTP engine. SIGMOD, pages 1243--1254, 2013.
    [17]
    C. Douglas. RDMA with PMEM: Software mechanisms for enabling access to remote persistent memory. Storage Developer Conference, 2015. http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf.
    [18]
    A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast remote memory. NSDI, pages 401--414, 2014.
    [19]
    G. Graefe. Instant recovery for data center savings. SIGMOD Record, 44(2):29--34, Aug. 2015.
    [20]
    J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 1st edition, 1992.
    [21]
    IBM. High availability through log shipping. IBM DB2 9.7 for Linux, UNIX, and Windows documentation, 2015.
    [22]
    InfiniBand Trade Association. InfiniBand roadmap. 2016. http://www.infinibandta.org/content/pages.php?pg=technology_overview.
    [23]
    Intel Corporation. Intel data direct I/O technology (Intel DDIO): A primer. 2012.
    [24]
    Intel Corporation. Intel 64 and IA-32 architectures software developer's manual. 2015.
    [25]
    JEDEC. DDR3 SDRAM standard. 2012. http://www.jedec.org/standards-documents/docs/jesd-79-3d.
    [26]
    R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, and A. Ailamaki. Aether: a scalable approach to logging. PVLDB, 3(1):681--692, 2010.
    [27]
    A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. SIGCOMM, pages 295--306, 2014.
    [28]
    A. Kalia, M. Kaminsky, and D. G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. OSDI, pages 185--201, 2016.
    [29]
    R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: a high-performance, distributed main memory transaction processing system. PVLDB, 1(2):1496--1499, 2008.
    [30]
    R. Kateja, A. Badam, S. Govindan, B. Sharma, and G. Ganger. Viyojit: Decoupling battery and DRAM capacities for battery-backed DRAM. ISCA, 2017.
    [31]
    J. Kim, K. Salem, K. Daudjee, A. Aboulnaga, and X. Pan. Database high availability using shadow systems. SoCC, pages 209--221, 2015.
    [32]
    K. Kim, T. Wang, R. Johnson, and I. Pandis. ERMIA: Fast memory-optimized database system for heterogeneous workloads. SIGMOD, pages 1675--1687, 2016.
    [33]
    H. Kimura. FOEDUS: OLTP engine for a thousand cores and NVRAM. SIGMOD, pages 691--706, 2015.
    [34]
    L. Lamport. The part-time parliament. ACM TOCS, 16(2):133--169, May 1998.
    [35]
    J. Levandoski, D. Lomet, and S. Sengupta. LLAMA: A cache/storage subsystem for modern hardware. PVLDB, 6(10):877--888, 2013.
    [36]
    J. Levandoski, D. Lomet, and S. Sengupta. The Bw-tree: A B-tree for new hardware platforms. ICDE, pages 302--313, 2013.
    [37]
    F. Liu, L. Yin, and S. Blanas. Design and evaluation of an RDMA-aware data shuffling operator for parallel database systems. EuroSys, pages 48--63, 2017.
    [38]
    D. Makreshanski, J. Giceva, C. Barthels, and G. Alonso. BatchDB: Efficient isolated execution of hybrid OLTP+OLAP workloads for interactive applications. SIGMOD, pages 37--50, 2017.
    [39]
    N. Malviya, A. Weisberg, S. Madden, and M. Stonebraker. Rethinking main memory OLTP recovery. ICDE, pages 604--615, 2014.
    [40]
    Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. EuroSys, pages 183--196, 2012.
    [41]
    Mellanox Technologies. RDMA aware networks programming user manual. 2015.
    [42]
    Mellanox Technologies. RDMA over converged ethernet (RoCE) - an efficient, low-cost, zero copy implementation. 2017. http://www.mellanox.com/page/products_dyn?product_family=79.
    [43]
    C. Min, S. Kashyap, S. Maass, W. Kang, and T. Kim. Understanding manycore scalability of file systems. USENIX ATC, pages 71--85, 2016.
    [44]
    U. F. Minhas, S. Rajagopalan, B. Cully, A. Aboulnaga, K. Salem, and A. Warfield. RemusDB: Transparent high availability for database systems. PVLDB, 4(11):738--748, 2011.
    [45]
    C. Mitchell, Y. Geng, and J. Li. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. USENIX ATC, pages 103--114, 2013.
    [46]
    C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. ARIES: a transaction recovery met- hod supporting fine-granularity locking and partial roll backs using write-ahead logging. TODS, 17(1):94--162, 1992.
    [47]
    Oracle. TimesTen in-memory database replication guide. Oracle Database Online Documentation, 2014.
    [48]
    Oracle. Chapter 17 Replication. MySQL 5.7 Reference Manual, 2015.
    [49]
    I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and W. Lehner. FPTree: A hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory. SIGMOD, pages 371--386, 2016.
    [50]
    I. Oukid, W. Lehner, T. Kissinger, T. Willhalm, and P. Bumbulis. Instant recovery for main memory databases. CIDR, 2015.
    [51]
    D. Qin, A. D. Brown, and A. Goel. Scalable replay-based replication for fast databases. PVLDB, 10(13):2025--2036, 2017.
    [52]
    P. S. Randal. High availability with SQL Server 2008. Microsoft White Papers, 2009. https://technet.microsoft.com/en-us/library/ee523927.aspx.
    [53]
    R. Ricci, G. Wong, L. Stoller, K. Webb, J. Duerig, K. Downie, and M. Hibler. Apt: A platform for repeatable research in computer science. SIGOPS Oper. Syst. Rev., 49(1):100--107, Jan. 2015. http://docs.aptlab.net/.
    [54]
    W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015.
    [55]
    M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee. Making updates disk-I/O friendly using SSDs. PVLDB, 6(11):997--1008, 2013.
    [56]
    T. Talpey. RDMA extensions for remote persistent memory access. 12th Annual Open Fabrics Alliance Workshop, 2016. https://www.openfabrics.org/images/eventpresos/2016presentations/215RDMAforRemPerMem.pdf.
    [57]
    The PostgreSQL Global Development Group. Chapter 25. High Availability, Load Balancing, and Replication. PostgreSQL 9.4.4 Documentation, 2015.
    [58]
    A. Thomson and D. J. Abadi. The case for determinism in database systems. PVLDB, 3(1--2):70--80, 2010.
    [59]
    A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: fast distributed transactions for partitioned database systems. SIGMOD, pages 1--12, 2012.
    [60]
    TPC. TPC benchmark C (OLTP) standard specification, revision 5.11, 2010. http://www.tpc.org/tpcc.
    [61]
    S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactions in multicore in-memory databases. SOSP, pages 18--32, 2013.
    [62]
    A. Verbitski, A. Gupta, D. Saha, M. Brahmadesam, K. Gupta, R. Mittal, S. Krishnamurthy, S. Maurice, T. Kharatishvili, and X. Bao. Amazon aurora: Design considerations for high throughput cloud-native relational databases. SIGMOD, pages 1041--1052, 2017.
    [63]
    Viking Technology. DDR4 NVDIMM. 2017. http://www.vikingtechnology.com/products/nvdimm/ddr4-nvdimm/.
    [64]
    T. Wang and R. Johnson. Scalable logging through emerging non-volatile memory. PVLDB, 7(10):865--876, 2014.
    [65]
    T. Wang, R. Johnson, and I. Pandis. Fresh replicas through append-only storage. HPTS, 2015. http://www.hpts.ws/papers/2015/lightning/append-only-log-ship.pdf.
    [66]
    Y. Wu, J. Arulraj, J. Lin, R. Xian, and A. Pavlo. An empirical evaluation of in-memory multi-version concurrency control. PVLDB, 10(7):781--792, 2017.
    [67]
    Y. Wu, W. Guo, C.-Y. Chan, and K.-L. Tan. Fast failure recovery for main-memory DBMSs on multicores. SIGMOD, pages 267--281, 2017.
    [68]
    M. Yang, D. Zhou, C. Kuo, C. Hong, L. Zhang, and L. Zhou. KuaFu: Closing the parallelism gap in database replication. ICDE 2013, pages 1186--1195, 2013.
    [69]
    C. Yao, D. Agrawal, G. Chen, B. C. Ooi, and S. Wu. Adaptive logging: Optimizing logging and recovery costs in distributed in-memory databases. SIGMOD, pages 1119--1134, 2016.
    [70]
    E. Zamanian, C. Binnig, T. Kraska, and T. Harris. The end of a myth: Distributed transactions can scale. PVLDB, 10(6):685--696, 2017.
    [71]
    Y. Zhang, J. Yang, A. Memaripour, and S. Swanson. Mojim: A reliable and highly-available non-volatile memory system. ASPLOS, pages 3--18, 2015.

    Cited By

    View all
    • (2023)DecLog: Decentralized Logging in Non-Volatile Memory for Time Series Database SystemsProceedings of the VLDB Endowment10.14778/3617838.361783917:1(1-14)Online publication date: 1-Sep-2023
    • (2023)PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent ReadsProceedings of the VLDB Endowment10.14778/3611540.361156216:12(3754-3767)Online publication date: 1-Aug-2023
    • (2022)EFA: A Viable Alternative to RDMA over InfiniBand for DBMSs?Proceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3538506(1-5)Online publication date: 12-Jun-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 4
    December 2017
    133 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 December 2017
    Published in PVLDB Volume 11, Issue 4

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)DecLog: Decentralized Logging in Non-Volatile Memory for Time Series Database SystemsProceedings of the VLDB Endowment10.14778/3617838.361783917:1(1-14)Online publication date: 1-Sep-2023
    • (2023)PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent ReadsProceedings of the VLDB Endowment10.14778/3611540.361156216:12(3754-3767)Online publication date: 1-Aug-2023
    • (2022)EFA: A Viable Alternative to RDMA over InfiniBand for DBMSs?Proceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3538506(1-5)Online publication date: 12-Jun-2022
    • (2022)Hihooi: A Database Replication Middleware for Scaling Transactional Databases ConsistentlyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298756034:2(691-707)Online publication date: 1-Feb-2022
    • (2022)Scalable and adaptive log manager in distributed systemsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-1357-517:2Online publication date: 8-Aug-2022
    • (2022)High-availability in-memory key-value store using RDMA and Optane DCPMMFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-1123-817:1Online publication date: 8-Aug-2022
    • (2021)In-network support for transaction triagingProceedings of the VLDB Endowment10.14778/3461535.346155114:9(1626-1639)Online publication date: 1-May-2021
    • (2021)Validity Tracking Based Log Management for In-Memory DatabasesIEEE Access10.1109/ACCESS.2021.31038629(111493-111504)Online publication date: 2021
    • (2021)An exploratory semantic analysis of logging questionsJournal of Software: Evolution and Process10.1002/smr.236133:7Online publication date: 1-Jul-2021
    • (2020)CoroBaseProceedings of the VLDB Endowment10.14778/3430915.343093214:3(431-444)Online publication date: 1-Nov-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media