skip to main content
research-article

Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication

Published: 18 September 2017 Publication History

Abstract

In-memory key/value store (KV-store) is a key building block for many systems like databases and large websites. Two key requirements for such systems are efficiency and availability, which demand a KV-store to continuously handle millions of requests per second. A common approach to availability is using replication, such as primary-backup (PBR), which, however, requires M+1 times memory to tolerate M failures. This renders scarce memory unable to handle useful user jobs.
This article makes the first case of building highly available in-memory KV-store by integrating erasure coding to achieve memory efficiency, while not notably degrading performance. A main challenge is that an in-memory KV-store has much scattered metadata. A single KV put may cause excessive coding operations and parity updates due to excessive small updates to metadata. Our approach, namely Cocytus, addresses this challenge by using a hybrid scheme that leverages PBR for small-sized and scattered data (e.g., metadata and key), while only applying erasure coding to relatively large data (e.g., value). To mitigate well-known issues like lengthy recovery of erasure coding, Cocytus uses an online recovery scheme by leveraging the replicated metadata information to continuously serve KV requests. To further demonstrate the usefulness of Cocytus, we have built a transaction layer by using Cocytus as a fast and reliable storage layer to store database records and transaction logs. We have integrated the design of Cocytus to Memcached and extend it to support in-memory transactions. Evaluation using YCSB with different KV configurations shows that Cocytus incurs low overhead for latency and throughput, can tolerate node failures with fast online recovery, while saving 33% to 46% memory compared to PBR when tolerating two failures. A further evaluation using the SmallBank OLTP benchmark shows that in-memory transactions can run atop Cocytus with high throughput, low latency, and low abort rate and recover fast from consecutive failures.

References

[1]
Mohammad Alomari, Michael Cahill, Alan Fekete, and Uwe Rohm. 2008. The cost of serializability on platforms that use snapshot isolation. In Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE’08). IEEE, 576--585.
[2]
Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). ACM, 53--64.
[3]
William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’11).
[4]
Thomas C. Bressoud and Fred B. Schneider. 1996. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst. (TOCS) 14, 1 (1996), 80--107.
[5]
Yingyi Bu, Vinayak Borkar, Guoqing Xu, and Michael J. Carey. 2013. A bloat-aware design for big data applications. In Proceedings of the ACM SIGPLAN International Symposium on Memory Management. ACM, 119--130.
[6]
Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. 1993. The primary-backup approach. Distrib. Syst. 2 (1993), 199--216.
[7]
Michael J. Cahill, Uwe Röhm, and Alan D. Fekete. 2009. Serializable isolation for snapshot databases. ACM Trans. Database Syst. (TODS) 34, 4 (2009), 20.
[8]
K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. (TOCS) 3, 1 (1985), 63--75.
[9]
Shimin Chen, Anastasia Ailamaki, Manos Athanassoulis, Phillip B. Gibbons, Ryan Johnson, Ippokratis Pandis, and Radu Stoica. 2011. TPC-E vs. TPC-C: Characterizing the new TPC-E benchmark via an I/O comparison study. ACM SIGMOD Rec. 39, 3 (2011), 5--10.
[10]
Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. 2016. Fast and general distributed transactions using rdma and htm. In Proceedings of the 11th European Conference on Computer Systems. ACM, 26.
[11]
Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. Upright cluster services. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, 277--290.
[12]
Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. In Proceedings of the ACM Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 105--118.
[13]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 143--154.
[14]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 251--264.
[15]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild et al. 2013. Spanner: Googles globally distributed database. ACM Trans. Comput. Syst. (TOCS) 31, 3 (2013), 8.
[16]
Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL server’s memory-optimized OLTP engine. In Proceedings of the 2013 International Conference on Management of Data. ACM, 1243--1254.
[17]
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. 2013. Oltp-bench: An extensible testbed for benchmarking relational databases. Proc. VLDB Endow. 7, 4 (2013), 277--288.
[18]
Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 54--70.
[19]
Leo Egghe. 2005. Zipfian and lotkaian continuous concentration theory. J. Amer. Soc. Info. Sci. Technol. 56, 9 (2005), 935--945.
[20]
Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’13), Vol. 13. 385--398.
[21]
Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. 2012. The SAP HANA database--An architecture overview. IEEE Data Eng. Bull. 35, 1 (2012), 28--33.
[22]
Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux J. 2004, 124 (2004), 5.
[23]
Aakash Goel, Bhuwan Chopra, Ciprian Gerea, Dhrúv Mátáni, Josh Metzler, Fahim Ul Haq, and Janet Wiener. 2014. Fast database restarts at facebook. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 541--549.
[24]
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, Sergey Yekhanin, and others. 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference. 15--26.
[25]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using rdma efficiently for key-value services. In Proceedings of the 2014 ACM Conference of the Special Interest Group on Data Communications (SIGCOMM’14). ACM, 295--306.
[26]
KLab Inc. 2011. Homepage. Retrieved from http://repcached.lab.klab.org.
[27]
Tirthankar Lahiri, Marie-Anne Neimat, and Steve Folkman. 2013. Oracle timesten: An in-memory database for enterprise applications. IEEE Data Eng. Bull. 36, 2 (2013), 6--13.
[28]
Chunbo Lai, Song Jiang, Liqiong Yang, Shiding Lin, Guangyu Sun, Zhenyu Hou, Can Cui, and Jason Cong. 2015. Atlas: Baidu’s key-value storage system for cloud data. In Proceedings of the 2015 31st Symposium on Mass Storage Systems and Technologies (MSST’15). IEEE, 1--14.
[29]
Leslie Lamport. 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18--25.
[30]
Xiaozhou Li, David G. Andersen, Michael Kaminsky, and Michael J. Freedman. 2014. Algorithmic improvements for fast concurrent cuckoo hashing. In Proceedings of the 9th European Conference on Computer Systems. ACM, 27.
[31]
Ran Liu, Heng Zhang, and Haibo Chen. 2014. Scalable read-mostly synchronization using passive reader-writer locks. In Proceedings of the 2014 USENIX Annual Technical Conference, USENIX ATC, Vol. 14. 219--230.
[32]
Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the Conference on File and Storage Technologies (FAST’16). 133--148.
[33]
Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the USENIX Annual Technical Conference. 103--114.
[34]
Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang et al. 2014. F4: Facebooks warm BLOB storage system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 383--398.
[35]
Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab et al. 2013. Scaling memcache at facebook. In Proceedings of the Conference on Network Systems Design and Implementation (NSDI’13). 385--398.
[36]
Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. ACM, 29--41.
[37]
J. S. Plank, E. L. Miller, and W. B. Houston. 2013. GF-Complete: A Comprehensive Open Source Library for Galois Field Arithmetic. Technical Report UT-CS-13-703. University of Tennessee.
[38]
J. S. Plank, S. Simmerman, and C. D. Schuman. 2008. Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications—Version 1.2. Technical Report CS-08-627. University of Tennessee.
[39]
K. V. Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran. 2015. Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 81--94.
[40]
K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. Proceedings of the Conference on USENIX HotStorage (2013).
[41]
K. V. Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ramchandran. 2016. EC-Cache: Load-balanced, low-latency cluster caching with online erasure coding. In Proceedings of the Conference on Operating Systems Design and Implementation (OSDI’16).
[42]
Irving S. Reed and Gustave Solomon. 1960. Polynomial codes over certain finite fields. J. Soc. Industr. Appl. Math. 8, 2 (1960), 300--304.
[43]
Luigi Rizzo. 1997. Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Comput. Commun. Rev. 27, 2 (1997), 24--36.
[44]
Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. Xoring elephants: Novel erasure codes for big data. In Proceedings of the Very Large Data Base Endowment. VLDB Endowment, 325--336.
[45]
Nihar B. Shah, K. V. Rashmi, P. Vijay Kumar, and Kannan Ramchandran. 2012. Distributed storage codes with repair-by-transfer and nonachievability of interior points on the storage-bandwidth tradeoff. IEEE Trans. Info. Theory 58, 3 (2012), 1837--1852.
[46]
Mark Silberstein, Lakshmi Ganesh, Yang Wang, Lorenzo Alvisi, and Mike Dahlin. 2014. Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage. In Proceedings of International Conference on Systems and Storage. ACM, 1--7.
[47]
SNIA. 2015. NVDIMM Special Interest Group. Retrieved from http://www.snia.org/forums/sssi/NVDIMM (2015).
[48]
Patrick Stuedi, Animesh Trivedi, and Bernard Metzler. 2012. Wimpy nodes with 10GbE: Leveraging one-sided operations in Soft-RDMA to boost memcached. In Proceedings of the USENIX Annual Technical Conference. 347--353.
[49]
Viking Technology. 2014. ArxCis-NV (TM): Non-Volatile DIMM. Retrieved from http://www.vikingtechnology.com/arxcis-nv.
[50]
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). ACM, 1--12.
[51]
Twitter Inc. 2012. Twemcache is the Twitter Memcached. Retrieved from https://github.com/twitter/twemcache (2012).
[52]
Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the Conference on Operating Systems Design and Implementation (OSDI’04), Vol. 4. 91--104.
[53]
Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy H. Campbell, and others. 2011. Consistent and durable data structures for non-volatile byte-addressable memory. In Proceedings of the Conference on File and Storage Technologies (FAST’11). 61--75.
[54]
Peng Wang, Kaiyuan Zhang, Rong Chen, Haibo Chen, and Haibing Guan. 2014b. Replication-based fault-tolerance for large-scale graph processing. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE, 562--573.
[55]
Yang Wang, Lorenzo Alvisi, and Mike Dahlin. 2012. Gnothi: Separating data and metadata for efficient and available storage replication. In Proceedings of the USENIX Annual Technical Conference. 413--424.
[56]
Zhaoguo Wang, Hao Qian, Jinyang Li, and Haibo Chen. 2014a. Using restricted transactional memory to build a scalable in-memory database. In Proceedings of the 9th European Conference on Computer Systems. ACM, 26.
[57]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 87--104.
[58]
Brent Welch, Marc Unangst, Zainul Abbasi, Garth A Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. 2008. Scalable performance of the panasas parallel file system. In Proceedings of the Conference on File and Storage Technologies (FAST’08), Vol. 8. 1--17.
[59]
Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: Reducing consistency cost for NVM-based single level systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies. USENIX Association, 167--181.
[60]
Jian Yin, Jean-Philippe Martin, Arun Venkataramani, Lorenzo Alvisi, and Mike Dahlin. 2003. Separating agreement from execution for byzantine fault tolerant services. In Proceedings of the Symposium on Operating Systems Principles (SOSP’03). ACM, 253--267.
[61]
Jeremy Zawodny. 2009. Redis: Lightweight key/value store that goes the extra mile. Linux Mag. 79 (2009).
[62]
Heng Zhang, Mingkai Dong, and Haibo Chen. 2016. Efficient and available in-memory KV-store with hybrid erasure coding and replication. In Proceedings of the USENIX Conference on File and Storage Technologies. 167--180.
[63]
Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. 2015. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 3--18.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 13, Issue 3
Special Issue on FAST 2017 and Regular Papers
August 2017
265 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3141876
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2017
Accepted: 01 March 2017
Revised: 01 March 2017
Received: 01 September 2016
Published in TOS Volume 13, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. KV-store
  2. Primary-backup replication
  3. erasure coding
  4. transactions

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Key Research 8 Development Program of China
  • China National Natural Science Foundation
  • Zhangjiang Hi-Tech program
  • Top-notch Youth Talents Program of China
  • Shanghai Science and Technology Development Fund

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)8
Reflects downloads up to 25 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ELECTProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650715(293-310)Online publication date: 27-Feb-2024
  • (2024)Hybrid fault tolerance in distributed in-memory storage systemsJUSTC10.52396/JUSTC-2022-012554:4(0406)Online publication date: 2024
  • (2024)Designing Non-uniform Locally Repairable Codes for Wide Stripes under Skewed File AccessesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673103(1197-1206)Online publication date: 12-Aug-2024
  • (2024)Achieving Tunable Erasure Coding with Cluster-Aware Redundancy TransitioningACM Transactions on Architecture and Code Optimization10.1145/3672077Online publication date: 10-Jun-2024
  • (2024)Elastic Reed-Solomon Codes for Efficient Redundancy Transitioning in Distributed Key-Value StoresIEEE/ACM Transactions on Networking10.1109/TNET.2023.330386532:1(670-685)Online publication date: Feb-2024
  • (2024)Advanced Elastic Reed–Solomon Codes for Erasure-Coded Key–Value StoresIEEE Internet of Things Journal10.1109/JIOT.2023.329957411:3(4747-4762)Online publication date: 1-Feb-2024
  • (2023)Optimal Rack-Coordinated Updates in Erasure-Coded Data Centers: Design and AnalysisIEEE Transactions on Computers10.1109/TC.2023.3234215(1-14)Online publication date: 2023
  • (2023)PMDB: A Range-Based Key-Value Store on Hybrid NVM-Storage SystemsIEEE Transactions on Computers10.1109/TC.2022.320275572:5(1274-1285)Online publication date: 1-May-2023
  • (2023)AdaCache: A Disaggregated Cache System with Adaptive Block Size for Cloud Block Storage2023 IEEE 16th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD60044.2023.00048(348-359)Online publication date: Jul-2023
  • (2022)LEGOStoreProceedings of the VLDB Endowment10.14778/3547305.354732315:10(2201-2215)Online publication date: 7-Sep-2022
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media