Skip to main content

Showing 1–50 of 71 results for author: Lemire, D

  1. Batched Ranged Random Integer Generation

    Authors: Nevin Brackett-Rozinsky, Daniel Lemire

    Abstract: Pseudorandom values are often generated as 64-bit binary words. These random words need to be converted into ranged values without statistical bias. We present an efficient algorithm to generate multiple independent uniformly-random bounded integers from a single uniformly-random binary word, without any bias. In the common case, our method uses one multiplication and no division operations per va… ▽ More

    Submitted 20 August, 2024; v1 submitted 12 August, 2024; originally announced August 2024.

    Comments: software: https://github.com/lemire/batched_random

  2. On-Demand JSON: A Better Way to Parse Documents?

    Authors: John Keiser, Daniel Lemire

    Abstract: JSON is a popular standard for data interchange on the Internet. Ingesting JSON documents can be a performance bottleneck. A popular parsing strategy consists in converting the input text into a tree-based data structure -- sometimes called a Document Object Model or DOM. We designed and implemented a novel JSON parsing interface -- called On-Demand -- that appears to the programmer like a convent… ▽ More

    Submitted 31 July, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Journal ref: Software: Practice and Experience 54 (6), 2024

  3. Parsing Millions of URLs per Second

    Authors: Yagiz Nizipli, Daniel Lemire

    Abstract: URLs are fundamental elements of web applications. By applying vector algorithms, we built a fast standard-compliant C++ implementation. Our parser uses three times fewer instructions than competing parsers following the WHATWG standard (e.g., Servo's rust-url) and up to eight times fewer instructions than the popular curl parser. The Node.js environment adopted our C++ library. In our tests on re… ▽ More

    Submitted 9 December, 2023; v1 submitted 17 November, 2023; originally announced November 2023.

    Journal ref: Software: Practice and Experience 54 (5), 2024

  4. Exact Short Products From Truncated Multipliers

    Authors: Daniel Lemire

    Abstract: We sometimes need to compute the most significant digits of the product of small integers with a multiplier requiring much storage: e.g., a large integer (e.g., $5^{100}$) or an irrational number ($π$). We only need to access the most significant digits of the multiplier-as long as the integers are sufficiently small. We provide an efficient algorithm to compute the range of integers given a trunc… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: Software at https://github.com/lemire/exactshortlib

    Journal ref: Computer Journal 67 (4), 2024

  5. Fast Number Parsing Without Fallback

    Authors: Noble Mushtak, Daniel Lemire

    Abstract: In recent work, Lemire (2021) presented a fast algorithm to convert number strings into binary floating-point numbers. The algorithm has been adopted by several important systems: e.g., it is part of the runtime libraries of GCC 12, Rust 1.55, and Go 1.16. The algorithm parses any number string with a significand containing no more than 19 digits into an IEEE floating-point number. However, there… ▽ More

    Submitted 27 February, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

    Journal ref: Software: Practice and Experience 53 (6), 2023

  6. Transcoding Unicode Characters with AVX-512 Instructions

    Authors: Robert Clausecker, Daniel Lemire

    Abstract: Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the p… ▽ More

    Submitted 5 August, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Journal ref: Software: Practice and Experience 53 (12) 2023

  7. Binary Fuse Filters: Fast and Smaller Than Xor Filters

    Authors: Thomas Mueller Graf, Daniel Lemire

    Abstract: Bloom and cuckoo filters provide fast approximate set membership while using little memory. Engineers use them to avoid expensive disk and network accesses. The recently introduced xor filters can be faster and smaller than Bloom and cuckoo filters. The xor filters are within 23% of the theoretical lower bound in storage as opposed to 44% for Bloom filters. Inspired by Dietzfelbinger and Walzer, w… ▽ More

    Submitted 4 January, 2022; originally announced January 2022.

    Journal ref: Journal of Experimental Algorithmics 27, 2022

  8. Unicode at Gigabytes per Second

    Authors: Daniel Lemire

    Abstract: We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s… ▽ More

    Submitted 19 May, 2023; v1 submitted 14 November, 2021; originally announced November 2021.

    Comments: SPIRE 2021: String Processing and Information Retrieval

    Journal ref: Software: Practice and Experience, Volume52, Issue2 February 2022

  9. Transcoding Billions of Unicode Characters per Second with SIMD Instructions

    Authors: Daniel Lemire, Wojciech Muła

    Abstract: In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By… ▽ More

    Submitted 14 November, 2022; v1 submitted 21 September, 2021; originally announced September 2021.

    Comments: Software: https://github.com/simdutf/simdutf

    Journal ref: Software: Practice and Experience, Volume 52, Issue 2 February 2022 Pages 555-575

  10. Number Parsing at a Gigabyte per Second

    Authors: Daniel Lemire

    Abstract: With disks and networks providing gigabytes per second, parsing decimal numbers from strings becomes a bottleneck. We consider the problem of parsing decimal numbers to the nearest binary floating-point value. The general problem requires variable-precision arithmetic. However, we need at most 17 digits to represent 64-bit standard floating-point numbers (IEEE 754). Thus we can represent the decim… ▽ More

    Submitted 3 November, 2022; v1 submitted 11 January, 2021; originally announced January 2021.

    Comments: Software at https://github.com/fastfloat/fast_float and https://github.com/lemire/simple_fastfloat_benchmark/

    Journal ref: Software: Practice and Experience 51 (8), 2021

  11. Integer Division by Constants: Optimal Bounds

    Authors: Daniel Lemire, Colin Bartlett, Owen Kaser

    Abstract: The integer division of a numerator n by a divisor d gives a quotient q and a remainder r. Optimizing compilers accelerate software by replacing the division of n by d with the division of c * n (or c * n + c) by m for convenient integers c and m chosen so that they approximate the reciprocal: c/m ~= 1/d. Such techniques are especially advantageous when m is chosen to be a power of two and when d… ▽ More

    Submitted 29 June, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

    Report number: TR-20-001, Dept of CS, UNB Saint John

    Journal ref: Heliyon 7 (6), 2021

  12. Validating UTF-8 In Less Than One Instruction Per Byte

    Authors: John Keiser, Daniel Lemire

    Abstract: The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software.

    Submitted 15 December, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

    Journal ref: Software: Practice and Experience 51 (5), 2021

  13. Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters

    Authors: Thomas Mueller Graf, Daniel Lemire

    Abstract: The Bloom filter provides fast approximate set membership while using little memory. Engineers often use these filters to avoid slow operations such as disk or network accesses. As an alternative, a cuckoo filter may need less space than a Bloom filter and it is faster. Chazelle et al. proposed a generalization of the Bloom filter called the Bloomier filter. Dietzfelbinger and Pagh described a var… ▽ More

    Submitted 27 January, 2020; v1 submitted 17 December, 2019; originally announced December 2019.

    Journal ref: Journal of Experimental Algorithmics 25 (1), 2020

  14. Efficient Computation of Positional Population Counts Using SIMD Instructions

    Authors: Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire

    Abstract: In several fields such as statistics, machine learning, and bioinformatics, categorical variables are frequently represented as one-hot encoded vectors. For example, given 8 distinct values, we map each value to a byte where only a single bit has been set. We are motivated to quickly compute statistics over such encodings. Given a stream of k-bit words, we seek to compute k distinct sums correspon… ▽ More

    Submitted 11 May, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

    Journal ref: Concurrency and Computation: Practice and Experience 33 (17), 2021

  15. Base64 encoding and decoding at almost the speed of a memory copy

    Authors: Wojciech Muła, Daniel Lemire

    Abstract: Many common document formats on the Internet are text-only such as email (MIME) and the Web (HTML, JavaScript, JSON and XML). To include images or executable code in these documents, we first encode them as text using base64. Standard base64 encoding uses 64~ASCII characters: both lower and upper case Latin letters, digits and two other symbols. We show how we can encode and decode base64 data at… ▽ More

    Submitted 2 October, 2019; originally announced October 2019.

    Journal ref: Software: Practice and Experience 50 (2), 2020

  16. Parsing Gigabytes of JSON per Second

    Authors: Geoff Langdale, Daniel Lemire

    Abstract: JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to… ▽ More

    Submitted 23 July, 2024; v1 submitted 21 February, 2019; originally announced February 2019.

    Comments: software: https://github.com/lemire/simdjson

    Journal ref: The VLDB Journal, 28(6), 2019

  17. Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries

    Authors: Daniel Lemire, Owen Kaser, Nathan Kurz

    Abstract: On common processors, integer multiplication is many times faster than integer division. Dividing a numerator n by a divisor d is mathematically equivalent to multiplication by the inverse of the divisor (n / d = n x 1/d). If the divisor is known in advance---or if repeated integer divisions will be performed with the same divisor---it can be beneficial to substitute a less costly multiplication f… ▽ More

    Submitted 20 November, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

    Journal ref: Software: Practice and Experience 49 (6), 2019

  18. Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ Fail Statistical Tests for Linearity

    Authors: Daniel Lemire, Melissa E. O'Neill

    Abstract: L'Ecuyer & Simard's Big Crush statistical test suite has revealed statistical flaws in many popular random number generators including Marsaglia's Xorshift generators. Vigna recently proposed some 64-bit variations on the Xorshift scheme that are further scrambled (i.e., Xorshift1024*, Xorshift1024+, Xorshift128+, Xoroshiro128+). Unlike their unscrambled counterparts, they pass Big Crush when inte… ▽ More

    Submitted 25 October, 2018; v1 submitted 11 October, 2018; originally announced October 2018.

    Journal ref: Computational and Applied Mathematics 350, 2019

  19. Fast Random Integer Generation in an Interval

    Authors: Daniel Lemire

    Abstract: In simulations, probabilistic algorithms and statistical tests, we often generate random integers in an interval (e.g., [0,s)). For example, random integers in an interval are essential to the Fisher-Yates random shuffle. Consequently, popular languages like Java, Python, C++, Swift and Go include ranged random integer generation functions as part of their runtime libraries. Pseudo-random values… ▽ More

    Submitted 28 December, 2018; v1 submitted 28 May, 2018; originally announced May 2018.

    Comments: to appear in ACM Transactions on Modeling and Computer Simulation

    Journal ref: ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

  20. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Authors: Edmon Begoli, Jesús Camacho Rodríguez, Julian Hyde, Michael J. Mior, Daniel Lemire

    Abstract: Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of proces… ▽ More

    Submitted 27 February, 2018; originally announced February 2018.

    Comments: SIGMOD'18

  21. Stream VByte: Faster Byte-Oriented Integer Compression

    Authors: Daniel Lemire, Nathan Kurz, Christoph Rupp

    Abstract: Arrays of integers are often compressed in search engines. Though there are many ways to compress integers, we are interested in the popular byte-oriented integer compression techniques (e.g., VByte or Google's Varint-GB). They are appealing due to their simplicity and engineering convenience. Amazon's varint-G8IU is one of the fastest byte-oriented compression technique published so far. It makes… ▽ More

    Submitted 27 September, 2017; v1 submitted 25 September, 2017; originally announced September 2017.

    Journal ref: Information Processing Letters 130, February 2018, Pages 1-6

  22. Roaring Bitmaps: Implementation of an Optimized Software Library

    Authors: Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai

    Abstract: Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix's Atlas, LinkedIn's Pinot, Metamarkets' Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Services… ▽ More

    Submitted 7 February, 2022; v1 submitted 22 September, 2017; originally announced September 2017.

    Journal ref: Software: Practice and Experience Volume 48, Issue 4 April 2018 Pages 867-895

  23. arXiv:1704.00605  [pdf, other

    cs.MS cs.PF

    Faster Base64 Encoding and Decoding Using AVX2 Instructions

    Authors: Wojciech Muła, Daniel Lemire

    Abstract: Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the efficiency of base64 encoding and decoding. Compared to state-of-the-art implementations, we multiply the speeds of both the encoding (~10x) and the decoding (… ▽ More

    Submitted 14 June, 2018; v1 submitted 30 March, 2017; originally announced April 2017.

    Comments: software at https://github.com/lemire/fastbase64

    Journal ref: ACM Transactions on the Web 12 (3), 2018

  24. On Desirable Semantics of Functional Dependencies over Databases with Incomplete Information

    Authors: Antonio Badia, Daniel Lemire

    Abstract: Codd's relational model describes just one possible world. To better cope with incomplete information, extended database models allow several possible worlds. Vague tables are one such convenient extended model where attributes accept sets of possible values (e.g., the manager is either Jill or Bob). However, conceptual database design in such cases remains an open problem. In particular, there is… ▽ More

    Submitted 6 April, 2017; v1 submitted 23 March, 2017; originally announced March 2017.

    Comments: to appear in Fundamenta Informaticae

    Journal ref: Fundamenta Informaticae 158 (2018) 327-352

  25. Faster Population Counts Using AVX2 Instructions

    Authors: Wojciech Muła, Nathan Kurz, Daniel Lemire

    Abstract: Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word (e.g., popcnt on x64 processors). Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instruc… ▽ More

    Submitted 5 September, 2018; v1 submitted 22 November, 2016; originally announced November 2016.

    Comments: Software is at https://github.com/CountOnes/hamming_weight

    Journal ref: Computer Journal, Volume 61, Issue 1, 1 January 2018

  26. Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions

    Authors: Daniel Lemire, Christoph Rupp

    Abstract: Compression can sometimes improve performance by making more of the data available to the processors faster. We consider the compression of integer keys in a B+-tree index. For this purpose, systems such as IBM DB2 use variable-byte compression over differentially coded keys. We revisit this problem with various compression alternatives such as Google's VarIntGB, Binary Packing and Frame-of-Refere… ▽ More

    Submitted 9 January, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

    Journal ref: Information Systems Volume 66, June 2017, Pages 13-23

  27. Regular and almost universal hashing: an efficient implementation

    Authors: Dmytro Ivanchykhin, Sergey Ignatchenko, Daniel Lemire

    Abstract: Random hashing can provide guarantees regarding the performance of data structures such as hash tables---even in an adversarial setting. Many existing families of hash functions are universal: given two data objects, the probability that they have the same hash value is low given that we pick hash functions at random. However, universality fails to ensure that all hash functions are well behaved.… ▽ More

    Submitted 18 October, 2016; v1 submitted 30 September, 2016; originally announced September 2016.

    Comments: accepted for publication in Software: Practice and Experience in September 2016

    Journal ref: Software: Practice and Experience 47 (10), 2017

  28. Consistently faster and smaller compressed bitmaps with Roaring

    Authors: Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser

    Abstract: Compressed bitmap indexes are used in databases and search engines. Many bitmap compression techniques have been proposed, almost all relying primarily on run-length encoding (RLE). However, on unsorted data, we can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique, Roaring, has… ▽ More

    Submitted 2 March, 2018; v1 submitted 21 March, 2016; originally announced March 2016.

    Journal ref: Software: Practice and Experience Volume 46, Issue 11, pages 1547-1569, November 2016

  29. arXiv:1503.07387  [pdf, other

    cs.IR

    Vectorized VByte Decoding

    Authors: Jeff Plaisance, Nathan Kurz, Daniel Lemire

    Abstract: We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountere… ▽ More

    Submitted 13 January, 2017; v1 submitted 20 February, 2015; originally announced March 2015.

    Comments: First International Symposium on Web Algorithms (June 2015)

  30. Faster 64-bit universal hashing using carry-less multiplications

    Authors: Daniel Lemire, Owen Kaser

    Abstract: Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed… ▽ More

    Submitted 4 November, 2015; v1 submitted 11 March, 2015; originally announced March 2015.

    Journal ref: Journal of Cryptographic Engineering, Volume 6, Issue 3, pp 171-185, 2016

  31. A General SIMD-based Approach to Accelerating Compression Algorithms

    Authors: Wayne Xin Zhao, Xudong Zhang, Daniel Lemire, Dongdong Shan, Jian-Yun Nie, Hongfei Yan, Ji-Rong Wen

    Abstract: Compression algorithms are important for data oriented tasks, especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets, provide us an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate… ▽ More

    Submitted 6 February, 2015; originally announced February 2015.

    ACM Class: E.4; H.3.1; C.1.2

    Journal ref: ACM Trans. Inf. Syst. 33, 3, Article 15 (March 2015)

  32. arXiv:1501.06587  [pdf, other

    cs.DL cs.CL cs.LG

    Measuring academic influence: Not all citations are equal

    Authors: Xiaodan Zhu, Peter Turney, Daniel Lemire, André Vellino

    Abstract: The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness… ▽ More

    Submitted 26 January, 2015; originally announced January 2015.

    Journal ref: Journal of the Association for Information Science and Technology, 66: 408-427

  33. Bloofi: Multidimensional Bloom Filters

    Authors: Adina Crainiceanu, Daniel Lemire

    Abstract: Bloom filters are probabilistic data structures commonly used for approximate membership problems in many areas of Computer Science (networking, distributed systems, databases, etc.). With the increase in data size and distribution of data, problems arise where a large number of Bloom filters are available, and all them need to be searched for potential matches. As an example, in a federated cloud… ▽ More

    Submitted 21 September, 2016; v1 submitted 8 January, 2015; originally announced January 2015.

    Journal ref: Information Systems Volume 54, December 2015, Pages 311-324

  34. Functional dependencies with null markers

    Authors: Antonio Badia, Daniel Lemire

    Abstract: Functional dependencies are an integral part of database design. However, they are only defined when we exclude null markers. Yet we commonly use null markers in practice. To bridge this gap between theory and practice, researchers have proposed definitions of functional dependencies over relations with null markers. Though sound, these definitions lack some qualities that we find desirable. For e… ▽ More

    Submitted 15 May, 2014; v1 submitted 19 April, 2014; originally announced April 2014.

    Comments: accepted at the Computer Journal (April 2014)

    Journal ref: Computer Journal 58 (5), 2015

  35. Better bitmap performance with Roaring bitmaps

    Authors: Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin

    Abstract: Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus we might prefer compressed bitmap indexes. Following Oracle's lead, bitmaps are often compressed using run-length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it us… ▽ More

    Submitted 15 March, 2016; v1 submitted 25 February, 2014; originally announced February 2014.

    Journal ref: Software: Practice and Experience Volume 46, Issue 5, pages 709-719, May 2016

  36. arXiv:1402.4466  [pdf, other

    cs.DB cs.DS

    Compressed bitmap indexes: beyond unions and intersections

    Authors: Owen Kaser, Daniel Lemire

    Abstract: Compressed bitmap indexes are used to speed up simple aggregate queries in databases. Indeed, set operations like intersections, unions and complements can be represented as logical operations (AND,OR,NOT) that are ideally suited for bitmaps. However, it is less obvious how to apply bitmaps to more advanced queries. For example, we might seek products in a store that meet some, but maybe not all,… ▽ More

    Submitted 15 August, 2014; v1 submitted 18 February, 2014; originally announced February 2014.

    Comments: Accepted for publication in Software: Practice and Experience on August 14th 2014. Note that arXiv:1402.4073 [cs:DB] is a companion to this paper; while they share some text, each contains many results not in the other

    Journal ref: Software: Practice & Experience 46 (2), 2016

  37. arXiv:1402.4073  [pdf, other

    cs.DB cs.DS

    Threshold and Symmetric Functions over Bitmaps

    Authors: Owen Kaser, Daniel Lemire

    Abstract: Bitmap indexes are routinely used to speed up simple aggregate queries in databases. Set operations such as intersections, unions and complements can be represented as logical operations (AND, OR, NOT). However, less is known about the application of bitmap indexes to more advanced queries. We want to extend the applicability of bitmap indexes. As a starting point, we consider symmetric Boolean qu… ▽ More

    Submitted 14 November, 2016; v1 submitted 17 February, 2014; originally announced February 2014.

    Comments: This paper uses small fonts and colours and is only intended for electronic viewing

    Report number: TR-14-001 ACM Class: H.2.2; H.3.2

  38. arXiv:1401.6399  [pdf, other

    cs.IR cs.DB cs.PF

    SIMD Compression and the Intersection of Sorted Integers

    Authors: Daniel Lemire, Leonid Boytsov, Nathan Kurz

    Abstract: Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent proces… ▽ More

    Submitted 20 April, 2020; v1 submitted 24 January, 2014; originally announced January 2014.

    Journal ref: Software: Practice and Experience Volume 46, Issue 6, pages 723-749, June 2016

  39. arXiv:1209.2137  [pdf, other

    cs.IR cs.DB

    Decoding billions of integers per second through vectorization

    Authors: Daniel Lemire, Leonid Boytsov

    Abstract: In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar natu… ▽ More

    Submitted 30 January, 2021; v1 submitted 10 September, 2012; originally announced September 2012.

    Comments: For software, see https://github.com/lemire/FastPFor, For data, see http://boytsov.info/datasets/clueweb09gap/

    Journal ref: Software: Practice & Experience 45 (1), 2015

  40. Reordering Rows for Better Compression: Beyond the Lexicographic Order

    Authors: Daniel Lemire, Owen Kaser, Eduardo Gutarra

    Abstract: Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lis… ▽ More

    Submitted 3 February, 2014; v1 submitted 9 July, 2012; originally announced July 2012.

    Comments: to appear in ACM TODS

    ACM Class: H.4.0

    Journal ref: ACM Trans. Database Syst. 37, 3, Article 20 (September 2012)

  41. Strongly universal string hashing is fast

    Authors: Owen Kaser, Daniel Lemire

    Abstract: We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is that hash functions with fewer multiplications are faster. Yet… ▽ More

    Submitted 20 September, 2018; v1 submitted 22 February, 2012; originally announced February 2012.

    Comments: Software is available at http://code.google.com/p/variablelengthstringhashing/ and https://github.com/lemire/StronglyUniversalStringHashing

    Journal ref: Computer Journal (2014) 57 (11): 1624-1638

  42. arXiv:1108.4041  [pdf, ps, other

    cs.DL

    Extracting, Transforming and Archiving Scientific Data

    Authors: Daniel Lemire, Andre Vellino

    Abstract: It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital cura… ▽ More

    Submitted 22 August, 2011; v1 submitted 19 August, 2011; originally announced August 2011.

    Comments: 8 pages, Fourth Workshop on Very Large Digital Libraries, 2011

  43. A Call to Arms: Revisiting Database Design

    Authors: Antonio Badia, Daniel Lemire

    Abstract: Good database design is crucial to obtain a sound, consistent database, and - in turn - good database design methodologies are the best way to achieve the right design. These methodologies are taught to most Computer Science undergraduates, as part of any Introduction to Database class. They can be considered part of the "canon", and indeed, the overall approach to database design has been unchang… ▽ More

    Submitted 2 October, 2012; v1 submitted 30 May, 2011; originally announced May 2011.

    Comments: Removed spurious column break. Nothing else was changed

    Journal ref: Antonio Badia and Daniel Lemire. A call to arms: revisiting database design. SIGMOD Record 40 (3), pages 61-69, 2011

  44. Time Series Classification by Class-Specific Mahalanobis Distance Measures

    Authors: Zoltán Prekopcsák, Daniel Lemire

    Abstract: To classify time series by nearest neighbors, we need to specify or learn one or several distance measures. We consider variations of the Mahalanobis distance measures which rely on the inverse covariance matrix of the data. Unfortunately --- for time series data --- the covariance matrix has often low rank. To alleviate this problem we can either use a pseudoinverse, covariance shrinking or limit… ▽ More

    Submitted 2 July, 2012; v1 submitted 7 October, 2010; originally announced October 2010.

    Journal ref: Advances in Data Analysis and Classification 6 (3), 2012

  45. The universality of iterated hashing over variable-length strings

    Authors: Daniel Lemire

    Abstract: Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string leng… ▽ More

    Submitted 24 November, 2011; v1 submitted 10 August, 2010; originally announced August 2010.

    Journal ref: Discrete Applied Mathematics 160 (4-5), 604--617 (2012)

  46. Diamond Dicing

    Authors: Hazel Webb, Daniel Lemire, Owen Kaser

    Abstract: In OLAP, analysts often select an interesting sample of the data. For example, an analyst might focus on products bringing revenues of at least 100 000 dollars, or on shops having sales greater than 400 000 dollars. However, current systems do not allow the application of both of these thresholds simultaneously, selecting products and shops satisfying both thresholds. For such purposes, we introdu… ▽ More

    Submitted 23 July, 2014; v1 submitted 18 June, 2010; originally announced June 2010.

    Comments: 29 pages

    Journal ref: Data & Knowledge Engineering, Volume 86, July 2013, Pages 1-18

  47. Reordering Columns for Smaller Indexes

    Authors: Daniel Lemire, Owen Kaser

    Abstract: Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before sorting can reduce the number of runs by a factor of two or more. Unfortunately, determining the best column order is NP-hard. For many cases, we prov… ▽ More

    Submitted 22 February, 2011; v1 submitted 7 September, 2009; originally announced September 2009.

    Comments: to appear in Information Sciences

    Journal ref: Daniel Lemire and Owen Kaser, Reordering Columns for Smaller Indexes, Information Sciences 181 (12), 2011

  48. arXiv:0906.0910  [pdf

    cs.DB cs.HC

    On the Challenges of Collaborative Data Processing

    Authors: Sylvie Noel, Daniel Lemire

    Abstract: The last 30 years have seen the creation of a variety of electronic collaboration tools for science and business. Some of the best-known collaboration tools support text editing (e.g., wikis). Wikipedia's success shows that large-scale collaboration can produce highly valuable content. Meanwhile much structured data is being collected and made publicly available. We have never had access to more… ▽ More

    Submitted 4 June, 2009; originally announced June 2009.

    Comments: to appear as a chapter in an upcoming book (Collaborative Information Behavior)

  49. Web 2.0 OLAP: From Data Cubes to Tag Clouds

    Authors: Kamel Aouiche, Daniel Lemire, Robert Godin

    Abstract: Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users… ▽ More

    Submitted 15 March, 2016; v1 submitted 16 May, 2009; originally announced May 2009.

    Comments: Software at https://github.com/lemire/OLAPTagCloud. arXiv admin note: substantial text overlap with arXiv:0710.2156

    Journal ref: Lecture Notes in Business Information Processing Vol. 18, pages 51-64, 2009

  50. Sorting improves word-aligned bitmap indexes

    Authors: Daniel Lemire, Owen Kaser, Kamel Aouiche

    Abstract: Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times f… ▽ More

    Submitted 29 July, 2016; v1 submitted 23 January, 2009; originally announced January 2009.

    Journal ref: Data & Knowledge Engineering, Volume 69, Issue 1, 2010, Pages 3-28