subscribe to arXiv mailings

doi 10.1002/spe3369

Batched Ranged Random Integer Generation

Authors: Nevin Brackett-Rozinsky, Daniel Lemire

Abstract: Pseudorandom values are often generated as 64-bit binary words. These random words need to be converted into ranged values without statistical bias. We present an efficient algorithm to generate multiple independent uniformly-random bounded integers from a single uniformly-random binary word, without any bias. In the common case, our method uses one multiplication and no division operations per va… ▽ More Pseudorandom values are often generated as 64-bit binary words. These random words need to be converted into ranged values without statistical bias. We present an efficient algorithm to generate multiple independent uniformly-random bounded integers from a single uniformly-random binary word, without any bias. In the common case, our method uses one multiplication and no division operations per value produced. In practice, our algorithm can more than double the speed of unbiased random shuffling for small to moderately large arrays. △ Less

Submitted 20 August, 2024; v1 submitted 12 August, 2024; originally announced August 2024.

Comments: software: https://github.com/lemire/batched_random

arXiv:2312.17149 [pdf, other]

doi 10.1002/spe.3313

On-Demand JSON: A Better Way to Parse Documents?

Authors: John Keiser, Daniel Lemire

Abstract: JSON is a popular standard for data interchange on the Internet. Ingesting JSON documents can be a performance bottleneck. A popular parsing strategy consists in converting the input text into a tree-based data structure -- sometimes called a Document Object Model or DOM. We designed and implemented a novel JSON parsing interface -- called On-Demand -- that appears to the programmer like a convent… ▽ More JSON is a popular standard for data interchange on the Internet. Ingesting JSON documents can be a performance bottleneck. A popular parsing strategy consists in converting the input text into a tree-based data structure -- sometimes called a Document Object Model or DOM. We designed and implemented a novel JSON parsing interface -- called On-Demand -- that appears to the programmer like a conventional DOM-based approach. However, the underlying implementation is a pointer iterating through the content, only materializing the results (objects, arrays, strings, numbers) lazily.On recent commodity processors, an implementation of our approach provides superior performance in multiple benchmarks. To ensure reproducibility, our work is freely available as open source software. Several systems use On-Demand: e.g., Apache Doris, the Node.js JavaScript runtime, Milvus, and Velox. △ Less

Submitted 31 July, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Journal ref: Software: Practice and Experience 54 (6), 2024

arXiv:2311.10533 [pdf, other]

doi 10.1002/spe.3296

Parsing Millions of URLs per Second

Authors: Yagiz Nizipli, Daniel Lemire

Abstract: URLs are fundamental elements of web applications. By applying vector algorithms, we built a fast standard-compliant C++ implementation. Our parser uses three times fewer instructions than competing parsers following the WHATWG standard (e.g., Servo's rust-url) and up to eight times fewer instructions than the popular curl parser. The Node.js environment adopted our C++ library. In our tests on re… ▽ More URLs are fundamental elements of web applications. By applying vector algorithms, we built a fast standard-compliant C++ implementation. Our parser uses three times fewer instructions than competing parsers following the WHATWG standard (e.g., Servo's rust-url) and up to eight times fewer instructions than the popular curl parser. The Node.js environment adopted our C++ library. In our tests on realistic data, a recent Node.js version (20.0) with our parser is four to five times faster than the last version with the legacy URL parser. △ Less

Submitted 9 December, 2023; v1 submitted 17 November, 2023; originally announced November 2023.

Journal ref: Software: Practice and Experience 54 (5), 2024

arXiv:2303.14321 [pdf, ps, other]

doi 10.1093/comjnl/bxad077

Exact Short Products From Truncated Multipliers

Authors: Daniel Lemire

Abstract: We sometimes need to compute the most significant digits of the product of small integers with a multiplier requiring much storage: e.g., a large integer (e.g., $5^{100}$) or an irrational number ($π$). We only need to access the most significant digits of the multiplier-as long as the integers are sufficiently small. We provide an efficient algorithm to compute the range of integers given a trunc… ▽ More We sometimes need to compute the most significant digits of the product of small integers with a multiplier requiring much storage: e.g., a large integer (e.g., $5^{100}$) or an irrational number ($π$). We only need to access the most significant digits of the multiplier-as long as the integers are sufficiently small. We provide an efficient algorithm to compute the range of integers given a truncated multiplier and a desired number of digits. △ Less

Submitted 24 March, 2023; originally announced March 2023.

Comments: Software at https://github.com/lemire/exactshortlib

Journal ref: Computer Journal 67 (4), 2024

arXiv:2212.06644 [pdf, ps, other]

doi 10.1002/spe.3198

Fast Number Parsing Without Fallback

Authors: Noble Mushtak, Daniel Lemire

Abstract: In recent work, Lemire (2021) presented a fast algorithm to convert number strings into binary floating-point numbers. The algorithm has been adopted by several important systems: e.g., it is part of the runtime libraries of GCC 12, Rust 1.55, and Go 1.16. The algorithm parses any number string with a significand containing no more than 19 digits into an IEEE floating-point number. However, there… ▽ More In recent work, Lemire (2021) presented a fast algorithm to convert number strings into binary floating-point numbers. The algorithm has been adopted by several important systems: e.g., it is part of the runtime libraries of GCC 12, Rust 1.55, and Go 1.16. The algorithm parses any number string with a significand containing no more than 19 digits into an IEEE floating-point number. However, there is a check leading to a fallback function to ensure correctness. This fallback function is never called in practice. We prove that the fallback is unnecessary. Thus we can slightly simplify the algorithm and its implementation. △ Less

Submitted 27 February, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Journal ref: Software: Practice and Experience 53 (6), 2023

arXiv:2212.05098 [pdf, ps, other]

doi 10.1002/spe.3261

Transcoding Unicode Characters with AVX-512 Instructions

Authors: Robert Clausecker, Daniel Lemire

Abstract: Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the p… ▽ More Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the previous best solutions. For example, we transcode Chinese text from UTF-8 to UTF-16 at more than 5 GiB/s using fewer than 2 CPU instructions per character. To ensure reproducibility, we make our software freely available as an open source library. Our library is part of the popular Node.js JavaScript runtime. △ Less

Submitted 5 August, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

Journal ref: Software: Practice and Experience 53 (12) 2023

arXiv:2201.01174 [pdf, other]

doi 10.1145/3510449

Binary Fuse Filters: Fast and Smaller Than Xor Filters

Authors: Thomas Mueller Graf, Daniel Lemire

Abstract: Bloom and cuckoo filters provide fast approximate set membership while using little memory. Engineers use them to avoid expensive disk and network accesses. The recently introduced xor filters can be faster and smaller than Bloom and cuckoo filters. The xor filters are within 23% of the theoretical lower bound in storage as opposed to 44% for Bloom filters. Inspired by Dietzfelbinger and Walzer, w… ▽ More Bloom and cuckoo filters provide fast approximate set membership while using little memory. Engineers use them to avoid expensive disk and network accesses. The recently introduced xor filters can be faster and smaller than Bloom and cuckoo filters. The xor filters are within 23% of the theoretical lower bound in storage as opposed to 44% for Bloom filters. Inspired by Dietzfelbinger and Walzer, we build probabilistic filters -- called binary fuse filters -- that are within 13% of the storage lower bound -- without sacrificing query speed. As an additional benefit, the construction of the new binary fuse filters can be more than twice as fast as the construction of xor filters. By slightly sacrificing query speed, we further reduce storage to within 8% of the lower bound. We compare the performance against a wide range of competitive alternatives such as Bloom filters, blocked Bloom filters, vector quotient filters, cuckoo filters, and the recent ribbon filters. Our experiments suggest that binary fuse filters are superior to xor filters. △ Less

Submitted 4 January, 2022; originally announced January 2022.

Journal ref: Journal of Experimental Algorithmics 27, 2022

arXiv:2111.08692 [pdf, other]

doi 10.1007/978-3-030-86692-1_2

Unicode at Gigabytes per Second

Authors: Daniel Lemire

Abstract: We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s… ▽ More We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode Unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings. △ Less

Submitted 19 May, 2023; v1 submitted 14 November, 2021; originally announced November 2021.

Comments: SPIRE 2021: String Processing and Information Retrieval

Journal ref: Software: Practice and Experience, Volume52, Issue2 February 2022

arXiv:2109.10433 [pdf, other]

doi 10.1002/spe.3036

Transcoding Billions of Unicode Characters per Second with SIMD Instructions

Authors: Daniel Lemire, Wojciech Muła

Abstract: In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By… ▽ More In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on current systems (x64 and ARM). To ensure reproducibility, we make our software freely available as an open source library. △ Less

Submitted 14 November, 2022; v1 submitted 21 September, 2021; originally announced September 2021.

Comments: Software: https://github.com/simdutf/simdutf

Journal ref: Software: Practice and Experience, Volume 52, Issue 2 February 2022 Pages 555-575

arXiv:2101.11408 [pdf, other]

doi 10.1002/spe.2984

Number Parsing at a Gigabyte per Second

Authors: Daniel Lemire

Abstract: With disks and networks providing gigabytes per second, parsing decimal numbers from strings becomes a bottleneck. We consider the problem of parsing decimal numbers to the nearest binary floating-point value. The general problem requires variable-precision arithmetic. However, we need at most 17 digits to represent 64-bit standard floating-point numbers (IEEE 754). Thus we can represent the decim… ▽ More With disks and networks providing gigabytes per second, parsing decimal numbers from strings becomes a bottleneck. We consider the problem of parsing decimal numbers to the nearest binary floating-point value. The general problem requires variable-precision arithmetic. However, we need at most 17 digits to represent 64-bit standard floating-point numbers (IEEE 754). Thus we can represent the decimal significand with a single 64-bit word. By combining the significand and precomputed tables, we can compute the nearest floating-point number using as few as one or two 64-bit multiplications. Our implementation can be several times faster than conventional functions present in standard C libraries on modern 64-bit systems (Intel, AMD, ARM and POWER9). Our work is available as open source software used by major systems such as Apache Arrow and Yandex ClickHouse. The Go standard library has adopted a version of our approach. △ Less

Submitted 3 November, 2022; v1 submitted 11 January, 2021; originally announced January 2021.

Comments: Software at https://github.com/fastfloat/fast_float and https://github.com/lemire/simple_fastfloat_benchmark/

Journal ref: Software: Practice and Experience 51 (8), 2021

arXiv:2012.12369 [pdf, other]

doi 10.1016/j.heliyon.2021.e07442

Integer Division by Constants: Optimal Bounds

Authors: Daniel Lemire, Colin Bartlett, Owen Kaser

Abstract: The integer division of a numerator n by a divisor d gives a quotient q and a remainder r. Optimizing compilers accelerate software by replacing the division of n by d with the division of c * n (or c * n + c) by m for convenient integers c and m chosen so that they approximate the reciprocal: c/m ~= 1/d. Such techniques are especially advantageous when m is chosen to be a power of two and when d… ▽ More The integer division of a numerator n by a divisor d gives a quotient q and a remainder r. Optimizing compilers accelerate software by replacing the division of n by d with the division of c * n (or c * n + c) by m for convenient integers c and m chosen so that they approximate the reciprocal: c/m ~= 1/d. Such techniques are especially advantageous when m is chosen to be a power of two and when d is a constant so that c and m can be precomputed. The literature contains many bounds on the distance between c/m and the divisor d. Some of these bounds are optimally tight, while others are not. We present optimally tight bounds for quotient and remainder computations. △ Less

Submitted 29 June, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

Report number: TR-20-001, Dept of CS, UNB Saint John

Journal ref: Heliyon 7 (6), 2021

arXiv:2010.03090 [pdf, other]

doi 10.1002/spe.2920

Validating UTF-8 In Less Than One Instruction Per Byte

Authors: John Keiser, Daniel Lemire

Abstract: The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software. The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software. △ Less

Submitted 15 December, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

Journal ref: Software: Practice and Experience 51 (5), 2021

arXiv:1912.08258 [pdf, other]

doi 10.1145/3376122

Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters

Authors: Thomas Mueller Graf, Daniel Lemire

Abstract: The Bloom filter provides fast approximate set membership while using little memory. Engineers often use these filters to avoid slow operations such as disk or network accesses. As an alternative, a cuckoo filter may need less space than a Bloom filter and it is faster. Chazelle et al. proposed a generalization of the Bloom filter called the Bloomier filter. Dietzfelbinger and Pagh described a var… ▽ More The Bloom filter provides fast approximate set membership while using little memory. Engineers often use these filters to avoid slow operations such as disk or network accesses. As an alternative, a cuckoo filter may need less space than a Bloom filter and it is faster. Chazelle et al. proposed a generalization of the Bloom filter called the Bloomier filter. Dietzfelbinger and Pagh described a variation on the Bloomier filter that can be used effectively for approximate membership queries. It has never been tested empirically, to our knowledge. We review an efficient implementation of their approach, which we call the xor filter. We find that xor filters can be faster than Bloom and cuckoo filters while using less memory. We further show that a more compact version of xor filters (xor+) can use even less space than highly compact alternatives (e.g., Golomb-compressed sequences) while providing speeds competitive with Bloom filters. △ Less

Submitted 27 January, 2020; v1 submitted 17 December, 2019; originally announced December 2019.

Journal ref: Journal of Experimental Algorithmics 25 (1), 2020

arXiv:1911.02696 [pdf, other]

doi 10.1002/cpe.6304

Efficient Computation of Positional Population Counts Using SIMD Instructions

Authors: Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire

Abstract: In several fields such as statistics, machine learning, and bioinformatics, categorical variables are frequently represented as one-hot encoded vectors. For example, given 8 distinct values, we map each value to a byte where only a single bit has been set. We are motivated to quickly compute statistics over such encodings. Given a stream of k-bit words, we seek to compute k distinct sums correspon… ▽ More In several fields such as statistics, machine learning, and bioinformatics, categorical variables are frequently represented as one-hot encoded vectors. For example, given 8 distinct values, we map each value to a byte where only a single bit has been set. We are motivated to quickly compute statistics over such encodings. Given a stream of k-bit words, we seek to compute k distinct sums corresponding to bit values at indexes 0, 1, 2, ..., k-1. If the k-bit words are one-hot encoded then the sums correspond to a frequency histogram. This multiple-sum problem is a generalization of the population-count problem where we seek the sum of all bit values. Accordingly, we refer to the multiple-sum problem as a positional population-count. Using SIMD (Single Instruction, Multiple Data) instructions from recent Intel processors, we describe algorithms for computing the 16-bit position population count using less than half of a CPU cycle per 16-bit word. Our best approach uses up to 400 times fewer instructions and is up to 50 times faster than baseline code using only regular (non-SIMD) instructions, for sufficiently large inputs. △ Less

Submitted 11 May, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

Journal ref: Concurrency and Computation: Practice and Experience 33 (17), 2021

arXiv:1910.05109 [pdf, other]

doi 10.1002/spe.2777

Base64 encoding and decoding at almost the speed of a memory copy

Authors: Wojciech Muła, Daniel Lemire

Abstract: Many common document formats on the Internet are text-only such as email (MIME) and the Web (HTML, JavaScript, JSON and XML). To include images or executable code in these documents, we first encode them as text using base64. Standard base64 encoding uses 64~ASCII characters: both lower and upper case Latin letters, digits and two other symbols. We show how we can encode and decode base64 data at… ▽ More Many common document formats on the Internet are text-only such as email (MIME) and the Web (HTML, JavaScript, JSON and XML). To include images or executable code in these documents, we first encode them as text using base64. Standard base64 encoding uses 64~ASCII characters: both lower and upper case Latin letters, digits and two other symbols. We show how we can encode and decode base64 data at nearly the speed of a memory copy (memcpy) on recent Intel processors, as long as the data does not fit in the first-level (L1) cache. We use the SIMD (Single Instruction Multiple Data) instruction set AVX-512 available on commodity processors. Our implementation generates several times fewer instructions than previous SIMD-accelerated base64 codecs. It is also more versatile, as it can be adapted---even at runtime---to any base64 variant by only changing constants. △ Less

Submitted 2 October, 2019; originally announced October 2019.

Journal ref: Software: Practice and Experience 50 (2), 2020

arXiv:1902.08318 [pdf, other]

doi 10.1007/s00778-019-00578-5

Parsing Gigabytes of JSON per Second

Authors: Geoff Langdale, Daniel Lemire

Abstract: JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to… ▽ More JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license. △ Less

Submitted 23 July, 2024; v1 submitted 21 February, 2019; originally announced February 2019.

Comments: software: https://github.com/lemire/simdjson

Journal ref: The VLDB Journal, 28(6), 2019

arXiv:1902.01961 [pdf, other]

doi 10.1002/spe.2689

Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries

Authors: Daniel Lemire, Owen Kaser, Nathan Kurz

Abstract: On common processors, integer multiplication is many times faster than integer division. Dividing a numerator n by a divisor d is mathematically equivalent to multiplication by the inverse of the divisor (n / d = n x 1/d). If the divisor is known in advance---or if repeated integer divisions will be performed with the same divisor---it can be beneficial to substitute a less costly multiplication f… ▽ More On common processors, integer multiplication is many times faster than integer division. Dividing a numerator n by a divisor d is mathematically equivalent to multiplication by the inverse of the divisor (n / d = n x 1/d). If the divisor is known in advance---or if repeated integer divisions will be performed with the same divisor---it can be beneficial to substitute a less costly multiplication for an expensive division. Currently, the remainder of the division by a constant is computed from the quotient by a multiplication and a subtraction. But if just the remainder is desired and the quotient is unneeded, this may be suboptimal. We present a generally applicable algorithm to compute the remainder more directly. Specifically, we use the fractional portion of the product of the numerator and the inverse of the divisor. On this basis, we also present a new, simpler divisibility algorithm to detect nonzero remainders. We also derive new tight bounds on the precision required when representing the inverse of the divisor. Furthermore, we present simple C implementations that beat the optimized code produced by state-of-art C compilers on recent x64 processors (e.g., Intel Skylake and AMD Ryzen), sometimes by more than 25%. On all tested platforms including 64-bit ARM and POWER8, our divisibility-test functions are faster than state-of-the-art Granlund-Montgomery divisibility-test functions, sometimes by more than 50%. △ Less

Submitted 20 November, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

Journal ref: Software: Practice and Experience 49 (6), 2019

arXiv:1810.05313 [pdf, ps, other]

doi 10.1016/j.cam.2018.10.019

Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ Fail Statistical Tests for Linearity

Authors: Daniel Lemire, Melissa E. O'Neill

Abstract: L'Ecuyer & Simard's Big Crush statistical test suite has revealed statistical flaws in many popular random number generators including Marsaglia's Xorshift generators. Vigna recently proposed some 64-bit variations on the Xorshift scheme that are further scrambled (i.e., Xorshift1024*, Xorshift1024+, Xorshift128+, Xoroshiro128+). Unlike their unscrambled counterparts, they pass Big Crush when inte… ▽ More L'Ecuyer & Simard's Big Crush statistical test suite has revealed statistical flaws in many popular random number generators including Marsaglia's Xorshift generators. Vigna recently proposed some 64-bit variations on the Xorshift scheme that are further scrambled (i.e., Xorshift1024*, Xorshift1024+, Xorshift128+, Xoroshiro128+). Unlike their unscrambled counterparts, they pass Big Crush when interleaving blocks of 32 bits for each 64-bit word (most significant, least significant, most significant, least significant, etc.). We report that these scrambled generators systematically fail Big Crush---specifically the linear-complexity and matrix-rank tests that detect linearity---when taking the 32 lowest-order bits in reverse order from each 64-bit word. △ Less

Submitted 25 October, 2018; v1 submitted 11 October, 2018; originally announced October 2018.

Journal ref: Computational and Applied Mathematics 350, 2019

arXiv:1805.10941 [pdf, other]

doi 10.1145/3230636

Fast Random Integer Generation in an Interval

Authors: Daniel Lemire

Abstract: In simulations, probabilistic algorithms and statistical tests, we often generate random integers in an interval (e.g., [0,s)). For example, random integers in an interval are essential to the Fisher-Yates random shuffle. Consequently, popular languages like Java, Python, C++, Swift and Go include ranged random integer generation functions as part of their runtime libraries. Pseudo-random values… ▽ More In simulations, probabilistic algorithms and statistical tests, we often generate random integers in an interval (e.g., [0,s)). For example, random integers in an interval are essential to the Fisher-Yates random shuffle. Consequently, popular languages like Java, Python, C++, Swift and Go include ranged random integer generation functions as part of their runtime libraries. Pseudo-random values are usually generated in words of a fixed number of bits (e.g., 32 bits, 64 bits) using algorithms such as a linear congruential generator. We need functions to convert such random words to random integers in an interval ([0,s)) without introducing statistical biases. The standard functions in programming languages such as Java involve integer divisions. Unfortunately, division instructions are relatively expensive. We review an unbiased function to generate ranged integers from a source of random words that avoids integer divisions with high probability. To establish the practical usefulness of the approach, we show that this algorithm can multiply the speed of unbiased random shuffling on x64 processors. Our proposed approach has been adopted by the Go language for its implementation of the shuffle function. △ Less

Submitted 28 December, 2018; v1 submitted 28 May, 2018; originally announced May 2018.

Comments: to appear in ACM Transactions on Modeling and Computer Simulation

Journal ref: ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

arXiv:1802.10233 [pdf, other]

doi 10.1145/3183713.3190662

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Authors: Edmon Begoli, Jesús Camacho Rodríguez, Julian Hyde, Michael J. Mior, Daniel Lemire

Abstract: Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of proces… ▽ More Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization. △ Less

Submitted 27 February, 2018; originally announced February 2018.

Comments: SIGMOD'18

arXiv:1709.08990 [pdf, other]

doi 10.1016/j.ipl.2017.09.011

Stream VByte: Faster Byte-Oriented Integer Compression

Authors: Daniel Lemire, Nathan Kurz, Christoph Rupp

Abstract: Arrays of integers are often compressed in search engines. Though there are many ways to compress integers, we are interested in the popular byte-oriented integer compression techniques (e.g., VByte or Google's Varint-GB). They are appealing due to their simplicity and engineering convenience. Amazon's varint-G8IU is one of the fastest byte-oriented compression technique published so far. It makes… ▽ More Arrays of integers are often compressed in search engines. Though there are many ways to compress integers, we are interested in the popular byte-oriented integer compression techniques (e.g., VByte or Google's Varint-GB). They are appealing due to their simplicity and engineering convenience. Amazon's varint-G8IU is one of the fastest byte-oriented compression technique published so far. It makes judicious use of the powerful single-instruction-multiple-data (SIMD) instructions available in commodity processors. To surpass varint-G8IU, we present Stream VByte, a novel byte-oriented compression technique that separates the control stream from the encoded data. Like varint-G8IU, Stream VByte is well suited for SIMD instructions. We show that Stream VByte decoding can be up to twice as fast as varint-G8IU decoding over real data sets. In this sense, Stream VByte establishes new speed records for byte-oriented integer compression, at times exceeding the speed of the memcpy function. On a 3.4GHz Haswell processor, it decodes more than 4 billion differentially-coded integers per second from RAM to L1 cache. △ Less

Submitted 27 September, 2017; v1 submitted 25 September, 2017; originally announced September 2017.

Journal ref: Information Processing Letters 130, February 2018, Pages 1-6

arXiv:1709.07821 [pdf, other]

doi 10.1002/spe.2560

Roaring Bitmaps: Implementation of an Optimized Software Library

Authors: Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai

Abstract: Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix's Atlas, LinkedIn's Pinot, Metamarkets' Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Services… ▽ More Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix's Atlas, LinkedIn's Pinot, Metamarkets' Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Services and Apache Kylin rely on a specific type of compressed bitmap index called Roaring. We present an optimized software library written in C implementing Roaring bitmaps: CRoaring. It benefits from several algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. In particular, we present vectorized algorithms to compute the intersection, union, difference and symmetric difference between arrays. We benchmark the library against a wide range of competitive alternatives, identifying weaknesses and strengths in our software. Our work is available under a liberal open-source license. △ Less

Submitted 7 February, 2022; v1 submitted 22 September, 2017; originally announced September 2017.

Journal ref: Software: Practice and Experience Volume 48, Issue 4 April 2018 Pages 867-895

arXiv:1704.00605 [pdf, other]

doi 10.1145/3132709

Faster Base64 Encoding and Decoding Using AVX2 Instructions

Authors: Wojciech Muła, Daniel Lemire

Abstract: Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the efficiency of base64 encoding and decoding. Compared to state-of-the-art implementations, we multiply the speeds of both the encoding (~10x) and the decoding (… ▽ More Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the efficiency of base64 encoding and decoding. Compared to state-of-the-art implementations, we multiply the speeds of both the encoding (~10x) and the decoding (~7x). We achieve these good results by using the single-instruction-multiple-data (SIMD) instructions available on recent Intel processors (AVX2). Our accelerated software abides by the specification and reports errors when encountering characters outside of the base64 set. It is available online as free software under a liberal license. △ Less

Submitted 14 June, 2018; v1 submitted 30 March, 2017; originally announced April 2017.

Comments: software at https://github.com/lemire/fastbase64

Journal ref: ACM Transactions on the Web 12 (3), 2018

arXiv:1703.08198 [pdf, other]

doi 10.3233/FI-2018-1651

On Desirable Semantics of Functional Dependencies over Databases with Incomplete Information

Authors: Antonio Badia, Daniel Lemire

Abstract: Codd's relational model describes just one possible world. To better cope with incomplete information, extended database models allow several possible worlds. Vague tables are one such convenient extended model where attributes accept sets of possible values (e.g., the manager is either Jill or Bob). However, conceptual database design in such cases remains an open problem. In particular, there is… ▽ More Codd's relational model describes just one possible world. To better cope with incomplete information, extended database models allow several possible worlds. Vague tables are one such convenient extended model where attributes accept sets of possible values (e.g., the manager is either Jill or Bob). However, conceptual database design in such cases remains an open problem. In particular, there is no canonical definition of functional dependencies (FDs) over possible worlds (e.g., each employee has just one manager). We identify several desirable properties that the semantics of such FDs should meet including Armstrong's axioms, the independence from irrelevant attributes, seamless satisfaction and implied by strong satisfaction. We show that we can define FDs such that they have all our desirable properties over vague tables. However, we also show that no notion of FD can satisfy all our desirable properties over a more general model (disjunctive tables). Our work formalizes a trade-off between having a general model and having well-behaved FDs. △ Less

Submitted 6 April, 2017; v1 submitted 23 March, 2017; originally announced March 2017.

Comments: to appear in Fundamenta Informaticae

Journal ref: Fundamenta Informaticae 158 (2018) 327-352

arXiv:1611.07612 [pdf, other]

doi 10.1093/comjnl/bxx046

Faster Population Counts Using AVX2 Instructions

Authors: Wojciech Muła, Nathan Kurz, Daniel Lemire

Abstract: Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word (e.g., popcnt on x64 processors). Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instruc… ▽ More Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word (e.g., popcnt on x64 processors). Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors. The benefits can be even greater for applications such as similarity measures (e.g., the Jaccard index) that require additional Boolean operations. Our approach has been adopted by LLVM: it is used by its popular C compiler (clang). △ Less

Submitted 5 September, 2018; v1 submitted 22 November, 2016; originally announced November 2016.

Comments: Software is at https://github.com/CountOnes/hamming_weight

Journal ref: Computer Journal, Volume 61, Issue 1, 1 January 2018

arXiv:1611.05428 [pdf, other]

doi 10.1016/j.is.2017.01.002

Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions

Authors: Daniel Lemire, Christoph Rupp

Abstract: Compression can sometimes improve performance by making more of the data available to the processors faster. We consider the compression of integer keys in a B+-tree index. For this purpose, systems such as IBM DB2 use variable-byte compression over differentially coded keys. We revisit this problem with various compression alternatives such as Google's VarIntGB, Binary Packing and Frame-of-Refere… ▽ More Compression can sometimes improve performance by making more of the data available to the processors faster. We consider the compression of integer keys in a B+-tree index. For this purpose, systems such as IBM DB2 use variable-byte compression over differentially coded keys. We revisit this problem with various compression alternatives such as Google's VarIntGB, Binary Packing and Frame-of-Reference. In all cases, we describe algorithms that can operate directly on compressed data. Many of our alternatives exploit the single-instruction-multiple-data (SIMD) instructions supported by modern CPUs. We evaluate our techniques in a database environment provided by Upscaledb, a production-quality key-value database. Our best techniques are SIMD accelerated: they simultaneously reduce memory usage while improving single-threaded speeds. In particular, a differentially coded SIMD binary-packing techniques (BP128) can offer a superior query speed (e.g., 40% better than an uncompressed database) while providing the best compression (e.g., by a factor of ten). For analytic workloads, our fast compression techniques offer compelling benefits. Our software is available as open source. △ Less

Submitted 9 January, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

Journal ref: Information Systems Volume 66, June 2017, Pages 13-23

arXiv:1609.09840 [pdf, other]

doi 10.1002/spe.2461

Regular and almost universal hashing: an efficient implementation

Authors: Dmytro Ivanchykhin, Sergey Ignatchenko, Daniel Lemire

Abstract: Random hashing can provide guarantees regarding the performance of data structures such as hash tables---even in an adversarial setting. Many existing families of hash functions are universal: given two data objects, the probability that they have the same hash value is low given that we pick hash functions at random. However, universality fails to ensure that all hash functions are well behaved.… ▽ More Random hashing can provide guarantees regarding the performance of data structures such as hash tables---even in an adversarial setting. Many existing families of hash functions are universal: given two data objects, the probability that they have the same hash value is low given that we pick hash functions at random. However, universality fails to ensure that all hash functions are well behaved. We further require regularity: when picking data objects at random they should have a low probability of having the same hash value, for any fixed hash function. We present the efficient implementation of a family of non-cryptographic hash functions (PM+) offering good running times, good memory usage as well as distinguishing theoretical guarantees: almost universality and component-wise regularity. On a variety of platforms, our implementations are comparable to the state of the art in performance. On recent Intel processors, PM+ achieves a speed of 4.7 bytes per cycle for 32-bit outputs and 3.3 bytes per cycle for 64-bit outputs. We review vectorization through SIMD instructions (e.g., AVX2) and optimizations for superscalar execution. △ Less

Submitted 18 October, 2016; v1 submitted 30 September, 2016; originally announced September 2016.

Comments: accepted for publication in Software: Practice and Experience in September 2016

Journal ref: Software: Practice and Experience 47 (10), 2017

arXiv:1603.06549 [pdf, ps, other]

doi 10.1002/spe.2402

Consistently faster and smaller compressed bitmaps with Roaring

Authors: Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser

Abstract: Compressed bitmap indexes are used in databases and search engines. Many bitmap compression techniques have been proposed, almost all relying primarily on run-length encoding (RLE). However, on unsorted data, we can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique, Roaring, has… ▽ More Compressed bitmap indexes are used in databases and search engines. Many bitmap compression techniques have been proposed, almost all relying primarily on run-length encoding (RLE). However, on unsorted data, we can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique, Roaring, has recently been proposed. Due to its good performance, it has been adopted by several production platforms (e.g., Apache Lucene, Apache Spark, Apache Kylin and Druid). Yet there are cases where run-length encoded bitmaps are smaller than the original Roaring bitmaps---typically when the data is sorted so that the bitmaps contain long compressible runs. To better handle these cases, we build a new Roaring hybrid that combines uncompressed bitmaps, packed arrays and RLE compressed segments. The result is a new Roaring format that compresses better. Overall, our new implementation of Roaring can be several times faster (up to two orders of magnitude) than the implementations of traditional RLE-based alternatives (WAH, Concise, EWAH) while compressing better. We review the design choices and optimizations that make these good results possible. △ Less

Submitted 2 March, 2018; v1 submitted 21 March, 2016; originally announced March 2016.

Journal ref: Software: Practice and Experience Volume 46, Issue 11, pages 1547-1569, November 2016

arXiv:1503.07387 [pdf, other]

Vectorized VByte Decoding

Authors: Jeff Plaisance, Nathan Kurz, Daniel Lemire

Abstract: We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountere… ▽ More We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (Masked VByte) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed. △ Less

Submitted 13 January, 2017; v1 submitted 20 February, 2015; originally announced March 2015.

Comments: First International Symposium on Web Algorithms (June 2015)

arXiv:1503.03465 [pdf, other]

doi 10.1007/s13389-015-0110-5

Faster 64-bit universal hashing using carry-less multiplications

Authors: Daniel Lemire, Owen Kaser

Abstract: Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed… ▽ More Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed (Google's CityHash). We find that CLHASH is 40% faster than CityHash on inputs larger than 64 bytes and just as fast otherwise. △ Less

Submitted 4 November, 2015; v1 submitted 11 March, 2015; originally announced March 2015.

Journal ref: Journal of Cryptographic Engineering, Volume 6, Issue 3, pp 171-185, 2016

arXiv:1502.01916 [pdf]

doi 10.1145/2735629

A General SIMD-based Approach to Accelerating Compression Algorithms

Authors: Wayne Xin Zhao, Xudong Zhang, Daniel Lemire, Dongdong Shan, Jian-Yun Nie, Hongfei Yan, Ji-Rong Wen

Abstract: Compression algorithms are important for data oriented tasks, especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets, provide us an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate… ▽ More Compression algorithms are important for data oriented tasks, especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets, provide us an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art non-vectorized algorithms with respect to decoding speeds. △ Less

Submitted 6 February, 2015; originally announced February 2015.

ACM Class: E.4; H.3.1; C.1.2

Journal ref: ACM Trans. Inf. Syst. 33, 3, Article 15 (March 2015)

arXiv:1501.06587 [pdf, other]

doi 10.1002/asi.23179

Measuring academic influence: Not all citations are equal

Authors: Xiaodan Zhu, Peter Turney, Daniel Lemire, André Vellino

Abstract: The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness… ▽ More The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work, we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features, among those we evaluated, were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index. △ Less

Submitted 26 January, 2015; originally announced January 2015.

Journal ref: Journal of the Association for Information Science and Technology, 66: 408-427

arXiv:1501.01941 [pdf, other]

doi 10.1016/j.is.2015.01.002

Bloofi: Multidimensional Bloom Filters

Authors: Adina Crainiceanu, Daniel Lemire

Abstract: Bloom filters are probabilistic data structures commonly used for approximate membership problems in many areas of Computer Science (networking, distributed systems, databases, etc.). With the increase in data size and distribution of data, problems arise where a large number of Bloom filters are available, and all them need to be searched for potential matches. As an example, in a federated cloud… ▽ More Bloom filters are probabilistic data structures commonly used for approximate membership problems in many areas of Computer Science (networking, distributed systems, databases, etc.). With the increase in data size and distribution of data, problems arise where a large number of Bloom filters are available, and all them need to be searched for potential matches. As an example, in a federated cloud environment, each cloud provider could encode the information using Bloom filters and share the Bloom filters with a central coordinator. The problem of interest is not only whether a given element is in any of the sets represented by the Bloom filters, but which of the existing sets contain the given element. This problem cannot be solved by just constructing a Bloom filter on the union of all the sets. Instead, we effectively have a multidimensional Bloom filter problem: given an element, we wish to receive a list of candidate sets where the element might be. To solve this problem, we consider 3 alternatives. Firstly, we can naively check many Bloom filters. Secondly, we propose to organize the Bloom filters in a hierarchical index structure akin to a B+ tree, that we call Bloofi. Finally, we propose another data structure that packs the Bloom filters in such a way as to exploit bit-level parallelism, which we call Flat-Bloofi. Our theoretical and experimental results show that Bloofi and Flat-Bloofi provide scalable and efficient solutions alternatives to search through a large number of Bloom filters. △ Less

Submitted 21 September, 2016; v1 submitted 8 January, 2015; originally announced January 2015.

Journal ref: Information Systems Volume 54, December 2015, Pages 311-324

arXiv:1404.4963 [pdf, other]

doi 10.1093/comjnl/bxu039

Functional dependencies with null markers

Authors: Antonio Badia, Daniel Lemire

Abstract: Functional dependencies are an integral part of database design. However, they are only defined when we exclude null markers. Yet we commonly use null markers in practice. To bridge this gap between theory and practice, researchers have proposed definitions of functional dependencies over relations with null markers. Though sound, these definitions lack some qualities that we find desirable. For e… ▽ More Functional dependencies are an integral part of database design. However, they are only defined when we exclude null markers. Yet we commonly use null markers in practice. To bridge this gap between theory and practice, researchers have proposed definitions of functional dependencies over relations with null markers. Though sound, these definitions lack some qualities that we find desirable. For example, some fail to satisfy Armstrong's axioms---while these axioms are part of the foundation of common database methodologies. We propose a set of properties that any extension of functional dependencies over relations with null markers should possess. We then propose two new extensions having these properties. These extensions attempt to allow null markers where they make sense to practitioners. They both support Armstrong's axioms and provide realizable null markers: at any time, some or all of the null markers can be replaced by actual values without causing an anomaly. Our proposals may improve database designs. △ Less

Submitted 15 May, 2014; v1 submitted 19 April, 2014; originally announced April 2014.

Comments: accepted at the Computer Journal (April 2014)

Journal ref: Computer Journal 58 (5), 2015

arXiv:1402.6407 [pdf, other]

doi 10.1002/spe.2325

Better bitmap performance with Roaring bitmaps

Authors: Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin

Abstract: Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus we might prefer compressed bitmap indexes. Following Oracle's lead, bitmaps are often compressed using run-length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it us… ▽ More Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus we might prefer compressed bitmap indexes. Following Oracle's lead, bitmaps are often compressed using run-length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high-performance RLE-based bitmap encoding techniques: WAH (Word Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable Integer Set). On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e.g., 2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections). Our results challenge the view that RLE-based bitmap compression is best. △ Less

Submitted 15 March, 2016; v1 submitted 25 February, 2014; originally announced February 2014.

Journal ref: Software: Practice and Experience Volume 46, Issue 5, pages 709-719, May 2016

arXiv:1402.4466 [pdf, other]

doi 10.1002/spe.2289

Compressed bitmap indexes: beyond unions and intersections

Authors: Owen Kaser, Daniel Lemire

Abstract: Compressed bitmap indexes are used to speed up simple aggregate queries in databases. Indeed, set operations like intersections, unions and complements can be represented as logical operations (AND,OR,NOT) that are ideally suited for bitmaps. However, it is less obvious how to apply bitmaps to more advanced queries. For example, we might seek products in a store that meet some, but maybe not all,… ▽ More Compressed bitmap indexes are used to speed up simple aggregate queries in databases. Indeed, set operations like intersections, unions and complements can be represented as logical operations (AND,OR,NOT) that are ideally suited for bitmaps. However, it is less obvious how to apply bitmaps to more advanced queries. For example, we might seek products in a store that meet some, but maybe not all, criteria. Such threshold queries generalize intersections and unions; they are often used in information-retrieval and data-mining applications. We introduce new algorithms that are sometimes three orders of magnitude faster than a naive approach. Our work shows that bitmap indexes are more broadly applicable than is commonly believed. △ Less

Submitted 15 August, 2014; v1 submitted 18 February, 2014; originally announced February 2014.

Comments: Accepted for publication in Software: Practice and Experience on August 14th 2014. Note that arXiv:1402.4073 [cs:DB] is a companion to this paper; while they share some text, each contains many results not in the other

Journal ref: Software: Practice & Experience 46 (2), 2016

arXiv:1402.4073 [pdf, other]

Threshold and Symmetric Functions over Bitmaps

Authors: Owen Kaser, Daniel Lemire

Abstract: Bitmap indexes are routinely used to speed up simple aggregate queries in databases. Set operations such as intersections, unions and complements can be represented as logical operations (AND, OR, NOT). However, less is known about the application of bitmap indexes to more advanced queries. We want to extend the applicability of bitmap indexes. As a starting point, we consider symmetric Boolean qu… ▽ More Bitmap indexes are routinely used to speed up simple aggregate queries in databases. Set operations such as intersections, unions and complements can be represented as logical operations (AND, OR, NOT). However, less is known about the application of bitmap indexes to more advanced queries. We want to extend the applicability of bitmap indexes. As a starting point, we consider symmetric Boolean queries (e.g., threshold functions). For example, we might consider stores as sets of products, and ask for products that are on sale in 2 to 10 stores. Such symmetric Boolean queries generalize intersection, union, and T-occurrence queries. It may not be immediately obvious to an engineer how to use bitmap indexes for symmetric Boolean queries. Yet, maybe surprisingly, we find that the best of our bitmap-based algorithms are competitive with the state-of-the-art algorithms for important special cases (e.g., MergeOpt, MergeSkip, DivideSkip, ScanCount). Moreover, unlike the competing algorithms, the result of our computation is again a bitmap which can be further processed within a bitmap index. We review algorithmic design issues such as the aggregation of many compressed bitmaps. We conclude with a discussion on other advanced queries that bitmap indexes might be able to support efficiently. △ Less

Submitted 14 November, 2016; v1 submitted 17 February, 2014; originally announced February 2014.

Comments: This paper uses small fonts and colours and is only intended for electronic viewing

Report number: TR-14-001 ACM Class: H.2.2; H.3.2

arXiv:1401.6399 [pdf, other]

doi 10.1002/spe.2326

SIMD Compression and the Intersection of Sorted Integers

Authors: Daniel Lemire, Leonid Boytsov, Nathan Kurz

Abstract: Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent proces… ▽ More Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decoding speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD Galloping algorithm. We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once. We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state-of-the-art approach. △ Less

Submitted 20 April, 2020; v1 submitted 24 January, 2014; originally announced January 2014.

Journal ref: Software: Practice and Experience Volume 46, Issue 6, pages 723-749, June 2016

arXiv:1209.2137 [pdf, other]

doi 10.1002/spe.2203

Decoding billions of integers per second through vectorization

Authors: Daniel Lemire, Leonid Boytsov

Abstract: In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar natu… ▽ More In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding. △ Less

Submitted 30 January, 2021; v1 submitted 10 September, 2012; originally announced September 2012.

Comments: For software, see https://github.com/lemire/FastPFor, For data, see http://boytsov.info/datasets/clueweb09gap/

Journal ref: Software: Practice & Experience 45 (1), 2015

arXiv:1207.2189 [pdf, other]

doi 10.1145/2338626.2338627

Reordering Rows for Better Compression: Beyond the Lexicographic Order

Authors: Daniel Lemire, Owen Kaser, Eduardo Gutarra

Abstract: Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lis… ▽ More Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases. △ Less

Submitted 3 February, 2014; v1 submitted 9 July, 2012; originally announced July 2012.

Comments: to appear in ACM TODS

ACM Class: H.4.0

Journal ref: ACM Trans. Database Syst. 37, 3, Article 20 (September 2012)

arXiv:1202.4961 [pdf, other]

doi 10.1093/comjnl/bxt070

Strongly universal string hashing is fast

Authors: Owen Kaser, Daniel Lemire

Abstract: We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is that hash functions with fewer multiplications are faster. Yet… ▽ More We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is that hash functions with fewer multiplications are faster. Yet we find that they may fail to be faster due to operation pipelining. We present experimental results on several processors including low-powered processors. Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also prove, using accessible proofs, the strong universality of our families. △ Less

Submitted 20 September, 2018; v1 submitted 22 February, 2012; originally announced February 2012.

Comments: Software is available at http://code.google.com/p/variablelengthstringhashing/ and https://github.com/lemire/StronglyUniversalStringHashing

Journal ref: Computer Journal (2014) 57 (11): 1624-1638

arXiv:1108.4041 [pdf, ps, other]

Extracting, Transforming and Archiving Scientific Data

Authors: Daniel Lemire, Andre Vellino

Abstract: It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital cura… ▽ More It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research. △ Less

Submitted 22 August, 2011; v1 submitted 19 August, 2011; originally announced August 2011.

Comments: 8 pages, Fourth Workshop on Very Large Digital Libraries, 2011

arXiv:1105.6001 [pdf, ps, other]

doi 10.1145/2070736.2070750

A Call to Arms: Revisiting Database Design

Authors: Antonio Badia, Daniel Lemire

Abstract: Good database design is crucial to obtain a sound, consistent database, and - in turn - good database design methodologies are the best way to achieve the right design. These methodologies are taught to most Computer Science undergraduates, as part of any Introduction to Database class. They can be considered part of the "canon", and indeed, the overall approach to database design has been unchang… ▽ More Good database design is crucial to obtain a sound, consistent database, and - in turn - good database design methodologies are the best way to achieve the right design. These methodologies are taught to most Computer Science undergraduates, as part of any Introduction to Database class. They can be considered part of the "canon", and indeed, the overall approach to database design has been unchanged for years. Moreover, none of the major database research assessments identify database design as a strategic research direction. Should we conclude that database design is a solved problem? Our thesis is that database design remains a critical unsolved problem. Hence, it should be the subject of more research. Our starting point is the observation that traditional database design is not used in practice - and if it were used it would result in designs that are not well adapted to current environments. In short, database design has failed to keep up with the times. In this paper, we put forth arguments to support our viewpoint, analyze the root causes of this situation and suggest some avenues of research. △ Less

Submitted 2 October, 2012; v1 submitted 30 May, 2011; originally announced May 2011.

Comments: Removed spurious column break. Nothing else was changed

Journal ref: Antonio Badia and Daniel Lemire. A call to arms: revisiting database design. SIGMOD Record 40 (3), pages 61-69, 2011

arXiv:1010.1526 [pdf, other]

doi 10.1007/s11634-012-0110-6

Time Series Classification by Class-Specific Mahalanobis Distance Measures

Authors: Zoltán Prekopcsák, Daniel Lemire

Abstract: To classify time series by nearest neighbors, we need to specify or learn one or several distance measures. We consider variations of the Mahalanobis distance measures which rely on the inverse covariance matrix of the data. Unfortunately --- for time series data --- the covariance matrix has often low rank. To alleviate this problem we can either use a pseudoinverse, covariance shrinking or limit… ▽ More To classify time series by nearest neighbors, we need to specify or learn one or several distance measures. We consider variations of the Mahalanobis distance measures which rely on the inverse covariance matrix of the data. Unfortunately --- for time series data --- the covariance matrix has often low rank. To alleviate this problem we can either use a pseudoinverse, covariance shrinking or limit the matrix to its diagonal. We review these alternatives and benchmark them against competitive methods such as the related Large Margin Nearest Neighbor Classification (LMNN) and the Dynamic Time Warping (DTW) distance. As we expected, we find that the DTW is superior, but the Mahalanobis distance measures are one to two orders of magnitude faster. To get best results with Mahalanobis distance measures, we recommend learning one distance measure per class using either covariance shrinking or the diagonal approach. △ Less

Submitted 2 July, 2012; v1 submitted 7 October, 2010; originally announced October 2010.

Journal ref: Advances in Data Analysis and Classification 6 (3), 2012

arXiv:1008.1715 [pdf, other]

doi 10.1016/j.dam.2011.11.009

The universality of iterated hashing over variable-length strings

Authors: Daniel Lemire

Abstract: Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string leng… ▽ More Iterated hash functions process strings recursively, one character at a time. At each iteration, they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent, but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string length given the collision probability. △ Less

Submitted 24 November, 2011; v1 submitted 10 August, 2010; originally announced August 2010.

Journal ref: Discrete Applied Mathematics 160 (4-5), 604--617 (2012)

arXiv:1006.3726 [pdf, other]

doi 10.1016/j.datak.2013.01.001

Diamond Dicing

Authors: Hazel Webb, Daniel Lemire, Owen Kaser

Abstract: In OLAP, analysts often select an interesting sample of the data. For example, an analyst might focus on products bringing revenues of at least 100 000 dollars, or on shops having sales greater than 400 000 dollars. However, current systems do not allow the application of both of these thresholds simultaneously, selecting products and shops satisfying both thresholds. For such purposes, we introdu… ▽ More In OLAP, analysts often select an interesting sample of the data. For example, an analyst might focus on products bringing revenues of at least 100 000 dollars, or on shops having sales greater than 400 000 dollars. However, current systems do not allow the application of both of these thresholds simultaneously, selecting products and shops satisfying both thresholds. For such purposes, we introduce the diamond cube operator, filling a gap among existing data warehouse operations. Because of the interaction between dimensions the computation of diamond cubes is challenging. We compare and test various algorithms on large data sets of more than 100 million facts. We find that while it is possible to implement diamonds in SQL, it is inefficient. Indeed, our custom implementation can be a hundred times faster than popular database engines (including a row-store and a column-store). △ Less

Submitted 23 July, 2014; v1 submitted 18 June, 2010; originally announced June 2010.

Comments: 29 pages

Journal ref: Data & Knowledge Engineering, Volume 86, July 2013, Pages 1-18

arXiv:0909.1346 [pdf, other]

doi 10.1016/j.ins.2011.02.002

Reordering Columns for Smaller Indexes

Authors: Daniel Lemire, Owen Kaser

Abstract: Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before sorting can reduce the number of runs by a factor of two or more. Unfortunately, determining the best column order is NP-hard. For many cases, we prov… ▽ More Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before sorting can reduce the number of runs by a factor of two or more. Unfortunately, determining the best column order is NP-hard. For many cases, we prove that the number of runs in table columns is minimized if we sort columns by increasing cardinality. Experimentally, sorting based on Hilbert space-filling curves is poor at minimizing the number of runs. △ Less

Submitted 22 February, 2011; v1 submitted 7 September, 2009; originally announced September 2009.

Comments: to appear in Information Sciences

Journal ref: Daniel Lemire and Owen Kaser, Reordering Columns for Smaller Indexes, Information Sciences 181 (12), 2011

arXiv:0906.0910 [pdf]

On the Challenges of Collaborative Data Processing

Authors: Sylvie Noel, Daniel Lemire

Abstract: The last 30 years have seen the creation of a variety of electronic collaboration tools for science and business. Some of the best-known collaboration tools support text editing (e.g., wikis). Wikipedia's success shows that large-scale collaboration can produce highly valuable content. Meanwhile much structured data is being collected and made publicly available. We have never had access to more… ▽ More The last 30 years have seen the creation of a variety of electronic collaboration tools for science and business. Some of the best-known collaboration tools support text editing (e.g., wikis). Wikipedia's success shows that large-scale collaboration can produce highly valuable content. Meanwhile much structured data is being collected and made publicly available. We have never had access to more powerful databases and statistical packages. Is large-scale collaborative data analysis now possible? Using a quantitative analysis of Web 2.0 data visualization sites, we find evidence that at least moderate open collaboration occurs. We then explore some of the limiting factors of collaboration over data. △ Less

Submitted 4 June, 2009; originally announced June 2009.

Comments: to appear as a chapter in an upcoming book (Collaborative Information Behavior)

arXiv:0905.2657 [pdf, other]

doi 10.1007/978-3-642-01344-7_5

Web 2.0 OLAP: From Data Cubes to Tag Clouds

Authors: Kamel Aouiche, Daniel Lemire, Robert Godin

Abstract: Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users… ▽ More Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users can upload data and immediately navigate through its ad hoc dimensions. To support social networking, views can be easily shared and embedded in other Web sites. Algorithmically, our tag-cloud views are approximate range top-k queries over spontaneous data cubes. We present experimental evidence that iceberg cuboids provide adequate online approximations. We benchmark several browser-oblivious tag-cloud layout optimizations. △ Less

Submitted 15 March, 2016; v1 submitted 16 May, 2009; originally announced May 2009.

Comments: Software at https://github.com/lemire/OLAPTagCloud. arXiv admin note: substantial text overlap with arXiv:0710.2156

Journal ref: Lecture Notes in Business Information Processing Vol. 18, pages 51-64, 2009

arXiv:0901.3751 [pdf, other]

doi 10.1016/j.datak.2009.08.006

Sorting improves word-aligned bitmap indexes

Authors: Daniel Lemire, Owen Kaser, Kamel Aouiche

Abstract: Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times f… ▽ More Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate row-reordering heuristics. Simply permuting the columns of the table can increase the sorting efficiency by 40%. Secondary contributions include efficient algorithms to construct and aggregate bitmaps. The effect of word length is also reviewed by constructing 16-bit, 32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are slightly faster than 32-bit indexes despite being nearly twice as large. △ Less

Submitted 29 July, 2016; v1 submitted 23 January, 2009; originally announced January 2009.

Journal ref: Data & Knowledge Engineering, Volume 69, Issue 1, 2010, Pages 3-28

Showing 1–50 of 71 results for author: Lemire, D