-
Expected Density of Random Minimizers
Authors:
Shay Golan,
Arseny M. Shur
Abstract:
Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size $σ$, a minimizer is defined by two integers $k,w\ge2$ and a total order $ρ$ on strings of length $k$ (also called $k$-mers). A string is processed by a sliding window algorithm that chooses, in each window of length $w+k-1$, its minimal…
▽ More
Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size $σ$, a minimizer is defined by two integers $k,w\ge2$ and a total order $ρ$ on strings of length $k$ (also called $k$-mers). A string is processed by a sliding window algorithm that chooses, in each window of length $w+k-1$, its minimal $k$-mer with respect to $ρ$. A key characteristic of the minimizer is the expected density of chosen $k$-mers among all $k$-mers in a random infinite $σ$-ary string. Random minimizers, in which the order $ρ$ is chosen uniformly at random, are often used in applications. However, little is known about their expected density $\mathcal{DR}_σ(k,w)$ besides the fact that it is close to $\frac{2}{w+1}$ unless $w\gg k$.
We first show that $\mathcal{DR}_σ(k,w)$ can be computed in $O(kσ^{k+w})$ time. Then we attend to the case $w\le k$ and present a formula that allows one to compute $\mathcal{DR}_σ(k,w)$ in just $O(w^2)$ time. Further, we describe the behaviour of $\mathcal{DR}_σ(k,w)$ in this case, establishing the connection between $\mathcal{DR}_σ(k,w)$, $\mathcal{DR}_σ(k+1,w)$, and $\mathcal{DR}_σ(k,w+1)$. In particular, we show that $\mathcal{DR}_σ(k,w)<\frac{2}{w+1}$ (by a tiny margin) unless $w$ is small. We conclude with some partial results and conjectures for the case $w>k$.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Hamming Distance Oracle
Authors:
Itai Boneh,
Dvir Fried,
Shay Golan,
Matan Kraus
Abstract:
In this paper, we present and study the \emph{Hamming distance oracle problem}. In this problem, the task is to preprocess two strings $S$ and $T$ of lengths $n$ and $m$, respectively, to obtain a data-structure that is able to answer queries regarding the Hamming distance between a substring of $S$ and a substring of $T$.
For a constant size alphabet strings, we show that for every $x\le nm$ th…
▽ More
In this paper, we present and study the \emph{Hamming distance oracle problem}. In this problem, the task is to preprocess two strings $S$ and $T$ of lengths $n$ and $m$, respectively, to obtain a data-structure that is able to answer queries regarding the Hamming distance between a substring of $S$ and a substring of $T$.
For a constant size alphabet strings, we show that for every $x\le nm$ there is a data structure with $\tilde{O}(nm/x)$ preprocess time and $O(x)$ query time. We also provide a combinatorial conditional lower bound, showing that for every $\varepsilon > 0$ and $x \le nm$ there is no data structure with query time $O(x)$ and preprocess time $O((\frac{nm}{x})^{1-\varepsilon})$ unless combinatorial fast matrix multiplication is possible.
For strings over general alphabet, we present a data structure with $\tilde{O}(nm/\sqrt{x})$ preprocess time and $O(x)$ query time for every $x \le nm$.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Enhancing Consistency-Based Image Generation via Adversarialy-Trained Classification and Energy-Based Discrimination
Authors:
Shelly Golan,
Roy Ganz,
Michael Elad
Abstract:
The recently introduced Consistency models pose an efficient alternative to diffusion algorithms, enabling rapid and good quality image synthesis. These methods overcome the slowness of diffusion models by directly mapping noise to data, while maintaining a (relatively) simpler training. Consistency models enable a fast one- or few-step generation, but they typically fall somewhat short in sample…
▽ More
The recently introduced Consistency models pose an efficient alternative to diffusion algorithms, enabling rapid and good quality image synthesis. These methods overcome the slowness of diffusion models by directly mapping noise to data, while maintaining a (relatively) simpler training. Consistency models enable a fast one- or few-step generation, but they typically fall somewhat short in sample quality when compared to their diffusion origins. In this work we propose a novel and highly effective technique for post-processing Consistency-based generated images, enhancing their perceptual quality. Our approach utilizes a joint classifier-discriminator model, in which both portions are trained adversarially. While the classifier aims to grade an image based on its assignment to a designated class, the discriminator portion of the very same network leverages the softmax values to assess the proximity of the input image to the targeted data manifold, thereby serving as an Energy-based Model. By employing example-specific projected gradient iterations under the guidance of this joint machine, we refine synthesized images and achieve an improved FID scores on the ImageNet 64x64 dataset for both Consistency-Training and Consistency-Distillation techniques.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
String 2-Covers with No Length Restrictions
Authors:
Itai Boneh,
Shay Golan,
Arseny Shur
Abstract:
A $λ$-cover of a string $S$ is a set of strings $\{C_i\}_1^λ$ such that every index in $S$ is contained in an occurrence of at least one string $C_i$. The existence of a $1$-cover defines a well-known class of quasi-periodic strings. Quasi-periodicity can be decided in linear time, and all $1$-covers of a string can be reported in linear time plus the size of the output. Since in general it is NP-…
▽ More
A $λ$-cover of a string $S$ is a set of strings $\{C_i\}_1^λ$ such that every index in $S$ is contained in an occurrence of at least one string $C_i$. The existence of a $1$-cover defines a well-known class of quasi-periodic strings. Quasi-periodicity can be decided in linear time, and all $1$-covers of a string can be reported in linear time plus the size of the output. Since in general it is NP-complete to decide whether a string has a $λ$-cover, the natural next step is the development of efficient algorithms for $2$-covers. Radoszewski and Straszyński [ESA 2020] analysed the particular case where the strings in a $2$-cover must be of the same length. They provided an algorithm that reports all such $2$-covers of $S$ in time near-linear in $|S|$ and in the size of the output.
In this work, we consider $2$-covers in full generality. Since every length-$n$ string has $Ω(n^2)$ trivial $2$-covers (every prefix and suffix of total length at least $n$ constitute such a $2$-cover), we state the reporting problem as follows: given a string $S$ and a number $m$, report all $2$-covers $\{C_1,C_2\}$ of $S$ with length $|C_1|+|C_2|$ upper bounded by $m$. We present an $\tilde{O}(n + Output)$ time algorithm solving this problem, with Output being the size of the output. This algorithm admits a simpler modification that finds a $2$-cover of minimum length. We also provide an $\tilde{O}(n)$ time construction of a $2$-cover oracle which, given two substrings $C_1,C_2$ of $S$, reports in poly-logarithmic time whether $\{C_1,C_2\}$ is a $2$-cover of $S$.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Hairpin Completion Distance Lower Bound
Authors:
Itai Boneh,
Dvir Fried,
Shay Golan,
Matan Kraus
Abstract:
Hairpin completion, derived from the hairpin formation observed in DNA biochemistry, is an operation applied to strings, particularly useful in DNA computing. Conceptually, a right hairpin completion operation transforms a string $S$ into $S\cdot S'$ where $S'$ is the reverse complement of a prefix of $S$. Similarly, a left hairpin completion operation transforms a string $S$ into $S'\cdot S$ wher…
▽ More
Hairpin completion, derived from the hairpin formation observed in DNA biochemistry, is an operation applied to strings, particularly useful in DNA computing. Conceptually, a right hairpin completion operation transforms a string $S$ into $S\cdot S'$ where $S'$ is the reverse complement of a prefix of $S$. Similarly, a left hairpin completion operation transforms a string $S$ into $S'\cdot S$ where $S'$ is the reverse complement of a suffix of $S$. The hairpin completion distance from $S$ to $T$ is the minimum number of hairpin completion operations needed to transform $S$ into $T$. Recently Boneh et al. showed an $O(n^2)$ time algorithm for finding the hairpin completion distance between two strings of length at most $n$. In this paper we show that for any $\varepsilon>0$ there is no $O(n^{2-\varepsilon})$-time algorithm for the hairpin completion distance problem unless the Strong Exponential Time Hypothesis (SETH) is false. Thus, under SETH, the time complexity of the hairpin completion distance problem is quadratic, up to sub-polynomial factors.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Quantum Amplitude Loading for Rainbow Options Pricing
Authors:
Francesca Cibrario,
Or Samimi Golan,
Giacomo Ranieri,
Emanuele Dri,
Mattia Ippoliti,
Ron Cohen,
Christian Mattia,
Bartolomeo Montrucchio,
Amir Naveh,
Davide Corbelletto
Abstract:
This work introduces a novel approach to price rainbow options, a type of path-independent multi-asset derivatives, with quantum computers. Leveraging the Iterative Quantum Amplitude Estimation method, we present an end-to-end quantum circuit implementation, emphasizing efficiency by delaying the transition to price space. Moreover, we analyze two different amplitude loading techniques for handlin…
▽ More
This work introduces a novel approach to price rainbow options, a type of path-independent multi-asset derivatives, with quantum computers. Leveraging the Iterative Quantum Amplitude Estimation method, we present an end-to-end quantum circuit implementation, emphasizing efficiency by delaying the transition to price space. Moreover, we analyze two different amplitude loading techniques for handling exponential functions. Experiments on the IBM QASM simulator validate our quantum pricing model, contributing to the evolving field of quantum finance.
△ Less
Submitted 1 October, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Searching 2D-Strings for Matching Frames
Authors:
Itai Boneh,
Dvir Fried,
Shay Golan,
Matan Kraus,
Adrian Miclaus,
Arseny Shur
Abstract:
We introduce the natural notion of a matching frame in a $2$-dimensional string. A matching frame in a $2$-dimensional $n\times m$ string $M$, is a rectangle such that the strings written on the horizontal sides of the rectangle are identical, and so are the strings written on the vertical sides of the rectangle. Formally, a matching frame in $M$ is a tuple $(u,d,\ell,r)$ such that…
▽ More
We introduce the natural notion of a matching frame in a $2$-dimensional string. A matching frame in a $2$-dimensional $n\times m$ string $M$, is a rectangle such that the strings written on the horizontal sides of the rectangle are identical, and so are the strings written on the vertical sides of the rectangle. Formally, a matching frame in $M$ is a tuple $(u,d,\ell,r)$ such that $M[u][\ell ..r] = M[d][\ell ..r]$ and $M[u..d][\ell] = M[u..d][r]$.
In this paper, we present an algorithm for finding the maximum perimeter matching frame in a matrix $M$ in $\tilde{O}(n^{2.5})$ time (assuming $n \ge m)$. Additionally, for every constant $ε> 0$ we present a near-linear $(1-ε)$-approximation algorithm for the maximum perimeter of a matching frame.
In the development of the aforementioned algorithms, we introduce inventive technical elements and uncover distinctive structural properties that we believe will captivate the curiosity of the community.
△ Less
Submitted 18 April, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Near-Optimal Dynamic Time Warping on Run-Length Encoded Strings
Authors:
Itai Boneh,
Shay Golan,
Shay Mozes,
Oren Weimann
Abstract:
We give an $\tilde O(n^2)$ time algorithm for computing the exact Dynamic Time Warping distance between two strings whose run-length encoding is of size at most $n$. This matches (up to log factors) the known (conditional) lower bound, and should be compared with the previous fastest $O(n^3)$ time exact algorithm and the $\tilde O(n^2)$ time approximation algorithm.
We give an $\tilde O(n^2)$ time algorithm for computing the exact Dynamic Time Warping distance between two strings whose run-length encoding is of size at most $n$. This matches (up to log factors) the known (conditional) lower bound, and should be compared with the previous fastest $O(n^3)$ time exact algorithm and the $\tilde O(n^2)$ time approximation algorithm.
△ Less
Submitted 13 February, 2023;
originally announced February 2023.
-
Mapping Disruption Sources in the Power Grid and Implications for Resilience
Authors:
Maureen S. Golan,
Javad Mohammadi
Abstract:
Developing models and metrics that can address resilience against disruptions is vital to ensure power grid reliability and that adequate recovery and adaptation mechanisms are in place. In this paper, we propose a novel disruption mapping approach and apply it to the publicly available U.S. Department of Energy DOE-417 Electric Emergency and Disturbance Report to holistically analyze the origin o…
▽ More
Developing models and metrics that can address resilience against disruptions is vital to ensure power grid reliability and that adequate recovery and adaptation mechanisms are in place. In this paper, we propose a novel disruption mapping approach and apply it to the publicly available U.S. Department of Energy DOE-417 Electric Emergency and Disturbance Report to holistically analyze the origin of anomalous events and their propagation through the cyber, physical and human domains. We show that capturing the disruption process onset has implications for quantifying, mitigating, and reporting power grid resilience.
△ Less
Submitted 17 July, 2022;
originally announced July 2022.
-
Cyber Resilience: by Design or by Intervention?
Authors:
Alexander Kott,
Maureen S. Golan,
Benjamin D. Trump,
Igor Linkov
Abstract:
The term "cyber resilience by design" is growing in popularity. Here, by cyber resilience we refer to the ability of the system to resist, minimize and mitigate a degradation caused by a successful cyber-attack on a system or network of computing and communicating devices. Some use the term "by design" when arguing that systems must be designed and implemented in a provable mission assurance fashi…
▽ More
The term "cyber resilience by design" is growing in popularity. Here, by cyber resilience we refer to the ability of the system to resist, minimize and mitigate a degradation caused by a successful cyber-attack on a system or network of computing and communicating devices. Some use the term "by design" when arguing that systems must be designed and implemented in a provable mission assurance fashion, with the system's intrinsic properties ensuring that a cyber-adversary is unable to cause a meaningful degradation. Others recommend that a system should include a built-in autonomous intelligent agent responsible for thinking and acting towards continuous observation, detection, minimization and remediation of a cyber degradation. In all cases, the qualifier "by design" indicates that the source of resilience is somehow inherent in the structure and operation of the system. But what, then, is the other resilience, not by design? Clearly, there has to be another type of resilience, otherwise what's the purpose of the qualifier "by design"? Indeed, while mentioned less frequently, there exists an alternative form of resilience called "resilience by intervention." In this article we explore differences and mutual reliance of resilience by design and resilience by intervention.
△ Less
Submitted 26 January, 2022;
originally announced January 2022.
-
An Improved Algorithm for The $k$-Dyck Edit Distance Problem
Authors:
Dvir Fried,
Shay Golan,
Tomasz Kociumaka,
Tsvi Kopelowitz,
Ely Porat,
Tatiana Starikovskaya
Abstract:
A Dyck sequence is a sequence of opening and closing parentheses (of various types) that is balanced. The Dyck edit distance of a given sequence of parentheses $S$ is the smallest number of edit operations (insertions, deletions, and substitutions) needed to transform $S$ into a Dyck sequence. We consider the threshold Dyck edit distance problem, where the input is a sequence of parentheses $S$ an…
▽ More
A Dyck sequence is a sequence of opening and closing parentheses (of various types) that is balanced. The Dyck edit distance of a given sequence of parentheses $S$ is the smallest number of edit operations (insertions, deletions, and substitutions) needed to transform $S$ into a Dyck sequence. We consider the threshold Dyck edit distance problem, where the input is a sequence of parentheses $S$ and a positive integer $k$, and the goal is to compute the Dyck edit distance of $S$ only if the distance is at most $k$, and otherwise report that the distance is larger than $k$. Backurs and Onak [PODS'16] showed that the threshold Dyck edit distance problem can be solved in $O(n+k^{16})$ time.
In this work, we design new algorithms for the threshold Dyck edit distance problem which costs $O(n+k^{4.544184})$ time with high probability or $O(n+k^{4.853059})$ deterministically. Our algorithms combine several new structural properties of the Dyck edit distance problem, a refined algorithm for fast $(\min,+)$ matrix product, and a careful modification of ideas used in Valiant's parsing algorithm.
△ Less
Submitted 22 August, 2022; v1 submitted 3 November, 2021;
originally announced November 2021.
-
The Vaccine Supply Chain: A Call for Resilience Analytics to Support COVID-19 Vaccine Production and Distribution
Authors:
Maureen S. Golan,
Benjamin D. Trump,
Jeffrey C. Cegan,
Igor Linkov
Abstract:
The COVID19 pandemic has highlighted the lack of resilience in supply chains, as global networks fail from disruptions at single nodes and connections. Through an overview of the existing vaccine and pharmaceutical supply chain publications focusing on resilience, as well as recent papers reporting modeling of resilience in supply chains across multiple fields, we find that models for supply chain…
▽ More
The COVID19 pandemic has highlighted the lack of resilience in supply chains, as global networks fail from disruptions at single nodes and connections. Through an overview of the existing vaccine and pharmaceutical supply chain publications focusing on resilience, as well as recent papers reporting modeling of resilience in supply chains across multiple fields, we find that models for supply chain resilience are few and most of them are focused on individual dimensions of resilience rather than on comprehensive strategy necessary for scaling up vaccine production and distribution in emergency settings. We find that COVID19 resulted in a wave of interest to supply chain resilience, but publications from 2020 are narrow in focus and largely qualitative in nature; evidence-based models and measures are rare. Further, publications often focus exclusively on specific portions of the specific supply chain of interest, excluding associated supporting networks, such as transportation, social and command and control (C2) necessary for vaccine production and equitable distribution. This lack of network analysis is a major gap in the literature that needs to be bridged in order to create methods of real-time analysis and decision tools for the COVID19 vaccine supply chain. We conclude that a comprehensive, quantitative approach to network resilience that encompasses the supply chain in the context of other social and physical networks is needed in order to address the emerging challenges of a large-scale COVID-19 vaccination program. We further find that the COVID-19 pandemic underscores the necessity of positioning supply chain resilience within a multi-network context and formally incorporating temporal dimensions into analysis through the NAS definition of resilience, plan, absorb, recover, adapt, to ensure essential needs are met across all dimensions of society.
△ Less
Submitted 28 November, 2020;
originally announced November 2020.
-
Improved Circular $k$-Mismatch Sketches
Authors:
Shay Golan,
Tomasz Kociumaka,
Tsvi Kopelowitz,
Ely Porat,
Przemysław Uznański
Abstract:
The shift distance $\mathsf{sh}(S_1,S_2)$ between two strings $S_1$ and $S_2$ of the same length is defined as the minimum Hamming distance between $S_1$ and any rotation (cyclic shift) of $S_2$. We study the problem of sketching the shift distance, which is the following communication complexity problem: Strings $S_1$ and $S_2$ of length $n$ are given to two identical players (encoders), who inde…
▽ More
The shift distance $\mathsf{sh}(S_1,S_2)$ between two strings $S_1$ and $S_2$ of the same length is defined as the minimum Hamming distance between $S_1$ and any rotation (cyclic shift) of $S_2$. We study the problem of sketching the shift distance, which is the following communication complexity problem: Strings $S_1$ and $S_2$ of length $n$ are given to two identical players (encoders), who independently compute sketches (summaries) $\mathtt{sk}(S_1)$ and $\mathtt{sk}(S_2)$, respectively, so that upon receiving the two sketches, a third player (decoder) is able to compute (or approximate) $\mathsf{sh}(S_1,S_2)$ with high probability.
This paper primarily focuses on the more general $k$-mismatch version of the problem, where the decoder is allowed to declare a failure if $\mathsf{sh}(S_1,S_2)>k$, where $k$ is a parameter known to all parties. Andoni et al. (STOC'13) introduced exact circular $k$-mismatch sketches of size $\widetilde{O}(k+D(n))$, where $D(n)$ is the number of divisors of $n$. Andoni et al. also showed that their sketch size is optimal in the class of linear homomorphic sketches.
We circumvent this lower bound by designing a (non-linear) exact circular $k$-mismatch sketch of size $\widetilde{O}(k)$; this size matches communication-complexity lower bounds. We also design $(1\pm \varepsilon)$-approximate circular $k$-mismatch sketch of size $\widetilde{O}(\min(\varepsilon^{-2}\sqrt{k}, \varepsilon^{-1.5}\sqrt{n}))$, which improves upon an $\widetilde{O}(\varepsilon^{-2}\sqrt{n})$-size sketch of Crouch and McGregor (APPROX'11).
△ Less
Submitted 24 June, 2020;
originally announced June 2020.
-
The Streaming k-Mismatch Problem: Tradeoffs between Space and Total Time
Authors:
Shay Golan,
Tomasz Kociumaka,
Tsvi Kopelowitz,
Ely Porat
Abstract:
We revisit the $k$-mismatch problem in the streaming model on a pattern of length $m$ and a streaming text of length $n$, both over a size-$σ$ alphabet. The current state-of-the-art algorithm for the streaming $k$-mismatch problem, by Clifford et al. [SODA 2019], uses $\tilde O(k)$ space and $\tilde O\big(\sqrt k\big)$ worst-case time per character. The space complexity is known to be (uncondition…
▽ More
We revisit the $k$-mismatch problem in the streaming model on a pattern of length $m$ and a streaming text of length $n$, both over a size-$σ$ alphabet. The current state-of-the-art algorithm for the streaming $k$-mismatch problem, by Clifford et al. [SODA 2019], uses $\tilde O(k)$ space and $\tilde O\big(\sqrt k\big)$ worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is $\tilde O(n\sqrt k)$, and the fastest known offline algorithm, which costs $\tilde O\big(n + \min\big(\frac{nk}{\sqrt m},σn\big)\big)$ time. Moreover, it is not known whether improvements over the $\tilde O(n\sqrt k)$ total time are possible when using more than $O(k)$ space.
We address these gaps by designing a randomized streaming algorithm for the $k$-mismatch problem that, given an integer parameter $k\le s \le m$, uses $\tilde O(s)$ space and costs $\tilde O\big(n+\min\big(\frac {nk^2}m,\frac{nk}{\sqrt s},\frac{σnm}s\big)\big)$ total time. For $s=m$, the total runtime becomes $\tilde O\big(n + \min\big(\frac{nk}{\sqrt m},σn\big)\big)$, which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still $\tilde O\big(\sqrt k\big)$.
△ Less
Submitted 27 April, 2020;
originally announced April 2020.
-
Time-Space Tradeoffs for Finding a Long Common Substring
Authors:
Stav Ben-Nun,
Shay Golan,
Tomasz Kociumaka,
Matan Kraus
Abstract:
We consider the problem of finding, given two documents of total length $n$, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic $O(n)$-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutio…
▽ More
We consider the problem of finding, given two documents of total length $n$, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic $O(n)$-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutions require $Θ(n)$ space, which is prohibitive in many applications. To address this issue, Starikovskaya and Vildhøj (CPM 2013) showed that for $n^{2/3} \le s \le n^{1-o(1)}$, the LCS problem can be solved in $O(s)$ space and $O(\frac{n^2}{s})$ time. Kociumaka et al. (ESA 2014) generalized this tradeoff to $1 \leq s \leq n$, thus providing a smooth time-space tradeoff from constant to linear space. In this paper, we obtain a significant speed-up for instances where the length $L$ of the sought LCS is large. For $1 \leq s \leq n$, we show that the LCS problem can be solved in $O(s)$ space and $\tilde{O}(\frac{n^2}{L\cdot s}+n)$ time. The result is based on techniques originating from the LCS with Mismatches problem (Flouri et al., 2015; Charalampopoulos et al., CPM 2018), on space-efficient locally consistent parsing (Birenzwige et al., SODA 2020), and on the structure of maximal repetitions (runs) in the input documents.
△ Less
Submitted 28 April, 2020; v1 submitted 4 March, 2020;
originally announced March 2020.
-
Approximating Text-to-Pattern Hamming Distances
Authors:
Timothy M. Chan,
Shay Golan,
Tomasz Kociumaka,
Tsvi Kopelowitz,
Ely Porat
Abstract:
We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size $σ$, compute the Hamming distance between the pattern and the text at every location. Several $(1+ε)$-approximation algorithms have been proposed in the literature, with running time of the form $O(ε^{-O(1)}n\log n\log m)$, all using fast Fourier transform (FFT). W…
▽ More
We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size $σ$, compute the Hamming distance between the pattern and the text at every location. Several $(1+ε)$-approximation algorithms have been proposed in the literature, with running time of the form $O(ε^{-O(1)}n\log n\log m)$, all using fast Fourier transform (FFT). We describe a simple $(1+ε)$-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results:
- We obtain the first linear-time approximation algorithm; the running time is $O(ε^{-2}n)$.
- We obtain a faster exact algorithm computing all Hamming distances up to a given threshold k; its running time improves previous results by logarithmic factors and is linear if $k\le\sqrt m$.
- We obtain approximation algorithms with better $ε$-dependence using rectangular matrix multiplication. The time-bound is $Õ(n)$ when the pattern is sufficiently long: $m\ge ε^{-28}$. Previous algorithms require $Õ(ε^{-1}n)$ time.
- When k is not too small, we obtain a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in $O((n/k^{Ω(1)}+occ)n^{o(1)})$ time, where occ is the output size. The algorithm leads to a property tester, returning true if an exact match exists and false if the Hamming distance is more than $δm$ at every location, running in $Õ(δ^{-1/3}n^{2/3}+δ^{-1}n/m)$ time.
- We obtain a streaming algorithm to report all locations with Hamming distance approximately less than k, using $Õ(ε^{-2}\sqrt k)$ space. Previously, streaming algorithms were known for the exact problem with Õ(k) space or for the approximate problem with $Õ(ε^{-O(1)}\sqrt m)$ space.
△ Less
Submitted 1 January, 2020;
originally announced January 2020.
-
Littlewood Polynomials, Spectral-Null Codes, and Equipowerful Partitions
Authors:
Joe Buhler,
Shahar Golan,
Rob Pratt,
Stan Wagon
Abstract:
Let $[n]$ denote $\{0,1, ... , n-1\}$. A polynomial $f(x) = \sum a_i x^i$ is a Littlewood polynomial (LP) of length $n$ if the $a_i$ are $\pm 1$ for $i \in [n]$, and $a_i = 0$ for $i \ge n$. Such an LP is said to have order $m$ if it is divisible by $(x-1)^m$. The problem of finding the set $L_m$ of lengths of LPs of order $m$ is equivalent to finding the lengths of spectral-null codes of order…
▽ More
Let $[n]$ denote $\{0,1, ... , n-1\}$. A polynomial $f(x) = \sum a_i x^i$ is a Littlewood polynomial (LP) of length $n$ if the $a_i$ are $\pm 1$ for $i \in [n]$, and $a_i = 0$ for $i \ge n$. Such an LP is said to have order $m$ if it is divisible by $(x-1)^m$. The problem of finding the set $L_m$ of lengths of LPs of order $m$ is equivalent to finding the lengths of spectral-null codes of order $m$, and to finding $n$ such that $[n]$ admits a partition into two subsets whose first $m$ moments are equal. Extending the techniques and results of Boyd and others, we completely determine $L_7$ and $L_8$ and prove that 192 is the smallest element of $L_9$. Our primary tools are the use of carefully targeted searches using integer linear programming (both to find LPs and to disprove their existence for specific $n$ and $m$), and an unexpected new concept (that arose out of observed symmetry properties of LPs) that we call "regenerative pairs," which produce infinite arithmetic progressions in $L_m$. We prove that for $m \le$ 8, whenever there is an LP of length $n$ and order $m$, there is one of length $n$ and order $m$ that is symmetric (resp.~antisymmetric) if m is even (resp.~odd).
△ Less
Submitted 7 December, 2019;
originally announced December 2019.
-
Locally Consistent Parsing for Text Indexing in Small Space
Authors:
Or Birenzwige,
Shay Golan,
Ely Porat
Abstract:
We consider two closely related problems of text indexing in a sub-linear working space. The first problem is the Sparse Suffix Tree (SST) construction of a set of suffixes $B$ using only $O(|B|)$ words of space. The second problem is the Longest Common Extension (LCE) problem, where for some parameter $1\leτ\le n$, the goal is to construct a data structure that uses $O(\frac {n}τ)$ words of space…
▽ More
We consider two closely related problems of text indexing in a sub-linear working space. The first problem is the Sparse Suffix Tree (SST) construction of a set of suffixes $B$ using only $O(|B|)$ words of space. The second problem is the Longest Common Extension (LCE) problem, where for some parameter $1\leτ\le n$, the goal is to construct a data structure that uses $O(\frac {n}τ)$ words of space and can compute the longest common prefix length of any pair of suffixes. We show how to use ideas based on the Locally Consistent Parsing technique, that was introduced by Sahinalp and Vishkin [STOC '94], in some non-trivial ways in order to improve the known results for the above problems. We introduce new Las-Vegas and deterministic algorithms for both problems.
We introduce the first Las-Vegas SST construction algorithm that takes $O(n)$ time. This is an improvement over the last result of Gawrychowski and Kociumaka [SODA '17] who obtained $O(n)$ time for Monte-Carlo algorithm, and $O(n\sqrt{\log |B|})$ time for Las-Vegas algorithm. In addition, we introduce a randomized Las-Vegas construction for an LCE data structure that can be constructed in linear time and answers queries in $O(τ)$ time.
For the deterministic algorithms, we introduce an SST construction algorithm that takes $O(n\log \frac{n}{|B|})$ time (for $|B|=Ω(\log n)$). This is the first almost linear time, $O(n\cdot poly\log{n})$, deterministic SST construction algorithm, where all previous algorithms take at least $Ω\left(\min\{n|B|,\frac{n^2}{|B|}\}\right)$ time. For the LCE problem, we introduce a data structure that answers LCE queries in $O(τ\sqrt{\log^*n})$ time, with $O(n\logτ)$ construction time (for $τ=O(\frac{n}{\log n})$). This data structure improves both query time and construction time upon the results of Tanimura et al. [CPM '16].
△ Less
Submitted 1 January, 2020; v1 submitted 2 December, 2018;
originally announced December 2018.
-
Streaming Pattern Matching with d Wildcards
Authors:
Shay Golan,
Tsvi Kopelowitz,
Ely Porat
Abstract:
In the pattern matching with $d$ wildcards problem one is given a text $T$ of length $n$ and a pattern $P$ of length $m$ that contains $d$ wildcard characters, each denoted by a special symbol $'?'$. A wildcard character matches any other character. The goal is to establish for each $m$-length substring of $T$ whether it matches $P$. In the streaming model variant of the pattern matching with $d$…
▽ More
In the pattern matching with $d$ wildcards problem one is given a text $T$ of length $n$ and a pattern $P$ of length $m$ that contains $d$ wildcard characters, each denoted by a special symbol $'?'$. A wildcard character matches any other character. The goal is to establish for each $m$-length substring of $T$ whether it matches $P$. In the streaming model variant of the pattern matching with $d$ wildcards problem the text $T$ arrives one character at a time and the goal is to report, before the next character arrives, if the last $m$ characters match $P$ while using only $o(m)$ words of space.
In this paper we introduce two new algorithms for the $d$ wildcard pattern matching problem in the streaming model. The first is a randomized Monte Carlo algorithm that is parameterized by a constant $0\leq δ\leq 1$. This algorithm uses $\tilde{O}(d^{1-δ})$ amortized time per character and $\tilde{O}(d^{1+δ})$ words of space. The second algorithm, which is used as a black box in the first algorithm, is a randomized Monte Carlo algorithm which uses $O(d+\log m)$ worst-case time per character and $O(d\log m)$ words of space.
△ Less
Submitted 5 April, 2017;
originally announced April 2017.
-
Budget-Constrained Item Cold-Start Handling in Collaborative Filtering Recommenders via Optimal Design
Authors:
Oren Anava,
Shahar Golan,
Nadav Golbandi,
Zohar Karnin,
Ronny Lempel,
Oleg Rokhlenko,
Oren Somekh
Abstract:
It is well known that collaborative filtering (CF) based recommender systems provide better modeling of users and items associated with considerable rating history. The lack of historical ratings results in the user and the item cold-start problems. The latter is the main focus of this work. Most of the current literature addresses this problem by integrating content-based recommendation technique…
▽ More
It is well known that collaborative filtering (CF) based recommender systems provide better modeling of users and items associated with considerable rating history. The lack of historical ratings results in the user and the item cold-start problems. The latter is the main focus of this work. Most of the current literature addresses this problem by integrating content-based recommendation techniques to model the new item. However, in many cases such content is not available, and the question arises is whether this problem can be mitigated using CF techniques only. We formalize this problem as an optimization problem: given a new item, a pool of available users, and a budget constraint, select which users to assign with the task of rating the new item in order to minimize the prediction error of our model. We show that the objective function is monotone-supermodular, and propose efficient optimal design based algorithms that attain an approximation to its optimum. Our findings are verified by an empirical study using the Netflix dataset, where the proposed algorithms outperform several baselines for the problem at hand.
△ Less
Submitted 20 September, 2016; v1 submitted 10 June, 2014;
originally announced June 2014.