skip to main content
research-article

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and the Closest Pair Problem

Published: 06 May 2015 Publication History

Abstract

Given a set of n d-dimensional Boolean vectors with the promise that the vectors are chosen uniformly at random with the exception of two vectors that have Pearson correlation coefficient ρ (Hamming distance d· 1-ρ/ 2), how quickly can one find the two correlated vectors? We present an algorithm which, for any constant ϵ>0, and constant ρ>0, runs in expected time O(n5-ω / 4-ω+ϵ +nd) < O(n1.62 +nd), where ω < 2.4 is the exponent of matrix multiplication. This is the first subquadratic--time algorithm for this problem for which ρ does not appear in the exponent of n, and improves upon O(n2-O(ρ)), given by Paturi et al. [1989], the Locality Sensitive Hashing approach of Motwani [1998] and the Bucketing Codes approach of Dubiner [2008].
Applications and extensions of this basic algorithm yield significantly improved algorithms for several other problems.
Approximate Closest Pair. For any sufficiently small constant ϵ>0, given n d-dimensional vectors, there exists an algorithm that returns a pair of vectors whose Euclidean (or Hamming) distance differs from that of the closest pair by a factor of at most 1+ϵ, and runs in time O(n2-Θ(√ϵ)). The best previous algorithms (including Locality Sensitive Hashing) have runtime O(n2-O(ϵ)).
Learning Sparse Parities with Noise. Given samples from an instance of the learning parities with noise problem where each example has length n, the true parity set has size at most k « n, and the noise rate is η, there exists an algorithm that identifies the set of k indices in time nω+ϵ/3 k poly(1/1-2η) < n0.8k poly(1/1-2 η). This is the first algorithm with no dependence on η in the exponent of n, aside from the trivial O((nk)) ≈ O(nk) brute-force algorithm, and for large noise rates (η > 0.4), improves upon the results of Grigorescu et al. [2011] that give a runtime of n(1+(2 η)2 + o(1))k/2 poly(1/1-2η).
Learning k-Juntas with Noise. Given uniformly random length n Boolean vectors, together with a label, which is some function of just k « n of the bits, perturbed by noise rate η, return the set of relevant indices. Leveraging the reduction of Feldman et al. [2009], our result for learning k-parities implies an algorithm for this problem with runtime nω+ϵ/3 k poly(1/1-2η) < n0.8k poly(1/1-2 η), which is the first runtime for this problem of the form nck with an absolute constant c < 1.
Learning k-Juntas without Noise. Given uniformly random length n Boolean vectors, together with a label, which is some function of k « n of the bits, return the set of relevant indices. Using a modification of the algorithm of Mossel et al. [2004], and employing our algorithm for learning sparse parities with noise via the reduction of Feldman et al. [2009], we obtain an algorithm for this problem with runtime nω+ ϵ/4 k poly(n) < n0.6k poly(n), which improves on the previous best of nω+1/ωkn0.7k poly(n) of Mossel et al. [2004].

References

[1]
M. Ajtai, R. Kumar, and D. Sivakumar. 2001. A sieve algorithm for the shortest lattice vector problem. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 601--610.
[2]
N. Alon and A. Naor. 2004. Approximating the cut-norm via GrothendieckÕs inequality. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 72--80.
[3]
A. Andoni and P. Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 459--468.
[4]
A. Andoni and P. Indyk. 2008. Near--optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 1, 117--122.
[5]
S. Arora and R. Ge. 2011. New algorithms for learning in presence of errors. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). 403--415.
[6]
J. L. Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509--517.
[7]
A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. 1994. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 253--262.
[8]
A. Blum, A. Kalai, and H. Wasserman. 2003. Noise-tolerant learning, the parity problem, and the statistical query model. J. ACM 50, 4, 507--519.
[9]
Z. Brakerski and V. Vaikuntanathan. 2011. Efficient fully homomorphic encryption from (standard) LWE. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS).
[10]
M. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium on Theory of Computing (STOC).
[11]
K. Clarkson. 1988. A randomized algorithm for closest--point queries. SIAM J. Comput. 17, 4, 830--847.
[12]
D. Coppersmith. 1997. Rectangular matrix multiplication revisited. J. Complex. 13, 1, 42--49.
[13]
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. 2004. Locality--sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry (SoCG). 253--262.
[14]
M. Dubiner. 2008. Bucketing coding and information theory for the statistical high dimensional nearest neighbor problem. CoRR abs/0810.4182.
[15]
V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. 2009. On agnostic learning of parities, monomials and halfspaces. SIAM J. Comput. 39, 2, 606--645.
[16]
E. Grigorescu, L. Reyzin, and S. Vempala. 2011. On noise-tolerant learning of sparse parities and related problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT).
[17]
N. J. Hopper and M. Blum. 2001. Secure human identification protocols. In Proceedings of the ASIACRYPT. 52--66.
[18]
R. Impagliazzo and D. Zuckerman. 1989. How to recycle random bits. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 248--253.
[19]
P. Indyk and R. Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the ACM Symposium on Theory of Computing (STOC).
[20]
M. Kearns. 1998. Efficient noise-tolerant learning from statistical queries. J. ACM 45, 6, 983--1006.
[21]
E. Kushilevitz, R. Ostrovsky, and Y. Rabani. 2000. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30, 2, 457--474.
[22]
V. Lyubashevsky. 2005. The parity problem in the presence of noise, decoding random linear codes, and the subset sum problem. In Proceedings of the RANDOM. 378--389.
[23]
J. Marchini, P. Donnelly, and L. R. Cardon. 2005. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37, 4, 413--417.
[24]
S. Meiser. 1993. Point location in arrangements of hyperplanes. Inf. Computat. 106, 2, 286--303.
[25]
E. Mossel, R. O'Donnell, and R. Servedio. 2004. Learning functions of k relevant variables. J. Comput. System Sci. 69, 3, 421--434.
[26]
R. Motwani, A. Naor, and R. Panigrahy. 2006. Lower bounds on locality sensitive hashing. In Proceedings of the ACM Symposium on Computational Geometry (SoCG). 154--157.
[27]
R. O'Donnell, Y. Wu, and Y. Zhou. 2011. Optimal lower bounds for locality sensitive hashing (except when q is tiny). In Proceedings of the Innovations in Theoretical Computer Science Conference (ITCS). 275--283.
[28]
R. Pagh. 2012. Compressed matrix multiplication. In Proceedings of the Innovations in Theoretical Computer Science Conference (ITCS).
[29]
R. Panigrahy. 2006. Entropy-based nearest neighbor search in high dimensions. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA).
[30]
Ramamohan Paturi, Sanguthevar Rajasekaran, and John H. Reif. 1989. The light bulb problem. In Proceedings of the Conference on Learning Theory (COLT). 261--268.
[31]
C. Peikert. 2009. Public--key cryptosystems from the worst-case shortest vector problem. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 333--342.
[32]
O. Regev. 2009. On lattices, learning with errors, random linear codes, and cryptography. J. ACM 56, 6, 1--40.
[33]
O. Regev. 2010. The learning with errors problem. In Proceedings of the IEEE Conference on Computational Complexity (CCC) (Invited Survey).
[34]
T. J. Rivlin. 1974. The Chebyshev Polynomials. Wiley.
[35]
H. Samet. 2006. Foundations of Multidimensional and Metric Data Structures. Elsevier.
[36]
I. J. Schoenberg. 1942. Positive definite functions on spheres. Duke Math. J. 9, 1, 96--108.
[37]
G. Szegö. 1975. Orthogonal Polynomials, 4th Ed. American Mathematical Society, Colloquium Publications, 23. Providence, RI.
[38]
G. Valiant. 2012. Finding correlations in subquadratic time, with applications to learning parities and juntas. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS).
[39]
L. Valiant. 1988. Functionality in neural nets. In Proceedings of the 1st Workshop on Computational Learning Theory. 28--39.
[40]
K. A. Verbeurgt. 1990. Learning DNF under the uniform distribution in quasipolynomial time. In Proceedings of the Conference on Learning Theory (COLT). 314--326.
[41]
X. Wan, C. Yang, H. Xue, N. Tang, and W. Yu. 2010. Detecting two-locus associations allowing for interactions in genome-wide association studies. Bioinformatics 26, 20, 2517--2525.
[42]
R. Weber, H. J. Schek, and S. Blott. 1998. A quantitative analysis and performance study for similarity--search methods in high--dimensional spaces. In Proceedings of the 24th International Conference on Very Large Databases (VLDB).
[43]
V. Vassilevska Williams. 2012. Multiplying matrices faster than Coppersmith--Winograd. In Proceedings of the ACM Symposium on Theory of Computing (STOC).

Cited By

View all
  • (2023)New Lower Bounds for Adaptive Tolerant Junta Testing2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00108(1778-1786)Online publication date: 6-Nov-2023
  • (2023)A strong composition theorem for junta complexity and the boosting of property testers2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00107(1757-1777)Online publication date: 6-Nov-2023
  • (2022)Properly Learning Decision Trees in almost Polynomial TimeJournal of the ACM10.1145/356104769:6(1-19)Online publication date: 24-Nov-2022
  • Show More Cited By

Index Terms

  1. Finding Correlations in Subquadratic Time, with Applications to Learning Parities and the Closest Pair Problem

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of the ACM
    Journal of the ACM  Volume 62, Issue 2
    May 2015
    304 pages
    ISSN:0004-5411
    EISSN:1557-735X
    DOI:10.1145/2772377
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 May 2015
    Accepted: 01 January 2015
    Revised: 01 August 2014
    Received: 01 August 2013
    Published in JACM Volume 62, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Correlations
    2. approximate closest pair
    3. asymmetric embeddings
    4. learning juntas
    5. locality sensitive hashing
    6. metric embedding
    7. nearest neighbor
    8. parity with noise

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)54
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 22 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)New Lower Bounds for Adaptive Tolerant Junta Testing2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00108(1778-1786)Online publication date: 6-Nov-2023
    • (2023)A strong composition theorem for junta complexity and the boosting of property testers2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00107(1757-1777)Online publication date: 6-Nov-2023
    • (2022)Properly Learning Decision Trees in almost Polynomial TimeJournal of the ACM10.1145/356104769:6(1-19)Online publication date: 24-Nov-2022
    • (2022)Properly learning decision trees in almost polynomial time2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS52979.2021.00093(920-929)Online publication date: Feb-2022
    • (2021)Discovering Higher-Order Interactions Through Neural Information DecompositionEntropy10.3390/e2301007923:1(79)Online publication date: 7-Jan-2021
    • (2021)Active Sampling Count Sketch (ASCS) for Online Sparse Estimation of a Trillion Scale Covariance MatrixProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457327(352-364)Online publication date: 9-Jun-2021
    • (2021)An improved algorithm for learning sparse parities in the presence of noiseTheoretical Computer Science10.1016/j.tcs.2021.04.026873(76-86)Online publication date: Jun-2021
    • (2020)On efficient low distortion ultrametric embeddingProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525132(2078-2088)Online publication date: 13-Jul-2020
    • (2020)On Closest Pair in Euclidean Metric: Monochromatic is as Hard as BichromaticCombinatorica10.1007/s00493-019-4113-140:4(539-573)Online publication date: 1-Aug-2020
    • (2020)Explicit Correlation Amplifiers for Finding Outlier Correlations in Deterministic Subquadratic TimeAlgorithmica10.1007/s00453-020-00727-182:11(3306-3337)Online publication date: 1-Nov-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media