Skip to main content

Showing 1–50 of 63 results for author: Indyk, P

  1. arXiv:2406.10746  [pdf, other

    cs.CL cs.IR

    SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

    Authors: Haike Xu, Zongyu Lin, Yizhou Sun, Kai-Wei Chang, Piotr Indyk

    Abstract: Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations.… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  2. arXiv:2406.02891  [pdf, other

    cs.IR cs.DS cs.LG

    A Bi-metric Framework for Fast Similarity Search

    Authors: Haike Xu, Sandeep Silwal, Piotr Indyk

    Abstract: We propose a new "bi-metric" framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the ac… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  3. arXiv:2312.13490  [pdf, ps, other

    cs.DS

    Dimension-Accuracy Tradeoffs in Contrastive Embeddings for Triplets, Terminals & Top-k Nearest Neighbors

    Authors: Vaggos Chatziafratis, Piotr Indyk

    Abstract: Metric embeddings traditionally study how to map $n$ items to a target metric space such that distance lengths are not heavily distorted; but what if we only care to preserve the relative order of the distances (and not their length)? In this paper, we are motivated by the following basic question: given triplet comparisons of the form ``item $i$ is closer to item $j$ than to item $k$,'' can we fi… ▽ More

    Submitted 29 December, 2023; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: Abstract shortened for arxiv

  4. arXiv:2311.17868  [pdf, ps, other

    cs.DS

    Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions

    Authors: Justin Y. Chen, Piotr Indyk, David P. Woodruff

    Abstract: We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of $m$ elements from a universe of size $n$, its profile is a vector $φ$ whose $i$-th entry $φ_i$ represents the number of distinct elements that appear in the stream exactly $i$ times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

    Comments: To appear in ITCS 2024

  5. arXiv:2310.19126  [pdf, other

    cs.DS cs.CG cs.LG

    Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations

    Authors: Piotr Indyk, Haike Xu

    Abstract: Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing" version provably supports approximate neares… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS 2023

  6. arXiv:2307.03043  [pdf, other

    cs.DS cs.CG cs.GR cs.LG

    A Near-Linear Time Algorithm for the Chamfer Distance

    Authors: Ainesh Bakshi, Piotr Indyk, Rajesh Jayaram, Sandeep Silwal, Erik Waingarten

    Abstract: For any two point sets $A,B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\text{CH}(A,B)=\sum_{a \in A} \min_{b \in B} d_X(a,b)$, where $d_X$ is the underlying distance measure (e.g., the Euclidean or Manhattan distance). The Chamfer distance is a popular measure of dissimilarity between point clouds, used in many machine learning, computer vision, an… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

  7. arXiv:2306.11312  [pdf, other

    cs.DS cs.LG stat.ML

    Data Structures for Density Estimation

    Authors: Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal

    Abstract: We study statistical/computational tradeoffs for the following density estimation problem: given $k$ distributions $v_1, \ldots, v_k$ over a discrete domain of size $n$, and sampling access to a distribution $p$, identify $v_i$ that is "close" to $p$. Our main result is the first data structure that, given a sublinear (in $n$) number of samples from $p$, identifies $v_i$ in time sublinear in $k$.… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: To appear at ICML'23

  8. arXiv:2304.07652  [pdf, other

    cs.DS cs.DB cs.LG

    Learned Interpolation for Better Streaming Quantile Approximation with Worst-Case Guarantees

    Authors: Nicholas Schiefer, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal, Tal Wagner

    Abstract: An $\varepsilon$-approximate quantile sketch over a stream of $n$ inputs approximates the rank of any query point $q$ - that is, the number of input points less than $q$ - up to an additive error of $\varepsilon n$, generally with some probability of at least $1 - 1/\mathrm{poly}(n)$, while consuming $o(n)$ space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably opt… ▽ More

    Submitted 15 April, 2023; originally announced April 2023.

    Comments: 11 pages, 5 figures, published at SIAM ACDA 2023

  9. arXiv:2212.00642  [pdf, other

    cs.LG cs.DS

    Sub-quadratic Algorithms for Kernel Matrices via Kernel Density Estimation

    Authors: Ainesh Bakshi, Piotr Indyk, Praneeth Kacham, Sandeep Silwal, Samson Zhou

    Abstract: Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent comput… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  10. arXiv:2211.03232  [pdf, other

    cs.LG stat.ML

    Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks

    Authors: Anders Aamand, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Nicholas Schiefer, Sandeep Silwal, Tal Wagner

    Abstract: Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes… ▽ More

    Submitted 21 December, 2022; v1 submitted 6 November, 2022; originally announced November 2022.

    Comments: 22 pages,5 figures, published at NeurIPS 2022. Updated funding statements

  11. arXiv:2210.15114  [pdf, other

    cs.DS

    Faster Linear Algebra for Distance Matrices

    Authors: Piotr Indyk, Sandeep Silwal

    Abstract: The distance matrix of a dataset $X$ of $n$ points with respect to a distance function $f$ represents all pairwise distances between points in $X$ induced by $f$. Due to their wide applicability, distance matrices and related families of matrices have been the focus of many recent algorithmic works. We continue this line of research and take a broad view of algorithm design for distance matrices w… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Selected as Oral for NeurIPS 2022

  12. arXiv:2207.08686  [pdf, other

    cs.DS

    Streaming Algorithms for Support-Aware Histograms

    Authors: Justin Y. Chen, Piotr Indyk, Tal Wagner

    Abstract: Histograms, i.e., piece-wise constant approximations, are a popular tool used to represent data distributions. Traditionally, the difference between the histogram and the underlying distribution (i.e., the approximation error) is measured using the $L_p$ norm, which sums the differences between the two functions over all items in the domain. Although useful in many applications, the drawback of th… ▽ More

    Submitted 18 July, 2022; originally announced July 2022.

    Comments: Appearing in ICML'22

  13. arXiv:2206.07886  [pdf, other

    cs.LG cs.DS

    Generalization Bounds for Data-Driven Numerical Linear Algebra

    Authors: Peter Bartlett, Piotr Indyk, Tal Wagner

    Abstract: Data-driven algorithms can adapt their internal structure or parameters to inputs from unknown application-specific distributions, by learning from a training sample of inputs. Several recent works have applied this approach to problems in numerical linear algebra, obtaining significant empirical gains in performance. However, no theoretical explanation for their success was known. In this work… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: COLT 2022

  14. arXiv:2203.09572  [pdf, other

    cs.DS cs.LG

    Triangle and Four Cycle Counting with Predictions in Graph Streams

    Authors: Justin Y. Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David P. Woodruff, Michael Zhang

    Abstract: We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: To be presented at ICLR 2022

  15. arXiv:2111.13998  [pdf, other

    cs.CV

    Targeted Supervised Contrastive Learning for Long-Tailed Recognition

    Authors: Tianhong Li, Peng Cao, Yuan Yuan, Lijie Fan, Yuzhe Yang, Rogerio Feris, Piotr Indyk, Dina Katabi

    Abstract: Real-world data often exhibits long tail distributions with heavy class imbalance, where the majority classes can dominate the training process and alter the decision boundaries of the minority classes. Recently, researchers have investigated the potential of supervised contrastive learning for long-tailed recognition, and demonstrated that it provides a strong performance gain. In this paper, we… ▽ More

    Submitted 2 May, 2022; v1 submitted 27 November, 2021; originally announced November 2021.

    Comments: The first two authors contributed equally to this paper

  16. arXiv:2111.10041  [pdf, other

    cs.CG cs.DS cs.LG

    Embeddings and labeling schemes for A*

    Authors: Talya Eden, Piotr Indyk, Haike Xu

    Abstract: A* is a classic and popular method for graphs search and path finding. It assumes the existence of a heuristic function $h(u,t)$ that estimates the shortest distance from any input node $u$ to the destination $t$. Traditionally, heuristics have been handcrafted by domain experts. However, over the last few years, there has been a growing interest in learning heuristic functions. Such learned heuri… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: ITCS 2022

  17. arXiv:2111.03953  [pdf, ps, other

    cs.DS cs.CC

    Frequency Estimation with One-Sided Error

    Authors: Piotr Indyk, Shyam Narayanan, David P. Woodruff

    Abstract: Frequency estimation is one of the most fundamental problems in streaming algorithms. Given a stream $S$ of elements from some universe $U=\{1 \ldots n\}$, the goal is to compute, in a single pass, a short sketch of $S$ so that for any element $i \in U$, one can estimate the number $x_i$ of times $i$ occurs in $S$ based on the sketch alone. Two state of the art solutions to this problems are the C… ▽ More

    Submitted 6 November, 2021; originally announced November 2021.

    Comments: To appear in SODA 2022. Abstract abridged to meet arXiv requirements - see pdf for full abstract

  18. arXiv:2110.11439  [pdf, other

    cs.DS cs.LG

    (Optimal) Online Bipartite Matching with Degree Information

    Authors: Anders Aamand, Justin Y. Chen, Piotr Indyk

    Abstract: We propose a model for online graph problems where algorithms are given access to an oracle that predicts (e.g., based on modeling assumptions or on past data) the degrees of nodes in the graph. Within this model, we study the classic problem of online bipartite matching, and a natural greedy matching algorithm called MinPredictedDegree, which uses predictions of the degrees of offline nodes. For… ▽ More

    Submitted 14 November, 2022; v1 submitted 21 October, 2021; originally announced October 2021.

    Comments: To appear in NeurIPS'22. A prior version of this work was titled "(Optimal) Online Bipartite Matching with Predicted Degrees"

  19. arXiv:2110.03152  [pdf, other

    cs.CG cs.DS

    Optimal (Euclidean) Metric Compression

    Authors: Piotr Indyk, Tal Wagner

    Abstract: We study the problem of representing all distances between $n$ points in $\mathbb R^d$, with arbitrarily small distortion, using as few bits as possible. We give asymptotically tight bounds for this problem, for Euclidean metrics, for $\ell_1$ (a.k.a.~Manhattan) metrics, and for general metrics. Our bounds for Euclidean metrics mark the first improvement over compression schemes based on discret… ▽ More

    Submitted 6 October, 2021; originally announced October 2021.

  20. arXiv:2107.01804  [pdf, other

    cs.DS cs.LG math.PR

    Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering

    Authors: Shyam Narayanan, Sandeep Silwal, Piotr Indyk, Or Zamir

    Abstract: Random dimensionality reduction is a versatile tool for speeding up algorithms for high-dimensional problems. We study its application to two clustering problems: the facility location problem, and the single-linkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset $X$ onto a random $d = O(d_X)$-dimensional s… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: 25 pages. Published as a conference paper in ICML 2021

  21. arXiv:2106.08396  [pdf, other

    cs.LG cs.DS math.ST

    Learning-based Support Estimation in Sublinear Time

    Authors: Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner

    Abstract: We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the su… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: 17 pages. Published as a conference paper in ICLR 2021

  22. arXiv:2102.08341  [pdf, other

    cs.DS cs.LG math.NA

    Faster Kernel Matrix Algebra via Density Estimation

    Authors: Arturs Backurs, Piotr Indyk, Cameron Musco, Tal Wagner

    Abstract: We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix $K \in \mathbb{R}^{n \times n}$ corresponding to $n$ points $x_1,\ldots,x_n \in \mathbb{R}^d$. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. We show that the sum of matrix entries can be estimated to $1+ε$ relative error i… ▽ More

    Submitted 17 June, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

  23. arXiv:2012.09962  [pdf, other

    cs.LG cs.CV

    Addressing Feature Suppression in Unsupervised Visual Representations

    Authors: Tianhong Li, Lijie Fan, Yuan Yuan, Hao He, Yonglong Tian, Rogerio Feris, Piotr Indyk, Dina Katabi

    Abstract: Contrastive learning is one of the fastest growing research areas in machine learning due to its ability to learn useful representations without labeled data. However, contrastive learning is susceptible to feature suppression, i.e., it may discard important information relevant to the task of interest, and learn irrelevant features. Past work has addressed this limitation via handcrafted data aug… ▽ More

    Submitted 27 November, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

    Comments: The first two authors contributed equally to this paper

  24. arXiv:2006.05028  [pdf, other

    cs.DS cs.LG

    Online Page Migration with ML Advice

    Authors: Piotr Indyk, Frederik Mallmann-Trenn, Slobodan Mitrović, Ronitt Rubinfeld

    Abstract: We consider online algorithms for the {\em page migration problem} that use predictions, potentially imperfect, to improve their performance. The best known online algorithms for this problem, due to Westbrook'94 and Bienkowski et al'17, have competitive ratios strictly bounded away from 1. In contrast, we show that if the algorithm is given a prediction of the input sequence, then it can achieve… ▽ More

    Submitted 8 June, 2020; originally announced June 2020.

  25. arXiv:1911.07976  [pdf, ps, other

    cs.IT cs.DS cs.LG

    Estimating Entropy of Distributions in Constant Space

    Authors: Jayadev Acharya, Sourbh Bhadane, Piotr Indyk, Ziteng Sun

    Abstract: We consider the task of estimating the entropy of $k$-ary distributions from samples in the streaming model, where space is limited. Our main contribution is an algorithm that requires $O\left(\frac{k \log (1/\varepsilon)^2}{\varepsilon^3}\right)$ samples and a constant $O(1)$ memory words of space and outputs a $\pm\varepsilon$ estimate of $H(p)$. Without space limitations, the sample complexity… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

    Comments: NeurIPS 2019

  26. arXiv:1910.13984  [pdf, other

    cs.LG cs.DS stat.ML

    Learning-Based Low-Rank Approximations

    Authors: Piotr Indyk, Ali Vakilian, Yang Yuan

    Abstract: We introduce a "learning-based" algorithm for the low-rank decomposition problem: given an $n \times d$ matrix $A$, and a parameter $k$, compute a rank-$k$ matrix $A'$ that minimizes the approximation loss $\|A-A'\|_F$. The algorithm uses a training set of input matrices in order to optimize its performance. Specifically, some of the most efficient approximate algorithms for computing low-rank app… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

    Comments: NeurIPS 2019

  27. arXiv:1910.04126  [pdf, other

    cs.DS

    Scalable Nearest Neighbor Search for Optimal Transport

    Authors: Arturs Backurs, Yihe Dong, Piotr Indyk, Ilya Razenshteyn, Tal Wagner

    Abstract: The Optimal Transport (a.k.a. Wasserstein) distance is an increasingly popular similarity measure for rich data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search algorithms according to this distance, which poses a substantial computational bottleneck on massive datasets. In this work we introduce Flowtree, a fast and accurate approximation algor… ▽ More

    Submitted 28 September, 2020; v1 submitted 9 October, 2019; originally announced October 2019.

    Comments: ICML 2020

  28. arXiv:1908.05198  [pdf, other

    cs.DS

    (Learned) Frequency Estimation Algorithms under Zipfian Distribution

    Authors: Anders Aamand, Piotr Indyk, Ali Vakilian

    Abstract: \begin{abstract} The frequencies of the elements in a data stream are an important statistical measure and the task of estimating them arises in many applications within data analysis and machine learning. Two of the most popular algorithms for this problem, Count-Min and Count-Sketch, are widely used in practice. In a recent work [Hsu et al., ICLR'19], it was shown empirically that augmenting C… ▽ More

    Submitted 11 August, 2020; v1 submitted 14 August, 2019; originally announced August 2019.

  29. arXiv:1907.03197  [pdf, other

    cs.DS cs.LG

    Composable Core-sets for Determinant Maximization: A Simple Near-Optimal Algorithm

    Authors: Piotr Indyk, Sepideh Mahabadi, Shayan Oveis Gharan, Alireza Rezaei

    Abstract: ``Composable core-sets'' are an efficient framework for solving optimization problems in massive data models. In this work, we consider efficient construction of composable core-sets for the determinant maximization problem. This can also be cast as the MAP inference task for determinantal point processes, that have recently gained a lot of interest for modeling diversity and fairness. The problem… ▽ More

    Submitted 6 July, 2019; originally announced July 2019.

    Comments: This paper has appeared in the 36th International Conference on Machine Learning (ICML), 2019. This is an equal contribution paper

    ACM Class: F.2.0; G.1.2; G.1.6; G.2.2

  30. arXiv:1906.00339  [pdf, other

    cs.DS cs.LG

    Sample-Optimal Low-Rank Approximation of Distance Matrices

    Authors: Piotr Indyk, Ali Vakilian, Tal Wagner, David Woodruff

    Abstract: A distance matrix $A \in \mathbb R^{n \times m}$ represents all pairwise distances, $A_{ij}=\mathrm{d}(x_i,y_j)$, between two point sets $x_1,...,x_n$ and $y_1,...,y_m$ in an arbitrary metric space $(\mathcal Z, \mathrm{d})$. Such matrices arise in various computational contexts such as learning image manifolds, handwriting recognition, and multi-dimensional unfolding. In this work we study algo… ▽ More

    Submitted 2 June, 2019; originally announced June 2019.

    Comments: COLT 2019

  31. arXiv:1902.03534  [pdf, ps, other

    cs.DS cs.DM

    Set Cover in Sub-linear Time

    Authors: Piotr Indyk, Sepideh Mahabadi, Ronitt Rubinfeld, Ali Vakilian, Anak Yodpinyanee

    Abstract: We study the classic set cover problem from the perspective of sub-linear algorithms. Given access to a collection of $m$ sets over $n$ elements in the query model, we show that sub-linear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in Har-Peled et al. [2016] to the sub-linear quer… ▽ More

    Submitted 9 February, 2019; originally announced February 2019.

  32. arXiv:1902.03519  [pdf, other

    cs.DS cs.LG

    Scalable Fair Clustering

    Authors: Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, Tal Wagner

    Abstract: We study the fair variant of the classic $k$-median problem introduced by Chierichetti et al. [2017]. In the standard $k$-median problem, given an input pointset $P$, the goal is to find $k$ centers $C$ and assign each input point to one of the centers in $C$ such that the average distance of points to their cluster center is minimized. In the fair variant of $k$-median, the points are colored,… ▽ More

    Submitted 10 June, 2019; v1 submitted 9 February, 2019; originally announced February 2019.

    Comments: ICML 2019

  33. arXiv:1901.08544  [pdf, other

    cs.LG cs.CG cs.DS stat.ML

    Learning Space Partitions for Nearest Neighbor Search

    Authors: Yihe Dong, Piotr Indyk, Ilya Razenshteyn, Tal Wagner

    Abstract: Space partitions of $\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces [Andoni, Naor, Nikolov, Razenshteyn, Waingarten STOC 2018, FOCS 2018], we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classif… ▽ More

    Submitted 28 September, 2020; v1 submitted 24 January, 2019; originally announced January 2019.

    Comments: ICLR 2020

  34. arXiv:1807.11648  [pdf, ps, other

    cs.DS cs.LG stat.ML

    Composable Core-sets for Determinant Maximization Problems via Spectral Spanners

    Authors: Piotr Indyk, Sepideh Mahabadi, Shayan Oveis Gharan, Alireza Rezaei

    Abstract: We study a spectral generalization of classical combinatorial graph spanners to the spectral setting. Given a set of vectors $V\subseteq \Re^d$, we say a set $U\subseteq V$ is an $α$-spectral spanner if for all $v\in V$ there is a probability distribution $μ_v$ supported on $U$ such that $$vv^\intercal \preceq α\cdot\mathbb{E}_{u\simμ_v} uu^\intercal.$$ We show that any set $V$ has an… ▽ More

    Submitted 16 November, 2019; v1 submitted 30 July, 2018; originally announced July 2018.

    Comments: To appear in SODA 2020

  35. arXiv:1807.00112  [pdf, other

    cs.DS cs.CG

    Approximate Nearest Neighbors in Limited Space

    Authors: Piotr Indyk, Tal Wagner

    Abstract: We consider the $(1+ε)$-approximate nearest neighbor search problem: given a set $X$ of $n$ points in a $d$-dimensional space, build a data structure that, given any query point $y$, finds a point $x \in X$ whose distance to $y$ is at most $(1+ε) \min_{x \in X} \|x-y\|$ for an accuracy parameter $ε\in (0,1)$. Our main result is a data structure that occupies only $O(ε^{-2} n \log(n) \log(1/ε))$ bi… ▽ More

    Submitted 29 June, 2018; originally announced July 2018.

    Comments: COLT 2018

  36. arXiv:1806.09823  [pdf, ps, other

    cs.DS cs.CG cs.DB stat.ML

    Approximate Nearest Neighbor Search in High Dimensions

    Authors: Alexandr Andoni, Piotr Indyk, Ilya Razenshteyn

    Abstract: The nearest neighbor problem is defined as follows: Given a set $P$ of $n$ points in some metric space $(X,D)$, build a data structure that, given any point $q$, returns a point in $P$ that is closest to $q$ (its "nearest neighbor" in $P$). The data structure stores additional information about the set $P$, which is then used to find the nearest neighbor without computing all distances between… ▽ More

    Submitted 26 June, 2018; originally announced June 2018.

    Comments: 27 pages, no figures; to appear in the proceedings of ICM 2018 (accompanying the talk by P. Indyk)

  37. arXiv:1711.01520  [pdf, other

    cs.DS

    Practical Data-Dependent Metric Compression with Provable Guarantees

    Authors: Piotr Indyk, Ilya Razenshteyn, Tal Wagner

    Abstract: We introduce a new distance-preserving compact representation of multi-dimensional point-sets. Given $n$ points in a $d$-dimensional space where each coordinate is represented using $B$ bits (i.e., $dB$ bits per point), it produces a representation of size $O( d \log(d B/ε) + \log n)$ bits per point from which one can approximate the distances up to a factor of $1 \pm ε$. Our algorithm almost matc… ▽ More

    Submitted 4 November, 2017; originally announced November 2017.

    Comments: NIPS 2017

  38. arXiv:1706.06935  [pdf, other

    cs.NI

    Agile Millimeter Wave Networks with Provable Guarantees

    Authors: Haitham Hassanieh, Omid Abari, Michael Rodreguez, Mohammed Abdelghany, Dina Katabi, Piotr Indyk

    Abstract: There is much interest in integrating millimeter wave radios (mmWave) into wireless LANs and 5G cellular networks to benefit from their multiple GHz of available spectrum. Yet unlike existing technologies, e.g., WiFi, mmWave radios require highly directional antennas. Since the antennas have pencil-beams, the transmitter and receiver need to align their antenna beams before they can communicate. E… ▽ More

    Submitted 21 June, 2017; originally announced June 2017.

  39. arXiv:1704.02958  [pdf, ps, other

    cs.CC cs.DS cs.LG stat.ML

    On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

    Authors: Arturs Backurs, Piotr Indyk, Ludwig Schmidt

    Abstract: Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there has been a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of… ▽ More

    Submitted 10 April, 2017; originally announced April 2017.

  40. Approximate Sparse Linear Regression

    Authors: Sariel Har-Peled, Piotr Indyk, Sepideh Mahabadi

    Abstract: In the Sparse Linear Regression (SLR) problem, given a $d \times n$ matrix $M$ and a $d$-dimensional query $q$, the goal is to compute a $k$-sparse $n$-dimensional vector $τ$ such that the error $||M τ-q||$ is minimized. This problem is equivalent to the following geometric problem: given a set $P$ of $n$ points and a query point $q$ in $d$ dimensions, find the closest $k$-dimensional subspace to… ▽ More

    Submitted 28 April, 2018; v1 submitted 27 September, 2016; originally announced September 2016.

  41. arXiv:1609.06295  [pdf, other

    cs.CG

    Near-Optimal (Euclidean) Metric Compression

    Authors: Piotr Indyk, Tal Wagner

    Abstract: The metric sketching problem is defined as follows. Given a metric on $n$ points, and $ε>0$, we wish to produce a small size data structure (sketch) that, given any pair of point indices, recovers the distance between the points up to a $1+ε$ distortion. In this paper we consider metrics induced by $\ell_2$ and $\ell_1$ norms whose spread (the ratio of the diameter to the closest pair distance) is… ▽ More

    Submitted 28 November, 2016; v1 submitted 20 September, 2016; originally announced September 2016.

    Comments: SODA'17

  42. Simultaneous Nearest Neighbor Search

    Authors: Piotr Indyk, Robert Kleinberg, Sepideh Mahabadi, Yang Yuan

    Abstract: Motivated by applications in computer vision and databases, we introduce and study the Simultaneous Nearest Neighbor Search (SNN) problem. Given a set of data points, the goal of SNN is to design a data structure that, given a collection of queries, finds a collection of close points that are compatible with each other. Formally, we are given $k$ query points $Q=q_1,\cdots,q_k$, and a compatibilit… ▽ More

    Submitted 7 April, 2016; originally announced April 2016.

    ACM Class: F.2.2

  43. arXiv:1511.07070  [pdf, ps, other

    cs.CC cs.DS

    Which Regular Expression Patterns are Hard to Match?

    Authors: Arturs Backurs, Piotr Indyk

    Abstract: Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an $O(mn)$ running time (where $m$ is the length of the pattern and $n$ is the length of the text). This… ▽ More

    Submitted 26 September, 2016; v1 submitted 22 November, 2015; originally announced November 2015.

  44. arXiv:1509.02897  [pdf, other

    cs.DS cs.CG cs.IR

    Practical and Optimal LSH for Angular Distance

    Authors: Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, Ludwig Schmidt

    Abstract: We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn 2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving upon the well-… ▽ More

    Submitted 9 September, 2015; originally announced September 2015.

    Comments: 22 pages, an extended abstract is to appear in the proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015)

  45. Towards Tight Bounds for the Streaming Set Cover Problem

    Authors: Sariel Har-Peled, Piotr Indyk, Sepideh Mahabadi, Ali Vakilian

    Abstract: We consider the classic Set Cover problem in the data stream model. For $n$ elements and $m$ sets ($m\geq n$) we give a $O(1/δ)$-pass algorithm with a strongly sub-linear $\tilde{O}(mn^δ)$ space and logarithmic approximation factor. This yields a significant improvement over the earlier algorithm of Demaine et al. [DIMV14] that uses exponentially larger number of passes. We complement this result… ▽ More

    Submitted 2 May, 2016; v1 submitted 31 August, 2015; originally announced September 2015.

    Comments: A preliminary version of this paper is to appear in PODS 2016

  46. arXiv:1504.07648  [pdf, ps, other

    cs.IT cs.CC cs.LG math.FA

    Nearly Optimal Deterministic Algorithm for Sparse Walsh-Hadamard Transform

    Authors: Mahdi Cheraghchi, Piotr Indyk

    Abstract: For every fixed constant $α> 0$, we design an algorithm for computing the $k$-sparse Walsh-Hadamard transform of an $N$-dimensional vector $x \in \mathbb{R}^N$ in time $k^{1+α} (\log N)^{O(1)}$. Specifically, the algorithm is given query access to $x$ and computes a $k$-sparse $\tilde{x} \in \mathbb{R}^N$ satisfying $\|\tilde{x} - \hat{x}\|_1 \leq c \|\hat{x} - H_k(\hat{x})\|_1$, for an absolute c… ▽ More

    Submitted 28 April, 2015; originally announced April 2015.

  47. arXiv:1504.01076  [pdf, ps, other

    cs.DS cs.CG cs.IT

    Nearly-optimal bounds for sparse recovery in generic norms, with applications to $k$-median sketching

    Authors: Arturs Backurs, Piotr Indyk, Eric Price, Ilya Razenshteyn, David P. Woodruff

    Abstract: We initiate the study of trade-offs between sparsity and the number of measurements in sparse recovery schemes for generic norms. Specifically, for a norm $\|\cdot\|$, sparsity parameter $k$, approximation factor $K>0$, and probability of failure $P>0$, we ask: what is the minimal value of $m$ so that there is a distribution over $m \times n$ matrices $A$ with the property that for any $x$, given… ▽ More

    Submitted 4 April, 2015; originally announced April 2015.

    Comments: 29 pages

  48. arXiv:1412.3040  [pdf, other

    cs.DB

    Rapid Sampling for Visualizations with Ordering Guarantees

    Authors: Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, Ronitt Rubinfeld

    Abstract: Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual proper- ties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our… ▽ More

    Submitted 9 December, 2014; originally announced December 2014.

    Comments: Tech Report. 17 pages. Condensed version to appear in VLDB Vol. 8 No. 5

  49. arXiv:1412.0348  [pdf, ps, other

    cs.CC cs.DS

    Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false)

    Authors: Arturs Backurs, Piotr Indyk

    Abstract: The edit distance (a.k.a. the Levenshtein distance) between two strings is defined as the minimum number of insertions, deletions or substitutions of symbols needed to transform one string into another. The problem of computing the edit distance between two strings is a classical computational task, with a well-known algorithm based on dynamic programming. Unfortunately, all known algorithms for t… ▽ More

    Submitted 15 August, 2017; v1 submitted 30 November, 2014; originally announced December 2014.

    Comments: STOC'15

  50. arXiv:1406.1579  [pdf, ps, other

    cs.IT cs.DS

    Approximation Algorithms for Model-Based Compressive Sensing

    Authors: Chinmay Hegde, Piotr Indyk, Ludwig Schmidt

    Abstract: Compressive Sensing (CS) stipulates that a sparse signal can be recovered from a small number of linear measurements, and that this recovery can be performed efficiently in polynomial time. The framework of model-based compressive sensing (model-CS) leverages additional structure in the signal and prescribes new recovery schemes that can reduce the number of measurements even further. However, mod… ▽ More

    Submitted 20 April, 2015; v1 submitted 6 June, 2014; originally announced June 2014.