subscribe to arXiv mailings

Credit Attribution and Stable Compression

Authors: Roi Livni, Shay Moran, Kobbi Nissim, Chirag Pabbaraju

Abstract: Credit attribution is crucial across various fields. In academic research, proper citation acknowledges prior work and establishes original contributions. Similarly, in generative models, such as those trained on existing artworks or music, it is important to ensure that any generated content influenced by these works appropriately credits the original creators. We study credit attribution by ma… ▽ More Credit attribution is crucial across various fields. In academic research, proper citation acknowledges prior work and establishes original contributions. Similarly, in generative models, such as those trained on existing artworks or music, it is important to ensure that any generated content influenced by these works appropriately credits the original creators. We study credit attribution by machine learning algorithms. We propose new definitions--relaxations of Differential Privacy--that weaken the stability guarantees for a designated subset of $k$ datapoints. These $k$ datapoints can be used non-stably with permission from their owners, potentially in exchange for compensation. Meanwhile, the remaining datapoints are guaranteed to have no significant influence on the algorithm's output. Our framework extends well-studied notions of stability, including Differential Privacy ($k = 0$), differentially private learning with public data (where the $k$ public datapoints are fixed in advance), and stable sample compression (where the $k$ datapoints are selected adaptively by the algorithm). We examine the expressive power of these stability notions within the PAC learning framework, provide a comprehensive characterization of learnability for algorithms adhering to these principles, and propose directions and questions for future research. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 15 pages, 1 figure

arXiv:2404.04931 [pdf, other]

The Sample Complexity of Gradient Descent in Stochastic Convex Optimization

Authors: Roi Livni

Abstract: We analyze the sample complexity of full-batch Gradient Descent (GD) in the setup of non-smooth Stochastic Convex Optimization. We show that the generalization error of GD, with common choice of hyper-parameters, can be $\tilde Θ(d/m + 1/\sqrt{m})$, where $d$ is the dimension and $m$ is the sample size. This matches the sample complexity of \emph{worst-case} empirical risk minimizers. That means t… ▽ More We analyze the sample complexity of full-batch Gradient Descent (GD) in the setup of non-smooth Stochastic Convex Optimization. We show that the generalization error of GD, with common choice of hyper-parameters, can be $\tilde Θ(d/m + 1/\sqrt{m})$, where $d$ is the dimension and $m$ is the sample size. This matches the sample complexity of \emph{worst-case} empirical risk minimizers. That means that, in contrast with other algorithms, GD has no advantage over naive ERMs. Our bound follows from a new generalization bound that depends on both the dimension as well as the learning rate and number of iterations. Our bound also shows that, for general hyper-parameters, when the dimension is strictly larger than number of samples, $T=Ω(1/ε^4)$ iterations are necessary to avoid overfitting. This resolves an open problem by Schlisserman et al.23 and Amir er Al.21, and improves over previous lower bounds that demonstrated that the sample size must be at least square root of the dimension. △ Less

Submitted 10 April, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

arXiv:2403.17691 [pdf, other]

Not All Similarities Are Created Equal: Leveraging Data-Driven Biases to Inform GenAI Copyright Disputes

Authors: Uri Hacohen, Adi Haviv, Shahar Sarfaty, Bruria Friedman, Niva Elkin-Koren, Roi Livni, Amit H Bermano

Abstract: The advent of Generative Artificial Intelligence (GenAI) models, including GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content creation, enabling non-professionals to produce high-quality content across various domains. This transformative technology has led to a surge of synthetic content and sparked legal disputes over copyright infringement. To address these challenges,… ▽ More The advent of Generative Artificial Intelligence (GenAI) models, including GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content creation, enabling non-professionals to produce high-quality content across various domains. This transformative technology has led to a surge of synthetic content and sparked legal disputes over copyright infringement. To address these challenges, this paper introduces a novel approach that leverages the learning capacity of GenAI models for copyright legal analysis, demonstrated with GPT2 and Stable Diffusion models. Copyright law distinguishes between original expressions and generic ones (Scènes à faire), protecting the former and permitting reproduction of the latter. However, this distinction has historically been challenging to make consistently, leading to over-protection of copyrighted works. GenAI offers an unprecedented opportunity to enhance this legal analysis by revealing shared patterns in preexisting works. We propose a data-driven approach to identify the genericity of works created by GenAI, employing "data-driven bias" to assess the genericity of expressive compositions. This approach aids in copyright scope determination by utilizing the capabilities of GenAI to identify and prioritize expressive elements and rank them according to their frequency in the model's dataset. The potential implications of measuring expressive genericity for copyright law are profound. Such scoring could assist courts in determining copyright scope during litigation, inform the registration practices of Copyright Offices, allowing registration of only highly original synthetic works, and help copyright owners signal the value of their works and facilitate fairer licensing deals. More generally, this approach offers valuable insights to policymakers grappling with adapting copyright law to the challenges posed by the era of GenAI. △ Less

Submitted 7 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: Presented at ACM CSLAW 2024

arXiv:2402.09327 [pdf, other]

Information Complexity of Stochastic Convex Optimization: Applications to Generalization and Memorization

Authors: Idan Attias, Gintare Karolina Dziugaite, Mahdi Haghifam, Roi Livni, Daniel M. Roy

Abstract: In this work, we investigate the interplay between memorization and learning in the context of \emph{stochastic convex optimization} (SCO). We define memorization via the information a learning algorithm reveals about its training data points. We then quantify this information using the framework of conditional mutual information (CMI) proposed by Steinke and Zakynthinou (2020). Our main result is… ▽ More In this work, we investigate the interplay between memorization and learning in the context of \emph{stochastic convex optimization} (SCO). We define memorization via the information a learning algorithm reveals about its training data points. We then quantify this information using the framework of conditional mutual information (CMI) proposed by Steinke and Zakynthinou (2020). Our main result is a precise characterization of the tradeoff between the accuracy of a learning algorithm and its CMI, answering an open question posed by Livni (2023). We show that, in the $L^2$ Lipschitz--bounded setting and under strong convexity, every learner with an excess error $\varepsilon$ has CMI bounded below by $Ω(1/\varepsilon^2)$ and $Ω(1/\varepsilon)$, respectively. We further demonstrate the essential role of memorization in learning problems in SCO by designing an adversary capable of accurately identifying a significant fraction of the training samples in specific SCO problems. Finally, we enumerate several implications of our results, such as a limitation of generalization bounds based on CMI and the incompressibility of samples in SCO problems. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: 44 Pages

arXiv:2311.05398 [pdf, other]

The Sample Complexity Of ERMs In Stochastic Convex Optimization

Authors: Daniel Carmon, Roi Livni, Amir Yehudayoff

Abstract: Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that… ▽ More Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that $Ω(\frac{d}ε+\frac{1}{ε^2})$ data points are necessary (where $d$ is the dimension and $ε>0$ is the accuracy parameter). Proving an $ω(\frac{d}ε+\frac{1}{ε^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}ε+\frac{1}{ε^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}ε)$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of $\textit{linear}$ functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems. △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2305.14822 [pdf, other]

Can Copyright be Reduced to Privacy?

Authors: Niva Elkin-Koren, Uri Hacohen, Roi Livni, Shay Moran

Abstract: There is a growing concern that generative AI models will generate outputs closely resembling the copyrighted materials for which they are trained. This worry has intensified as the quality and complexity of generative models have immensely improved, and the availability of extensive datasets containing copyrighted material has expanded. Researchers are actively exploring strategies to mitigate th… ▽ More There is a growing concern that generative AI models will generate outputs closely resembling the copyrighted materials for which they are trained. This worry has intensified as the quality and complexity of generative models have immensely improved, and the availability of extensive datasets containing copyrighted material has expanded. Researchers are actively exploring strategies to mitigate the risk of generating infringing samples, with a recent line of work suggesting to employ techniques such as differential privacy and other forms of algorithmic stability to provide guarantees on the lack of infringing copying. In this work, we examine whether such algorithmic stability techniques are suitable to ensure the responsible use of generative models without inadvertently violating copyright laws. We argue that while these techniques aim to verify the presence of identifiable information in datasets, thus being privacy-oriented, copyright law aims to promote the use of original works for the benefit of society as a whole, provided that no unlicensed use of protected expression occurred. These fundamental differences between privacy and copyright must not be overlooked. In particular, we demonstrate that while algorithmic stability may be perceived as a practical tool to detect copying, such copying does not necessarily constitute copyright infringement. Therefore, if adopted as a standard for detecting an establishing copyright infringement, algorithmic stability may undermine the intended objectives of copyright law. △ Less

Submitted 24 March, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

arXiv:2302.04925 [pdf, ps, other]

Information Theoretic Lower Bounds for Information Theoretic Upper Bounds

Authors: Roi Livni

Abstract: We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in information-theoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our stud… ▽ More We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in information-theoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimension-dependent mutual information is necessary. This indicates that existing information-theoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity. △ Less

Submitted 14 January, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

arXiv:2206.03098 [pdf, ps, other]

Better Best of Both Worlds Bounds for Bandits with Switching Costs

Authors: Idan Amir, Guy Azov, Tomer Koren, Roi Livni

Abstract: We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021. We introduce a surprisingly simple and effective algorithm that simultaneously achieves minimax optimal regret bound of $\mathcal{O}(T^{2/3})$ in the oblivious adversarial setting and a bound of $\mathcal{O}(\min\{\log (T)/Δ^2,T^{2/3}\})$ in the stochastically-const… ▽ More We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021. We introduce a surprisingly simple and effective algorithm that simultaneously achieves minimax optimal regret bound of $\mathcal{O}(T^{2/3})$ in the oblivious adversarial setting and a bound of $\mathcal{O}(\min\{\log (T)/Δ^2,T^{2/3}\})$ in the stochastically-constrained regime, both with (unit) switching costs, where $Δ$ is the gap between the arms. In the stochastically constrained case, our bound improves over previous results due to Rouyer et al., that achieved regret of $\mathcal{O}(T^{1/3}/Δ)$. We accompany our results with a lower bound showing that, in general, $\tildeΩ(\min\{1/Δ^2,T^{2/3}\})$ regret is unavoidable in the stochastically-constrained case for algorithms with $\mathcal{O}(T^{2/3})$ worst-case regret. △ Less

Submitted 2 November, 2022; v1 submitted 7 June, 2022; originally announced June 2022.

arXiv:2204.08809 [pdf, ps, other]

Making Progress Based on False Discoveries

Authors: Roi Livni

Abstract: The study of adaptive data analysis examines how many statistical queries can be answered accurately using a fixed dataset while avoiding false discoveries (statistically inaccurate answers). In this paper, we tackle a question that precedes the field of study: Is data only valuable when it provides accurate answers to statistical queries? To answer this question, we use Stochastic Convex Optimiza… ▽ More The study of adaptive data analysis examines how many statistical queries can be answered accurately using a fixed dataset while avoiding false discoveries (statistically inaccurate answers). In this paper, we tackle a question that precedes the field of study: Is data only valuable when it provides accurate answers to statistical queries? To answer this question, we use Stochastic Convex Optimization as a case study. In this model, algorithms are considered as analysts who query an estimate of the gradient of a noisy function at each iteration and move towards its minimizer. It is known that $O(1/ε^2)$ examples can be used to minimize the objective function, but none of the existing methods depend on the accuracy of the estimated gradients along the trajectory. Therefore, we ask: How many samples are needed to minimize a noisy convex function if we require $ε$-accurate estimates of $O(1/ε^2)$ gradients? Or, might it be that inaccurate gradient estimates are \emph{necessary} for finding the minimum of a stochastic convex function at an optimal statistical rate? We provide two partial answers to this question. First, we show that a general analyst (queries that may be maliciously chosen) requires $Ω(1/ε^3)$ samples, ruling out the possibility of a foolproof mechanism. Second, we show that, under certain assumptions on the oracle, $\tilde Ω(1/ε^{2.5})$ samples are necessary for gradient descent to interact with the oracle. Our results are in contrast to classical bounds that show that $O(1/ε^2)$ samples can optimize the population risk to an accuracy of $O(ε)$, but with spurious gradients. △ Less

Submitted 7 February, 2023; v1 submitted 19 April, 2022; originally announced April 2022.

arXiv:2202.13361 [pdf, other]

Benign Underfitting of Stochastic Gradient Descent

Authors: Tomer Koren, Roi Livni, Yishay Mansour, Uri Sherman

Abstract: We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, su… ▽ More We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $Ω(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results. △ Less

Submitted 12 January, 2023; v1 submitted 27 February, 2022; originally announced February 2022.

arXiv:2202.13328 [pdf, ps, other]

Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization

Authors: Idan Amir, Roi Livni, Nathan Srebro

Abstract: We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e.~where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $ε$ (compared to the… ▽ More We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e.~where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $ε$ (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of $\tilde{O}(1/ε^2)$ and only $\tilde{O}(1/ε^2)$ iterations. This contrasts with general stochastic convex optimization, where $Ω(1/ε^4)$ iterations are needed Amir et al. [2021b]. The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using $Θ(1/ε^4)$ samples, we rely on uniform convergence in a distribution-dependent ball. △ Less

Submitted 30 October, 2022; v1 submitted 27 February, 2022; originally announced February 2022.

arXiv:2107.00469 [pdf, ps, other]

Never Go Full Batch (in Stochastic Convex Optimization)

Authors: Idan Amir, Yair Carmon, Tomer Koren, Roi Livni

Abstract: We study the generalization performance of $\text{full-batch}$ optimization algorithms for stochastic convex optimization: these are first-order methods that only access the exact gradient of the empirical risk (rather than gradients with respect to individual data points), that include a wide range of algorithms such as gradient descent, mirror descent, and their regularized and/or accelerated va… ▽ More We study the generalization performance of $\text{full-batch}$ optimization algorithms for stochastic convex optimization: these are first-order methods that only access the exact gradient of the empirical risk (rather than gradients with respect to individual data points), that include a wide range of algorithms such as gradient descent, mirror descent, and their regularized and/or accelerated variants. We provide a new separation result showing that, while algorithms such as stochastic gradient descent can generalize and optimize the population risk to within $ε$ after $O(1/ε^2)$ iterations, full-batch methods either need at least $Ω(1/ε^4)$ iterations or exhibit a dimension-dependent sample complexity. △ Less

Submitted 29 June, 2021; originally announced July 2021.

arXiv:2106.13513 [pdf, ps, other]

Littlestone Classes are Privately Online Learnable

Authors: Noah Golowich, Roi Livni

Abstract: We consider the problem of online classification under a privacy constraint. In this setting a learner observes sequentially a stream of labelled examples $(x_t, y_t)$, for $1 \leq t \leq T$, and returns at each iteration $t$ a hypothesis $h_t$ which is used to predict the label of each new example $x_t$. The learner's performance is measured by her regret against a known hypothesis class… ▽ More We consider the problem of online classification under a privacy constraint. In this setting a learner observes sequentially a stream of labelled examples $(x_t, y_t)$, for $1 \leq t \leq T$, and returns at each iteration $t$ a hypothesis $h_t$ which is used to predict the label of each new example $x_t$. The learner's performance is measured by her regret against a known hypothesis class $\mathcal{H}$. We require that the algorithm satisfies the following privacy constraint: the sequence $h_1, \ldots, h_T$ of hypotheses output by the algorithm needs to be an $(ε, δ)$-differentially private function of the whole input sequence $(x_1, y_1), \ldots, (x_T, y_T)$. We provide the first non-trivial regret bound for the realizable setting. Specifically, we show that if the class $\mathcal{H}$ has constant Littlestone dimension then, given an oblivious sequence of labelled examples, there is a private learner that makes in expectation at most $O(\log T)$ mistakes -- comparable to the optimal mistake bound in the non-private case, up to a logarithmic factor. Moreover, for general values of the Littlestone dimension $d$, the same mistake bound holds but with a doubly-exponential in $d$ factor. A recent line of work has demonstrated a strong connection between classes that are online learnable and those that are differentially-private learnable. Our results strengthen this connection and show that an online learning algorithm can in fact be directly privatized (in the realizable setting). We also discuss an adaptive setting and provide a sublinear regret bound of $O(\sqrt{T})$. △ Less

Submitted 25 June, 2021; originally announced June 2021.

arXiv:2102.01646 [pdf, ps, other]

Online Learning with Simple Predictors and a Combinatorial Characterization of Minimax in 0/1 Games

Authors: Steve Hanneke, Roi Livni, Shay Moran

Abstract: Which classes can be learned properly in the online model? -- that is, by an algorithm that at each round uses a predictor from the concept class. While there are simple and natural cases where improper learning is necessary, it is natural to ask how complex must the improper predictors be in such cases. Can one always achieve nearly optimal mistake/regret bounds using "simple" predictors? In th… ▽ More Which classes can be learned properly in the online model? -- that is, by an algorithm that at each round uses a predictor from the concept class. While there are simple and natural cases where improper learning is necessary, it is natural to ask how complex must the improper predictors be in such cases. Can one always achieve nearly optimal mistake/regret bounds using "simple" predictors? In this work, we give a complete characterization of when this is possible, thus settling an open problem which has been studied since the pioneering works of Angluin (1987) and Littlestone (1988). More precisely, given any concept class C and any hypothesis class H, we provide nearly tight bounds (up to a log factor) on the optimal mistake bounds for online learning C using predictors from H. Our bound yields an exponential improvement over the previously best known bound by Chase and Freitag (2020). As applications, we give constructive proofs showing that (i) in the realizable setting, a near-optimal mistake bound (up to a constant factor) can be attained by a sparse majority-vote of proper predictors, and (ii) in the agnostic setting, a near-optimal regret bound (up to a log factor) can be attained by a randomized proper algorithm. A technical ingredient of our proof which may be of independent interest is a generalization of the celebrated Minimax Theorem (von Neumann, 1928) for binary zero-sum games. A simple game which fails to satisfy Minimax is "Guess the Larger Number", where each player picks a number and the larger number wins. The payoff matrix is infinite triangular. We show this is the only obstruction: if a game does not contain triangular submatrices of unbounded sizes then the Minimax Theorem holds. This generalizes von Neumann's Minimax Theorem by removing requirements of finiteness (or compactness), and captures precisely the games of interest in online learning. △ Less

Submitted 2 February, 2021; originally announced February 2021.

arXiv:2102.01117 [pdf, ps, other]

SGD Generalizes Better Than GD (And Regularization Doesn't Help)

Authors: Idan Amir, Tomer Koren, Roi Livni

Abstract: We give a new separation result between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in the fundamental stochastic convex optimization model. While for SGD it is well-known that $O(1/ε^2)$ iterations suffice for obtaining a solution with $ε$ excess expected risk, we show that with the same number of steps GD may overfit and emit a solu… ▽ More We give a new separation result between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in the fundamental stochastic convex optimization model. While for SGD it is well-known that $O(1/ε^2)$ iterations suffice for obtaining a solution with $ε$ excess expected risk, we show that with the same number of steps GD may overfit and emit a solution with $Ω(1)$ generalization error. Moreover, we show that in fact $Ω(1/ε^4)$ iterations are necessary for GD to match the generalization performance of SGD, which is also tight due to recent work by Bassily et al. (2020). We further discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization. △ Less

Submitted 29 June, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

Journal ref: Conference on Learning Theory 2021

arXiv:2006.13508 [pdf, ps, other]

A Limitation of the PAC-Bayes Framework

Authors: Roi Livni, Shay Moran

Abstract: PAC-Bayes is a useful framework for deriving generalization bounds which was introduced by McAllester ('98). This framework has the flexibility of deriving distribution- and algorithm-dependent bounds, which are often tighter than VC-related uniform convergence bounds. In this manuscript we present a limitation for the PAC-Bayes framework. We demonstrate an easy learning task that is not amenable… ▽ More PAC-Bayes is a useful framework for deriving generalization bounds which was introduced by McAllester ('98). This framework has the flexibility of deriving distribution- and algorithm-dependent bounds, which are often tighter than VC-related uniform convergence bounds. In this manuscript we present a limitation for the PAC-Bayes framework. We demonstrate an easy learning task that is not amenable to a PAC-Bayes analysis. Specifically, we consider the task of linear classification in 1D; it is well-known that this task is learnable using just $O(\log(1/δ)/ε)$ examples. On the other hand, we show that this fact can not be proved using a PAC-Bayes analysis: for any algorithm that learns 1-dimensional linear classifiers there exists a (realizable) distribution for which the PAC-Bayes bound is arbitrarily large. △ Less

Submitted 3 September, 2021; v1 submitted 24 June, 2020; originally announced June 2020.

arXiv:2003.06152 [pdf, other]

Can Implicit Bias Explain Generalization? Stochastic Convex Optimization as a Case Study

Authors: Assaf Dauber, Meir Feder, Tomer Koren, Roi Livni

Abstract: The notion of implicit bias, or implicit regularization, has been suggested as a means to explain the surprising generalization ability of modern-days overparameterized learning algorithms. This notion refers to the tendency of the optimization algorithm towards a certain structured solution that often generalizes well. Recently, several papers have studied implicit regularization and were able to… ▽ More The notion of implicit bias, or implicit regularization, has been suggested as a means to explain the surprising generalization ability of modern-days overparameterized learning algorithms. This notion refers to the tendency of the optimization algorithm towards a certain structured solution that often generalizes well. Recently, several papers have studied implicit regularization and were able to identify this phenomenon in various scenarios. We revisit this paradigm in arguably the simplest non-trivial setup, and study the implicit bias of Stochastic Gradient Descent (SGD) in the context of Stochastic Convex Optimization. As a first step, we provide a simple construction that rules out the existence of a \emph{distribution-independent} implicit regularizer that governs the generalization ability of SGD. We then demonstrate a learning problem that rules out a very general class of \emph{distribution-dependent} implicit regularizers from explaining generalization, which includes strongly convex regularizers as well as non-degenerate norm-based regularizations. Certain aspects of our constructions point out to significant difficulties in providing a comprehensive explanation of an algorithm's generalization performance by solely arguing about its implicit regularization properties. △ Less

Submitted 22 December, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

arXiv:2003.00563 [pdf, other]

An Equivalence Between Private Classification and Online Prediction

Authors: Mark Bun, Roi Livni, Shay Moran

Abstract: We prove that every concept class with finite Littlestone dimension can be learned by an (approximate) differentially-private algorithm. This answers an open question of Alon et al. (STOC 2019) who proved the converse statement (this question was also asked by Neel et al.~(FOCS 2019)). Together these two results yield an equivalence between online learnability and private PAC learnability. We in… ▽ More We prove that every concept class with finite Littlestone dimension can be learned by an (approximate) differentially-private algorithm. This answers an open question of Alon et al. (STOC 2019) who proved the converse statement (this question was also asked by Neel et al.~(FOCS 2019)). Together these two results yield an equivalence between online learnability and private PAC learnability. We introduce a new notion of algorithmic stability called "global stability" which is essential to our proof and may be of independent interest. We also discuss an application of our results to boosting the privacy and accuracy parameters of differentially-private learners. △ Less

Submitted 22 June, 2021; v1 submitted 1 March, 2020; originally announced March 2020.

Comments: An earlier version of this manuscript claimed an upper bound over the sample complexity that is exponential in the Littlestone dimension. The argument was erranous, and the current version contains a correction, which leads to double-exponential dependence in the Littlestone-dimension

arXiv:2002.10286 [pdf, other]

Prediction with Corrupted Expert Advice

Authors: Idan Amir, Idan Attias, Tomer Koren, Roi Livni, Yishay Mansour

Abstract: We revisit the fundamental problem of prediction with expert advice, in a setting where the environment is benign and generates losses stochastically, but the feedback observed by the learner is subject to a moderate adversarial corruption. We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in this setting and performs opti… ▽ More We revisit the fundamental problem of prediction with expert advice, in a setting where the environment is benign and generates losses stochastically, but the feedback observed by the learner is subject to a moderate adversarial corruption. We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in this setting and performs optimally in a wide range of environments, regardless of the magnitude of the injected corruption. Our results reveal a surprising disparity between the often comparable Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) frameworks: we show that for experts in the corrupted stochastic regime, the regret performance of OMD is in fact strictly inferior to that of FTRL. △ Less

Submitted 20 October, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: NeurIPS 2020 Camera Ready

Journal ref: Conference on Neural Information Processing Systems 2020

arXiv:1906.00264 [pdf, ps, other]

Graph-based Discriminators: Sample Complexity and Expressiveness

Authors: Roi Livni, Yishay Mansour

Abstract: A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. This basic task is considered, for example, in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution. % Classically, we use a hypothes… ▽ More A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. This basic task is considered, for example, in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution. % Classically, we use a hypothesis class $H$ and claim that the two distributions are distinct if for some $h\in H$ the expected value on the two distributions is (significantly) different. Our starting point is the following fundamental problem: "is having the hypothesis dependent on more than a single random example beneficial". To address this challenge we define $k$-ary based discriminators, which have a family of Boolean $k$-ary functions $\mathcal{G}$. Each function $g\in \mathcal{G}$ naturally defines a hyper-graph, indicating whether a given hyper-edge exists. A function $g\in \mathcal{G}$ distinguishes between two distributions, if the expected value of $g$, on a $k$-tuple of i.i.d examples, on the two distributions is (significantly) different. We study the expressiveness of families of $k$-ary functions, compared to the classical hypothesis class $H$, which is $k=1$. We show a separation in expressiveness of $k+1$-ary versus $k$-ary functions. This demonstrate the great benefit of having $k\geq 2$ as distinguishers. For $k\geq 2$ we introduce a notion similar to the VC-dimension, and show that it controls the sample complexity. We proceed and provide upper and lower bounds as a function of our extended notion of VC-dimension. △ Less

Submitted 1 June, 2019; originally announced June 2019.

arXiv:1902.04782 [pdf, ps, other]

On the Expressive Power of Kernel Methods and the Efficiency of Kernel Learning by Association Schemes

Authors: Pravesh K. Kothari, Roi Livni

Abstract: We study the expressive power of kernel methods and the algorithmic feasibility of multiple kernel learning for a special rich class of kernels. Specifically, we define \emph{Euclidean kernels}, a diverse class that includes most, if not all, families of kernels studied in literature such as polynomial kernels and radial basis functions. We then describe the geometric and spectral structure of t… ▽ More We study the expressive power of kernel methods and the algorithmic feasibility of multiple kernel learning for a special rich class of kernels. Specifically, we define \emph{Euclidean kernels}, a diverse class that includes most, if not all, families of kernels studied in literature such as polynomial kernels and radial basis functions. We then describe the geometric and spectral structure of this family of kernels over the hypercube (and to some extent for any compact domain). Our structural results allow us to prove meaningful limitations on the expressive power of the class as well as derive several efficient algorithms for learning kernels over different domains. △ Less

Submitted 13 February, 2019; originally announced February 2019.

arXiv:1902.03468 [pdf, ps, other]

Synthetic Data Generators: Sequential and Private

Authors: Olivier Bousquet, Roi Livni, Shay Moran

Abstract: We study the sample complexity of private synthetic data generation over an unbounded sized class of statistical queries, and show that any class that is privately proper PAC learnable admits a private synthetic data generator (perhaps non-efficient). Previous work on synthetic data generators focused on the case that the query class $\mathcal{D}$ is finite and obtained sample complexity bounds th… ▽ More We study the sample complexity of private synthetic data generation over an unbounded sized class of statistical queries, and show that any class that is privately proper PAC learnable admits a private synthetic data generator (perhaps non-efficient). Previous work on synthetic data generators focused on the case that the query class $\mathcal{D}$ is finite and obtained sample complexity bounds that scale logarithmically with the size $|\mathcal{D}|$. Here we construct a private synthetic data generator whose sample complexity is independent of the domain size, and we replace finiteness with the assumption that $\mathcal{D}$ is privately PAC learnable (a formally weaker task, hence we obtain equivalence between the two tasks). △ Less

Submitted 7 December, 2020; v1 submitted 9 February, 2019; originally announced February 2019.

arXiv:1806.00949 [pdf, other]

Private PAC learning implies finite Littlestone dimension

Authors: Noga Alon, Roi Livni, Maryanthe Malliaris, Shay Moran

Abstract: We show that every approximately differentially private learning algorithm (possibly improper) for a class $H$ with Littlestone dimension~$d$ requires $Ω\bigl(\log^*(d)\bigr)$ examples. As a corollary it follows that the class of thresholds over $\mathbb{N}$ can not be learned in a private manner; this resolves open question due to [Bun et al., 2015, Feldman and Xiao, 2015]. We leave as an open qu… ▽ More We show that every approximately differentially private learning algorithm (possibly improper) for a class $H$ with Littlestone dimension~$d$ requires $Ω\bigl(\log^*(d)\bigr)$ examples. As a corollary it follows that the class of thresholds over $\mathbb{N}$ can not be learned in a private manner; this resolves open question due to [Bun et al., 2015, Feldman and Xiao, 2015]. We leave as an open question whether every class with a finite Littlestone dimension can be learned by an approximately differentially private algorithm. △ Less

Submitted 8 March, 2019; v1 submitted 4 June, 2018; originally announced June 2018.

Comments: STOC camera-ready version

arXiv:1711.05893 [pdf, other]

On Communication Complexity of Classification Problems

Authors: Daniel M. Kane, Roi Livni, Shay Moran, Amir Yehudayoff

Abstract: This work studies distributed learning in the spirit of Yao's model of communication complexity: consider a two-party setting, where each of the players gets a list of labelled examples and they communicate in order to jointly perform some learning task. To naturally fit into the framework of learning theory, the players can send each other examples (as well as bits) where each example/bit costs o… ▽ More This work studies distributed learning in the spirit of Yao's model of communication complexity: consider a two-party setting, where each of the players gets a list of labelled examples and they communicate in order to jointly perform some learning task. To naturally fit into the framework of learning theory, the players can send each other examples (as well as bits) where each example/bit costs one unit of communication. This enables a uniform treatment of infinite classes such as half-spaces in $\mathbb{R}^d$, which are ubiquitous in machine learning. We study several fundamental questions in this model. For example, we provide combinatorial characterizations of the classes that can be learned with efficient communication in the proper-case as well as in the improper-case. These findings imply unconditional separations between various learning contexts, e.g.\ realizable versus agnostic learning, proper versus improper learning, etc. The derivation of these results hinges on a type of decision problems we term "{\it realizability problems}" where the goal is deciding whether a distributed input sample is consistent with an hypothesis from a pre-specified class. From a technical perspective, the protocols we use are based on ideas from machine learning theory and the impossibility results are based on ideas from communication complexity theory. △ Less

Submitted 23 April, 2018; v1 submitted 15 November, 2017; originally announced November 2017.

arXiv:1710.08997 [pdf, ps, other]

Multi-Armed Bandits with Metric Movement Costs

Authors: Tomer Koren, Roi Livni, Yishay Mansour

Abstract: We consider the non-stochastic Multi-Armed Bandit problem in a setting where there is a fixed and known metric on the action space that determines a cost for switching between any pair of actions. The loss of the online learner has two components: the first is the usual loss of the selected actions, and the second is an additional loss due to switching between actions. Our main contribution gives… ▽ More We consider the non-stochastic Multi-Armed Bandit problem in a setting where there is a fixed and known metric on the action space that determines a cost for switching between any pair of actions. The loss of the online learner has two components: the first is the usual loss of the selected actions, and the second is an additional loss due to switching between actions. Our main contribution gives a tight characterization of the expected minimax regret in this setting, in terms of a complexity measure $\mathcal{C}$ of the underlying metric which depends on its covering numbers. In finite metric spaces with $k$ actions, we give an efficient algorithm that achieves regret of the form $\widetilde{O}(\max\{\mathcal{C}^{1/3}T^{2/3},\sqrt{kT}\})$, and show that this is the best possible. Our regret bound generalizes previous known regret bounds for some special cases: (i) the unit-switching cost regret $\widetildeΘ(\max\{k^{1/3}T^{2/3},\sqrt{kT}\})$ where $\mathcal{C}=Θ(k)$, and (ii) the interval metric with regret $\widetildeΘ(\max\{T^{2/3},\sqrt{kT}\})$ where $\mathcal{C}=Θ(1)$. For infinite metrics spaces with Lipschitz loss functions, we derive a tight regret bound of $\widetildeΘ(T^{\frac{d+1}{d+2}})$ where $d \ge 1$ is the Minkowski dimension of the space, which is known to be tight even when there are no switching costs. △ Less

Submitted 24 October, 2017; originally announced October 2017.

arXiv:1709.03871 [pdf, ps, other]

Agnostic Learning by Refuting

Authors: Pravesh K. Kothari, Roi Livni

Abstract: The sample complexity of learning a Boolean-valued function class is precisely characterized by its Rademacher complexity. This has little bearing, however, on the sample complexity of \emph{efficient} agnostic learning. We introduce \emph{refutation complexity}, a natural computational analog of Rademacher complexity of a Boolean concept class and show that it exactly characterizes the sample c… ▽ More The sample complexity of learning a Boolean-valued function class is precisely characterized by its Rademacher complexity. This has little bearing, however, on the sample complexity of \emph{efficient} agnostic learning. We introduce \emph{refutation complexity}, a natural computational analog of Rademacher complexity of a Boolean concept class and show that it exactly characterizes the sample complexity of \emph{efficient} agnostic learning. Informally, refutation complexity of a class $\mathcal{C}$ is the minimum number of example-label pairs required to efficiently distinguish between the case that the labels correlate with the evaluation of some member of $\mathcal{C}$ (\emph{structure}) and the case where the labels are i.i.d. Rademacher random variables (\emph{noise}). The easy direction of this relationship was implicitly used in the recent framework for improper PAC learning lower bounds of Daniely and co-authors via connections to the hardness of refuting random constraint satisfaction problems. Our work can be seen as making the relationship between agnostic learning and refutation implicit in their work into an explicit equivalence. In a recent, independent work, Salil Vadhan discovered a similar relationship between refutation and PAC-learning in the realizable (i.e. noiseless) case. △ Less

Submitted 30 November, 2017; v1 submitted 12 September, 2017; originally announced September 2017.

arXiv:1702.07444 [pdf, ps, other]

Bandits with Movement Costs and Adaptive Pricing

Authors: Tomer Koren, Roi Livni, Yishay Mansour

Abstract: We extend the model of Multi-armed Bandit with unit switching cost to incorporate a metric between the actions. We consider the case where the metric over the actions can be modeled by a complete binary tree, and the distance between two leaves is the size of the subtree of their least common ancestor, which abstracts the case that the actions are points on the continuous interval $[0,1]$ and the… ▽ More We extend the model of Multi-armed Bandit with unit switching cost to incorporate a metric between the actions. We consider the case where the metric over the actions can be modeled by a complete binary tree, and the distance between two leaves is the size of the subtree of their least common ancestor, which abstracts the case that the actions are points on the continuous interval $[0,1]$ and the switching cost is their distance. In this setting, we give a new algorithm that establishes a regret of $\widetilde{O}(\sqrt{kT} + T/k)$, where $k$ is the number of actions and $T$ is the time horizon. When the set of actions corresponds to whole $[0,1]$ interval we can exploit our method for the task of bandit learning with Lipschitz loss functions, where our algorithm achieves an optimal regret rate of $\widetildeΘ(T^{2/3})$, which is the same rate one obtains when there is no penalty for movements. As our main application, we use our new algorithm to solve an adaptive pricing problem. Specifically, we consider the case of a single seller faced with a stream of patient buyers. Each buyer has a private value and a window of time in which they are interested in buying, and they buy at the lowest price in the window, if it is below their value. We show that with an appropriate discretization of the prices, the seller can achieve a regret of $\widetilde{O}(T^{2/3})$ compared to the best fixed price in hindsight, which outperform the previous regret bound of $\widetilde{O}(T^{3/4})$ for the problem. △ Less

Submitted 23 February, 2017; originally announced February 2017.

arXiv:1606.05316 [pdf, other]

Learning Infinite-Layer Networks: Without the Kernel Trick

Authors: Roi Livni, Daniel Carmon, Amir Globerson

Abstract: Infinite--Layer Networks (ILN) have recently been proposed as an architecture that mimics neural networks while enjoying some of the advantages of kernel methods. ILN are networks that integrate over infinitely many nodes within a single hidden layer. It has been demonstrated by several authors that the problem of learning ILN can be reduced to the kernel trick, implying that whenever a certain in… ▽ More Infinite--Layer Networks (ILN) have recently been proposed as an architecture that mimics neural networks while enjoying some of the advantages of kernel methods. ILN are networks that integrate over infinitely many nodes within a single hidden layer. It has been demonstrated by several authors that the problem of learning ILN can be reduced to the kernel trick, implying that whenever a certain integral can be computed analytically they are efficiently learnable. In this work we give an online algorithm for ILN, which avoids the kernel trick assumption. More generally and of independent interest, we show that kernel methods in general can be exploited even when the kernel cannot be efficiently computed but can only be estimated via sampling. We provide a regret analysis for our algorithm, showing that it matches the sample complexity of methods which have access to kernel values. Thus, our method is the first to demonstrate that the kernel trick is not necessary as such, and random features suffice to obtain comparable performance. △ Less

Submitted 28 July, 2017; v1 submitted 16 June, 2016; originally announced June 2016.

arXiv:1603.06352 [pdf, ps, other]

Online Learning with Low Rank Experts

Authors: Elad Hazan, Tomer Koren, Roi Livni, Yishay Mansour

Abstract: We consider the problem of prediction with expert advice when the losses of the experts have low-dimensional structure: they are restricted to an unknown $d$-dimensional subspace. We devise algorithms with regret bounds that are independent of the number of experts and depend only on the rank $d$. For the stochastic model we show a tight bound of $Θ(\sqrt{dT})$, and extend it to a setting of an ap… ▽ More We consider the problem of prediction with expert advice when the losses of the experts have low-dimensional structure: they are restricted to an unknown $d$-dimensional subspace. We devise algorithms with regret bounds that are independent of the number of experts and depend only on the rank $d$. For the stochastic model we show a tight bound of $Θ(\sqrt{dT})$, and extend it to a setting of an approximate $d$ subspace. For the adversarial model we show an upper bound of $O(d\sqrt{T})$ and a lower bound of $Ω(\sqrt{dT})$. △ Less

Submitted 23 May, 2016; v1 submitted 21 March, 2016; originally announced March 2016.

arXiv:1501.03273 [pdf, other]

Classification with Low Rank and Missing Data

Authors: Elad Hazan, Roi Livni, Yishay Mansour

Abstract: We consider classification and regression tasks where we have missing data and assume that the (clean) data resides in a low rank subspace. Finding a hidden subspace is known to be computationally hard. Nevertheless, using a non-proper formulation we give an efficient agnostic algorithm that classifies as good as the best linear classifier coupled with the best low-dimensional subspace in which th… ▽ More We consider classification and regression tasks where we have missing data and assume that the (clean) data resides in a low rank subspace. Finding a hidden subspace is known to be computationally hard. Nevertheless, using a non-proper formulation we give an efficient agnostic algorithm that classifies as good as the best linear classifier coupled with the best low-dimensional subspace in which the data resides. A direct implication is that our algorithm can linearly (and non-linearly through kernels) classify provably as well as the best classifier that has access to the full data. △ Less

Submitted 14 January, 2015; originally announced January 2015.

arXiv:1410.1141 [pdf, other]

On the Computational Efficiency of Training Neural Networks

Authors: Roi Livni, Shai Shalev-Shwartz, Ohad Shamir

Abstract: It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of trai… ▽ More It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of training neural networks from a modern perspective. We provide both positive and negative results, some of them yield new provably efficient and practical algorithms for training certain types of neural networks. △ Less

Submitted 28 October, 2014; v1 submitted 5 October, 2014; originally announced October 2014.

Comments: Section 2 is revised due to a mistake

arXiv:1304.7045 [pdf, other]

An Algorithm for Training Polynomial Networks

Authors: Roi Livni, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We consider deep neural networks, in which the output of each node is a quadratic function of its inputs. Similar to other deep architectures, these networks can compactly represent any function on a finite training set. The main goal of this paper is the derivation of an efficient layer-by-layer algorithm for training such networks, which we denote as the \emph{Basis Learner}. The algorithm is a… ▽ More We consider deep neural networks, in which the output of each node is a quadratic function of its inputs. Similar to other deep architectures, these networks can compactly represent any function on a finite training set. The main goal of this paper is the derivation of an efficient layer-by-layer algorithm for training such networks, which we denote as the \emph{Basis Learner}. The algorithm is a universal learner in the sense that the training error is guaranteed to decrease at every iteration, and can eventually reach zero under mild conditions. We present practical implementations of this algorithm, as well as preliminary experimental results. We also compare our deep architecture to other shallow architectures for learning polynomials, in particular kernel learning. △ Less

Submitted 20 February, 2014; v1 submitted 25 April, 2013; originally announced April 2013.

Showing 1–32 of 32 results for author: Livni, R