subscribe to arXiv mailings

Differentially Private Representation Learning via Image Captioning

Authors: Tom Sander, Yaodong Yu, Maziar Sanjabi, Alain Durmus, Yi Ma, Kamalika Chaudhuri, Chuan Guo

Abstract: Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn… ▽ More Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn representations that are not significantly better than hand-crafted features. In this work, we show that effective DP representation learning can be done via image captioning and scaling up to internet-scale multimodal datasets. Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of $\varepsilon=8$, a linear classifier trained on top of learned DP-Cap features attains 65.8% accuracy on ImageNet-1K, considerably improving the previous SOTA of 56.5%. Our work challenges the prevailing sentiment that high-utility DP representation learning cannot be achieved by training from scratch. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.14904 [pdf, other]

Watermarking Makes Language Models Radioactive

Authors: Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, Teddy Furon

Abstract: This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination le… ▽ More This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination level to the watermark robustness, its proportion in the training set, and the fine-tuning process. We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence (p-value < 1e-5) even when as little as 5% of training text is watermarked. Thus, LLM watermarking, originally designed for detecting machine-generated text, gives the ability to easily identify if the outputs of a watermarked LLM were used to fine-tune another LLM. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.08344 [pdf, other]

Implicit Bias in Noisy-SGD: With Applications to Differentially Private Training

Authors: Tom Sander, Maxime Sylvestre, Alain Durmus

Abstract: Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) yields superior test performance compared to larger batches. The specific noise structure inherent to SGD is known to be responsible for this implicit bias. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch trainin… ▽ More Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) yields superior test performance compared to larger batches. The specific noise structure inherent to SGD is known to be responsible for this implicit bias. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch training still results in a significant decrease in performance, which poses an important challenge because strong DP guarantees necessitate the use of massive batches. We first show that the phenomenon extends to Noisy-SGD (DP-SGD without clipping), suggesting that the stochasticity (and not the clipping) is the cause of this implicit bias, even with additional isotropic Gaussian noise. We theoretically analyse the solutions obtained with continuous versions of Noisy-SGD for the Linear Least Square and Diagonal Linear Network settings, and reveal that the implicit bias is indeed amplified by the additional noise. Thus, the performance issues of large-batch DP-SGD training are rooted in the same underlying principles as SGD, offering hope for potential improvements in large batch training strategies. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2308.00420 [pdf, other]

The complexity of the Timetable-Based Railway Network Design Problem

Authors: Nadine Friesen, Tim Sander, Karl Nachtigall, Nils Nießen

Abstract: Because of the long planning periods and their long life cycle, railway infrastructure has to be outlined long ahead. At the present, the infrastructure is designed while only little about the intended operation is known. Hence, the timetable and the operation are adjusted to the infrastructure. Since space, time and money for extension measures of railway infrastructure are limited, each modifica… ▽ More Because of the long planning periods and their long life cycle, railway infrastructure has to be outlined long ahead. At the present, the infrastructure is designed while only little about the intended operation is known. Hence, the timetable and the operation are adjusted to the infrastructure. Since space, time and money for extension measures of railway infrastructure are limited, each modification has to be done carefully and long lasting and should be appropriate for the future unknown demand. To take this into account, we present the robust network design problem for railway infrastructure under capacity constraints and uncertain timetables. Here, we plan the required expansion measures for an uncertain long-term timetable. We show that this problem is NP-hard even when restricted to bipartite graphs and very simple timetables and present easier solvable special cases. This problem corresponds to the fixed-charge network design problem where the expansion costs are minimized such that the timetable is conductible. We model this problem by an integer linear program using time expanded networks. To incorporate the uncertainty of the future timetable, we use a scenario-based approach. We define scenarios with individual departure and arrival times and optional trains. The network is then optimized such that a given percentage of the scenarios can be operated while minimizing the expansion costs and potential penalty costs for not scheduled optional trains. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2210.03403 [pdf, other]

TAN Without a Burn: Scaling Laws of DP-SGD

Authors: Tom Sander, Pierre Stock, Alexandre Sablayrolles

Abstract: Differentially Private methods for training Deep Neural Networks (DNNs) have progressed recently, in particular with the use of massive batches and aggregated data augmentations for a large number of training steps. These techniques require much more computing resources than their non-private counterparts, shifting the traditional privacy-accuracy trade-off to a privacy-accuracy-compute trade-off… ▽ More Differentially Private methods for training Deep Neural Networks (DNNs) have progressed recently, in particular with the use of massive batches and aggregated data augmentations for a large number of training steps. These techniques require much more computing resources than their non-private counterparts, shifting the traditional privacy-accuracy trade-off to a privacy-accuracy-compute trade-off and making hyper-parameter search virtually impossible for realistic scenarios. In this work, we decouple privacy analysis and experimental behavior of noisy training to explore the trade-off with minimal computational requirements. We first use the tools of Rényi Differential Privacy (RDP) to highlight that the privacy budget, when not overcharged, only depends on the total amount of noise (TAN) injected throughout training. We then derive scaling laws for training models with DP-SGD to optimize hyper-parameters with more than a $100\times$ reduction in computational budget. We apply the proposed method on CIFAR-10 and ImageNet and, in particular, strongly improve the state-of-the-art on ImageNet with a +9 points gain in top-1 accuracy for a privacy budget epsilon=8. △ Less

Submitted 24 May, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

Showing 1–5 of 5 results for author: Sander, T