subscribe to arXiv mailings

Scalable Optimization in the Modular Norm

Authors: Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, Jeremy Bernstein

Abstract: To improve performance in contemporary deep learning, one is interested in scaling up the neural network in terms of both the number and the size of the layers. When ramping up the width of a single layer, graceful scaling of training has been linked to the need to normalize the weights and their updates in the "natural norm" particular to that layer. In this paper, we significantly generalize thi… ▽ More To improve performance in contemporary deep learning, one is interested in scaling up the neural network in terms of both the number and the size of the layers. When ramping up the width of a single layer, graceful scaling of training has been linked to the need to normalize the weights and their updates in the "natural norm" particular to that layer. In this paper, we significantly generalize this idea by defining the modular norm, which is the natural norm on the full weight space of any neural network architecture. The modular norm is defined recursively in tandem with the network architecture itself. We show that the modular norm has several promising applications. On the practical side, the modular norm can be used to normalize the updates of any base optimizer so that the learning rate becomes transferable across width and depth. This means that the user does not need to compute optimizer-specific scale factors in order to scale training. On the theoretical side, we show that for any neural network built from "well-behaved" atomic modules, the gradient of the network is Lipschitz-continuous in the modular norm, with the Lipschitz constant admitting a simple recursive formula. This characterization opens the door to porting standard ideas in optimization theory over to deep learning. We have created a Python package called Modula that automatically normalizes weight updates in the modular norm of the architecture. The package is available via "pip install modula" with source code at https://github.com/jxbz/modula. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.07987 [pdf, other]

The Platonic Representation Hypothesis

Authors: Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola

Abstract: We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure dis… ▽ More We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: Equal contributions

arXiv:2402.16828 [pdf, other]

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

Authors: Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal

Abstract: The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoR… ▽ More The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.10382 [pdf, other]

doi 10.1145/3613904.3642839

Making Short-Form Videos Accessible with Hierarchical Video Summaries

Authors: Tess Van Daele, Akhil Iyer, Yuning Zhang, Jalyn C. Derry, Mina Huh, Amy Pavel

Abstract: Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form… ▽ More Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts (i.e. short-form videos) have become a primary source of information and entertainment. Many short-form videos are inaccessible to blind and low vision (BLV) viewers due to their rapid visual changes, on-screen text, and music or meme-audio overlays. In our formative study, 7 BLV viewers who regularly watched short-form videos reported frequently skipping such inaccessible content. We present ShortScribe, a system that provides hierarchical visual summaries of short-form videos at three levels of detail to support BLV viewers in selecting and understanding short-form videos. ShortScribe allows BLV users to navigate between video descriptions based on their level of interest. To evaluate ShortScribe, we assessed description accuracy and conducted a user study with 10 BLV participants comparing ShortScribe to a baseline interface. When using ShortScribe, participants reported higher comprehension and provided more accurate summaries of video content. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: To appear at CHI 2024

arXiv:2401.06354 [pdf, other]

Initial Analysis of Data-Driven Haptic Search for the Smart Suction Cup

Authors: Jungpyo Lee, Sebastian D. Lee, Tae Myung Huh, Hannah S. Stuart

Abstract: Suction cups offer a useful gripping solution, particularly in industrial robotics and warehouse applications. Vision-based grasp algorithms, like Dex-Net, show promise but struggle to accurately perceive dark or reflective objects, sub-resolution features, and occlusions, resulting in suction cup grip failures. In our prior work, we designed the Smart Suction Cup, which estimates the flow state w… ▽ More Suction cups offer a useful gripping solution, particularly in industrial robotics and warehouse applications. Vision-based grasp algorithms, like Dex-Net, show promise but struggle to accurately perceive dark or reflective objects, sub-resolution features, and occlusions, resulting in suction cup grip failures. In our prior work, we designed the Smart Suction Cup, which estimates the flow state within the cup and provides a mechanically resilient end-effector that can inform arm feedback control through a sense of touch. We then demonstrated how this cup's signals enable haptically-driven search behaviors for better grasping points on adversarial objects. This prior work uses a model-based approach to predict the desired motion direction, which opens up the question: does a data-driven approach perform better? This technical report provides an initial analysis harnessing the data previously collected. Specifically, we compare the model-based method with a preliminary data-driven approach to accurately estimate lateral pose adjustment direction for improved grasp success. △ Less

Submitted 21 October, 2023; originally announced January 2024.

arXiv:2309.07360 [pdf, other]

doi 10.1109/TRO.2023.3331063

Haptic search with the Smart Suction Cup on adversarial objects

Authors: Jungpyo Lee, Sebastian D. Lee, Tae Myung Huh, Hannah S. Stuart

Abstract: Suction cups are an important gripper type in industrial robot applications, and prior literature focuses on using vision-based planners to improve grasping success in these tasks. Vision-based planners can fail due to adversarial objects or lose generalizability for unseen scenarios, without retraining learned algorithms. We propose haptic exploration to improve suction cup grasping when visual g… ▽ More Suction cups are an important gripper type in industrial robot applications, and prior literature focuses on using vision-based planners to improve grasping success in these tasks. Vision-based planners can fail due to adversarial objects or lose generalizability for unseen scenarios, without retraining learned algorithms. We propose haptic exploration to improve suction cup grasping when visual grasp planners fail. We present the Smart Suction Cup, an end-effector that utilizes internal flow measurements for tactile sensing. We show that model-based haptic search methods, guided by these flow measurements, improve grasping success by up to 2.5x as compared with using only a vision planner during a bin-picking task. In characterizing the Smart Suction Cup on both geometric edges and curves, we find that flow rate can accurately predict the ideal motion direction even with large postural errors. The Smart Suction Cup includes no electronics on the cup itself, such that the design is easy to fabricate and haptic exploration does not damage the sensor. This work motivates the use of suction cups with autonomous haptic search capabilities in especially adversarial scenarios. △ Less

Submitted 15 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: Accepted final version to appear in the IEEE Transactions on Robotics

arXiv:2307.07589 [pdf, other]

GenAssist: Making Image Generation Accessible

Authors: Mina Huh, Yi-Hao Peng, Amy Pavel

Abstract: Blind and low vision (BLV) creators use images to communicate with sighted audiences. However, creating or retrieving images is challenging for BLV creators as it is difficult to use authoring tools or assess image search results. Thus, creators limit the types of images they create or recruit sighted collaborators. While text-to-image generation models let creators generate high-fidelity images b… ▽ More Blind and low vision (BLV) creators use images to communicate with sighted audiences. However, creating or retrieving images is challenging for BLV creators as it is difficult to use authoring tools or assess image search results. Thus, creators limit the types of images they create or recruit sighted collaborators. While text-to-image generation models let creators generate high-fidelity images based on a text description (i.e. prompt), it is difficult to assess the content and quality of generated images. We present GenAssist, a system to make text-to-image generation accessible. Using our interface, creators can verify whether generated image candidates followed the prompt, access additional details in the image not specified in the prompt, and skim a summary of similarities and differences between image candidates. To power the interface, GenAssist uses a large language model to generate visual questions, vision-language models to extract answers, and a large language model to summarize the results. Our study with 12 BLV creators demonstrated that GenAssist enables and simplifies the process of image selection and generation, making visual authoring more accessible to all. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: For accessibility tagged pdf, please refer to the ancillary file

arXiv:2305.08842 [pdf, other]

Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks

Authors: Minyoung Huh, Brian Cheung, Pulkit Agrawal, Phillip Isola

Abstract: This work examines the challenges of training neural networks using vector quantization using straight-through estimation. We find that a primary cause of training instability is the discrepancy between the model embedding and the code-vector distribution. We identify the factors that contribute to this issue, including the codebook gradient sparsity and the asymmetric nature of the commitment los… ▽ More This work examines the challenges of training neural networks using vector quantization using straight-through estimation. We find that a primary cause of training instability is the discrepancy between the model embedding and the code-vector distribution. We identify the factors that contribute to this issue, including the codebook gradient sparsity and the asymmetric nature of the commitment loss, which leads to misaligned code-vector assignments. We propose to address this issue via affine re-parameterization of the code vectors. Additionally, we introduce an alternating optimization to reduce the gradient error introduced by the straight-through estimation. Moreover, we propose an improvement to the commitment loss to ensure better alignment between the codebook representation and the model embedding. These optimization methods improve the mathematical approximation of the straight-through estimation and, ultimately, the model performance. We demonstrate the effectiveness of our methods on several common model architectures, such as AlexNet, ResNet, and ViT, across various tasks, including image classification and generative modeling. △ Less

Submitted 15 May, 2023; originally announced May 2023.

arXiv:2303.00510 [pdf, other]

A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit

Authors: Mina Huh, Ruchira Ray, Corey Karnei

Abstract: Data augmentations are known to improve robustness in speech-processing tasks. In this study, we summarize and compare different data augmentation strategies using S3PRL toolkit. We explore how HuBERT and wav2vec perform using different augmentation techniques (SpecAugment, Gaussian Noise, Speed Perturbation) for Phoneme Recognition (PR) and Automatic Speech Recognition (ASR) tasks. We evaluate mo… ▽ More Data augmentations are known to improve robustness in speech-processing tasks. In this study, we summarize and compare different data augmentation strategies using S3PRL toolkit. We explore how HuBERT and wav2vec perform using different augmentation techniques (SpecAugment, Gaussian Noise, Speed Perturbation) for Phoneme Recognition (PR) and Automatic Speech Recognition (ASR) tasks. We evaluate model performance in terms of phoneme error rate (PER) and word error rate (WER). From the experiments, we observed that SpecAugment slightly improves the performance of HuBERT and wav2vec on the original dataset. Also, we show that models trained using the Gaussian Noise and Speed Perturbation dataset are more robust when tested with augmented test sets. △ Less

Submitted 29 March, 2024; v1 submitted 27 February, 2023; originally announced March 2023.

arXiv:2302.14117 [pdf, other]

doi 10.1145/3544548.3581494

AVscript: Accessible Video Editing with Audio-Visual Scripts

Authors: Mina Huh, Saelyne Yang, Yi-Hao Peng, Xiang 'Anthony' Chen, Young-Ho Kim, Amy Pavel

Abstract: Sighted and blind and low vision (BLV) creators alike use videos to communicate with broad audiences. Yet, video editing remains inaccessible to BLV creators. Our formative study revealed that current video editing tools make it difficult to access the visual content, assess the visual quality, and efficiently navigate the timeline. We present AVscript, an accessible text-based video editor. AVscr… ▽ More Sighted and blind and low vision (BLV) creators alike use videos to communicate with broad audiences. Yet, video editing remains inaccessible to BLV creators. Our formative study revealed that current video editing tools make it difficult to access the visual content, assess the visual quality, and efficiently navigate the timeline. We present AVscript, an accessible text-based video editor. AVscript enables BLV creators to edit their video using a script that embeds the video's visual content, visual errors (e.g., dark or blurred footage), and speech. BLV creators can use AVscript to efficiently navigate between scenes and visual errors or to locate objects in the frame or spoken words of interest. A comparison study (N=12) showed that AVscript significantly lowered BLV creators' mental demands while increasing confidence and independence in video editing. We further demonstrate the potential of AVscript through an exploratory study (N=3) where BLV creators edited their own footage. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Comments: CHI 2023

arXiv:2209.13032 [pdf, other]

Totems: Physical Objects for Verifying Visual Integrity

Authors: Jingwei Ma, Lucy Chai, Minyoung Huh, Tongzhou Wang, Ser-Nam Lim, Phillip Isola, Antonio Torralba

Abstract: We introduce a new approach to image forensics: placing physical refractive objects, which we call totems, into a scene so as to protect any photograph taken of that scene. Totems bend and redirect light rays, thus providing multiple, albeit distorted, views of the scene within a single image. A defender can use these distorted totem pixels to detect if an image has been manipulated. Our approach… ▽ More We introduce a new approach to image forensics: placing physical refractive objects, which we call totems, into a scene so as to protect any photograph taken of that scene. Totems bend and redirect light rays, thus providing multiple, albeit distorted, views of the scene within a single image. A defender can use these distorted totem pixels to detect if an image has been manipulated. Our approach unscrambles the light rays passing through the totems by estimating their positions in the scene and using their known geometric and material properties. To verify a totem-protected image, we detect inconsistencies between the scene reconstructed from totem viewpoints and the scene's appearance from the camera viewpoint. Such an approach makes the adversarial manipulation task more difficult, as the adversary must modify both the totem and image pixels in a geometrically consistent manner without knowing the physical properties of the totem. Unlike prior learning-based approaches, our method does not require training on datasets of specific manipulations, and instead uses physical properties of the scene and camera to solve the forensics problem. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: ECCV 2022 camera ready version; project page https://jingweim.github.io/totems/

arXiv:2203.08864 [pdf, other]

doi 10.1145/3491102.3517490

"It Feels Like Taking a Gamble": Exploring Perceptions, Practices, and Challenges of Using Makeup and Cosmetics for People with Visual Impairments

Authors: Franklin Mingzhe Li, Franchesca Spektor, Meng Xia, Mina Huh, Peter Cederberg, Yuqi Gong, Kristen Shinohara, Patrick Carrington

Abstract: Makeup and cosmetics offer the potential for self-expression and the reshaping of social roles for visually impaired people. However, there exist barriers to conducting a beauty regime because of the reliance on visual information and color variances in makeup. We present a content analysis of 145 YouTube videos to demonstrate visually impaired individuals' unique practices before, during, and aft… ▽ More Makeup and cosmetics offer the potential for self-expression and the reshaping of social roles for visually impaired people. However, there exist barriers to conducting a beauty regime because of the reliance on visual information and color variances in makeup. We present a content analysis of 145 YouTube videos to demonstrate visually impaired individuals' unique practices before, during, and after doing makeup. Based on the makeup practices, we then conducted semi-structured interviews with 12 visually impaired people to discuss their perceptions of and challenges with the makeup process in more depth. Overall, through our findings and discussion, we present novel perceptions of makeup from visually impaired individuals (e.g., broader representations of blindness and beauty). The existing challenges provide opportunities for future research to address learning barriers, insufficient feedback, and physical and environmental barriers, making the experience of doing makeup more accessible to people with visual impairments. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: In CHI Conference on Human Factors in Computing Systems (CHI '22), April 29-May 5, 2022, New Orleans, LA, USA. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3491102.3517490

arXiv:2110.15349 [pdf, other]

Learning to Ground Multi-Agent Communication with Autoencoders

Authors: Toru Lin, Minyoung Huh, Chris Stauffer, Ser-Nam Lim, Phillip Isola

Abstract: Communication requires having a common language, a lingua franca, between agents. This language could emerge via a consensus process, but it may require many generations of trial and error. Alternatively, the lingua franca can be given by the environment, where agents ground their language in representations of the observed world. We demonstrate a simple way to ground language in learned represent… ▽ More Communication requires having a common language, a lingua franca, between agents. This language could emerge via a consensus process, but it may require many generations of trial and error. Alternatively, the lingua franca can be given by the environment, where agents ground their language in representations of the observed world. We demonstrate a simple way to ground language in learned representations, which facilitates decentralized multi-agent communication and coordination. We find that a standard representation learning algorithm -- autoencoding -- is sufficient for arriving at a grounded common language. When agents broadcast these representations, they learn to understand and respond to each other's utterances and achieve surprisingly strong task performance across a variety of multi-agent communication environments. △ Less

Submitted 28 October, 2021; originally announced October 2021.

Comments: Project page, code, and videos can be found at https://toruowo.github.io/marl-ae-comm/

arXiv:2105.02345 [pdf, other]

A Multi-Chamber Smart Suction Cup for Adaptive Gripping and Haptic Exploration

Authors: Tae Myung Huh, Kate Sanders, Michael Danielczuk, Monica Li, Yunliang Chen, Ken Goldberg, Hannah S. Stuart

Abstract: We present a novel robot end-effector for gripping and haptic exploration. Tactile sensing through suction flow monitoring is applied to a new suction cup design that contains multiple chambers for air flow. Each chamber connects with its own remote pressure transducer, which enables both absolute and differential pressure measures between chambers. By changing the overall vacuum applied to this s… ▽ More We present a novel robot end-effector for gripping and haptic exploration. Tactile sensing through suction flow monitoring is applied to a new suction cup design that contains multiple chambers for air flow. Each chamber connects with its own remote pressure transducer, which enables both absolute and differential pressure measures between chambers. By changing the overall vacuum applied to this smart suction cup, it can perform different functions such as gentle haptic exploration (low pressure) and monitoring breaks in the seal during strong astrictive gripping (high pressure). Haptic exploration of surfaces through sliding and palpation can guide the selection of suction grasp locations and help to identify the local surface geometry. During suction gripping, this design localizes breaks in the suction seal between four quadrants with up to 97% accuracy and detects breaks in the suction seal early enough to avoid total grasp failure. △ Less

Submitted 18 October, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

arXiv:2103.10427 [pdf, other]

The Low-Rank Simplicity Bias in Deep Networks

Authors: Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, Phillip Isola

Abstract: Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutio… ▽ More Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutions with lower effective rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low effective rank embedding increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well. We then show that the simplicity bias exists at both initialization and after training and is resilient to hyper-parameters and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance on CIFAR and ImageNet without changing the modeling capacity. △ Less

Submitted 23 March, 2023; v1 submitted 18 March, 2021; originally announced March 2021.

arXiv:2005.01703 [pdf, other]

Transforming and Projecting Images into Class-conditional Generative Networks

Authors: Minyoung Huh, Richard Zhang, Jun-Yan Zhu, Sylvain Paris, Aaron Hertzmann

Abstract: We present a method for projecting an input image into the space of a class-conditional generative neural network. We propose a method that optimizes for transformation to counteract the model biases in generative neural networks. Specifically, we demonstrate that one can solve for image translation, scale, and global color transformation, during the projection optimization to address the object-c… ▽ More We present a method for projecting an input image into the space of a class-conditional generative neural network. We propose a method that optimizes for transformation to counteract the model biases in generative neural networks. Specifically, we demonstrate that one can solve for image translation, scale, and global color transformation, during the projection optimization to address the object-center bias and color bias of a Generative Adversarial Network. This projection process poses a difficult optimization problem, and purely gradient-based optimizations fail to find good solutions. We describe a hybrid optimization strategy that finds good projections by estimating transformations and class parameters. We show the effectiveness of our method on real images and further demonstrate how the corresponding projections lead to better editability of these images. △ Less

Submitted 27 August, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

Comments: Accepted to ECCV2020 (oral)

arXiv:1911.02104 [pdf, other]

doi 10.1109/LRA.2021.3140132

Perceived Intensities of Normal and Shear Skin Stimuli using a Wearable Haptic Bracelet

Authors: Mine Sarac, Tae Myung Huh, Hojung Choi, Mark Cutkosky, Massimiliano Di Luca, Allison M. Okamura

Abstract: Our aim is to provide effective interaction with virtual objects, despite the lack of co-location of virtual and real-world contacts, while taking advantage of relatively large skin area and ease of mounting on the forearm. We performed two human participant studies to determine the effects of haptic feedback in the normal and shear directions during virtual manipulation using haptic devices worn… ▽ More Our aim is to provide effective interaction with virtual objects, despite the lack of co-location of virtual and real-world contacts, while taking advantage of relatively large skin area and ease of mounting on the forearm. We performed two human participant studies to determine the effects of haptic feedback in the normal and shear directions during virtual manipulation using haptic devices worn near the wrist. In the first study, participants performed significantly better while discriminating stiffness values of virtual objects when the feedback consisted of normal displacements compared to shear displacements. Participants also commented that they could detect normal cues much easier than shear, which motivated us to perform a second study to find the point of subjective equality (PSE) between normal and shear stimuli. Our results show that shear stimuli require a larger actuator displacement but less force than normal stimuli to achieve perceptual equality for our haptic bracelets. We found that normal and shear stimuli cannot be equalized through skin displacement nor the interaction forces across all users. Rather, a calibration method is needed to find the point of equality for each user where normal and shear stimuli create the same intensity on the user's skin. △ Less

Submitted 12 April, 2022; v1 submitted 5 November, 2019; originally announced November 2019.

Comments: 8 pages, In Press IEEE Robotic Automation Letters, IEEE International Conference on Robotics and Automation

arXiv:1805.04096 [pdf, other]

Fighting Fake News: Image Splice Detection via Learned Self-Consistency

Authors: Minyoung Huh, Andrew Liu, Andrew Owens, Alexei A. Efros

Abstract: Advances in photo editing and manipulation tools have made it significantly easier to create fake imagery. Learning to detect such manipulations, however, remains a challenging problem due to the lack of sufficient amounts of manipulated training data. In this paper, we propose a learning algorithm for detecting visual image manipulations that is trained only using a large dataset of real photogra… ▽ More Advances in photo editing and manipulation tools have made it significantly easier to create fake imagery. Learning to detect such manipulations, however, remains a challenging problem due to the lack of sufficient amounts of manipulated training data. In this paper, we propose a learning algorithm for detecting visual image manipulations that is trained only using a large dataset of real photographs. The algorithm uses the automatically recorded photo EXIF metadata as supervisory signal for training a model to determine whether an image is self-consistent -- that is, whether its content could have been produced by a single imaging pipeline. We apply this self-consistency model to the task of detecting and localizing image splices. The proposed method obtains state-of-the-art performance on several image forensics benchmarks, despite never seeing any manipulated images at training. That said, it is merely a step in the long quest for a truly general purpose visual forensics tool. △ Less

Submitted 5 September, 2018; v1 submitted 10 May, 2018; originally announced May 2018.

arXiv:1608.08614 [pdf, other]

What makes ImageNet good for transfer learning?

Authors: Minyoung Huh, Pulkit Agrawal, Alexei A. Efros

Abstract: The tremendous success of ImageNet-trained deep features on a wide range of transfer tasks begs the question: what are the properties of the ImageNet dataset that are critical for learning good, general-purpose features? This work provides an empirical investigation of various facets of this question: Is more pre-training data always better? How does feature quality depend on the number of trainin… ▽ More The tremendous success of ImageNet-trained deep features on a wide range of transfer tasks begs the question: what are the properties of the ImageNet dataset that are critical for learning good, general-purpose features? This work provides an empirical investigation of various facets of this question: Is more pre-training data always better? How does feature quality depend on the number of training examples per class? Does adding more object classes improve performance? For the same data budget, how should the data be split into classes? Is fine-grained recognition necessary for learning good features? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class? To answer these and related questions, we pre-trained CNN features on various subsets of the ImageNet dataset and evaluated transfer performance on PASCAL detection, PASCAL action classification, and SUN scene classification tasks. Our overall findings suggest that most changes in the choice of pre-training data long thought to be critical do not significantly affect transfer performance.? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class? △ Less

Submitted 10 December, 2016; v1 submitted 30 August, 2016; originally announced August 2016.

Showing 1–19 of 19 results for author: Huh, M