research-article

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

Authors:

Alex Falcon,

Giuseppe Serra, and

Oswald LanzAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

Pages 4385 - 4394

https://doi.org/10.1145/3503161.3548365

Published: 10 October 2022 Publication History

Get Access

Abstract

Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA\_VideoRetrieval.

Supplementary Material

MP4 File (MM22-fp2827.mp4)

Presentation video of the paper "A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval". We propose a multimodal data augmentation technique for semantic text-video retrieval which works in the latent space. This raises several advantages when compared to techniques working on raw data, including less problems (privacy, copyright, etc) with data sharing, and easier applicability of the same technique to multiple modalities (e.g. video and text). Several comparison and experiments are performed to show the advantages compared to previously published techniques, while also improving state-of-the-art-techniques on two public datasets.

Download
60.08 MB

References

[1]

Kfir Aberman, Mingyi Shi, Jing Liao, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. 2019. Deep video-based performance cloning. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 219--233.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Applying data augmentation technique on blast-induced overbreak prediction: Resolving the problem of data shortage and data imbalance

Crossover based technique for data augmentation

TMMDA: A New Token Mixup Multimodal Data Augmentation for Multimodal Sentiment Analysis

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations