skip to main content
research-article

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

Published: 10 October 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA\_VideoRetrieval.

    Supplementary Material

    MP4 File (MM22-fp2827.mp4)
    Presentation video of the paper "A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval". We propose a multimodal data augmentation technique for semantic text-video retrieval which works in the latent space. This raises several advantages when compared to techniques working on raw data, including less problems (privacy, copyright, etc) with data sharing, and easier applicability of the same technique to multiple modalities (e.g. video and text). Several comparison and experiments are performed to show the advantages compared to previously published techniques, while also improving state-of-the-art-techniques on two public datasets.

    References

    [1]
    Kfir Aberman, Mingyi Shi, Jing Liao, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. 2019. Deep video-based performance cloning. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 219--233.
    [2]
    Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York.
    [3]
    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728--1738.
    [4]
    Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. 2021. Revisiting resnets: Improved training and scaling strategies. Advances in Neural Information Processing Systems, Vol. 34 (2021).
    [5]
    Nino Cauli and Diego Reforgiato Recupero. 2022. Survey on Videos Data Augmentation for Deep Learning Models. Future Internet, Vol. 14, 3 (2022), 93.
    [6]
    L. Ceci. 2022. Hours of video uploaded to YouTube every minute as of February 2020. https://www.statista.com/statistics/259477/hours-of-video-uploaded-to-youtube-every-minute. [Online; accessed 31-March-2022].
    [7]
    Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638--10647.
    [8]
    Tsz-Him Cheung and Dit-Yan Yeung. 2020. Modals: Modality-agnostic automated data augmentation in the latent space. In International Conference on Learning Representations.
    [9]
    Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. 2021. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583--11593.
    [10]
    Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. 2018. Learning to evaluate image captioning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 5804--5812.
    [11]
    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2021a. Rescaling egocentric vision. International Journal of Computer Vision (2021).
    [12]
    Dima Damen, Adriano Fragomeni, Jonathan Munro, Toby Perrett, Daniel Whettam, Michael Wray, Antonino Furnari, Giovanni Maria Farinella, and Davide Moltisanti. 2021b. EPIC-KITCHENS-100- 2021 Challenges Report. Technical Report. University of Bristol.
    [13]
    Jianfeng Dong, Xirong Li, Chaoxi Xu, Gang Yang, and Xun Wang. 2018. Feature re-learning with data augmentation for content-based video recommendation. In Proceedings of the 26th ACM international conference on Multimedia. 2058--2062.
    [14]
    Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
    [15]
    Jianfeng Dong, Xun Wang, Leimin Zhang, Chaoxi Xu, Gang Yang, and Xirong Li. 2019. Feature re-learning with data augmentation for video relevance prediction. IEEE Transactions on Knowledge and Data Engineering, Vol. 33, 5 (2019), 1946--1959.
    [16]
    Alexander Richard Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Radev, and Yashar Mehdad. 2021. Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 704--717.
    [17]
    Alex Falcon, Giuseppe Serra, and Oswald Lanz. 2022. Learning video retrieval models with relevance-aware online mining. arXiv preprint arXiv:2203.08688 (2022).
    [18]
    Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In Proceedings of the IEEE ECCV. Springer.
    [19]
    Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. 2019. Large-scale weakly-supervised pre-training for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12046--12055.
    [20]
    Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 297--304.
    [21]
    Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Computer Vision and Pattern Recognition (Computer Vision and Pattern Recognition'06), Vol. 2. IEEE, 1735--1742.
    [22]
    Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, and Alexander G Schwing. 2021. SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction from Video Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1418--1428.
    [23]
    Hochul Hwang, Cheongjae Jang, Geonwoo Park, Junghyun Cho, and Ig-Jae Kim. 2021. ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications. IEEE Access (2021).
    [24]
    Takashi Isobe, Jian Han, Fang Zhuz, Yali Liy, and Shengjin Wang. 2020. Intra-clip aggregation for video person re-identification. In 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2336--2340.
    [25]
    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.
    [26]
    Weike Jin, Zhou Zhao, Pengcheng Zhang, Jieming Zhu, Xiuqiang He, and Yueting Zhuang. 2021. Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1114--1124.
    [27]
    Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492--5501.
    [28]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, Vol. 25 (2012).
    [29]
    Varun Kumar, Hadrien Glaude, Cyprien de Lichy, and Wlliam Campbell. 2019. A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019). 1--10.
    [30]
    Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331--7341.
    [31]
    Jie Li, Mingqiang Yang, Yupeng Liu, Yanyan Wang, Qinghe Zheng, and Deqiang Wang. 2019. Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks. Engineering Letters, Vol. 27, 3 (2019).
    [32]
    Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, et al. 2021. VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks.
    [33]
    Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, and Gang Yang. 2020. Sea: Sentence encoder assembly for video retrieval by textual queries. IEEE Transactions on Multimedia, Vol. 23 (2020), 4351--4362.
    [34]
    Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. International Conference on Computer Vision (2021).
    [35]
    Xiaofeng Liu, Yang Zou, Lingsheng Kong, Zhihui Diao, Junliang Yan, Jun Wang, Site Li, Ping Jia, and Jane You. 2018. Data augmentation via latent space interpolation for image classification. In 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 728--733.
    [36]
    Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. BMVC (2019).
    [37]
    Shayne Longpre, Yu Wang, and Chris DuBois. 2020. How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4401--4411.
    [38]
    Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3428--3448.
    [39]
    Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition. 9879--9889.
    [40]
    Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).
    [41]
    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630--2640.
    [42]
    George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM, Vol. 38, 11 (1995), 39--41.
    [43]
    Junghyun Min, R Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen. 2020. Syntactic Data Augmentation Increases Robustness to Inference Heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2339--2352.
    [44]
    Shantipriya Parida and Petr Motlicek. 2019. Abstract text summarization: A low resource challenge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5994--5998.
    [45]
    Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, Joao Henriques, and Andrea Vedaldi. 2020. Support-set bottlenecks for video-text representation learning. Proceedings of the International Conference on Learning Representations (2020).
    [46]
    Hieu Pham, Xinyi Wang, Yiming Yang, and Graham Neubig. 2020. Meta Back-Translation. In International Conference on Learning Representations.
    [47]
    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 779--788.
    [48]
    Dimitrios Sakkos, Hubert PH Shum, and Edmond SL Ho. 2019. Illumination-based data augmentation for robust background subtraction. In 2019 13th International Conference on Software, Knowledge, Information Management and Applications (SKIMA). IEEE, 1--8.
    [49]
    Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Computer Vision and Pattern Recognition. 815--823.
    [50]
    Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. 2019. Cycle-consistency for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6649--6658.
    [51]
    Connor Shorten, Taghi M Khoshgoftaar, and Borko Furht. 2021. Text data augmentation for deep learning. Journal of big Data, Vol. 8, 1 (2021), 1--34.
    [52]
    Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 4631--4640.
    [53]
    Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM international conference on Multimedia. 988--997.
    [54]
    Liangliang Wang, Lianzheng Ge, Ruifeng Li, and Yajun Fang. 2017. Three-stream CNNs for action recognition. Pattern Recognition Letters, Vol. 92 (2017), 33--40.
    [55]
    Shuo Wang, Dan Guo, Xin Xu, Li Zhuo, and Meng Wang. 2019a. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 2s (2019), 1--16.
    [56]
    William Yang Wang and Diyi Yang. 2015. That's so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In Proceedings of the 2015 conference on empirical methods in natural language processing. 2557--2563.
    [57]
    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019b. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581--4591.
    [58]
    Xiaohan Wang, Linchao Zhu, and Yi Yang. 2021b. T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition. 5079--5088.
    [59]
    Zixu Wang, Yishu Miao, and Lucia Specia12. 2021a. Cross-Modal Generative Augmentation for Visual Question Answering. (2021).
    [60]
    Dongxu Wei, Xiaowei Xu, Haibin Shen, and Kejie Huang. 2020. Gac-gan: A general method for appearance-controllable human video motion transfer. IEEE Transactions on Multimedia, Vol. 23 (2020), 2457--2470.
    [61]
    Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6382--6388.
    [62]
    Michael Wray, Hazel Doughty, and Dima Damen. 2021. On semantic similarity in video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3650--3660.
    [63]
    Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 450--459.
    [64]
    Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Conditional bert contextual augmentation. In International Conference on Computational Science. Springer, 84--95.
    [65]
    Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng. 2016. Data Noising as Smoothing in Neural Network Language Models. (2016).
    [66]
    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016b. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Computer Vision and Pattern Recognition. 5288--5296.
    [67]
    Zhenqi Xu, Jiani Hu, and Weihong Deng. 2016a. Recurrent convolutional neural network for video classification. In 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.
    [68]
    Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. 2020. Videomix: Rethinking data augmentation for video classification. arXiv preprint arXiv:2012.03457 (2020).
    [69]
    Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11782--11791.
    [70]
    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.
    [71]
    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 13001--13008.
    [72]
    Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018. Towards Automatic Learning of Procedures From Web Instructional Videos. In AAAI Conference on Artificial Intelligence. 7590--7598. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17344.

    Cited By

    View all
    • (2024)Semantic Fusion Augmentation and Semantic Boundary Detection: A Novel Approach to Multi-Target Video Moment Retrieval2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00664(6769-6778)Online publication date: 3-Jan-2024
    • (2024)Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech SynthesisICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446098(11166-11170)Online publication date: 14-Apr-2024
    • (2023)FArMARe: a Furniture-Aware Multi-task methodology for Recommending Apartments based on the user interests2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00464(4295-4305)Online publication date: 2-Oct-2023
    • Show More Cited By

    Index Terms

    1. A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 10 October 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. cross-modal video retrieval
        2. data augmentation
        3. vision and language

        Qualifiers

        • Research-article

        Conference

        MM '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 995 of 4,171 submissions, 24%

        Upcoming Conference

        MM '24
        The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)186
        • Downloads (Last 6 weeks)13

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Semantic Fusion Augmentation and Semantic Boundary Detection: A Novel Approach to Multi-Target Video Moment Retrieval2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00664(6769-6778)Online publication date: 3-Jan-2024
        • (2024)Latent Filling: Latent Space Data Augmentation for Zero-Shot Speech SynthesisICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446098(11166-11170)Online publication date: 14-Apr-2024
        • (2023)FArMARe: a Furniture-Aware Multi-task methodology for Recommending Apartments based on the user interests2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00464(4295-4305)Online publication date: 2-Oct-2023
        • (2023)Verbs in Action: Improving verb understanding in video-language models2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01428(15533-15545)Online publication date: 1-Oct-2023
        • (2023)Heterogeneous Graph Learning for Acoustic Event ClassificationICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095073(1-5)Online publication date: 4-Jun-2023
        • (2023)Auxiliary Cross-Modal Representation Learning With Triplet Loss Functions for Online Handwriting RecognitionIEEE Access10.1109/ACCESS.2023.331081911(94148-94172)Online publication date: 2023

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media