skip to main content
research-article

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Published: 27 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence regression. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.

    References

    [1]
    Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arxiv: 2006.11477 [cs.CL]
    [2]
    Tadas Baltru?aitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision. 1--10. https://doi.org/10.1109/WACV.2016.7477553
    [3]
    Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal Multi-Task Learning for Dimensional and Continuous Emotion Recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (Mountain View, California, USA) (AVEC '17). Association for Computing Machinery, New York, NY, USA, 19--26. https://doi.org/10.1145/3133944.3133949
    [4]
    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1505--1518. https://doi.org/10.1109/JSTSP.2022.3188113
    [5]
    Huang-Cheng Chou, Chi-Chun Lee, and Carlos Busso. 2022. Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier. In Proc. Interspeech 2022. 161--165. https://doi.org/10.21437/Interspeech.2022--11041
    [6]
    Roberto Cipolla, Yarin Gal, and Alex Kendall. 2018. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7482--7491. https://doi.org/10.1109/CVPR.2018.00781
    [7]
    Jose Maria Garcia-Garcia, Victor M. R. Penichet, and Maria D. Lozano. 2017. Emotion Detection: A Technology Review. In Proceedings of the XVIII International Conference on Human Computer Interaction. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3123818.3123852
    [8]
    Hatice Gunes and Bjoern Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image & Vision Computing, Vol. 31, 2 (2013), 120--136.
    [9]
    Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-philippe Morency, and Soujanya Poria. 2021b. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI '21). Association for Computing Machinery, New York, NY, USA, 6--15. https://doi.org/10.1145/3462244.3479919
    [10]
    Wei Han, Hui Chen, and Soujanya Poria. 2021a. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arxiv: 2109.00412 [cs.CL]
    [11]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
    [12]
    Jonathan Herzig, Michal Shmueli-Scheuer, and David Konopnicki. 2017. Emotion Detection from Text via Ensemble Classification Using Word Embeddings. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (Amsterdam, The Netherlands) (ICTIR '17). Association for Computing Machinery, New York, NY, USA, 269--272. https://doi.org/10.1145/3121050.3121093
    [13]
    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460. https://doi.org/10.1109/TASLP.2021.3122291
    [14]
    Anthony Hu and Seth Flaxman. 2018. Multimodal Sentiment Analysis To Explore the Structure of Emotions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD '18). Association for Computing Machinery, New York, NY, USA, 350--358. https://doi.org/10.1145/3219819.3219853
    [15]
    Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7837--7851. https://aclanthology.org/2022.emnlp-main.534
    [16]
    Anoop K, Deepak P, and Lajish V L. 2020. Emotion Cognizance Improves Health Fake News Identification. In Proceedings of the 24th Symposium on International Database Engineering & Applications (Seoul, Republic of Korea) (IDEAS '20). Association for Computing Machinery, New York, NY, USA, Article 12, 10 pages. https://doi.org/10.1145/3410566.3410595
    [17]
    Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. 2019. Speech Emotion Recognition Using Deep Learning Techniques: A Review. IEEE Access, Vol. 7 (2019), 117327--117345. https://doi.org/10.1109/ACCESS.2019.2936124
    [18]
    Zheng Lian, Bin Liu, and Jianhua Tao. 2021. CTNet: Conversational Transformer Network for Emotion Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 29 (jan 2021), 985--1000. https://doi.org/10.1109/TASLP.2021.3049898
    [19]
    Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arxiv: 2304.08981 [cs.CL]
    [20]
    Zheng Lian, Jianhua Tao, Bin Liu, and Jian Huang. 2019. Conversational Emotion Analysis via Attention Mechanisms. In Proc. Interspeech 2019. 1936--1940. https://doi.org/10.21437/Interspeech.2019-1577
    [21]
    Jingyun Liang, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 4076--4085. https://doi.org/10.1109/ICCV48922.2021.00406
    [22]
    Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, and Tianrui Li. 2021. ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale Fusion of Locally Descriptors. arxiv: 2112.01368 [cs.CL]
    [23]
    Mona Hafez Mahmoud. 2019. A Survey of Some Interdisciplinary Methods and Tools to Measure Learners' Emotions in Intelligent Tutoring Systems. In 2019 6th International Conference on Advanced Control Circuits and Systems (ACCS) and 2019 5th International Conference on New Paradigms in Electronics & information Technology (PEIT). 1--6. https://doi.org/10.1109/ACCS-PEIT48329.2019.9062885
    [24]
    Mehdi Malekzadeh, Mumtaz Begum Mustafa, and Adel Lahsasna. 2015. A review of emotion regulation in intelligent tutoring systems. Journal of Educational Technology & Society, Vol. 18, 4 (2015), 435--445.
    [25]
    Fatemeh Noroozi, Marina Marjanovic, Angelina Njegus, Sergio Escalera, and Gholamreza Anbarjafari. 2019. Audio-Visual Emotion Recognition in Video Clips. IEEE Transactions on Affective Computing, Vol. 10, 1 (2019), 60--75. https://doi.org/10.1109/TAFFC.2017.2713783
    [26]
    Keyur Patel, Dev Mehta, Chinmay Mistry, Rajesh Gupta, Sudeep Tanwar, Neeraj Kumar, and Mamoun Alazab. 2020. Facial Sentiment Analysis Using AI Techniques: State-of-the-Art, Taxonomies, and Challenges. IEEE Access, Vol. 8 (2020), 90495--90519. https://doi.org/10.1109/ACCESS.2020.2993803
    [27]
    Jouni Pohjalainen, Fabien Fabien Ringeval, Zixing Zhang, and Björn Schuller. 2016. Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition. In Proceedings of the 24th ACM International Conference on Multimedia (Amsterdam, The Netherlands) (MM '16). Association for Computing Machinery, New York, NY, USA, 670--674. https://doi.org/10.1145/2964284.2967306
    [28]
    David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arxiv: 1510.08484
    [29]
    Bogdan Vlasenko and Andreas Wendemuth. 2009. Processing affected speech within human machine interaction. In 10th Annual Conference of the International Speech Communication Association. ISCA, Brighton, United Kingdom.
    [30]
    Kexin Wang, Zheng Lian, Licai Sun, Bin Liu, Jianhua Tao, and Yin Fan. 2022c. Emotional Reaction Analysis Based on Multi-Label Graph Convolutional Networks and Dynamic Facial Expression Recognition Transformer. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge (Lisboa, Portugal) (MuSe' 22). Association for Computing Machinery, New York, NY, USA, 75--80. https://doi.org/10.1145/3551876.3554810
    [31]
    Shu-Lin Wang, I-En Chiang Honours, Alex Kuo, and Jing-Ya Lin. 2022b. Mobile Emotion Healthcare System Applying Sentiment analysis. In 2022 IEEE International Conference on Big Data (Big Data). 2814--2820. https://doi.org/10.1109/BigData55660.2022.10021053
    [32]
    Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. 2022a. A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. arxiv: 2111.02735 [cs.CL]
    [33]
    Ali Yadollahi, Ameneh Gholipour Shahraki, and Osmar R. Zaiane. 2017. Current State of Text Sentiment Analysis from Opinion to Emotion Mining. ACM Comput. Surv., Vol. 50, 2, Article 25 (may 2017), 33 pages. https://doi.org/10.1145/3057270
    [34]
    Jing Zhao and Wei-Qiang Zhang. 2022. Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1227--1241. https://doi.org/10.1109/JSTSP.2022.3184480
    [35]
    Zengqun Zhao and Qingshan Liu. 2021. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia (MM '21). Association for Computing Machinery, New York, NY, USA, 1553--1561. https://doi.org/10.1145/3474085.3475292
    [36]
    Hengshun Zhou, Jun Du, Yuanyuan Zhang, Qing Wang, Qing-Feng Liu, and Chin-Hui Lee. 2021. Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 29 (jul 2021), 2617--2629. https://doi.org/10.1109/TASLP.2021.3096037
    [37]
    Hengshun Zhou, Debin Meng, Yuanyuan Zhang, Xiaojiang Peng, Jun Du, Kai Wang, and Yu Qiao. 2019. Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 562--566. https://doi.org/10.1145/3340555.3355713

    Cited By

    View all
    • (2024)Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture OptimizationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447231(11766-11770)Online publication date: 14-Apr-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep feature fusion
    2. joint decoding
    3. mer2023
    4. multi-task learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)86
    • Downloads (Last 6 weeks)8

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture OptimizationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447231(11766-11770)Online publication date: 14-Apr-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media