高级检索
刘云翔,张可欣. 双模态跨语料库语音情感识别[J]. 应用技术学报,2024,24(1):77-84.. DOI: 10.3969/j.issn.2096-3424.2024.01.008
引用本文: 刘云翔,张可欣. 双模态跨语料库语音情感识别[J]. 应用技术学报,2024,24(1):77-84.. DOI: 10.3969/j.issn.2096-3424.2024.01.008
LIU Yunxiang, ZHANG Kexin. Bimodal cross-corpus speech emotion recognition[J]. Journal of Technology, 2024, 24(1): 77-84. DOI: 10.3969/j.issn.2096-3424.2024.01.008
Citation: LIU Yunxiang, ZHANG Kexin. Bimodal cross-corpus speech emotion recognition[J]. Journal of Technology, 2024, 24(1): 77-84. DOI: 10.3969/j.issn.2096-3424.2024.01.008

双模态跨语料库语音情感识别

Bimodal cross-corpus speech emotion recognition

  • 摘要: 语音情感识别(SER)在双模态的跨数据库语音情感识别研究较少,跨数据库情感识别过度减少数据集之间差异的同时,会忽视情感判别能力的特征的问题。YouTube数据集为源数据,互动情感二元动作捕捉数据库(IEMOCAP)为目标数据。在源数据和目标数据中,Opensmile工具箱用来提取语音特征,将提取的语音特征输入到CNN和双向长短期记忆网络(BLSTM),来提取更高层次的特征,文本模态为语音信号的翻译稿。首先双向编码器表示转换器(Bert)把文本信息向量化,BLSTM提取文本特征,然后设计模态不变损失来形成2种模态的公共表示空间。为了解决跨语料库的SER问题,通过联合优化线性判别分析(LDA)、最大平均差异(MMD)、图嵌入(GE)和标签回归(LSR),学习源数据和目标数据的公共子空间。为了保留情绪辨别特征,情感判别损失与MMD+GE+LDA+LSR相结合。SVM分类器作为迁移公共子空间的最终情感分类,IEMOCAP上的实验结果表明,此方法优于其他先进的跨语料库和双模态SER.

     

    Abstract: In the field of speech emotion recognition(SER), a heterogeneity gap exists between different modalities and most cross-corpus SER only uses the audio modality. These issues were addressed simultaneously. YouTube datasets were selected as source data and an interactive emotional dyadic motion capture database (IEMOCAP) as target data. The Opensmile toolbox was used to extract speech features from both source and target data, then the extracted speech features were input into Convolutional Neural Network (CNN) and bidirectional long short-term memory network (BLSTM) to extract higher-level speech features with the text mode as the translation of speech signals. Firstly, BLSTM was adopted to extract the text features from text information vectorized by Bidirectional Encoder Representation from Transformers (BERT), then modality-invariance loss was designed to form a common representation space for the two modalities. To solve the problem of cross-corpus SER, a common subspace of source data and target data were learned by optimizing Linear Discriminant analysis (LDA), Maximum Mean Discrepancy (MMD) and Graph Embedding (GE) and Label Smoothing Regularization (LSR) jointly. To preserve emotion-discriminative features, emotion-aware center loss was combined with MMD+GE+LDA+LSR. The SVM classifier was designed as a final emotion classification for migrating common subspaces. The experimental results on IEMOCAP showed that this method outperformed other state-of-art cross-corpus and bimodal SER.

     

/

返回文章
返回