Optimized cross-corpus speech emotion recognition framework based on Normalized 1D Convolutional Neural Network

Authors

  • Nishant Barsainyan
  • Dileep Kumar Singh

DOI:

https://doi.org/10.6977/IJoSI.202502_9(1).0008

Keywords:

Convolutional Neural Networks, Cross-Corpus, Deep Learning, Feature Extraction, Signal Processing, Speech Emotion Recognition, XGB Classifier

Abstract

Human-computer interaction (HCI) improved via voice detection of emotions. Speech Emotion Recognition (SER) software typically detects the appearance of various feelings in the speaker.  However, there are significant challenges in combining information from multidisciplinary domains, notably speech-emotion recognition and applied psychology. Some researchers have used handcrafted attributes to categorize emotions and obtained high classification accuracy. However, these attributes reduce the categorization accuracy for multi-lingual environments. Deep learning algorithms have been utilized to autonomously retrieve the local representation from supplied speech data.  The given strategies can't extract the most valuable characteristics from challenging speech inputs.  To address this constraint, we propose an innovative SER framework that employs data augmentation approaches before generating relevant feature sets from each utterance and selecting the most discriminative optimum features. And the chosen feature vector is sent into the Normalized 1D CNN for emotion recognition using multi-lingual databases. This study evaluates the effectiveness of an XGB classifier for multi-lingual emotion recognition by testing its performance on data from a corpus trained on a different corpus.  The testing outcomes displayed that our proposed SER architecture functioned better than existing SER approaches.

Downloads

Published

2025-02-19