Speech emotion recognition (SER) is the task of recognizing emotions from speech signals. While people are capable of performing this task efficiently as a natural aspect of speech communication, it is still a work in progress to automate it using programmable devices. Speech emotion recognition plays an important role in the development of human-computer interaction since adding emotions to machines makes them appear and act in a human-like manner. Various SER techniques have been developed over the last few decades, but the problem has not yet been completely solved. This paper proposes a speech emotion recognition technique based on the hybrid of two deep learning architectures namely Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM). Deep CNN has demonstrated its effectiveness in local feature selection, whereas LSTM has shown great success in the sequential processing of large texts. The proposed Convolutional LSTM (Co-LSTM) approach aims to create an efficient automatic method of emotion detection in human-machine communication. In the suggested method, Mel Frequency Cepstral Coefficient (MFCC) is used to extract a matrix of spectral features from the speech signal and afterward is converted to 1-dimensional (1D) array. Then, Co-LSTM is employed as a feature selection and classification method to learn the model for emotion recognition. The experimental analyses were carried out on the classification of all the eight emotions of the speech from RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) and TESS (Toronto Emotional Speech Set) databases. An accuracy of 86.7% was achieved with Co-LSTM using MFCC Spectrogram features. The obtained results convincingly prove the effectiveness of the proposed algorithm when compared to the previous works and other well-known classifiers

Reviewers Suggestions