Skip to content

The neural network model is capable of detecting five different male/female emotions from audio speeches. (Deep Learning, NLP, Python)

Notifications You must be signed in to change notification settings

LekhaCodes/Speech-Emotion-Recognition-Using-Deep-CNN

 
 

Repository files navigation

Speech Emotion Recognition

Introduction

  • This repository handles building and training Speech Emotion Recognition System.
  • The basic idea behind this tool is to build and train/test a suited machine learning ( as well as deep learning ) algorithm that could recognize and detects human emotions from speech.
  • This is useful for many industry fields such as making product recommendations, affective computing, etc.
  • Check this tutorial for more information.

Requirements

  • Python 3.6+

Python Packages

  • librosa==0.6.3
  • numpy
  • pandas
  • soundfile==0.9.0
  • wave
  • sklearn
  • tqdm==4.28.1
  • matplotlib==2.2.3
  • pyaudio==0.2.11
  • ffmpeg (optional): used if you want to add more sample audio by converting to 16000Hz sample rate and mono channel which is provided in convert_wavs.py

Install these libraries by the following command:

pip3 install -r requirements.txt

Dataset

This repository used 4 datasets (including this repo's custom dataset) which are downloaded and formatted already in data folder:

  • RAVDESS : The Ryson Audio-Visual Database of Emotional Speech and Song that contains 24 actors (12 male, 12 female), vocalizing two lexically-matched statements in a neutral North American accent.
  • TESS : Toronto Emotional Speech Set that was modeled on the Northwestern University Auditory Test No. 6 (NU-6; Tillman & Carhart, 1966). A set of 200 target words were spoken in the carrier phrase "Say the word _____' by two actresses (aged 26 and 64 years).
  • EMO-DB : As a part of the DFG funded research project SE462/3-1 in 1997 and 1999 we recorded a database of emotional utterances spoken by actors. The recordings took place in the anechoic chamber of the Technical University Berlin, department of Technical Acoustics. Director of the project was Prof. Dr. W. Sendlmeier, Technical University of Berlin, Institute of Speech and Communication, department of communication science. Members of the project were mainly Felix Burkhardt, Miriam Kienast, Astrid Paeschke and Benjamin Weiss.
  • Custom : Some unbalanced noisy dataset that is located in data/train-custom for training and data/test-custom for testing in which you can add/remove recording samples easily by converting the raw audio to 16000 sample rate, mono channel (this is provided in create_wavs.py script in convert_audio(audio_path) method which requires ffmpeg to be installed and in PATH) and adding the emotion to the end of audio file name separated with '_' (e.g "20190616_125714_happy.wav" will be parsed automatically as happy)

Emotions available

There are 9 emotions available: "neutral", "calm", "happy" "sad", "angry", "fear", "disgust", "ps" (pleasant surprise) and "boredom".

Feature Extraction

Feature extraction is the main part of the speech emotion recognition system. It is basically accomplished by changing the speech waveform to a form of parametric representation at a relatively lesser data rate.

In this repository, we have used the most used features that are available in librosa library including:

  • MFCC
  • Chromagram
  • MEL Spectrogram Frequency (mel)
  • Contrast
  • Tonnetz (tonal centroid features)

Example 1: Using 3 Emotions

The way to build and train a model for classifying 3 emotions is as shown below:

from emotion_recognition import EmotionRecognizer
from sklearn.svm import SVC
# init a model, let's use SVC
my_model = SVC()
# pass my model to EmotionRecognizer instance
# and balance the dataset
rec = EmotionRecognizer(model=my_model, emotions=['sad', 'neutral', 'happy'], balance=True, verbose=0)
# train the model
rec.train()
# check the test accuracy for that model
print("Test score:", rec.test_score())
# check the train accuracy for that model
print("Train score:", rec.train_score())

Output:

Test score: 0.8148148148148148
Train score: 1.0

Determining the best model

In order to determine the best model, you can by:

# loads the best estimators from `grid` folder that was searched by GridSearchCV in `grid_search.py`,
# and set the model to the best in terms of test score, and then train it
rec.determine_best_model(train=True)
# get the determined sklearn model name
print(rec.model.__class__.__name__, "is the best")
# get the test accuracy score for the best estimator
print("Test score:", rec.test_score())

Output:

MLPClassifier is the best
Test Score: 0.8958333333333334

Predicting

Just pass an audio path to the rec.predict() method as shown below:

# this is a neutral speech from emo-db
print("Prediction:", rec.predict("data/emodb/wav/15a04Nc.wav"))
# this is a sad speech from TESS
print("Prediction:", rec.predict("data/tess_ravdess/validation/Actor_25/25_01_01_01_mob_sad.wav"))

Output:

Prediction: neutral
Prediction: sad

About

The neural network model is capable of detecting five different male/female emotions from audio speeches. (Deep Learning, NLP, Python)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 96.4%
  • Python 3.6%