Skip to content

Commit 55ddeb8

Browse files
authored
Merge pull request #32 from jsingh811/paper
Paper
2 parents b9341d0 + ba8895c commit 55ddeb8

File tree

4 files changed

+85
-0
lines changed

4 files changed

+85
-0
lines changed

paper/gfcc.png

91.8 KB
Loading

paper/mfcc.png

57.4 KB
Loading

paper/paper.bib

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
@INPROCEEDINGS{6639061,
2+
author={Zhao, Xiaojia and Wang, DeLiang},
3+
booktitle={2013 IEEE International Conference on Acoustics, Speech and Signal Processing},
4+
title={Analyzing noise robustness of MFCC and GFCC features in speaker identification},
5+
year={2013},
6+
volume={},
7+
number={},
8+
pages={7204-7208},
9+
doi={10.1109/ICASSP.2013.6639061}}
10+
11+
@inbook{inbook,
12+
author = {Jeevan, Medikonda and Dhingra, Atul and Hanmandlu, M. and Panigrahi, Bijaya},
13+
year = {2017},
14+
month = {10},
15+
pages = {85-91},
16+
title = {Robust Speaker Verification Using GFCC Based i-Vectors},
17+
volume = {395},
18+
isbn = {978-81-322-3590-3},
19+
doi = {10.1007/978-81-322-3592-7_9}
20+
}
21+
22+
@misc{opensource,
23+
author = {Jyotika Singh},
24+
title = {An introduction to audio processing and machine learning using Python},
25+
year = {2019},
26+
publisher = {Opensource},
27+
journal = {Opensource article},
28+
url = {https://opensource.com/article/19/9/audio-processing-machine-learning-python}
29+
}
30+
31+
@INPROCEEDINGS{6921394,
32+
author={Chauhan, Paresh M. and Desai, Nikita P.},
33+
booktitle={2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE)},
34+
title={Mel Frequency Cepstral Coefficients (MFCC) based speaker identification in noisy environment using wiener filter},
35+
year={2014},
36+
volume={},
37+
number={},
38+
pages={1-5},
39+
doi={10.1109/ICGCCEE.2014.6921394}}

paper/paper.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
---
2+
title: 'pyAudioProcessing: Audio Processing, Feature Extraction and building Machine Learning Models from Audio Data'
3+
tags:
4+
- Python
5+
- audio
6+
- audio processing
7+
- feature extraction
8+
- machine learning
9+
- gfcc
10+
- mfcc
11+
- cepstral coefficients
12+
- spectral coefficients
13+
authors:
14+
- name: Jyotika Singh
15+
orcid: 0000-0002-5442-3004
16+
date: 2 June 2021
17+
bibliography: paper.bib
18+
19+
---
20+
21+
# Summary
22+
23+
PyAudioProcessing is a Python based library for processing audio data, forming and extracting numerical features from audio and further building and testing machine learning models. This library allows you to extract features such as MFCC, GFCC, spectral features, chroma features and other beat based and cepstrum based features from audio to use with one's own classification backend or popular scikit-learn classifiers.
24+
25+
# Statement of need
26+
27+
PyAudioProcessing is a Python based library for processing audio data into features and building Machine Learning models. Audio processing and feature extraction research is popular in MATLAB. There are comparatively fewer resources for audio processing and classification in Python. This tool contains implementation of popular and different audio feature extraction that can be use in combination with most scikit-learn classifiers. Audio data can be trained, tested and classified using pyAudioProcessing. The output consists of cross validation scores and results of testing on custom audio files.
28+
29+
The library lets the user extract aggregated data features calculated per audio file. Unique feature extractions such as Mel Frequency Cepstral Coefficients (MFCC) [@6921394], Gammatone Frequency Cepstral Coefficients (GFCC) [@inbook], spectral coefficients, chroma features and others are available to extract and use in combination with different backend classifiers. While MFCC features find use in most commonly encountered audio processing tasks such as audio type classification, speech classification, GFCC features have been found to have application in speaker identification or speaker diarization. Many such applications, comparisons and uses can be found in this IEEE paper [@6639061]. All these features are also helpful for a variety of other audio classification tasks.
30+
31+
# Audio features
32+
33+
Information about getting started with audio processing is described in [@opensource].
34+
35+
![MFCC from audio spectrum.\label{fig:mfcc}](mfcc.png)
36+
37+
The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes our features match more closely what humans hear. The Mel-frequency scale is approximately linear for frequencies below 1 kHz and logarithmic for frequencies above 1 kHz, as shown in \autoref{fig:mfcc}. This is motivated by the fact that the human auditory system becomes less frequency-selective as frequency increases above 1 kHz.
38+
Passing a spectrum through the Mel filter bank, followed by taking the log magnitude and a discrete cosine transform (DCT) produces the Mel cepstrum. DCT extracts the signal's main information and peaks. It is also widely used in JPEG and MPEG compressions. The peaks are the gist of the audio information. Typically, the first 13 coefficients extracted from the Mel cepstrum are called the MFCCs. These hold very useful information about audio and are often used to train machine learning models. This can be further seen in the form of an illustration in \autoref{fig:mfcc}.
39+
40+
![GFCC from audio spectrum.\label{fig:gfcc}](gfcc.png)
41+
42+
Another filter inspired by human hearing is the Gammatone filter bank. Gammatone filters are conceived to be a good approximation to the human auditory filters and are used as a front-end simulation of the cochlea. Since a human ear is the perfect receiver and distinguisher of speakers in the presence of noise or no noise, construction of gammatone filters that mimic auditory filters became desirable. Thus, it has many applications in speech processing because it aims to replicate how we hear. GFCCs are formed by passing the spectrum through Gammatone filter bank, followed by loudness compression and DCT, as seen in \autoref{fig:gfcc}. The first (approximately) 22 features are called GFCCs. GFCCs have a number of applications in speech processing, such as speaker identification.
43+
44+
Other features useful in audio processing tasks (especially speech) include Linear prediction coefficients and Linear prediction cepstral coefficients (LPCC), Bark frequency cepstral coefficients (BFCC), Power normalized cepstral coefficients (PNCC), and spectral features like spectral flux, entropy, roll off, centroid, spread, and energy entropy.
45+
46+
# References

0 commit comments

Comments
 (0)