Skip to content

Commit 983c238

Browse files
committed
Modify paper
1 parent 0931a1e commit 983c238

File tree

4 files changed

+22
-10
lines changed

4 files changed

+22
-10
lines changed

gfcc.png

-60.3 KB
Loading

mfcc.png

-66.3 KB
Loading

paper.bib

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,14 @@ @misc{opensource
2626
publisher = {Opensource},
2727
journal = {Opensource article},
2828
url = {https://opensource.com/article/19/9/audio-processing-machine-learning-python}
29-
}
29+
}
30+
31+
@INPROCEEDINGS{6921394,
32+
author={Chauhan, Paresh M. and Desai, Nikita P.},
33+
booktitle={2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE)},
34+
title={Mel Frequency Cepstral Coefficients (MFCC) based speaker identification in noisy environment using wiener filter},
35+
year={2014},
36+
volume={},
37+
number={},
38+
pages={1-5},
39+
doi={10.1109/ICGCCEE.2014.6921394}}

paper.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,25 +20,27 @@ bibliography: paper.bib
2020

2121
# Summary
2222

23-
PyAudioProcessing is a Python based library for processing audio data, forming and extracting numerical features from audio and further bulding machine learning models. This library allows you to extract features such as MFCC, GFCC, spectral features, chroma features and other beat based and cepstrum based features from audio to use with one's own classification backend or popular scikit-learn classifiers.
23+
PyAudioProcessing is a Python based library for processing audio data, forming and extracting numerical features from audio and further building and testing machine learning models. This library allows you to extract features such as MFCC, GFCC, spectral features, chroma features and other beat based and cepstrum based features from audio to use with one's own classification backend or popular scikit-learn classifiers.
2424

2525
# Statement of need
2626

27-
PyAudioProcessing is a Python based library for processing audio data into features and building Machine Learning models. Audio processing and feature extraction research is popular in MATLAB. There are comparatively fewer resources for audio processing and classification in Python. This tool contains implementation of popular and different audio feature extraction that can be use in combination with most scikit-learn classifiers. Unique feature extractions such as Mel Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC) [@inbook], spectral coefficients, chroma features and others are available to extract and use in combination with different backend classifiers. While MFCC features find use in most commonly encountered audio processing tasks such as audio type classification, speech classification, GFCC features have been found to have application in speaker identification/diarization. Many such applications, comparisons and uses can be found in this IEEE paper [@6639061].
27+
PyAudioProcessing is a Python based library for processing audio data into features and building Machine Learning models. Audio processing and feature extraction research is popular in MATLAB. There are comparatively fewer resources for audio processing and classification in Python. This tool contains implementation of popular and different audio feature extraction that can be use in combination with most scikit-learn classifiers. Audio data can be trained, tested and classified using pyAudioProcessing. The output consists of cross validation scores and results of testing on custom audio files.
2828

29-
# Audio features
29+
The library lets the user extract aggregated data features calculated per audio file. Unique feature extractions such as Mel Frequency Cepstral Coefficients (MFCC) [@6921394], Gammatone Frequency Cepstral Coefficients (GFCC) [@inbook], spectral coefficients, chroma features and others are available to extract and use in combination with different backend classifiers. While MFCC features find use in most commonly encountered audio processing tasks such as audio type classification, speech classification, GFCC features have been found to have application in speaker identification or speaker diarization. Many such applications, comparisons and uses can be found in this IEEE paper [@6639061]. All these features are also helpful for a variety of other audio classification tasks.
3030

31-
Information about getting started with audio processing is described in @opensource.
31+
# Audio features
3232

33-
Passing a spectrum through the Mel filter bank, followed by taking the log magnitude and a discrete cosine transform (DCT) produces the Mel cepstrum. DCT extracts the signal's main information and peaks. It is also widely used in JPEG and MPEG compressions. The peaks are the gist of the audio information. Typically, the first 13 coefficients extracted from the Mel cepstrum are called the MFCCs. These hold very useful information about audio and are often used to train machine learning models. This can be further seen in the form of an illustration in \autoref{fig:mfcc}.
33+
Information about getting started with audio processing is described in [@opensource].
3434

35-
Another filter inspired by human hearing is the Gammatone filter bank. This filter bank is used as a front-end simulation of the cochlea. Thus, it has many applications in speech processing because it aims to replicate how we hear. GFCCs are formed by passing the spectrum through Gammatone filter bank, followed by loudness compression and DCT, as seen in \autoref{fig:gfcc}. The first (approximately) 22 features are called GFCCs. GFCCs have a number of applications in speech processing, such as speaker identification.
35+
![MFCC from audio spectrum.\label{fig:mfcc}](mfcc.png)
3636

37-
Other features useful in audio processing tasks (especially speech) include LPCC, BFCC, PNCC, and spectral features like spectral flux, entropy, roll off, centroid, spread, and energy entropy.
37+
The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency. Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes our features match more closely what humans hear. The Mel-frequency scale is approximately linear for frequencies below 1 kHz and logarithmic for frequencies above 1 kHz, as shown in \autoref{fig:mfcc}. This is motivated by the fact that the human auditory system becomes less frequency-selective as frequency increases above 1 kHz.
38+
Passing a spectrum through the Mel filter bank, followed by taking the log magnitude and a discrete cosine transform (DCT) produces the Mel cepstrum. DCT extracts the signal's main information and peaks. It is also widely used in JPEG and MPEG compressions. The peaks are the gist of the audio information. Typically, the first 13 coefficients extracted from the Mel cepstrum are called the MFCCs. These hold very useful information about audio and are often used to train machine learning models. This can be further seen in the form of an illustration in \autoref{fig:mfcc}.
3839

40+
![GFCC from audio spectrum.\label{fig:gfcc}](gfcc.png)
3941

40-
![MFCC from audio spectrum.\label{fig:mfcc}](mfcc.png)
42+
Another filter inspired by human hearing is the Gammatone filter bank. Gammatone filters are conceived to be a good approximation to the human auditory filters and are used as a front-end simulation of the cochlea. Since a human ear is the perfect receiver and distinguisher of speakers in the presence of noise or no noise, construction of gammatone filters that mimic auditory filters became desirable. Thus, it has many applications in speech processing because it aims to replicate how we hear. GFCCs are formed by passing the spectrum through Gammatone filter bank, followed by loudness compression and DCT, as seen in \autoref{fig:gfcc}. The first (approximately) 22 features are called GFCCs. GFCCs have a number of applications in speech processing, such as speaker identification.
4143

42-
![GFCC from audio spectrum.\label{fig:gfcc}](gfcc.png)
44+
Other features useful in audio processing tasks (especially speech) include Linear prediction coefficients and Linear prediction cepstral coefficients (LPCC), Bark frequency cepstral coefficients (BFCC), Power normalized cepstral coefficients (PNCC), and spectral features like spectral flux, entropy, roll off, centroid, spread, and energy entropy.
4345

4446
# References

0 commit comments

Comments
 (0)