You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Features extracted get saved in `audio_features.json`.
187
187
188
+
## Audio format conversion
189
+
190
+
You can convert you audio in `.mp4`, `.mp3`, `.m4a` and `.aac` to `.wav`. This will allow you to use audio feature generation and classification functionalities.
191
+
192
+
In order to convert your audios, the following code sample can be used.
193
+
194
+
```
195
+
from pyAudioProcessing.convert_audio import convert_files_to_wav
196
+
197
+
# dir_path is the path to the directory/folder on your machine containing audio files
198
+
dir_path = "data/mp4_files"
199
+
200
+
# simple change audio_format to "mp3", "m4a" or "acc" depending on the format
Copy file name to clipboardExpand all lines: paper/paper.md
+73-3Lines changed: 73 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: 'pyAudioProcessing: Audio Processing, Feature Extraction and building Machine Learning Models from Audio Data'
2
+
title: 'pyAudioProcessing: Audio Processing, Feature Extraction and Building Machine Learning Models from Audio Data'
3
3
tags:
4
4
- Python
5
5
- audio
@@ -24,17 +24,87 @@ bibliography: paper.bib
24
24
25
25
# Summary
26
26
27
-
PyAudioProcessing is a Python based library for processing audio data, forming and extracting numerical features from audio and further building and testing machine learning models. This library allows you to extract features such as MFCC, GFCC, spectral features, chroma features and other beat based and cepstrum based features from audio to use with one's own classification backend or popular scikit-learn classifiers.
27
+
PyAudioProcessing is a Python based library for processing audio data, constructing and extracting numerical features from audio, building and testing machine learning models and classifying data with existing pre-trained audio classification models or custom user-built models. PyAudioProcessing provides five core functionalities comprising different stages of audio signal processing.
28
+
29
+
1. Converting audio files to ".wav" format to give the users the ability to work with different types of audio files and convert them to ".wav" to increase compatibility with code and processes and work with ".wav" audio type.
30
+
31
+
2. Builds numerical features from audio that can be used to train machine learning models. The set of features supported evolve with time as research informs new and improved algorithms.
32
+
33
+
3. Includes the ability to export the features built with this library to use with any custom machine learning backend of the user's choosing.
34
+
35
+
4. Includes the capability that allows users to train scikit-learn classifiers using features of their choosing directly from raw data. This library runs
36
+
37
+
a. automatic hyper-parameter tuning
38
+
b. returns to the user the training model metrics along with cross-validation confusion matrix for model evaluation
39
+
c. allows the users to test the created classifier with the same features used for training
40
+
41
+
5. Includes pre-trained models to provide users with baseline audio classifiers.
42
+
43
+
It in an end-to-end solution for converting between audio file formats, building features from raw audio samples and training a machine learning model that can then be used to classify unseen raw audio samples. This library allows the user to extract features such as MFCC, GFCC, spectral features, chroma features and other beat based and cepstrum based features from audio to use with one's own classification backend or popular scikit-learn classifiers that have been built into pyAudioProcessing.
44
+
45
+
MATLAB is the language of choice for a vast amount of research in the audio and speech processing domain. On the contrary, Python remains the language of choice for a vast majority of Machine Learning research and functionality. This library contains features converted to Python that were originally built in MATLAB following a research invention. This software contributes to the available open-source software by enabling users to use Python based machine learning backend with highly researched audio features such as GFCC and others that are actively user for many audio classification based applications but are not readily available in Python due to primary popularity of research in MATLAB.
46
+
47
+
This software aims to provide machine learning engineers, data scientists, researchers and students with a set of baseline models to classify audio, the ability to use this library to build features on custom training data, the ability to automatically train on a scikit-learn classifier and perform hyper-parameter tuning using this library, the ability to export the built features for integration with any machine learning backend and the ability to classify audio files. This software furthers aims to aid users in addressing research efforts using GFCC and other evolving and actively researched audio features possible with Python.
48
+
28
49
29
50
# Statement of need
30
51
52
+
The motivation behind this software is understanding the popularity of Python for Machine Learning and presenting solutions for computing complex audio features using Python. This not only implies the need for resource to guide solutions for audio processing, but also signifies the need for Python guides and implementations to solve audio and speech classification tasks. The classifier implementation examples that are a part of this software and the README aim to give the users a sample solution to audio classification problems and help build the foundation to tackle new and unseen problems.
53
+
54
+
Different data processing techniques work well for different types of data. For example, word vector formations work great for text data [@nlp]. However, passing numbers data, an audio signal or an image through word vector formation is not likely to bring back any meaningful numerical representation that can be used to train machine learning models. Different data types correlate with feature formation techniques specific to their domain rather than a "one size fits all".
55
+
31
56
PyAudioProcessing is a Python based library for processing audio data into features and building Machine Learning models. Audio processing and feature extraction research is popular in MATLAB. There are comparatively fewer resources for audio processing and classification in Python. This tool contains implementation of popular and different audio feature extraction that can be use in combination with most scikit-learn classifiers. Audio data can be trained, tested and classified using pyAudioProcessing. The output consists of cross validation scores and results of testing on custom audio files.
32
57
33
58
The library lets the user extract aggregated data features calculated per audio file. Unique feature extractions such as Mel Frequency Cepstral Coefficients (MFCC) [@6921394], Gammatone Frequency Cepstral Coefficients (GFCC) [@inbook], spectral coefficients, chroma features and others are available to extract and use in combination with different backend classifiers. While MFCC features find use in most commonly encountered audio processing tasks such as audio type classification, speech classification, GFCC features have been found to have application in speaker identification or speaker diarization. Many such applications, comparisons and uses can be found in this IEEE paper [@6639061]. All these features are also helpful for a variety of other audio classification tasks.
34
59
35
60
Some other popular libraries for the domain of audio processing include librosa [@mcfee2015librosa] and pyAudioAnalysis [@giannakopoulos2015pyaudioanalysis]. Librosa is a python package for music and audio analysis. It provides the building blocks necessary to create music information retrieval systems. PyAudioAnalysis is a python library for audio feature extraction, classification, segmentation and applications. It allows the user to train scikit-learn models for mfcc, spectral and chroma features.
36
61
37
-
PyAudioProcessing adds multiple additional features. The library includes the implementation of GFCC features converted from MATLAB code to allow users to leverage features for speech classification and speaker identification tasks in addition to MFCC and spectral features that are useful for music and other audio classification tasks. It allows the user to choose from the different feature options and use single or combinations of different audio features. The features can be run through a variety of scikit-learn models including a grid search for best model and Hyperparameters, along with a final confusion matrix and cross validation performance statistics. It further allows for saving and exporting the different audio features per audio file for the user to be able to leverage those while using a different custom classifier backend that is not a part of scikit-learn's models.
62
+
PyAudioProcessing adds multiple additional features. The library includes the implementation of GFCC features converted from MATLAB based research to allow users to leverage Python with features for speech classification and speaker identification tasks in addition to MFCC and spectral features that are useful for music and other audio classification tasks. It allows the user to choose from the different feature options and use single or combinations of different audio features. The features can be run through a variety of scikit-learn models including a grid search for best model and Hyperparameters, along with a final confusion matrix and cross validation performance statistics. It further allows for saving and exporting the different audio features per audio file for the user to be able to leverage those while using a different custom classifier backend that is not a part of scikit-learn's models.
63
+
64
+
The library further provides some pre-build audio classification models such as `speechVSmusic`, `speechVSmusicVSbirds` sound classifier and `music genre` classifier for give the users a baseline of pre-trained models for their common audio classification tasks. The user can use the library to build custom classifiers with the help of the instructions in the README.
65
+
66
+
There is an additional functionality that allows users to convert their audio files to "wav" format to gain compatibility for using analysis and feature extraction on their audio files.
67
+
68
+
Given the use of this software in the community today inspires the need and growth of this software. It is referenced in a text book titled `Artificial Intelligence with Python Cookbook` published by Packt Publishing in October 2020 [@packt]. Additionally, pyAudioProcessing is a part of specific admissions requirement for a funded PhD project at University of Portsmouth <supid="portsmouth">[1](#footnote_portsmouth)</sup>. It is further referenced in this thesis paer titled "Master Thesis AI Methodologies for Processing Acoustic Signals AI Usage for Processing Acoustic Signals" [@phdthesis].
This software offer pre-trained models. This is an evolving feature as new datasets and classification problems gain prominence in research. Some of the pre-trained models include the following.
76
+
77
+
1. Audio type classifier to determine speech versus music: Trained SVM classifier for classifying audio into two possible classes - music, speech. This classifier was trained using MFCC, spectral and chroma features. Confusion matrix has scores such as follows.
78
+
79
+
|| music | speech |
80
+
| --- | --- | --- |
81
+
| music | 48.80 | 1.20 |
82
+
| speech | 0.60 | 49.40 |
83
+
84
+
2. Audio type classifier to determine speech versus music versus bird sounds: Trained SVM classifier that classifying audio into three possible classes - music, speech and birds. This classifier was trained using MFCC, spectral and chroma features. Confusion matrix has scores such as follows.
85
+
86
+
|| music | speech | birds |
87
+
| --- | --- | --- | --- |
88
+
| music | 31.53 | 0.73 | 1.07 |
89
+
| speech | 1.00 | 32.33 | 0.00 |
90
+
| birds | 0.00 | 0.00 | 33.33 |
91
+
92
+
3. Music genre classifier using the GTZAN [@tzanetakis_essl_cook_2001] dataset: Trained on SVM classifier using GFCC, MFCC, spectral and chroma features to classify music into 10 genre classes - blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock. Confusion matrix has scores such as follows.
93
+
94
+
|| pop | met | dis | blu | reg | cla | rock | hip | cou | jazz |
These baseline models aim to present capability of audio feature generation algorithms in extracting meaningful numeric patterns from the audio data. One can train their own classifiers using similar features and different machine learning backend for researching and exploring improvements.
0 commit comments