|
1 | | -# ContrastivePredictiveCoding |
| 1 | +# Vector-Quantized Contrastive Predictive Coding |
| 2 | + |
| 3 | +To learn discrete representations of speech for the [ZeroSpeech challenges](https://zerospeech.com/), we propose vector-quantized contrastive predictive coding. |
| 4 | +An encoder maps input speech into a discrete sequence of codes. |
| 5 | +Next, an autoregressive model summarises the latent representation (up until time t) into a context vector. |
| 6 | +Using this context, the model learns to discriminate future frames from negative examples sampled randomly from other utterances. |
| 7 | +Finally, an RNN based vocoder is trained to generate audio from the discretized representation. |
| 8 | + |
| 9 | +<p align="center"> |
| 10 | + <img width="784" height="340" alt="VQ-CPC model summary" |
| 11 | + src="https://raw.githubusercontent.com/bshall/VectorQuantizedCPC/master/model.png"> |
| 12 | +</p> |
| 13 | + |
| 14 | +## Requirements |
| 15 | + |
| 16 | +1. Ensure you have Python 3 and PyTorch 1.4 or greater. |
| 17 | + |
| 18 | +2. Install [NVIDIA/apex](https://github.com/NVIDIA/apex) for mixed precision training. |
| 19 | + |
| 20 | +3. Install pip dependencies: |
| 21 | + ``` |
| 22 | + pip install requirements.txt |
| 23 | + ``` |
| 24 | + |
| 25 | +4. For evaluation install [bootphon/zerospeech2020](https://github.com/bootphon/zerospeech2020). |
| 26 | +
|
| 27 | +## Data and Preprocessing |
| 28 | +
|
| 29 | +1. Download and extract the [ZeroSpeech2020 datasets](https://download.zerospeech.com/). |
| 30 | +
|
| 31 | +2. Download the train/test splits [here](https://github.com/bshall/VectorQuantizedCPC/releases/tag/v0.1) |
| 32 | + and extract in the root directory of the repo. |
| 33 | + |
| 34 | +3. Preprocess audio and extract train/test log-Mel spectrograms: |
| 35 | + ``` |
| 36 | + python preprocess.py in_dir=/path/to/dataset dataset=[2019/english or 2019/surprise] |
| 37 | + ``` |
| 38 | + Note: `in_dir` must be the path to the `2019` folder. |
| 39 | + For `dataset` choose between `2019/english` or `2019/surprise`. |
| 40 | + Other datasets will be added in the future. |
| 41 | + ``` |
| 42 | + e.g. python preprecess.py in_dir=../datasets/2020/2019 dataset=2019/english |
| 43 | + ``` |
| 44 | + |
| 45 | +## Training |
| 46 | + |
| 47 | +1. Train the VQ-CPC model (pretrained weights will be released soon): |
| 48 | + ``` |
| 49 | + python train_cpc.py checkpoint_dir=path/to/checkpoint_dir dataset=[2019/english or 2019/surprise] |
| 50 | + ``` |
| 51 | + ``` |
| 52 | + e.g. python train_cpc.py checkpoint_dir=checkpoints/cpc/2019english dataset=2019/english |
| 53 | + ``` |
| 54 | + |
| 55 | +2. Train the vocoder: |
| 56 | + ``` |
| 57 | + python train_vocoder.py cpc_checkpoint=path/to/cpc/checkpoint checkpoint_dir=path/to/checkpoint_dir dataset=[2019/english or 2019/surprise] |
| 58 | + ``` |
| 59 | + ``` |
| 60 | + e.g. python train_vocoder.py cpc_checkpoint=checkpoints/cpc/english2019/model.ckpt-24000.pt checkpoint_dir=checkpoints/vocoder/english2019 |
| 61 | + ``` |
| 62 | +
|
| 63 | +## Evaluation |
| 64 | + |
| 65 | +### Voice conversion |
| 66 | +
|
| 67 | +``` |
| 68 | +python convert.py cpc_checkpoint=path/to/cpc/checkpoint vocoder_checkpoint=path/to/vocoder/checkpoint in_dir=path/to/wavs out_dir=path/to/out_dir synthesis_list=path/to/synthesis_list dataset=[2019/english or 2019/surprise] |
| 69 | +``` |
| 70 | +Note: the `synthesis list` is a `json` file: |
| 71 | +``` |
| 72 | +[ |
| 73 | + [ |
| 74 | + "english/test/S002_0379088085", |
| 75 | + "V002", |
| 76 | + "V002_0379088085" |
| 77 | + ] |
| 78 | +] |
| 79 | +``` |
| 80 | +containing a list of items with a) the path (relative to `in_dir`) of the source `wav` files; |
| 81 | +b) the target speaker (see `datasets/2019/english/speakers.json` for a list of options); |
| 82 | +and c) the target file name. |
| 83 | +``` |
| 84 | +e.g. python convert.py cpc_checkpoint=checkpoints/cpc/english2019/model.ckpt-25000.pt vocoder_checkpoint=checkpoints/vocoder/english2019/model.ckpt-150000.pt in_dir=../datasets/2020/2019 out_dir=submission/2019/english/test synthesis_list=datasets/2019/english/synthesis.json in_dir=../../Datasets/2020/2019 dataset=2019/english |
| 85 | +``` |
| 86 | +Voice conversion samples will be available soon. |
| 87 | +
|
| 88 | +### ABX Score |
| 89 | + |
| 90 | +1. Encode test data for evaluation: |
| 91 | + ``` |
| 92 | + python encode.py checkpoint=path/to/checkpoint out_dir=path/to/out_dir dataset=[2019/english or 2019/surprise] |
| 93 | + ``` |
| 94 | + ``` |
| 95 | + e.g. python encode.py checkpoint=checkpoints/2019english/model.ckpt-500000.pt out_dir=submission/2019/english/test dataset=2019/english |
| 96 | + ``` |
| 97 | + |
| 98 | +2. Run ABX evaluation script (see [bootphon/zerospeech2020](https://github.com/bootphon/zerospeech2020)). |
| 99 | +
|
| 100 | +The ABX score for the pretrained english model is: |
| 101 | +``` |
| 102 | +{ |
| 103 | + "2019": { |
| 104 | + "english": { |
| 105 | + "scores": { |
| 106 | + "abx": 13.444869807551896, |
| 107 | + "bitrate": 421.3347459545065 |
| 108 | + }, |
| 109 | + "details_bitrate": { |
| 110 | + "test": 421.3347459545065, |
| 111 | + "auxiliary_embedding1": 817.3706731019037, |
| 112 | + "auxiliary_embedding2": 817.6857350383482 |
| 113 | + }, |
| 114 | + "details_abx": { |
| 115 | + "test": { |
| 116 | + "cosine": 13.444869807551896, |
| 117 | + "KL": 50.0, |
| 118 | + "levenshtein": 27.836903478166363 |
| 119 | + }, |
| 120 | + "auxiliary_embedding1": { |
| 121 | + "cosine": 12.47147337307366, |
| 122 | + "KL": 50.0, |
| 123 | + "levenshtein": 43.91132599798928 |
| 124 | + }, |
| 125 | + "auxiliary_embedding2": { |
| 126 | + "cosine": 12.29162067184495, |
| 127 | + "KL": 50.0, |
| 128 | + "levenshtein": 44.29540315886812 |
| 129 | + } |
| 130 | + } |
| 131 | + } |
| 132 | + } |
| 133 | +} |
| 134 | +``` |
| 135 | +
|
| 136 | +## References |
| 137 | +
|
| 138 | +This work is based on: |
| 139 | +
|
| 140 | +1. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. ["Representation learning with contrastive predictive coding."](https://arxiv.org/abs/1807.03748) |
| 141 | + arXiv preprint arXiv:1807.03748 (2018). |
| 142 | +
|
| 143 | +2. Aaron van den Oord, and Oriol Vinyals. ["Neural discrete representation learning."](https://arxiv.org/abs/1711.00937) |
| 144 | + Advances in Neural Information Processing Systems. 2017. |
0 commit comments