You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, parallel training in paddle version is implemented in the form of Paddle Distributed Data Parallelism [DDP](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/cluster_quick_start_collective_cn.html).
194
-
DeePMD-kit will decide whether to launch the training in parallel (distributed) mode or in serial mode depending on your execution command.
195
-
196
-
### Dataloader and Dataset
197
-
198
-
First, we establish a DeepmdData class for each system, which is consistent with the TensorFlow version in this level. Then, we create a dataloader for each system, resulting in the same number of dataloaders as the number of systems. Next, we create a dataset for the dataloaders obtained in the previous step. This allows us to query the data for each system through this dataset, while the iteration pointers for each system are maintained by their respective dataloaders. Finally, a dataloader is created for the outermost dataset.
199
-
200
-
We achieve custom sampling methods using a weighted sampler. The length of the sampler is set to total_batch_num \* num_workers.The parameter "num_workers" defines the number of threads involved in multi-threaded loading, which can be modified by setting the environment variable NUM_WORKERS (default: min(8, ncpus)).
201
-
202
-
> **Note** The underlying dataloader will use a distributed sampler to ensure that each GPU receives batches with different content in parallel mode, which will use sequential sampler in serial mode. In the TensorFlow version, Horovod shuffles the dataset using different random seeds for the same purpose..
203
-
204
-
```mermaid
205
-
flowchart LR
206
-
subgraph systems
207
-
subgraph system1
208
-
direction LR
209
-
frame1[frame 1]
210
-
frame2[frame 2]
211
-
end
212
-
subgraph system2
213
-
direction LR
214
-
frame3[frame 3]
215
-
frame4[frame 4]
216
-
frame5[frame 5]
217
-
end
218
-
end
219
-
subgraph dataset
220
-
dataset1[dataset 1]
221
-
dataset2[dataset 2]
222
-
end
223
-
system1 -- frames --> dataset1
224
-
system2 --> dataset2
225
-
subgraph distribted sampler
226
-
ds1[distributed sampler 1]
227
-
ds2[distributed sampler 2]
228
-
end
229
-
dataset1 --> ds1
230
-
dataset2 --> ds2
231
-
subgraph dataloader
232
-
dataloader1[dataloader 1]
233
-
dataloader2[dataloader 2]
234
-
end
235
-
ds1 -- mini batch --> dataloader1
236
-
ds2 --> dataloader2
237
-
subgraph index[index on Rank 0]
238
-
dl11[dataloader 1, entry 1]
239
-
dl21[dataloader 2, entry 1]
240
-
dl22[dataloader 2, entry 2]
241
-
end
242
-
dataloader1 --> dl11
243
-
dataloader2 --> dl21
244
-
dataloader2 --> dl22
245
-
index -- for each step, choose 1 system --> WeightedSampler
246
-
--> dploaderset --> bufferedq[buffered queue] --> model
247
-
```
248
-
249
193
### How to use
250
194
251
195
We use [`paddle.distributed.fleet`](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/cluster_quick_start_collective_cn.html) to launch a DDP training session.
0 commit comments