We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
1 parent 6d4039d commit e7f2f87Copy full SHA for e7f2f87
README.md
@@ -39,6 +39,7 @@ When the dataset you download from Hugging Face is too large, running the applic
39
## Todos
40
41
- [ ] Introduce more dimensions to evaluate the dataset quality
42
+ - Another indicator of poor quality data is excessive repetition of certain words or phrases within a document (Gopher)
43
- [ ] Optimize the deduplication example using parallel computing technique
44
- [ ] More robust junk data detecting
45
- [ ] Use a classifier to evaluate the quality of data
0 commit comments