Skip to content

Commit e7f2f87

Browse files
committed
Add todo
1 parent 6d4039d commit e7f2f87

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ When the dataset you download from Hugging Face is too large, running the applic
3939
## Todos
4040

4141
- [ ] Introduce more dimensions to evaluate the dataset quality
42+
- Another indicator of poor quality data is excessive repetition of certain words or phrases within a document (Gopher)
4243
- [ ] Optimize the deduplication example using parallel computing technique
4344
- [ ] More robust junk data detecting
4445
- [ ] Use a classifier to evaluate the quality of data

0 commit comments

Comments
 (0)