Skip to content

Commit ee5169a

Browse files
committed
first commit
0 parents  commit ee5169a

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
2+
# Hugging Face Datasets Text Quality Analysis
3+
4+
![](./screenshot.png)
5+
6+
The purpose of this repository is to let people evaluate the quality of datasets on Hugging Face. It retrieves parquet files from Hugging Face, identifies the junk data, duplication, contamination, biased content, and other quality issues within a given dataset.
7+
8+
## Covered Dimension
9+
10+
* Junk Data
11+
* Short Document
12+
* Duplication
13+
* Contamination
14+
* Biased Content
15+
16+
17+
## Instructions
18+
19+
1.Prerequisites
20+
Note that the code only works `Python >= 3.9` and `streamlit >= 1.23.1`
21+
22+
```
23+
$ conda create -n datasets-quality python=3.9
24+
$ conda activate datasets-quality
25+
```
26+
27+
2. Install dependencies
28+
```
29+
$ cd HuggingFace-Datasets-Text-Quality-Analysis
30+
$ pip install -r requirements.txt
31+
```
32+
33+
3.Run Streamlit application
34+
```
35+
python -m streamlit run app.py
36+
```
37+
38+
## Todos
39+
40+
[ ] Introduce more dimensions to evaluate the dataset quality
41+
[ ] Optimize the deduplication example using parallel computing technique
42+
[ ] More robust junk data detecting
43+
[ ] Use a classifier to evaluate the quality of data
44+
[ ] Test the code with larger dataset in a cluster environment
45+
[ ] More data frame manipulation backend: e.g. Dask
46+
# HuggingFace-Datasets-Text-Quality-Analysis

0 commit comments

Comments
 (0)