-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
Good First IssueLow hanging fruitsLow hanging fruits
Description
Currently, datasets are given as markdown files with lots of unused columns:
| Project name | Domain | Source code available (yes/no)? | Is it a git repository (yes/no)? | Repository URL | Clone URL | Estimated number of commits |
|---|---|---|---|---|---|---|
| apache-httpd | web server | y | y | https://github.com/apache/httpd | https://github.com/DiffDetective/httpd.git | 32,927 |
| berkeley-db-libdb | database system | y | y | https://github.com/berkeleydb/libdb | https://github.com/DiffDetective/libdb.git | 7 |
Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for Domain, and Repository URL are interesting but not essential. So maybe these could stay in the files but be the last two columns.
Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with | as separator instead of , or ;.So maybe we could reuse our CSV IO classes here.
Metadata
Metadata
Assignees
Labels
Good First IssueLow hanging fruitsLow hanging fruits