Skip to content

Commit 66fe179

Browse files
committed
uci datasets
1 parent 27f50d5 commit 66fe179

30 files changed

+944
-29
lines changed

README.md

Lines changed: 50 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ We have tried to resolve any conflicts in the *best* possible manner.
164164
`chameleon_t4_8k` and suggests its relation with CHAMELEON, but
165165
its screenshot does not appear in the paper.
166166

167-
* `iris`, `iris5` - the (? - see Bezdek et al., 1999 for discussion)
167+
* `iris`, `iris5` - "the" (for discussion see Bezdek et al., 1999)
168168
famous Iris dataset and its imbalanced version considered
169169
in (Gagolewski et al., 2016).
170170

@@ -176,7 +176,7 @@ We have tried to resolve any conflicts in the *best* possible manner.
176176
datasets available at the SIPU (Speech and Image Processing Unit,
177177
School of Computing, University of Eastern Finland) website
178178

179-
Many datasets were proposed by Fränti et al., see
179+
Many datasets were proposed by P. Fränti et al., see
180180
(Fränti, Sieranoja, 2018). However, some datasets gathered from other
181181
sources (see the referenced catalogue for citations) but available
182182
for download via the SIPU website are also included.
@@ -190,22 +190,34 @@ We have tried to resolve any conflicts in the *best* possible manner.
190190
We excluded the `DIM`-sets as they turn out to be too easy
191191
for most algorithms.
192192

193-
5. [`wut`](catalogue/wut.md) -
193+
5. [`uci`](catalog/uci.md) -
194+
a selection of datasets available at the University of California, Irvine,
195+
[Machine Learning Repository](http://archive.ics.uci.edu/ml/)
196+
(Dua and Graff, 2018)
197+
198+
Some of these datasets in this selection were considered
199+
for benchmark purposes
200+
in - among others - (Graves and Pedrycz, 2010); they are
201+
also listed in the SIPU repository.
202+
Note that "the" Iris dataset is available elsewhere (see `other`).
203+
204+
6. [`wut`](catalogue/wut.md) -
194205
authored by the fantastic students
195206
of Marek's [Python for Data Analysis course](http://www.gagolewski.com/teaching/padpy/) @
196207
[Warsaw University of Technology](https://ww4.mini.pw.edu.pl/):
197208
Przemysław Kosewski, Jędrzej Krauze, Eliza Kaczorek, Anna Gierlak,
198209
Adam Wawrzyniak, Aleksander Truszczyński, Mateusz Kobyłka and Michał Maciąg.
199210

200211

201-
5. [`g2mg`](catalogue/g2mg.md) -
212+
7. [`g2mg`](catalogue/g2mg.md) -
202213
a modified version of the SIPU `G2`-sets with variances
203-
dependent on datasets' dimensionalities
214+
dependent on datasets' dimensionalities, i.e., s*np.sqrt(d/2),
215+
which makes these problems more difficult.
204216

205217
Each dataset consists of 2048 observations belonging
206218
to either of two Gaussian clusters in 1, 2, ..., 128 dimensions.
207219

208-
6. [`h2mg`](catalogue/h2mg.md) -
220+
8. [`h2mg`](catalogue/h2mg.md) -
209221
two Gaussian-like hubs with spread dependent on datasets' dimensionalities
210222

211223
Each dataset consists of 2048 observations in 1, 2, ..., 128 dimensions.
@@ -266,28 +278,37 @@ We have tried to resolve any conflicts in the *best* possible manner.
266278
|43 |sipu/s4 | 5000| 2|
267279
|44 |sipu/spiral | 312| 2|
268280
|45 |sipu/unbalance | 6500| 2|
269-
|46 |wut/circles | 4000| 2|
270-
|47 |wut/cross | 2000| 2|
271-
|48 |wut/graph | 2500| 2|
272-
|49 |wut/isolation | 9000| 2|
273-
|50 |wut/labirynth | 3546| 2|
274-
|51 |wut/mk1 | 300| 2|
275-
|52 |wut/mk2 | 1000| 2|
276-
|53 |wut/mk3 | 600| 3|
277-
|54 |wut/mk4 | 1500| 3|
278-
|55 |wut/olympic | 5000| 2|
279-
|56 |wut/smile | 1000| 2|
280-
|57 |wut/stripes | 5000| 2|
281-
|58 |wut/trajectories | 10000| 2|
282-
|59 |wut/trapped_lovers | 5000| 3|
283-
|60 |wut/twosplashes | 400| 2|
284-
|61 |wut/windows | 2977| 2|
285-
|62 |wut/x1 | 120| 2|
286-
|63 |wut/x2 | 120| 2|
287-
|64 |wut/x3 | 185| 2|
288-
|65 |wut/z1 | 192| 2|
289-
|66 |wut/z2 | 900| 2|
290-
|67 |wut/z3 | 1000| 2|
281+
|46 |uci/ecoli | 336| 7|
282+
|47 |uci/glass | 214| 9|
283+
|48 |uci/ionosphere | 351| 34|
284+
|49 |uci/sonar | 208| 60|
285+
|50 |uci/statlog | 2310| 19|
286+
|51 |uci/wdbc | 569| 30|
287+
|52 |uci/wine | 178| 13|
288+
|53 |uci/yeast | 1484| 8|
289+
|54 |wut/circles | 4000| 2|
290+
|55 |wut/cross | 2000| 2|
291+
|56 |wut/graph | 2500| 2|
292+
|57 |wut/isolation | 9000| 2|
293+
|58 |wut/labirynth | 3546| 2|
294+
|59 |wut/mk1 | 300| 2|
295+
|60 |wut/mk2 | 1000| 2|
296+
|61 |wut/mk3 | 600| 3|
297+
|62 |wut/mk4 | 1500| 3|
298+
|63 |wut/olympic | 5000| 2|
299+
|64 |wut/smile | 1000| 2|
300+
|65 |wut/stripes | 5000| 2|
301+
|66 |wut/trajectories | 10000| 2|
302+
|67 |wut/trapped_lovers | 5000| 3|
303+
|68 |wut/twosplashes | 400| 2|
304+
|69 |wut/windows | 2977| 2|
305+
|70 |wut/x1 | 120| 2|
306+
|71 |wut/x2 | 120| 2|
307+
|72 |wut/x3 | 185| 2|
308+
|73 |wut/z1 | 192| 2|
309+
|74 |wut/z2 | 900| 2|
310+
|75 |wut/z3 | 1000| 2|
311+
291312

292313

293314
We recommend that `h2mg` sets should be studied separately
@@ -458,7 +479,7 @@ Dasgupta S., Ng V. (2009). *Single Data, Multiple Clusterings*, In:
458479
Proc. NIPS Workshop *Clustering: Science or Art? Towards Principled Approaches*.
459480
Available at [clusteringtheory.org](http://clusteringtheory.org)
460481

461-
Dua D., Karra Taniskidou E. (2018). *UCI Machine Learning Repository*
482+
Dua D., Graff C. (2019). *UCI Machine Learning Repository*
462483
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
463484
School of Information and Computer Science.
464485

catalogue/uci.csv

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
dataset,n,d,labels,k,noise,g
2+
uci/ecoli,336,7,labels0,8,0,0.6454081632653061
3+
uci/glass,214,9,labels0,6,0,0.48411214953271026
4+
uci/ionosphere,351,34,labels0,2,0,0.28205128205128205
5+
uci/sonar,208,60,labels0,2,0,0.0673076923076923
6+
uci/statlog,2310,19,labels0,7,0,0.0
7+
uci/wdbc,569,30,labels0,2,0,0.2548330404217926
8+
uci/wine,178,13,labels0,3,0,0.12921348314606743
9+
uci/yeast,1484,8,labels0,10,0,0.6323749625636418

catalogue/uci.md

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
**[Benchmark Suite for Clustering Algorithms -- Version 1](https://github.com/gagolews/clustering_benchmarks_v1)
2+
is maintained by [Marek Gagolewski](http://www.gagolewski.com)**
3+
4+
5+
--------------------------------------------------------------------------------
6+
7+
**Datasets**
8+
9+
* [uci/ecoli](#uci_ecoli)
10+
* [uci/glass](#uci_glass)
11+
* [uci/ionosphere](#uci_ionosphere)
12+
* [uci/sonar](#uci_sonar)
13+
* [uci/statlog](#uci_statlog)
14+
* [uci/wdbc](#uci_wdbc)
15+
* [uci/wine](#uci_wine)
16+
* [uci/yeast](#uci_yeast)
17+
18+
--------------------------------------------------------------------------------
19+
20+
## uci/ecoli (n=336, d=7) <a name="uci_ecoli"></a>
21+
22+
Ecoli
23+
24+
More information: https://archive.ics.uci.edu/ml/datasets/Ecoli
25+
26+
Please cite as:
27+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
28+
[http://archive.ics.uci.edu/ml].
29+
Irvine, CA: University of California, School of Information and Computer Science.
30+
31+
`labels0` come from the author(s).
32+
33+
34+
35+
#### `labels0`
36+
37+
true_k= 8, noise= 0, true_g=0.645
38+
39+
label_counts=[143, 77, 52, 35, 20, 5, 2, 2]
40+
41+
> **(preview generation suppressed)**
42+
43+
44+
45+
46+
47+
## uci/glass (n=214, d=9) <a name="uci_glass"></a>
48+
49+
Glass Identification
50+
51+
More information: https://archive.ics.uci.edu/ml/datasets/glass+identification
52+
53+
Please cite as:
54+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
55+
[http://archive.ics.uci.edu/ml].
56+
Irvine, CA: University of California, School of Information and Computer Science.
57+
58+
`labels0` come from the author(s).
59+
60+
61+
62+
#### `labels0`
63+
64+
true_k= 6, noise= 0, true_g=0.484
65+
66+
label_counts=[70, 76, 17, 29, 13, 9]
67+
68+
> **(preview generation suppressed)**
69+
70+
71+
72+
73+
74+
## uci/ionosphere (n=351, d=34) <a name="uci_ionosphere"></a>
75+
76+
Ionosphere
77+
78+
More information: https://archive.ics.uci.edu/ml/datasets/Ionosphere
79+
80+
Please cite as:
81+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
82+
[http://archive.ics.uci.edu/ml].
83+
Irvine, CA: University of California, School of Information and Computer Science.
84+
85+
`labels0` come from the author(s).
86+
87+
88+
89+
#### `labels0`
90+
91+
true_k= 2, noise= 0, true_g=0.282
92+
93+
label_counts=[225, 126]
94+
95+
> **(preview generation suppressed)**
96+
97+
98+
99+
100+
101+
## uci/sonar (n=208, d=60) <a name="uci_sonar"></a>
102+
103+
Connectionist Bench (Sonar, Mines vs. Rocks)
104+
105+
More information: https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)
106+
107+
Please cite as:
108+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
109+
[http://archive.ics.uci.edu/ml].
110+
Irvine, CA: University of California, School of Information and Computer Science.
111+
112+
`labels0` come from the author(s).
113+
114+
115+
116+
#### `labels0`
117+
118+
true_k= 2, noise= 0, true_g=0.067
119+
120+
label_counts=[97, 111]
121+
122+
> **(preview generation suppressed)**
123+
124+
125+
126+
127+
128+
## uci/statlog (n=2310, d=19) <a name="uci_statlog"></a>
129+
130+
Statlog (Image Segmentation)
131+
132+
More information: https://archive.ics.uci.edu/ml/datasets/Statlog+(Image+Segmentation)
133+
134+
Please cite as:
135+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
136+
[http://archive.ics.uci.edu/ml].
137+
Irvine, CA: University of California, School of Information and Computer Science.
138+
139+
`labels0` come from the author(s).
140+
141+
142+
143+
#### `labels0`
144+
145+
true_k= 7, noise= 0, true_g=0.000
146+
147+
label_counts=[330, 330, 330, 330, 330, 330, 330]
148+
149+
> **(preview generation suppressed)**
150+
151+
152+
153+
154+
155+
## uci/wdbc (n=569, d=30) <a name="uci_wdbc"></a>
156+
157+
Breast Cancer Wisconsin (Diagnostic)
158+
159+
More information: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
160+
161+
Please cite as:
162+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
163+
[http://archive.ics.uci.edu/ml].
164+
Irvine, CA: University of California, School of Information and Computer Science.
165+
166+
`labels0` come from the author(s).
167+
168+
169+
170+
#### `labels0`
171+
172+
true_k= 2, noise= 0, true_g=0.255
173+
174+
label_counts=[212, 357]
175+
176+
> **(preview generation suppressed)**
177+
178+
179+
180+
181+
182+
## uci/wine (n=178, d=13) <a name="uci_wine"></a>
183+
184+
Wine
185+
186+
More information: https://archive.ics.uci.edu/ml/datasets/wine
187+
188+
Please cite as:
189+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
190+
[http://archive.ics.uci.edu/ml].
191+
Irvine, CA: University of California, School of Information and Computer Science.
192+
193+
`labels0` come from the author(s).
194+
195+
196+
197+
#### `labels0`
198+
199+
true_k= 3, noise= 0, true_g=0.129
200+
201+
label_counts=[59, 71, 48]
202+
203+
> **(preview generation suppressed)**
204+
205+
206+
207+
208+
209+
## uci/yeast (n=1484, d=8) <a name="uci_yeast"></a>
210+
211+
Yeast
212+
213+
More information: https://archive.ics.uci.edu/ml/datasets/Yeast
214+
215+
Please cite as:
216+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
217+
[http://archive.ics.uci.edu/ml].
218+
Irvine, CA: University of California, School of Information and Computer Science.
219+
220+
`labels0` come from the author(s).
221+
222+
223+
224+
#### `labels0`
225+
226+
true_k=10, noise= 0, true_g=0.632
227+
228+
label_counts=[244, 429, 463, 44, 51, 163, 35, 30, 20, 5]
229+
230+
> **(preview generation suppressed)**
231+
232+
233+
234+
235+

catalogue_generate_all.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
./catalogue_generate.py graves
55
./catalogue_generate.py other
66
./catalogue_generate.py sipu
7+
./catalogue_generate.py uci
78
./catalogue_generate.py wut
89
./catalogue_generate.py h2mg
910
./catalogue_generate.py g2mg

catalogue_summarise.R

100644100755
File mode changed.

uci/ecoli.data.gz

2.43 KB
Binary file not shown.

uci/ecoli.labels0.gz

60 Bytes
Binary file not shown.

uci/ecoli.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Ecoli
2+
3+
More information: https://archive.ics.uci.edu/ml/datasets/Ecoli
4+
5+
Please cite as:
6+
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository
7+
[http://archive.ics.uci.edu/ml].
8+
Irvine, CA: University of California, School of Information and Computer Science.
9+
10+
`labels0` come from the author(s).

uci/glass.data.gz

3.11 KB
Binary file not shown.

uci/glass.labels0.gz

53 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)