Skip to content

Commit 26e52f0

Browse files
committed
fix README.md
1 parent 35ddeba commit 26e52f0

File tree

1 file changed

+9
-8
lines changed

1 file changed

+9
-8
lines changed

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ In other words, use
108108

109109

110110
On a side note, we discourage the use of the (raw) Fowlkes-Mallows (FM) index,
111-
because its expected value for two "uncorrelated" partitions is 1/k,
111+
because its expected value for two unrelated partitions is 1/k,
112112
therefore averaging of FM scores for partitions of different cardinalities
113113
becomes meaningless.
114114

@@ -198,14 +198,14 @@ We have tried to resolve any conflicts in the *best* possible manner.
198198

199199

200200
5. [`g2mg`](catalogue/g2mg.md) -
201-
a "corrected" version of the SIPU `G2`-sets with variances
201+
a modified version of the SIPU `G2`-sets with variances
202202
dependent on datasets' dimensionalities
203203

204204
Each dataset consists of 2048 observations belonging
205205
to either of two Gaussian clusters in 1, 2, ..., 128 dimensions.
206206

207207
6. [`h2mg`](catalogue/h2mg.md) -
208-
two Gaussian-like "hubs" with spread dependent on datasets' dimensionalities
208+
two Gaussian-like hubs with spread dependent on datasets' dimensionalities
209209

210210
Each dataset consists of 2048 observations in 1, 2, ..., 128 dimensions.
211211
Each point is sampled from a sphere centred at its own cluster's centre,
@@ -277,7 +277,7 @@ We have tried to resolve any conflicts in the *best* possible manner.
277277

278278

279279
We recommend that `h2mg` sets should be studied separately
280-
(there are too many of them - they can easily "overshadow" the
280+
(there are too many of them -- they can easily overshadow the
281281
above ones).
282282

283283
| |dataset | n| d|
@@ -355,11 +355,12 @@ ground truth label vectors
355355

356356
* a gzipped text file with exactly `n` integers, one per each line
357357
* the `i`-th label (line) corresponds to the `i`-th data point
358-
* `0` denotes the noise class (if present), first "meaningful" cluster is
358+
* `0` denotes the noise class (if present), first meaningful cluster is
359359
named `1`
360360
* hence, class labels are consecutive integers: `0`, `1`, `2`, ..., `k`,
361-
where `k` is the total number of "meaningful" clusters
362-
* `labels0` usually denotes the "original" label vector as defined by
361+
where `k` is the total number of clusters (noise not included in
362+
the counting)
363+
* `labels0` usually denotes the original label vector as defined by
363364
the dataset's creator (if provided)
364365

365366

@@ -385,7 +386,7 @@ scientific computing packages.
385386

386387
```python
387388
import numpy as np
388-
dataset = "..." # e.g., wut/smile
389+
dataset = "..." # e.g., "wut/smile" (UNIX-like) or r"wut\smile" (Windows)
389390
data = np.loadtxt(dataset+".data.gz", ndmin=2)
390391
labels = np.loadtxt(dataset+".labels0.gz", dtype=np.intc)
391392
# recall that 0 denotes the noise class, 1 - 1st cluster, 2 - 2nd one, etc.

0 commit comments

Comments
 (0)