@@ -108,7 +108,7 @@ In other words, use
108108
109109
110110On a side note, we discourage the use of the (raw) Fowlkes-Mallows (FM) index,
111- because its expected value for two "uncorrelated" partitions is 1/k,
111+ because its expected value for two unrelated partitions is 1/k,
112112therefore averaging of FM scores for partitions of different cardinalities
113113becomes meaningless.
114114
@@ -198,14 +198,14 @@ We have tried to resolve any conflicts in the *best* possible manner.
198198
199199
2002005 . [ ` g2mg ` ] ( catalogue/g2mg.md ) -
201- a "corrected" version of the SIPU ` G2 ` -sets with variances
201+ a modified version of the SIPU ` G2 ` -sets with variances
202202 dependent on datasets' dimensionalities
203203
204204 Each dataset consists of 2048 observations belonging
205205 to either of two Gaussian clusters in 1, 2, ..., 128 dimensions.
206206
2072076 . [ ` h2mg ` ] ( catalogue/h2mg.md ) -
208- two Gaussian-like " hubs" with spread dependent on datasets' dimensionalities
208+ two Gaussian-like hubs with spread dependent on datasets' dimensionalities
209209
210210 Each dataset consists of 2048 observations in 1, 2, ..., 128 dimensions.
211211 Each point is sampled from a sphere centred at its own cluster's centre,
@@ -277,7 +277,7 @@ We have tried to resolve any conflicts in the *best* possible manner.
277277
278278
279279We recommend that ` h2mg ` sets should be studied separately
280- (there are too many of them - they can easily " overshadow" the
280+ (there are too many of them -- they can easily overshadow the
281281above ones).
282282
283283| | dataset | n| d|
@@ -355,11 +355,12 @@ ground truth label vectors
355355
356356 * a gzipped text file with exactly `n` integers, one per each line
357357 * the `i`-th label (line) corresponds to the `i`-th data point
358- * `0` denotes the noise class (if present), first " meaningful" cluster is
358+ * `0` denotes the noise class (if present), first meaningful cluster is
359359 named `1`
360360 * hence, class labels are consecutive integers: `0`, `1`, `2`, ..., `k`,
361- where `k` is the total number of "meaningful" clusters
362- * `labels0` usually denotes the "original" label vector as defined by
361+ where `k` is the total number of clusters (noise not included in
362+ the counting)
363+ * `labels0` usually denotes the original label vector as defined by
363364 the dataset's creator (if provided)
364365
365366
@@ -385,7 +386,7 @@ scientific computing packages.
385386
386387``` python
387388import numpy as np
388- dataset = " ..." # e.g., wut/smile
389+ dataset = " ..." # e.g., " wut/smile" (UNIX-like) or r"wut\smile" (Windows)
389390data = np.loadtxt(dataset+ " .data.gz" , ndmin = 2 )
390391labels = np.loadtxt(dataset+ " .labels0.gz" , dtype = np.intc)
391392# recall that 0 denotes the noise class, 1 - 1st cluster, 2 - 2nd one, etc.
0 commit comments