Skip to content

Commit b6ad790

Browse files
authored
Merge pull request #35 from lucasimi/develop
Develop
2 parents dc3c02b + 3d2ca3a commit b6ad790

27 files changed

+1368
-782
lines changed

.github/workflows/test.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,7 @@ jobs:
1818
python-version: '3.10'
1919
- name: Install dependencies
2020
run: |
21-
python -m pip install coverage
22-
python -m pip install -e .
21+
python -m pip install -e .[dev]
2322
- name: Run tests and code coverage
2423
run: |
2524
coverage run --source=src -m unittest discover -s tests

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,7 @@
44
**/*.egg-info
55
**/.ipynb_checkpoints
66
**/*.log
7+
8+
.coverage
9+
.vscode
10+
dist

README.md

Lines changed: 69 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -2,124 +2,121 @@
22

33
![test](https://github.com/lucasimi/tda-mapper-python/actions/workflows/test.yml/badge.svg) [![codecov](https://codecov.io/github/lucasimi/tda-mapper-python/graph/badge.svg?token=FWSD8JUG6R)](https://codecov.io/github/lucasimi/tda-mapper-python)
44

5-
In recent years, an ever growing interest in **Topological Data Analysis** (TDA) emerged in the field of data science. The core principle of TDA is to gain insights from data by using topological methods, as they show good resilience to noise, and they are often more stable than many traditional techniques. This Python package provides an implementation of the **Mapper Algorithm**, one of the most common tools from TDA.
5+
In recent years, an ever growing interest in **Topological Data Analysis** (TDA) emerged in the field of data science. The core idea of TDA is to gain insights from data by using topological methods that are proved to be reliable with respect to noise, and that behave nicely with respect to dimension. This Python package provides an implementation of the **Mapper Algorithm**, a well-known tool from TDA.
66

7-
The mapper algorithm takes any dataset $X$ (usually high dimensional), and returns a graph $G$, called **Mapper Graph**. Surprisingly enough, despite living in a 2-dimensional space, the mapper graph $G$ represents a reliable summary for the shape of $X$ (they share the same number of connected components). This feature makes the mapper algorithm a very appealing choice over more traditional approaches, for example those based on projections, because they often give you no way to control shape distortions. Moreover, preventing artifacts is especially important for data visualization: the mapper graph is often a capable tool, which can help you identify hidden patterns in high-dimensional data.
7+
The Mapper Algorithm takes any dataset $X$ and returns a *shape-summary* in the form a graph $G$, called **Mapper Graph**. It's possible to prove, under reasonable conditions, that $X$ and $G$ share the same number of connected components.
88

99
## Basics
1010

11-
Here we'll give just a brief description of the core ideas around the mapper, but the interested reader is advised to take a look at the original [paper](https://research.math.osu.edu/tgda/mapperPBG.pdf). The Mapper Algorithm follows these steps:
11+
Let $f$ be any chosen *lens*, i.e. a continuous map $f \colon X \to Y$, being $Y$ any parameter space (*typically* low dimensional). In order to build the Mapper Graph follow these steps:
1212

13-
1. Take any *lens* you want. A lens is just a continuous map $f \colon X \to Y$, where $Y$ is any parameter space, usually having dimension lower than $X$. You can think of $f$ as a set of KPIs, or features of particular interest for the domain of study. Some common choices for $f$ are *statistics* (of any order), *projections*, *entropy*, *density*, *eccentricity*, and so forth.
13+
1. Build an *open cover* for $f(X)$, i.e. a collection of *open sets* whose union makes the whole image $f(X)$.
1414

15-
![Step 1](https://raw.githubusercontent.com/lucasimi/tda-mapper-python/main/resources/mapper_1.png)
15+
2. Run clustering on the preimage of each open set. All these local clusters together make a *refined open cover* for $X$.
1616

17-
2. Build an *open cover* for $f(X)$. An open cover is a collection of open sets (like open balls, or open intervals) whose union makes the whole image $f(X)$, and can possibly intersect.
17+
3. Build the mapper graph $G$ by taking a node for each local cluster, and by drawing an edge between two nodes whenever their corresponding local clusters intersect.
1818

19-
![Step 2](https://raw.githubusercontent.com/lucasimi/tda-mapper-python/main/resources/mapper_2.png)
19+
To get an idea, in the following picture we have $X$ as an X-shaped point cloud in $\mathbb{R}^2$, with $f$ being the *height function*, i.e. the projection on the $y$-axis. In the leftmost part we cover the projection of $X$ with three open sets. Every open set is represented with a different color. Then we take the preimage of these sets, cluster then, and finally build the graph according to intersections.
2020

21-
3. For each element $U$ of the open cover of $f(X)$, let $f^{-1}(U)$ be the preimage of $U$ under $f$. Then the collection of all the $f^{-1}(U)$'s makes an open cover of $X$. At this point, split every preimage $f^{-1}(U)$ into clusters, by running any chosen *clustering* algorithm, and keep track of all the local clusters obtained. All these local clusters together make a *refined open cover* for $X$.
21+
![Steps](resources/mapper.png)
2222

23-
![Step 3](https://raw.githubusercontent.com/lucasimi/tda-mapper-python/main/resources/mapper_3.png)
24-
25-
4. Build the mapper graph $G$ by taking a node for each local cluster, and by drawing an edge between two nodes whenever their corresponding local clusters intersect.
26-
27-
![Step 4](https://raw.githubusercontent.com/lucasimi/tda-mapper-python/main/resources/mapper_4.png)
28-
29-
N.B.: The choice of the lens $f$ has a deep practical impact on the mapper graph. Theoretically, if clusters were able to perfectly identify connected components (and if they were "reasonably well behaved"), chosing any $f$ would give the same mapper graph (see the [Nerve Theorem](https://en.wikipedia.org/wiki/Nerve_complex#Nerve_theorems) for a more precise statement). In this case, there would be no need for a tool like the mapper, since clustering algorithms would provide a complete tool to understand the shape of data. Unfortunately, clustering algorithms are not that good. Think for example about the case of $f$ being a constant function: in this case computing the mapper graph would be equivalent to performing clustering on the whole dataset. For this reason a good choice for $f$ would be any continuous map which is somewhat *sensible* to data: the more sublevel sets are apart, the higher the chance of a good local clustering.
23+
The choice of the lens is the most relevant on the shape of the Mapper Graph. Some common choices are *statistics*, *projections*, *entropy*, *density*, *eccentricity*, and so forth. However, in order to pick a good lens, specific domain knowledge for the data at hand can give a hint. For an in-depth description of Mapper please read [the original paper](https://research.math.osu.edu/tgda/mapperPBG.pdf).
3024

3125
## Installation
3226

33-
First, clone this repo, `cd` into the local repo, and install via `pip` from your local repo
27+
Clone this repo, and install via `pip` from your local directory
3428
```
3529
python -m pip install .
3630
```
31+
Alternatively, you can use `pip` to install directly from GitHub
32+
```
33+
pip install git+https://github.com/lucasimi/tda-mapper-python.git
34+
```
35+
If you want to install the version from a specific branch, for example `develop`, you can run
36+
```
37+
pip install git+https://github.com/lucasimi/tda-mapper-python.git@develop
38+
```
3739

38-
## How to use this package - A First Example
40+
## A worked out example
3941

40-
In the following example, we use the mapper to perform some analysis on the famous Iris dataset. This dataset consists of 150 records, having 4 numerical features and a label which represents a class. As lens, we chose the PCA on two components.
42+
In order to show how to use this package, we perform some analysis on the the well known dataset of hand written digits (more info [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html)), consisting of less than 2000 8x8 pictures represented as arrays of 64 elements.
4143

4244
```python
43-
from sklearn.datasets import load_iris
45+
import numpy as np
46+
47+
from sklearn.datasets import load_digits
4448
from sklearn.cluster import AgglomerativeClustering
4549
from sklearn.decomposition import PCA
4650

47-
import matplotlib
48-
4951
from tdamapper.core import *
5052
from tdamapper.cover import *
5153
from tdamapper.clustering import *
5254
from tdamapper.plot import *
5355

54-
iris_data = load_iris()
55-
X, y = iris_data.data, iris_data.target
56-
lens = PCA(2).fit_transform(X)
56+
import matplotlib
5757

58-
cover = CubicCover(n_intervals=7, overlap_frac=0.25)
59-
clustering = AgglomerativeClustering(n_clusters=2, linkage='single')
58+
digits = load_digits()
59+
X, y = [np.array(x) for x in digits.data], digits.target
60+
lens = PCA(2).fit_transform(X)
6061

61-
mapper_algo = MapperAlgorithm(cover, clustering)
62+
mapper_algo = MapperAlgorithm(
63+
cover=GridCover(n_intervals=10, overlap_frac=0.65),
64+
clustering=AgglomerativeClustering(10),
65+
verbose=True,
66+
n_jobs=8)
6267
mapper_graph = mapper_algo.fit_transform(X, lens)
63-
mapper_plot = MapperPlot(X, mapper_graph)
64-
colored = mapper_plot.with_colors(colors=list(y), agg=np.nanmedian)
6568

66-
fig, ax = plt.subplots(1, 1, figsize=(7, 7))
67-
colored.plot_static(title='class', ax=ax)
69+
mapper_plot = MapperPlot(X, mapper_graph,
70+
colors=y,
71+
cmap='jet',
72+
agg=np.nanmean,
73+
dim=2,
74+
iterations=400)
75+
fig_mean = mapper_plot.plot(title='digit (mean)', width=600, height=600)
76+
fig_mean.show(config={'scrollZoom': True})
6877
```
6978

70-
![The mapper graph of the iris dataset](https://raw.githubusercontent.com/lucasimi/tda-mapper-python/main/resources/iris.png)
71-
72-
As you can see from the plot, we can identify two major connected components, one which corresponds precisely to a single class, and the other which is shared by the other two classes.
79+
![The mapper graph of the digits dataset, colored according to mean value](resources/digits_mean.png)
7380

74-
## A Second Example
75-
76-
In this second example we try to take a look at the shape of the digits dataset. This dataset consists of less than 2000 pictures of handwritten digits, represented as dim-64 arrays (8x8 pictures)
81+
It's also possible to obtain a new plot colored according to different values, while keeping the same computed geometry. For example, if we want to visualize how much dispersion we have on each cluster, we could plot colors according to the standard deviation
7782

7883
```python
79-
from sklearn.datasets import load_digits
80-
from sklearn.cluster import KMeans
81-
from sklearn.decomposition import PCA
82-
83-
from tdamapper.core import *
84-
from tdamapper.cover import *
85-
from tdamapper.clustering import *
86-
from tdamapper.plot import *
87-
88-
import matplotlib
89-
90-
digits = load_digits()
91-
X, y = [np.array(x) for x in digits.data], digits.target
92-
lens = PCA(2).fit_transform(X)
84+
fig_std = mapper_plot.with_colors(
85+
colors=y,
86+
cmap='viridis',
87+
agg=np.nanstd,
88+
).plot(title='digit (std)', width=600, height=600)
89+
fig_std.show(config={'scrollZoom': True})
90+
```
9391

94-
cover = CubicCover(n_intervals=15, overlap_frac=0.25)
95-
clustering = KMeans(10, n_init='auto')
92+
![The mapper graph of the digits dataset, colored according to std](resources/digits_std.png)
9693

97-
mapper_algo = MapperAlgorithm(cover, clustering)
98-
mapper_graph = mapper_algo.fit_transform(X, lens)
99-
mapper_plot = MapperPlot(X, mapper_graph, iterations=100)
94+
The mapper graph of the digits dataset shows a few interesting patterns. For example, we can make the following observations:
10095

101-
fig = mapper_plot.with_colors(colors=y, cmap='jet', agg=np.nanmedian).plot_interactive_2d(title='digit', width=512, height=512)
102-
fig.show(config={'scrollZoom': True})
103-
```
104-
105-
![The mapper graph of the digits dataset](https://raw.githubusercontent.com/lucasimi/tda-mapper-python/main/resources/digits.png)
96+
* Clusters that share the same color are all connected together, and located in the same area of the graph. This behavior is present in those digits which are easy to tell apart from the others, for example digits 0 and 4.
10697

107-
As you can see the mapper graph shows interesting patterns. Note that the shape of the graph is obtained by looking only at the 8x8 pictures, discarding any information about the actual label (the digit). You can see that those local clusters which share the same labels are located in the same area of the graph. This tells you (as you would expect) that the labelling is *compatible with the shape of data*.
98+
* Some clusters are not well separated and tend to overlap one on the other. This mixed behavior is present in those digits which can be easily confused one with the other, for example digits 5 and 6.
10899

109-
![Digits 4 and 7](https://raw.githubusercontent.com/lucasimi/tda-mapper-python/main/resources/digits_4_7.png)
100+
* Clusters located across the "boundary" of two different digits show a transition either due to a change in distribution or due to distorsions in the hand written text, for example digits 8 and 2.
110101

111-
Moreover, by zooming in, you can see that some clusters are located next to others. For example in the picture you can see the details of digits '4' (cyan) and '7' (red) being located one next to the other.
112102

113103
### Development - Supported Features
114104

115105
- [x] Topology
116-
- [x] Any custom lens
117-
- [x] Any custom metric
106+
- [x] custom lenses
107+
- [x] custom metrics
108+
118109
- [x] Cover algorithms:
119-
- [x] Cubic Cover
120-
- [x] Ball Cover
121-
- [x] Knn Cover
110+
- [x] `GridCover`
111+
- [x] `BallCover`
112+
- [x] `KnnCover`
113+
122114
- [x] Clustering algoritms
123-
- [x] Any sklearn clustering algorithm
124-
- [x] Skip clustering
125-
- [x] Clustering induced by cover
115+
- [x] `sklearn.cluster`-compatible algorithms
116+
- [x] `TrivialClustering` to skip clustering
117+
- [x] `CoverClustering` for clustering induced by cover
118+
119+
- [x] Plot
120+
- [x] 2d interactive plot
121+
- [x] 3d interactive plot
122+
- [ ] HTML embeddable plot

pyproject.toml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "tda-mapper"
7-
version = "0.0.2"
8-
description = "A simple and efficient implementation of the Mapper Algorithm for Topological Data Analysis (TDA)"
7+
version = "0.1.0"
8+
description = "A simple and efficient implementation of the Mapper Algorithm from Topological Data Analysis (TDA)"
99
readme = "README.md"
1010
authors = [{ name = "Luca Simi", email = "lucasimi90@gmail.com" }]
1111
license = { file = "LICENSE" }
@@ -19,13 +19,13 @@ dependencies = [
1919
"matplotlib>=3.3.4",
2020
"networkx>=2.5",
2121
"numpy>=1.20.1",
22-
"scikit-learn>=0.24.1",
23-
"plotly>=4.14.3"
22+
"plotly>=4.14.3",
23+
"joblib>=1.2.0"
2424
]
2525
requires-python = ">=3.6"
2626

2727
[project.optional-dependencies]
28-
dev = ["coverage"]
28+
dev = ["coverage", "pandas", "scikit-learn"]
2929

3030
[project.urls]
3131
Homepage = "https://github.com/lucasimi/tda-mapper-python"

resources/digits.png

-125 KB
Binary file not shown.

resources/digits_4_7.png

-132 KB
Binary file not shown.

resources/digits_mean.png

141 KB
Loading

resources/digits_std.png

136 KB
Loading

resources/mapper.png

37 KB
Loading

0 commit comments

Comments
 (0)