Skip to content

Commit 67eb1e3

Browse files
authored
Update readme (#231)
* update README * Update images * more updates * Update blog link
1 parent 1fb60e4 commit 67eb1e3

File tree

3 files changed

+79
-67
lines changed

3 files changed

+79
-67
lines changed

README.md

Lines changed: 79 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -23,59 +23,50 @@
2323

2424
# Overview
2525

26-
The **SDMetrics** library provides a set of **dataset-agnostic tools** for evaluating the **quality
27-
of a synthetic database** by comparing it to the real database that it is modeled after.
26+
The SDMetrics library evaluates synthetic data by comparing it to the real data that you're trying to mimic. It includes a variety of metrics to capture different aspects of the data, for example **quality and privacy**. It also includes reports that you can run to generate insights and share with your team.
27+
28+
The SDMetrics library is **model-agnostic**, meaning you can use any synthetic data. The library does not need to know how you created the data.
29+
2830

2931
| Important Links | |
3032
| --------------------------------------------- | -------------------------------------------------------------------- |
3133
| :computer: **[Website]** | Check out the SDV Website for more information about the project. |
32-
| :orange_book: **[SDV Blog]** | Regular publshing of useful content about Synthetic Data Generation. |
34+
| :orange_book: **[Blog]** | A deeper look at open source, synthetic data creation and evaluation.|
3335
| :book: **[Documentation]** | Quickstarts, User and Development Guides, and API Reference. |
3436
| :octocat: **[Repository]** | The link to the Github Repository of this library. |
35-
| :scroll: **[License]** | The entire ecosystem is published under the MIT License. |
37+
| :scroll: **[License]** | The library is published under the MIT License. |
3638
| :keyboard: **[Development Status]** | This software is in its Pre-Alpha stage. |
3739
| [![][Slack Logo] **Community**][Community] | Join our Slack Workspace for announcements and discussions. |
38-
| [![][MyBinder Logo] **Tutorials**][Tutorials] | Run the SDV Tutorials in a Binder environment. |
40+
| [![][Google Colab Logo] **Tutorials**][Tutorials] | Get started with SDMetrics in a notebook. |
3941

4042
[Website]: https://sdv.dev
41-
[SDV Blog]: https://sdv.dev/blog
42-
[Documentation]: https://sdv.dev/SDV
43+
[Blog]: https://datacebo.com/blog
44+
[Documentation]: https://docs.sdv.dev/sdmetrics
4345
[Repository]: https://github.com/sdv-dev/SDMetrics
4446
[License]: https://github.com/sdv-dev/SDMetrics/blob/master/LICENSE
4547
[Development Status]: https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha
4648
[Slack Logo]: https://github.com/sdv-dev/SDV/blob/master/docs/images/slack.png
4749
[Community]: https://bit.ly/sdv-slack-invite
48-
[MyBinder Logo]: https://github.com/sdv-dev/SDV/blob/master/docs/images/mybinder.png
49-
[Tutorials]: https://mybinder.org/v2/gh/sdv-dev/SDV/master?filepath=tutorials
50+
[Google Colab Logo]: https://github.com/sdv-dev/SDV/blob/master/docs/images/google_colab.png
51+
[Tutorials]: https://bit.ly/sdmetrics-demo
5052

5153
## Features
5254

53-
It supports multiple data modalities:
55+
Quickly generate insights and share results with your team using **SDMetrics Reports**. For example, the Diagnostic Report quickly checks for common problems, and the Quality Report provides visualizations comparing the real and synthetic data.
5456

55-
* **Single Columns**: Compare 1 dimensional `numpy` arrays representing individual columns.
56-
* **Column Pairs**: Compare how columns in a `pandas.DataFrame` relate to each other, in groups of 2.
57-
* **Single Table**: Compare an entire table, represented as a `pandas.DataFrame`.
58-
* **Multi Table**: Compare multi-table and relational datasets represented as a python `dict` with
59-
multiple tables passed as `pandas.DataFrame`s.
60-
* **Time Series**: Compare tables representing ordered sequences of events.
57+
<img align="center" src="docs/images/column_comparison.png"></img>
6158

62-
It includes a variety of metrics such as:
59+
You can also explore and apply individual metrics as needed. The SDMetrics library includes a variety of metrics for different goals:
6360

64-
* **Statistical metrics** which use statistical tests to compare the distributions of the real
65-
and synthetic distributions.
66-
* **Detection metrics** which use machine learning to try to distinguish between real and synthetic data.
67-
* **Efficacy metrics** which compare the performance of machine learning models when run on the synthetic and real data.
68-
* **Bayesian Network and Gaussian Mixture metrics** which learn the distribution of the real data
69-
and evaluate the likelihood of the synthetic data belonging to the learned distribution.
70-
* **Privacy metrics** which evaluate whether the synthetic data is leaking information about the real data.
61+
* Privacy metrics evaluate whether the synthetic data is leaking information about the real data
62+
* ML Efficacy metrics estimate the outcomes of using the synthetic data to solve machine learning problems
63+
* … and more!
7164

72-
# Install
65+
Some of these metrics are experimental and actively being researched by the data science community.
7366

74-
**SDMetrics** is part of the **SDV** project and is automatically installed alongside it. For
75-
details about this process please visit the [SDV Installation Guide](
76-
https://sdv.dev/SDV/getting_started/install.html)
67+
# Install
7768

78-
Optionally, **SDMetrics** can also be installed as a standalone library using the following commands:
69+
Install SDMetrics using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.
7970

8071
**Using `pip`:**
8172

@@ -89,61 +80,82 @@ pip install sdmetrics
8980
conda install -c conda-forge -c pytorch sdmetrics
9081
```
9182

92-
For more installation options please visit the [SDMetrics installation Guide](INSTALL.md)
83+
For more installation options please visit the [SDMetrics installation Guide](https://github.com/sdv-dev/SDMetrics/blob/master/INSTALL.md).
9384

9485
# Usage
9586

96-
**SDMetrics** is included as part of the framework offered by SDV to evaluate the quality of
97-
your synthetic dataset. For more details about how to use it please visit the corresponding
98-
User Guide:
87+
Get started with **SDMetrics Reports** using some demo data,
9988

100-
* [Evaluating Synthetic Data](https://sdv.dev/SDV/user_guides/evaluation/index.html)
89+
```python
90+
from sdmetrics import load_demo
91+
from sdmetrics.reports.single_table import QualityReport
10192

102-
## Standalone usage
93+
real_data, synthetic_data, metadata = load_demo(modality='single_table')
10394

104-
**SDMetrics** can also be used as a standalone library to run metrics individually.
95+
my_report = QualityReport()
96+
my_report.generate(real_data, synthetic_data, metadata)
97+
```
98+
```
99+
Creating report: 100%|██████████| 4/4 [00:00<00:00, 5.22it/s]
100+
101+
Overall Quality Score: 82.84%
105102
106-
In this short example we show how to use it to evaluate a toy multi-table dataset and its
107-
synthetic replica by running all the compatible multi-table metrics on it:
103+
Properties:
104+
Column Shapes: 82.78%
105+
Column Pair Trends: 82.9%
106+
```
108107

109-
```python3
110-
import sdmetrics
108+
Once you generate the report, you can drill down on the details and visualize the results.
111109

112-
# Load the demo data, which includes:
113-
# - A dict containing the real tables as pandas.DataFrames.
114-
# - A dict containing the synthetic clones of the real data.
115-
# - A dict containing metadata about the tables.
116-
real_data, synthetic_data, metadata = sdmetrics.load_demo()
110+
```python
111+
my_report.get_visualization(property_name='Column Pair Trends')
112+
```
113+
<img align="center" src="docs/images/column_pairs.png"></img>
117114

118-
# Obtain the list of multi table metrics, which is returned as a dict
119-
# containing the metric names and the corresponding metric classes.
120-
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()
115+
Save the report and share it with your team.
116+
```python
117+
my_report.save(filepath='demo_data_quality_report.pkl')
121118

122-
# Run all the compatible metrics and get a report
123-
sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)
119+
# load it at any point in the future
120+
my_report = QualityReport.load(filepath='demo_data_quality_report.pkl')
124121
```
125122

126-
The output will be a table with all the details about the executed metrics and their score:
123+
**Want more metrics?** You can also manually apply any of the metrics in this library to your data.
127124

128-
| metric | name | score | min_value | max_value | goal |
129-
|------------------------------|----------------------------------------------|------------|-------------|-------------|----------|
130-
| CSTest | Chi-Squared | 0.76651 | 0 | 1 | MAXIMIZE |
131-
| KSComplement | Complement to Kolmogorov-Smirnov D statistic | 0.75 | 0 | 1 | MAXIMIZE |
132-
| LogisticDetection | LogisticRegression Detection | 0.882716 | 0 | 1 | MAXIMIZE |
133-
| SVCDetection | SVC Detection | 0.833333 | 0 | 1 | MAXIMIZE |
134-
| BNLikelihood | BayesianNetwork Likelihood | nan | 0 | 1 | MAXIMIZE |
135-
| BNLogLikelihood | BayesianNetwork Log Likelihood | nan | -inf | 0 | MAXIMIZE |
136-
| LogisticParentChildDetection | LogisticRegression Detection | 0.619444 | 0 | 1 | MAXIMIZE |
137-
| SVCParentChildDetection | SVC Detection | 0.916667 | 0 | 1 | MAXIMIZE |
125+
```python
126+
# calculate whether the synthetic data respects the min/max bounds
127+
# set by the real data
128+
from sdmetrics.single_table import BoundaryAdherence
138129

139-
# What's next?
130+
BoundaryAdherence.compute(
131+
real_data['start_date'],
132+
synthetic_data['start_date']
133+
)
134+
```
135+
```
136+
0.8503937007874016
137+
```
140138

141-
If you want to read more about each individual metric, please visit the following folders:
139+
```python
140+
# calculate whether an attacker will be able to guess sensitive
141+
# information based on combination of synthetic data and their
142+
# own information
143+
from sdmetrics.single_table import CategoricalCAP
144+
145+
CategoricalCAP.compute(
146+
real_data,
147+
synthetic_data,
148+
key_fields=['gender', 'work_experience'],
149+
sensitive_fields=['degree_type']
150+
)
151+
```
152+
```
153+
0.4601209799017264
154+
```
155+
156+
# What's next?
142157

143-
* Single Column Metrics: [sdmetrics/single_column](sdmetrics/single_column)
144-
* Single Table Metrics: [sdmetrics/single_table](sdmetrics/single_table)
145-
* Multi Table Metrics: [sdmetrics/multi_table](sdmetrics/multi_table)
146-
* Time Series Metrics: [sdmetrics/timeseries](sdmetrics/timeseries)
158+
To learn more about the reports and metrics, visit the [SDMetrics Documentation](https://docs.sdv.dev/sdmetrics).
147159

148160
---
149161

docs/images/column_comparison.png

177 KB
Loading

docs/images/column_pairs.png

217 KB
Loading

0 commit comments

Comments
 (0)