2323
2424# Overview
2525
26- The ** SDMetrics** library provides a set of ** dataset-agnostic tools** for evaluating the ** quality
27- of a synthetic database** by comparing it to the real database that it is modeled after.
26+ The SDMetrics library evaluates synthetic data by comparing it to the real data that you're trying to mimic. It includes a variety of metrics to capture different aspects of the data, for example ** quality and privacy** . It also includes reports that you can run to generate insights and share with your team.
27+
28+ The SDMetrics library is ** model-agnostic** , meaning you can use any synthetic data. The library does not need to know how you created the data.
29+
2830
2931| Important Links | |
3032| --------------------------------------------- | -------------------------------------------------------------------- |
3133| :computer : ** [ Website] ** | Check out the SDV Website for more information about the project. |
32- | :orange_book : ** [ SDV Blog] ** | Regular publshing of useful content about Synthetic Data Generation. |
34+ | :orange_book : ** [ Blog] ** | A deeper look at open source, synthetic data creation and evaluation. |
3335| :book : ** [ Documentation] ** | Quickstarts, User and Development Guides, and API Reference. |
3436| :octocat: ** [ Repository] ** | The link to the Github Repository of this library. |
35- | :scroll : ** [ License] ** | The entire ecosystem is published under the MIT License. |
37+ | :scroll : ** [ License] ** | The library is published under the MIT License. |
3638| :keyboard : ** [ Development Status] ** | This software is in its Pre-Alpha stage. |
3739| [ ![ ] [ Slack Logo ] ** Community** ] [ Community ] | Join our Slack Workspace for announcements and discussions. |
38- | [ ![ ] [ MyBinder Logo] ** Tutorials** ] [ Tutorials ] | Run the SDV Tutorials in a Binder environment. |
40+ | [ ![ ] [ Google Colab Logo] ** Tutorials** ] [ Tutorials ] | Get started with SDMetrics in a notebook. |
3941
4042[ Website ] : https://sdv.dev
41- [ SDV Blog] : https://sdv.dev /blog
42- [ Documentation ] : https://sdv.dev/SDV
43+ [ Blog ] : https://datacebo.com /blog
44+ [ Documentation ] : https://docs. sdv.dev/sdmetrics
4345[ Repository ] : https://github.com/sdv-dev/SDMetrics
4446[ License ] : https://github.com/sdv-dev/SDMetrics/blob/master/LICENSE
4547[ Development Status ] : https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha
4648[ Slack Logo ] : https://github.com/sdv-dev/SDV/blob/master/docs/images/slack.png
4749[ Community ] : https://bit.ly/sdv-slack-invite
48- [ MyBinder Logo] : https://github.com/sdv-dev/SDV/blob/master/docs/images/mybinder .png
49- [ Tutorials ] : https://mybinder.org/v2/gh/sdv-dev/SDV/master?filepath=tutorials
50+ [ Google Colab Logo] : https://github.com/sdv-dev/SDV/blob/master/docs/images/google_colab .png
51+ [ Tutorials ] : https://bit.ly/sdmetrics-demo
5052
5153## Features
5254
53- It supports multiple data modalities:
55+ Quickly generate insights and share results with your team using ** SDMetrics Reports ** . For example, the Diagnostic Report quickly checks for common problems, and the Quality Report provides visualizations comparing the real and synthetic data.
5456
55- * ** Single Columns** : Compare 1 dimensional ` numpy ` arrays representing individual columns.
56- * ** Column Pairs** : Compare how columns in a ` pandas.DataFrame ` relate to each other, in groups of 2.
57- * ** Single Table** : Compare an entire table, represented as a ` pandas.DataFrame ` .
58- * ** Multi Table** : Compare multi-table and relational datasets represented as a python ` dict ` with
59- multiple tables passed as ` pandas.DataFrame ` s.
60- * ** Time Series** : Compare tables representing ordered sequences of events.
57+ <img align =" center " src =" docs/images/column_comparison.png " ></img >
6158
62- It includes a variety of metrics such as :
59+ You can also explore and apply individual metrics as needed. The SDMetrics library includes a variety of metrics for different goals :
6360
64- * ** Statistical metrics** which use statistical tests to compare the distributions of the real
65- and synthetic distributions.
66- * ** Detection metrics** which use machine learning to try to distinguish between real and synthetic data.
67- * ** Efficacy metrics** which compare the performance of machine learning models when run on the synthetic and real data.
68- * ** Bayesian Network and Gaussian Mixture metrics** which learn the distribution of the real data
69- and evaluate the likelihood of the synthetic data belonging to the learned distribution.
70- * ** Privacy metrics** which evaluate whether the synthetic data is leaking information about the real data.
61+ * Privacy metrics evaluate whether the synthetic data is leaking information about the real data
62+ * ML Efficacy metrics estimate the outcomes of using the synthetic data to solve machine learning problems
63+ * … and more!
7164
72- # Install
65+ Some of these metrics are experimental and actively being researched by the data science community.
7366
74- ** SDMetrics** is part of the ** SDV** project and is automatically installed alongside it. For
75- details about this process please visit the [ SDV Installation Guide] (
76- https://sdv.dev/SDV/getting_started/install.html )
67+ # Install
7768
78- Optionally, ** SDMetrics** can also be installed as a standalone library using the following commands:
69+ Install SDMetrics using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.
7970
8071** Using ` pip ` :**
8172
@@ -89,61 +80,82 @@ pip install sdmetrics
8980conda install -c conda-forge -c pytorch sdmetrics
9081```
9182
92- For more installation options please visit the [ SDMetrics installation Guide] ( INSTALL.md )
83+ For more installation options please visit the [ SDMetrics installation Guide] ( https://github.com/sdv-dev/SDMetrics/blob/master/ INSTALL.md) .
9384
9485# Usage
9586
96- ** SDMetrics** is included as part of the framework offered by SDV to evaluate the quality of
97- your synthetic dataset. For more details about how to use it please visit the corresponding
98- User Guide:
87+ Get started with ** SDMetrics Reports** using some demo data,
9988
100- * [ Evaluating Synthetic Data] ( https://sdv.dev/SDV/user_guides/evaluation/index.html )
89+ ``` python
90+ from sdmetrics import load_demo
91+ from sdmetrics.reports.single_table import QualityReport
10192
102- ## Standalone usage
93+ real_data, synthetic_data, metadata = load_demo( modality = ' single_table ' )
10394
104- ** SDMetrics** can also be used as a standalone library to run metrics individually.
95+ my_report = QualityReport()
96+ my_report.generate(real_data, synthetic_data, metadata)
97+ ```
98+ ```
99+ Creating report: 100%|██████████| 4/4 [00:00<00:00, 5.22it/s]
100+
101+ Overall Quality Score: 82.84%
105102
106- In this short example we show how to use it to evaluate a toy multi-table dataset and its
107- synthetic replica by running all the compatible multi-table metrics on it:
103+ Properties:
104+ Column Shapes: 82.78%
105+ Column Pair Trends: 82.9%
106+ ```
108107
109- ``` python3
110- import sdmetrics
108+ Once you generate the report, you can drill down on the details and visualize the results.
111109
112- # Load the demo data, which includes:
113- # - A dict containing the real tables as pandas.DataFrames.
114- # - A dict containing the synthetic clones of the real data.
115- # - A dict containing metadata about the tables.
116- real_data, synthetic_data, metadata = sdmetrics.load_demo()
110+ ``` python
111+ my_report.get_visualization(property_name = ' Column Pair Trends' )
112+ ```
113+ <img align =" center " src =" docs/images/column_pairs.png " ></img >
117114
118- # Obtain the list of multi table metrics, which is returned as a dict
119- # containing the metric names and the corresponding metric classes.
120- metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses( )
115+ Save the report and share it with your team.
116+ ``` python
117+ my_report.save( filepath = ' demo_data_quality_report.pkl ' )
121118
122- # Run all the compatible metrics and get a report
123- sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata = metadata )
119+ # load it at any point in the future
120+ my_report = QualityReport.load( filepath = ' demo_data_quality_report.pkl ' )
124121```
125122
126- The output will be a table with all the details about the executed metrics and their score:
123+ ** Want more metrics? ** You can also manually apply any of the metrics in this library to your data.
127124
128- | metric | name | score | min_value | max_value | goal |
129- | ------------------------------| ----------------------------------------------| ------------| -------------| -------------| ----------|
130- | CSTest | Chi-Squared | 0.76651 | 0 | 1 | MAXIMIZE |
131- | KSComplement | Complement to Kolmogorov-Smirnov D statistic | 0.75 | 0 | 1 | MAXIMIZE |
132- | LogisticDetection | LogisticRegression Detection | 0.882716 | 0 | 1 | MAXIMIZE |
133- | SVCDetection | SVC Detection | 0.833333 | 0 | 1 | MAXIMIZE |
134- | BNLikelihood | BayesianNetwork Likelihood | nan | 0 | 1 | MAXIMIZE |
135- | BNLogLikelihood | BayesianNetwork Log Likelihood | nan | -inf | 0 | MAXIMIZE |
136- | LogisticParentChildDetection | LogisticRegression Detection | 0.619444 | 0 | 1 | MAXIMIZE |
137- | SVCParentChildDetection | SVC Detection | 0.916667 | 0 | 1 | MAXIMIZE |
125+ ``` python
126+ # calculate whether the synthetic data respects the min/max bounds
127+ # set by the real data
128+ from sdmetrics.single_table import BoundaryAdherence
138129
139- # What's next?
130+ BoundaryAdherence.compute(
131+ real_data[' start_date' ],
132+ synthetic_data[' start_date' ]
133+ )
134+ ```
135+ ```
136+ 0.8503937007874016
137+ ```
140138
141- If you want to read more about each individual metric, please visit the following folders:
139+ ``` python
140+ # calculate whether an attacker will be able to guess sensitive
141+ # information based on combination of synthetic data and their
142+ # own information
143+ from sdmetrics.single_table import CategoricalCAP
144+
145+ CategoricalCAP.compute(
146+ real_data,
147+ synthetic_data,
148+ key_fields = [' gender' , ' work_experience' ],
149+ sensitive_fields = [' degree_type' ]
150+ )
151+ ```
152+ ```
153+ 0.4601209799017264
154+ ```
155+
156+ # What's next?
142157
143- * Single Column Metrics: [ sdmetrics/single_column] ( sdmetrics/single_column )
144- * Single Table Metrics: [ sdmetrics/single_table] ( sdmetrics/single_table )
145- * Multi Table Metrics: [ sdmetrics/multi_table] ( sdmetrics/multi_table )
146- * Time Series Metrics: [ sdmetrics/timeseries] ( sdmetrics/timeseries )
158+ To learn more about the reports and metrics, visit the [ SDMetrics Documentation] ( https://docs.sdv.dev/sdmetrics ) .
147159
148160---
149161
0 commit comments