Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Draft PDF
on:
push:
paths:
- paper/**
- .github/workflows/draft-pdf.yml

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper/paper.md
- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/paper.pdf
65 changes: 65 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
@software{histomicsui,
title = {HistomicsUI: Organize, visualize, annotate, and analyze histology images},
author = {{Kitware, Inc}},
year = {2025},
note = {Package version 1.7.0},
url = {https://github.com/DigitalSlideArchive/HistomicsUI},
doi = {10.5281/zenodo.5474914},
}

@software{histomicstk,
title = {HistomicsTK: a Python package for the analysis of digital pathology images},
author = {{Kitware, Inc}},
year = {2025},
note = {Package version 1.4.0},
url = {https://github.com/DigitalSlideArchive/HistomicsTK},
doi = {10.5281/zenodo.14833780},
}

@software{digitalslidearchive,
title = {Digital Slide Archive: a system for working with large microscopy images},
author = {{Kitware, Inc}},
year = {2025},
note = {Commit 2da1bfc7365dd72011854b5aebf4a744cfcf98a1; Access: 2025-04-30},
url = {https://github.com/DigitalSlideArchive/digital_slide_archive},
}

@article{batchbald2019,
author = {Andreas Kirsch and
Joost van Amersfoort and
Yarin Gal},
title = {BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian
Active Learning},
journal = {CoRR},
volume = {abs/1906.08158},
year = {2019},
url = {http://arxiv.org/abs/1906.08158},
eprinttype = {arXiv},
eprint = {1906.08158},
timestamp = {Thu, 14 Oct 2021 09:14:34 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1906-08158.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{Gutman2017,
title = {The Digital Slide Archive: A Software Platform for Management, Integration, and Analysis of Histology for Cancer Research},
volume = {77},
ISSN = {1538-7445},
url = {http://dx.doi.org/10.1158/0008-5472.can-17-0629},
DOI = {10.1158/0008-5472.can-17-0629},
number = {21},
journal = {Cancer Research},
publisher = {American Association for Cancer Research (AACR)},
author = {Gutman, David A. and Khalilia, Mohammed and Lee, Sanghoon and Nalisnik, Michael and Mullen, Zach and Beezley, Jonathan and Chittajallu, Deepak R. and Manthey, David and Cooper, Lee A.D.},
year = {2017},
month = oct,
pages = {e75–e78}
}

@misc{TCGAData,
author = {National Cancer Institute and National Human Genome Research Institute},
title = {The Cancer Genome Atlas (TCGA) Program},
year = {2022},
url = {https://www.cancer.gov/tcga},
note = {Accessed: 2022-11-10]}
}
85 changes: 85 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
title: 'WSI Superpixel Guided Labeling'
tags:
- Python
- histology
- bioimage informatics
- whole slide annotation
- whole slide images
- guided labeling
# (add orcid for anyone who has one)
authors:
- name: Brianna Major
affiliation: 1
orcid: 0000-0003-4968-5701
- name: Jeffery A. Goldstein
affiliation: 2
orcid: 0000-0002-4086-057X
- name: Michael Nagler
affiliation: 1
orcid: 0000-0003-3531-6630
- name: Lee A. Newberg
affiliation: 1
orcid: 0000-0003-4644-8874
- name: Abhishek Sharma
affiliation: 2
- name: Anders Sildnes
affiliation: 2
- name: Faiza Ahmed
affiliation: 1
orcid: 0000-0001-6687-9941
- name: Jeff Baumes
affiliation: 1
orcid: 0000-0002-4719-3490
- name: Lee A.D. Cooper
affiliation: 2
orcid: 0000-0002-3504-4965
- name: David Manthey
affiliation: 1
orcid: 0000-0002-4580-8770
affiliations:
- index: 1
name: Kitware, Inc., New York, United States
- index: 2
name: Northwestern University Feinberg School of Medicine, Illinois, United States
date: 30 April 2025
bibliography: paper.bib
---

# Summary

`WSI Superpixel Guided Labeling` facilitates active learning on whole slide images. It has a user interface built on top of the HistomicsUI [@histomicsui] base and deployed as part of the Digital Slide Archive [@Gutman2017, @digitalslidearchive], and uses the HistomicsTK [@histomicstk] tool kit as part of the process.

Users label superpixel regions or other segmented areas of whole slide images to be used as classification input for machine learning algorithms. An example algorithm is included which generates superpixels, features, and machine learning models for active learning on a directory of images. The interface allows bulk labeling, labeling the most impactful superpixels to improve the model, and reviewing labeled and predicted categories.

# Statement of need

One of the limitations in generating accurate models is the need for labeled data. Given a model and a few labeled samples, there are a variety of algorithms that can be used to determine what samples should be additionally labeled to most efficiently improve the model. To actually get labeled data, this prediction of which samples to label needs to be combined with an efficient workflow so that the domain expert can use their labeling time in the most effective manner possible.

`WSI Superpixel Guided Labeling` provides a user interface and workflow for this guided labeling process. Given a set of whole slide images, the images are segmented based on a some user choices. This segmentation is the basis for labeling. The user can specify any number of label categories, including labels that will be excluded from training (for instance, for segmented regions whose categories cannot be accurately determined). After labeling a few initial segments, a model is generated and used to both predict the category of all segments and the segments that would result in the best improvement in the model if they were also labeled. The user can retrain the model at any time and review the results of both the predictions and other users.

For development, the initial segmentation uses superpixels generated with the SLIC algorithm. These are computed on whole slide images in a tiled manner so that they can work on arbitrarily large images, and the tile boundaries are properly handled to avoid visible artifacts. Either of two basic models can be trained and used for predictions: small-scale CNN using image features implemented in tensorflow/keras or torch, or a huggingface foundation model that generates a one-dimensional feature vector. The certainty criteria for which segments should be labeled next can also be selected, and includes confidence, margin, negative entropy, and the BatchBALD [@batchbald2019] algorithm.

We had a placental pathologist provide feedback to validate the efficiency of the user interface and utility of the process.

# Basic Workflow

When starting a new labeling project, the user selects how superpixels are generated, which certainty metric is used for determining the optimal labeling order, and what features are used for model training. The labeling mode allows defining project labels and performing initial labeling. This mode can also be used to add new label categories or combine two categories if they should not have been distinct.

![The Bulk Labeling interface showing one of the project images divided into superpixels with some categories defined. A user can "paint" areas with known labels as an initial seed for the guided labeling process](../docs/screenshots/initial_labels.png)

Once some segments have been labeled and an initial training process has been performed, additional segments are shown with their predictions. The user can use keyboard shortcuts or the mouse to confirm or correct labels. These are presented in an order that maximizes the utility of improving the model based on the originally selected certainty metric.

![The Guided Labeling interface showing a row of superpixels to be labeled and part of a whole slide image](../docs/screenshots/active_learning_view.png)

To check on overall behavior or correct mistakes, there is a review mode that allows seeing all labeled segments with various filtering and sorting options. This can be used to check agreement between pathologists or determine how well the model agrees with the manually labeled data.

![The Review interface showing labeled superpixels in each category](../docs/screenshots/reviewmode.png)

The whole slide image data in these figures are from data generated by the TCGA Research Network [@TCGA].

# Acknowledgements

This work has been funded in part by National Library of Medicine grant 5R01LM013523 entitled "Guiding humans to create better labeled datasets for machine learning in biomedical research".

# References