Skip to content

Commit c4ce826

Browse files
committed
update marker genes slides
1 parent a57b382 commit c4ce826

File tree

2 files changed

+3176
-255
lines changed

2 files changed

+3176
-255
lines changed

Slides/09_ClusterMarkerGenes.Rmd

Lines changed: 46 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Marker Gene Identification"
3-
author: "Zeynep Kalender-Atak; Hugo Tavares"
4-
date: 'Nov 2021'
3+
author: "Ashley Sawle, Zeynep Kalender-Atak, Hugo Tavares"
4+
date: 'Jun 2022'
55
output:
66
ioslides_presentation:
77
widescreen: true
@@ -18,167 +18,85 @@ knitr::include_graphics('Images/workflow2_cluMarkerGenes.png')
1818

1919
## Identifying Cluster Marker Genes
2020

21-
```{r echo=FALSE, out.width= "50%", out.extra='style="float:left; padding:20px"'}
21+
```{r echo=FALSE, out.width= "50%", out.extra='style="float:left; padding:30px"'}
2222
knitr::include_graphics('Images/tsne_clusters.png')
2323
```
2424

2525
Our goal is to identify genes that are differently expressed between clusters
2626

27-
* exclusively expressed in a single cluster or not
27+
Calculate effect sizes that capture differences in:
2828

29-
* different methods that test for:
29+
* mean expression level
30+
* rank of expression
31+
* proportion of cells expressing the gene
3032

31-
* differences in the mean expression level
32-
33-
* differences in the rank of expression
34-
35-
* differences in the proportion of cells expressing the gene
36-
37-
* compile a summary table
33+
These are calculated in pairwise cluster comparisons.
3834

3935

40-
## Differential expression
36+
## Cohen's _d_: mean difference
4137

42-
* Differential expression is comparative. Common comparisons include:
38+
* Log(fold change) of mean gene expression, standardized by the average standard deviation across the groups.
39+
* Positive value indicates upregulation in the cluster of interest, and vice-versa.
4340

44-
* pairwise cluster comparisons
4541

46-
+ eg. cluster 1 vs cluster2, cluster 1 vs cluster 3, cluster 2 vs cluster 3, etc...
42+
## AUC: rank difference
4743

48-
* for a given cluster find 'marker genes' that are:
44+
* "Area Under the Curve" quantifies the ability to distinguish between two gene expression distributions.
45+
* Measures the likelihood that a random cell in the cluster of interest has a higher expression of the gene than a random cell in the other cluster.
46+
* Takes no account of the magnitude of gene expression.
47+
* Ranges from 0 to 1, which can be interpreted as:
48+
+ 1 = upregulation
49+
+ 0 = downregulation
50+
+ 0.5 = no difference
4951

50-
+ DE compared to at least one other cluster
51-
+ DE compared to each of the other clusters
52-
+ DE compared to “most” of the other clusters
53-
+ DE and up-regulated (easier to interpret)
54-
55-
* cell-type comparisons (if cell type is known) - with and without clustering
5652

57-
## `findMarkers`
53+
## Detection rate difference
5854

59-
```{r, eval=FALSE}
60-
findMarkers(
61-
sce,
62-
groups = sce$louvain, # clusters to compare
63-
block = sce$SampleGroup, # covariates in statistical model
64-
test.type = "t", # t-test (default)
65-
direction = "any", # test for either higher or lower expression (default)
66-
lfc = 0, # null hypothesis log-fold-change = 0 (default)
67-
pval.type = "any" # ranking of p-values based on any comparison (default)
68-
)
69-
```
55+
* Log(fold change) in the proportion of cells in which the gene is detected (counts > 0) in the cluster of interest versus the proportion of cells in which the gene is detected in the other cluster.
56+
* Takes no account of the magnitude of gene expression.
57+
* Positive values indicate that the gene is detected in more cells in the cluster of interest than the other cluster.
7058

7159

60+
## `scran::scoreMarkers()` function
7261

73-
## `findMarkers`
62+
For each cluster the function computes the effect size scores between it and every
63+
other cluster.
7464

75-
```{r, eval=FALSE}
76-
findMarkers(
65+
```r
66+
scoreMarkers(
7767
sce,
78-
groups = sce$louvain, # clusters to compare
68+
groups = sce$louvain15 # clusters to compare
7969
block = sce$SampleGroup, # covariates in statistical model
80-
test.type = "t", # t-test (default)
81-
direction = "any", # test for either higher or lower expression (default)
82-
lfc = 0, # null hypothesis log-fold-change = 0 (default)
83-
pval.type = "any" # ranking of p-values based on any comparison (default)
8470
)
8571
```
8672

73+
Outputs a list of `DataFrame` with summary statistics for the metrics we just covered (columns named with suffix `cohen`, `AUC` and `detected`).
8774

88-
```{r echo=FALSE, out.width='90%', fig.align='center'}
89-
knitr::include_graphics('Images/findmarkers_testType.png')
90-
```
91-
92-
93-
## Gene-wise null hypothesis
94-
95-
* t-test: "Is the mean expression of a gene in cluster 1 and cluster 2 the same?"
96-
97-
* Wilcoxon rank-sum test: "It is equally likely that a randomly selected cell from cluster 1 has higher or lower expression of a gene than a randomly selected cell from cluster 2?"
98-
99-
* Binomial test: "Is the probability of a gene being expressed the same in cluster 1 and cluster 2?"
100-
101-
102-
## Statistical challenges
103-
104-
```{r echo=FALSE, out.width= "45%", out.extra='style="float:left; padding:30px"'}
105-
knitr::include_graphics('Images/marker_gene_distributions.png')
106-
```
107-
108-
To an extent, all these models poorly capture the underlying features of the data.
109-
110-
* high noise levels (technical and biological factors)
111-
112-
* small library sizes
113-
114-
* small amounts of available mRNAs result in amplification biases and dropout events
115-
116-
* 3' bias, partial coverage, uneven depth of transcripts
117-
118-
* stochastic nature of transcription
119-
120-
* multimodality in gene expression (presence of multiple possible cell states within a cell population)
121-
122-
<!--
123-
## Parametric vs Non-parametric tests
124-
125-
* The better a model captures the data, the better (more accurate) are the statistical inferences
12675

127-
* When we cannot fit a model to our data, we resort to non-parametric models (e.g. Wilcoxon rank-sum test, Kruskal-Wallis, Kolmogorov-Smirnov test)
76+
## `scran::scoreMarkers()`: summary statistics
12877

129-
* Non-parametric tests generally convert observed expression values to ranks
130-
131-
* They test whether the distribution of ranks for one group are significantly different from the distribution of ranks for the other group
132-
133-
* May fail in presence of large number of tied values, such as the case of dropouts (zeros) in scRNA-seq
134-
135-
* If the conditions for a parametric test hold, then it will be typically more powerful than a non-parametric test
136-
-->
137-
138-
139-
## Performance of different tests
140-
141-
```{r echo=FALSE, out.width= "40%", out.extra='style="float:left; padding:30px; margin-top: -85px"'}
142-
knitr::include_graphics('Images/DE_performance.png')
143-
```
144-
145-
However:
146-
147-
* t-test and Wilcoxon rank-sum test work well in practice, given at least few dozens cells to compare
148-
149-
* Bulk RNA-seq analysis methods do not generally perform worse than those specifically developed for scRNA-seq
150-
151-
* Filtering out lowly expressed genes in quite important for good performance of bulk methods (edgeR, DEseq2)
152-
153-
(source: [Soneson & Robinson 2018](https://doi.org/10.1038/nmeth.4612))
78+
* **mean.X** - mean score across all pairwise comparisons.
79+
* **min.X** - minimum score obtained across all pairwise comparisons. Most stringent statistic: high score indicates upregulation relative to *all* other clusters.
80+
* **median.X** - median score across all pairwise comparisons. More robust to outliers than the mean.
81+
* **max.X** - maximum score obtained across all pairwise comparisons. The least stringent summary statistic: a high score only indicates that the gene is upregulated relative to *at least one* other cluster.
82+
* **rank.X** - minimum ranking ("min-rank") of that gene's score across all clusters. A rank of 1 indicates that gene had the highest score in at least one of the pairwise comparisons.
15483

15584

15685
## So, what's really important?
15786

158-
* understand what are we trying to compare with the different tests (difference in mean expression, difference in probability of being expressed, probability of being highly/lowly expressed)
159-
160-
* It’s important to understand the underlying data
161-
162-
* It’s important to assess and **validate the results**
163-
164-
* Strictly speaking, identifying genes differentially expressed between clusters is statistically flawed, since the clusters were themselves defined based on the gene expression data itself. Validation is crucial as a follow-up from these analyses.
165-
166-
167-
## Things to think about: during analysis
87+
* Understand what are we trying to compare with the different scores:
88+
+ difference in mean expression
89+
+ probability of being highly/lowly expressed
90+
+ difference in probability of being expressed)
16891

169-
* Do not use batch-integrated expression data for differential analysis
92+
* Strictly speaking, identifying genes differentially expressed between clusters is statistically flawed, since the clusters were themselves defined based on the gene expression data itself. Validation is crucial as a follow-up from these analyses.
17093

171-
* Instead, **include batch in the statistical model** (the `findMarkers()` function has the `block` argument to achieve this)
94+
* Do not use batch-integrated expression data for calculating marker gene scores, instead, **include batch in the statistical model** (the `scoreMarkers()` function has the `block` argument to achieve this).
17295

173-
* Depending on the method you choose use: counts, normalised counts or log-normalized counts.
174-
175-
* Normalization strategy has a big influence on the results in differential expression.
176-
177-
* e.g comparing cell types with few expressed genes vs a cell type with many genes.
178-
179-
180-
## Things to think about: after analysis
96+
* Normalization strategy has a big influence on the results in differences in expression between cell and between clusters.
18197

18298
* A lot of what you get might be noise. Take two random set of cells and run DE and you probably with have a few significant genes with most of the commonly used tests.
18399

184-
* Think of the results as hypotheses that need independent verification (e.g. microscopy, qPCR)
100+
* It’s important to assess and **validate the results**. Think of the results as
101+
hypotheses that need independent verification (e.g. microscopy, qPCR)
102+

Slides/09_ClusterMarkerGenes.html

Lines changed: 3130 additions & 127 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)