Code
if(! "pacman" %in% installed.packages()) install.packages("pacman")
::p_load("Seurat", "tidyverse", "R.utils", "ggpubr", "patchwork",
pacman"celldex", "SingleR", "scRNAseq", "pheatmap",
"ggsci", "scds")
Case study - NKG2A and HLA-E define an alternative immune checkpoint axis in bladder cancer
Clarice Groeneveld
February 14, 2025
Download this notebook: https://raw.githubusercontent.com/csgroen/teaching/refs/heads/main/20250214_Tutorial_DU_Seurat.qmd
For the best understanding of this tutorial, please make sure you have at least an intermediate understanding of R. Here are some recommended resources to learn the basics and necessary intermediate concepts:
Specifically courses R Programming, Getting and Cleaning Data and Exploratory Data Analysis may provide a good jump start.
The very basics of R programming as also explained in this thorough book from the Posit team (that builds RStudio).
This goes further than we need, but a base for most of the programming questions a data analyst working in R could have to start with.
Seurat has many very useful tutorials for all of its features. Elements of this case study were taken from the following:
To ensure we have all installed libraries, we’ll use R package “pacman”, which allows us to check if packages are not installed and load them. This is not a replacement for more comprehensive version control (e.g. using renv
), but it’s simple to install and very lightweight.
I’m personally not a big fan of the standard look of Seurat plots, so I’ll apply this ggplot2 theme to (almost) all Seurat plots we generate:
Note: Seurat plots are ggplot-based and therefore can be “added to” in a grammar-of-graphics manner. Plots for multiple features are patchworks, which use a slightly different syntax to add to (this is why sometimes you’ll see
&
instead of+
being used).
Our dataset comes from Salomé et al (2022), “NKG2A and HLA-E define an alternative immune checkpoint axis in bladder cancer”. This single-cell data is a bit particular: they have FACS-sorted their cells into CD45+ and CD45-, notably to enrich their dataset in hematopoetic cells.
We’ll download the data from the link provided in the article:
[1] "385_Patient49_CD45pos" "394_Patient50_CD45pos" "397_Patient51_CD45pos"
[4] "421_Patient52_CD45pos" "434_Patient53_CD45pos" "435_Patient54_CD45pos"
[7] "453_Patient01_CD45pos" "454_Patient01_CD45neg" "456_Patient02_CD45pos"
[10] "457_Patient02_CD45neg" "584_Patient03_bulk" "595_Patient04_bulk"
[13] "613_Patient55_bulk" "651_Patient05_bulk" "656_Patient06_bulk"
[16] "663_Patient07_bulk" "695_Patient08_bulk"
We can see the data is distributed as one folder per dataset, and includes the bulk data, which we’re not going to use. For this reason, I will filter out the “bulk” folders and to get my vector of sc_dts
(single-cell datasets).
[1] "385_Patient49_CD45pos" "394_Patient50_CD45pos" "397_Patient51_CD45pos"
[4] "421_Patient52_CD45pos" "434_Patient53_CD45pos" "435_Patient54_CD45pos"
[7] "453_Patient01_CD45pos" "454_Patient01_CD45neg" "456_Patient02_CD45pos"
[10] "457_Patient02_CD45neg"
Syntax note:
ref_dir |> list.files() |> str_subset("bulk", negate = TRUE)
is equivalent tostr_subset(list.files(ref_dir), "bulk", negate = TRUE)
. The|>
was first introduced as%>%
from R package magrittr and officially incorporated into base R>=4.0. It’s purpose is to increase code legibility through “chaining” commands instead of “nesting” them. If you would like to learn more, you can check the help file? |>
We can check inside each folder:
[1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"
This is a common export of single-cell data and can be read directly into Seurat using function Read10X
.
This snippet will read all of our samples at once:
read_exp_sample <- function(sc_file) {
message('Reading file: ', sc_file)
so <- ref_dir |> filePath(sc_file) |> Read10X()
so <- CreateSeuratObject(so, project = "BLCA_NKG2A")
so[["sample"]] <- str_remove(sc_file, "^.{4}")
so[["patient"]] <- str_extract(sc_file, "Patient..")
so[["CD45"]] <- str_extract(sc_file, "(?<=CD45).*")
return(so)
}
sc_data <- lapply(sc_dts, read_exp_sample)
Reading file: 385_Patient49_CD45pos
Reading file: 394_Patient50_CD45pos
Reading file: 397_Patient51_CD45pos
Reading file: 421_Patient52_CD45pos
Reading file: 434_Patient53_CD45pos
Reading file: 435_Patient54_CD45pos
Reading file: 453_Patient01_CD45pos
Reading file: 454_Patient01_CD45neg
Reading file: 456_Patient02_CD45pos
Reading file: 457_Patient02_CD45neg
The script we ran generated 10 Seurat objects, which we will merge into one for further analysis. Each count matrix, at first, will be stored as a Layer
in our Seurat object.
Here are some “essential commands” for accession of information inside the Seurat object: https://satijalab.org/seurat/articles/essential_commands.html.
The document also outlines the “basic pipeline”.
To see some general info about the object, we can just print it:
An object of class Seurat
33538 features across 50940 samples within 1 assay
Active assay: RNA (33538 features, 0 variable features)
10 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10
Note the layers, for the different samples. We can also see what analyses have been run in this object (none yet).
[[]]
allows us to access the cell metadata in a concise manner:
orig.ident nCount_RNA nFeature_RNA sample
AAACCTGAGGCCCTCA-1_1 BLCA_NKG2A 4917 1671 Patient49_CD45pos
AAACCTGAGTTCGCGC-1_1 BLCA_NKG2A 891 369 Patient49_CD45pos
AAACCTGCACTGTTAG-1_1 BLCA_NKG2A 1121 606 Patient49_CD45pos
AAACCTGCAGACGCTC-1_1 BLCA_NKG2A 3877 1484 Patient49_CD45pos
AAACCTGGTACAGCAG-1_1 BLCA_NKG2A 1393 552 Patient49_CD45pos
AAACCTGGTCACACGC-1_1 BLCA_NKG2A 1491 709 Patient49_CD45pos
patient CD45
AAACCTGAGGCCCTCA-1_1 Patient49 pos
AAACCTGAGTTCGCGC-1_1 Patient49 pos
AAACCTGCACTGTTAG-1_1 Patient49 pos
AAACCTGCAGACGCTC-1_1 Patient49 pos
AAACCTGGTACAGCAG-1_1 Patient49 pos
AAACCTGGTCACACGC-1_1 Patient49 pos
To get matrices from the Seurat objects, there are many “official” ways, but one of these two, which have exactly the same output, should work well:
5 x 5 sparse Matrix of class "dgCMatrix"
AAACCTGAGGCCCTCA-1_1 AAACCTGAGTTCGCGC-1_1 AAACCTGCACTGTTAG-1_1
MIR1302-2HG . . .
FAM138A . . .
OR4F5 . . .
AL627309.1 . . .
AL627309.3 . . .
AAACCTGCAGACGCTC-1_1 AAACCTGGTACAGCAG-1_1
MIR1302-2HG . .
FAM138A . .
OR4F5 . .
AL627309.1 . .
AL627309.3 . .
5 x 5 sparse Matrix of class "dgCMatrix"
AAACCTGAGGCCCTCA-1_1 AAACCTGAGTTCGCGC-1_1 AAACCTGCACTGTTAG-1_1
MIR1302-2HG . . .
FAM138A . . .
OR4F5 . . .
AL627309.1 . . .
AL627309.3 . . .
AAACCTGCAGACGCTC-1_1 AAACCTGGTACAGCAG-1_1
MIR1302-2HG . .
FAM138A . .
OR4F5 . .
AL627309.1 . .
AL627309.3 . .
To get the “ids” of the cells, we can simply call:
For a single-cell RNA-seq object, we can see the genes (features) like so:
There are very standard steps for pre-processing, but that require some level of decision on the part of the user. The basic tutorial of Seurat, on PBMCs, covers this in detail: https://satijalab.org/seurat/articles/pbmc3k_tutorial
This command will figure out the percentage of reads of mitchondrial RNA:
There are three major pieces of information we’d like to check to make filtering decisions: number of expressed features, number of reads and the percentage of mitochondrial RNA. Here, I plot per sample:
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.
We can also plot this for a particular known variable. For example, here I show for whether the cells were FACS-sorted as CD45 positive or negative:
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.
We’ll reproduce the filtering and normalization steps proposed by the paper as closely as possible:
For each sample, cells were first selected as expressing less than 16–20% mitochondrial genes and displaying a minimum of 200–300 and a maximum of 2500–3500 features.
As we’re standardizing pre-processing for all samples, we’ll take the least stringent:
Data were then log-normalized using a scale factor of 10,000.
Normalizing layer: counts.1
Normalizing layer: counts.2
Normalizing layer: counts.3
Normalizing layer: counts.4
Normalizing layer: counts.5
Normalizing layer: counts.6
Normalizing layer: counts.7
Normalizing layer: counts.8
Normalizing layer: counts.9
Normalizing layer: counts.10
The 2,000 most variable features were then identified, data were scaled based on all the features […]
Finding variable features for layer counts.1
Finding variable features for layer counts.2
Finding variable features for layer counts.3
Finding variable features for layer counts.4
Finding variable features for layer counts.5
Finding variable features for layer counts.6
Finding variable features for layer counts.7
Finding variable features for layer counts.8
Finding variable features for layer counts.9
Finding variable features for layer counts.10
Centering and scaling data matrix
[…] and principal component analysis was performed.
PC_ 1
Positive: S100P, SPINK1, KRT18, ADIRF, PSCA, KRT8, KRT19, MGST1, GDF15, KLF5
CD24, HPGD, DHRS2, HES1, VSIG2, SDC4, SMIM22, C19orf33, SLPI, S100A6
AGR2, SPINT2, TNFAIP2, GPRC5A, TSPAN6, S100A13, CLDN7, CD9, EFNA1, WFDC2
Negative: CD52, IGKC, LAPTM5, RGS1, S100A4, IGLC2, IL32, CD3D, CD74, IGHA1
IGHG1, CD69, LSP1, HSPA1A, VIM, LTB, CCL5, TRBC2, HLA-DPB1, COTL1
IGLC3, ALOX5AP, TRBC1, IGHG2, GPR183, RGS2, GZMA, TYROBP, NKG7, CD7
PC_ 2
Positive: SPARC, IGFBP7, CALD1, COL1A2, COL3A1, COL6A2, COL4A1, IFITM3, MYL9, COL1A1
FN1, TPM2, CAVIN3, THY1, BGN, FSTL1, COX7A1, TIMP1, COL6A1, DCN
C1S, TAGLN, C1R, RGS5, COL5A2, LGALS1, SPARCL1, MFGE8, CTGF, COL6A3
Negative: HPGD, GATA3, CD24, S100P, RAB11FIP1, CD3D, SPINK1, PSCA, DHRS2, KRT8
KRT19, KLF5, CD52, UPK2, TRBC2, TMEM97, GDPD3, GDF15, TMPRSS2, VSIG2
UPK1A, SDC4, KRT20, SMIM22, SPTSSB, MGST1, CNGA1, WFDC2, CXADR, UPK3A
PC_ 3
Positive: CALD1, SPARC, COL3A1, COL6A2, COL1A2, IGFBP7, IL32, COL4A1, COL1A1, RGS5
THY1, MYL9, TPM2, SELENOM, CAVIN3, COL6A1, FSTL1, SOD3, MFGE8, BGN
COL5A2, COX7A1, MGP, COL6A3, NDUFA4L2, SPARCL1, PCOLCE, CTGF, C1R, LHFPL6
Negative: TYROBP, FCER1G, LYZ, MS4A6A, CD14, HLA-DRA, FTL, CTSS, CD68, C1QC
C1QA, C1QB, MS4A7, FCGR2A, SPI1, OLR1, LST1, MS4A4A, C5AR1, PSAP
HLA-DRB5, SERPINA1, APOC1, CTSB, CST3, SAT1, CD163, HLA-DPA1, BCL2A1, HLA-DMA
PC_ 4
Positive: RAMP2, PCAT19, VWF, CLEC14A, HSPG2, EGFL7, RAMP3, CALCRL, AQP1, ECSCR
PCDH17, PODXL, FLT1, CD34, CDH5, VWA1, FAM167B, CLDN5, SLCO2A1, CXorf36
ENG, PECAM1, CYYR1, MMRN2, NOTCH4, SLC9A3R2, INSR, HYAL2, TM4SF1, ESM1
Negative: COL1A2, COL3A1, COL1A1, DCN, COL6A3, TPM2, C1S, PCOLCE, C1R, COL5A2
TAGLN, LUM, SOD3, COL6A1, RARRES2, BGN, ACTA2, MYL9, AEBP1, SERPINF1
COL6A2, CYGB, MXRA8, RGS5, MFGE8, NDUFA4L2, MEG3, PPP1R14A, SDC2, SERPING1
PC_ 5
Positive: RNASE1, UPK2, GDPD3, UBE2C, TMEM97, CNGA1, UPK1A, RAB11FIP1, S100A13, UPK3A
BHMT, TNFAIP2, STMN1, CD24, TMPRSS2, PLPP5, SDC4, BIRC5, NQO1, SPRR3
EMX2, KRT20, MKI67, ERBB2, CCNB2, CDK1, HMGB2, PCLAF, MYCL, MESP1
Negative: KRT17, FABP4, SERPINB5, DKK1, PLA2G2A, CLU, CLCA4, CRTAC1, LAMB3, S100A2
AQP3, TSPAN1, IGFBP2, IGFBP5, ERRFI1, IRF6, ANXA1, BTBD16, ANXA10, FHL2
SHH, PERP, FABP5, PRAC1, CYP1B1, FOSL1, HSD17B2, SOX15, SFN, GPR87
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 13337773 712.4 19671861 1050.6 19671861 1050.6
Vcells 1682607264 12837.3 5535284006 42230.9 5765848852 43990.0
We can visualize the PCA with DimPlot
, coloring by different categorical variables. Here, we see the influence of both the sample (batch) and CD45 positive vs negative (which is correlated to batch but might also contribute to variance).
Dimensionality of the dataset was then assessed using the JackStraw and ElbowPlot functions.
From Seurat documentation, we see that JackStraw is quite computationally intensive:
In Macosko et al, we implemented a resampling test inspired by the JackStraw procedure. While still available in Seurat (see previous vignette), this is a slow and computationally expensive procedure, and we is no longer routinely used in single cell analysis.
This is not the most beautiful Elbow plot ever, but the idea is to find the “inflection point” of the plot (I’m estimating at about 10) and go a little bit further into the “stable” area, where adding more dimensions doesn’t add more infromation.
For methods that require the dimensionality, we’ll set it at 15.
It’s never a bad idea to add some doublet detection.
Doublets are UMIs that contained more than one cell and thefore have a mixed profile.
This benchmark shows some of the main methods (though new ones may emerge): https://www.sciencedirect.com/science/article/pii/S2405471220304592
We’ll use cxds
from scdc
because it’s fast and reasonably accurate.
However, it uses a SingleCellExperiment
standard. Luckily, we can easily convert between SeuratObject
and SingleCellExperiment
.
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 13355280 713.3 19671861 1050.6 19671861 1050.6
Vcells 1682219141 12834.4 5535284006 42230.9 5765848852 43990.0
Clusters were calculated […]
Before clustering, we need to establish a neighborhood graph and compute shared nearest neighbors (SNN). We’ll do this on 15 PCs:
From this, we can use “FindClusters” for Louvain clustering, which uses a rather temperamental “resolution” parameter, very dependent on datasets usually.
The largest the “resolution”, the more clusters are obtained.
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 44699
Number of edges: 1534079
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9465
Number of communities: 26
Elapsed time: 7 seconds
UMAP and t-SNE are dimensionality reductions for visualization. They are very popular and we’ll use them widely. They minimize local distances, often making “neat” looking clusters.
Be wary of not using the UMAP projection itself as an input for analysis, as they contain lots of data distortions (which is normal of anything that is reducing a space of 3000 features to 2 dimensions)
[…] and data dimensions were reduced using the t-SNE and UMAP methods.
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
16:34:12 UMAP embedding parameters a = 0.9922 b = 1.112
16:34:12 Read 44699 rows and found 15 numeric columns
16:34:12 Using Annoy for neighbor search, n_neighbors = 30
16:34:12 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:34:15 Writing NN index file to temp file /tmp/RtmpPri6SU/file12975263d7093
16:34:15 Searching Annoy index using 1 thread, search_k = 3000
16:34:27 Annoy recall = 100%
16:34:28 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:34:29 Initializing from normalized Laplacian + noise (using RSpectra)
16:34:31 Commencing optimization for 200 epochs, with 1900728 positive edges
16:34:45 Optimization finished
We can check the state of our Seurat object:
An object of class Seurat
33538 features across 44699 samples within 1 assay
Active assay: RNA (33538 features, 2000 variable features)
21 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10, data.1, data.2, data.3, data.4, data.5, data.6, data.7, data.8, data.9, data.10, scale.data
2 dimensional reductions calculated: pca, umap
And plot the Seurat clusters… only to find we’re definitely clustering per sample.
Batch effect is rampant in single-cell RNA-seq and there are lots of integration methods proposed to deal with this. They often return a dimensionality reduction (more precisely, latent space) that minimizes this batch effect.
There are 5 methods implemented in Seurat alone. We’ll use Harmony, because it’s fairly fast and effective, but see this benchmark for the most performant methods:
See the reference tutorial: https://satijalab.org/seurat/articles/seurat5_integration
Bulk cells were merged for the HLA analysis, while CD8 T cells only were merged for the CD8 T cell analysis. CD8 T cells were identified using the FindAllMarkers and the VlnPlot functions.
Warning in harmony::HarmonyMatrix(data_mat = Embeddings(object = orig), :
HarmonyMatrix is deprecated and will be removed in the future from the API in
the future
Warning: Warning: The parameters do_pca and npcs are deprecated. They will be ignored for this function call and please remove parameters do_pca and npcs and pass to harmony cell_embeddings directly.
This warning is displayed once per session.
Warning: Warning: The parameter tau is deprecated. It will be ignored for this function call and please remove parameter tau in future function calls. Advanced users can set value of parameter tau by using parameter .options and function harmony_options().
This warning is displayed once per session.
Warning: Warning: The parameter block.size is deprecated. It will be ignored for this function call and please remove parameter block.size in future function calls. Advanced users can set value of parameter block.size by using parameter .options and function harmony_options().
This warning is displayed once per session.
Warning: Warning: The parameter max.iter.harmony is replaced with parameter max_iter. It will be ignored for this function call and please use parameter max_iter in future function calls.
This warning is displayed once per session.
Warning: Warning: The parameter max.iter.cluster is deprecated. It will be ignored for this function call and please remove parameter max.iter.cluster in future function calls. Advanced users can set value of parameter max.iter.cluster by using parameter .options and function harmony_options().
This warning is displayed once per session.
Warning: Warning: The parameter epsilon.cluster is deprecated. It will be ignored for this function call and please remove parameter epsilon.cluster in future function calls. Advanced users can set value of parameter epsilon.cluster by using parameter .options and function harmony_options().
This warning is displayed once per session.
Warning: Warning: The parameter epsilon.harmony is deprecated. It will be ignored for this function call and please remove parameter epsilon.harmony in future function calls. If users want to control if harmony would stop early or not, use parameter early_stop. Advanced users can set value of parameter epsilon.harmony by using parameter .options and function harmony_options().
This warning is displayed once per session.
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 13450658 718.4 19671861 1050.6 19671861 1050.6
Vcells 1691092990 12902.1 5535284006 42230.9 5765848852 43990.0
We’ll need re-run clustering algorithm in the Harmony latent space (here, using 30 dimensions)
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 44699
Number of edges: 1710955
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9238
Number of communities: 18
Elapsed time: 10 seconds
We can also run a UMAP in this latent space:
16:36:20 UMAP embedding parameters a = 0.9922 b = 1.112
16:36:20 Read 44699 rows and found 30 numeric columns
16:36:20 Using Annoy for neighbor search, n_neighbors = 30
16:36:20 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:36:23 Writing NN index file to temp file /tmp/RtmpPri6SU/file12975e7a5eda
16:36:23 Searching Annoy index using 1 thread, search_k = 3000
16:36:35 Annoy recall = 100%
16:36:36 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:36:37 Initializing from normalized Laplacian + noise (using RSpectra)
16:36:39 Commencing optimization for 200 epochs, with 2061946 positive edges
16:36:54 Optimization finished
We can check back on our features that could be markers of bias in a cluster, and check back in with our doublet scores:
AAACCTGAGGCCCTCA-1_1 AAACCTGAGTTCGCGC-1_1 AAACCTGCACTGTTAG-1_1
266320.09 29805.91 219085.93
AAACCTGCAGACGCTC-1_1 AAACCTGGTACAGCAG-1_1
100458.58 193619.85
Picking joint bandwidth of 569
Picking joint bandwidth of 123
Picking joint bandwidth of 0.469
Picking joint bandwidth of 6840
We can see, for example, that cluster 17 seems to be enriched in doublets.
To compare the co-occurence of two categorical variables, we can do a heatmap:
From this, we can exclude cluster 17:
Differential expression analysis will by default run on the “identity” feature, which we make sure to set as “harmony_clusters” at this point:
AAACCTGAGGCCCTCA-1_1 AAACCTGAGTTCGCGC-1_1 AAACCTGCACTGTTAG-1_1
2 6 1
AAACCTGCAGACGCTC-1_1 AAACCTGGTACAGCAG-1_1
3 1
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
For differential analysis to use all layers at the same time, we need to join them before finding markers (each clusters vs all others):
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
Calculating cluster 12
Calculating cluster 13
Calculating cluster 14
Calculating cluster 15
Calculating cluster 16
For annotation, we can now use this table:
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
IL7R 0 2.929902 0.634 0.158 0 0 IL7R
CD3D 0 1.512433 0.750 0.281 0 0 CD3D
IL32 0 1.225477 0.865 0.404 0 0 IL32
TRAC 0 1.525893 0.722 0.273 0 0 TRAC
LTB 0 2.055308 0.713 0.273 0 0 LTB
CD2 0 1.805078 0.660 0.230 0 0 CD2
And visualize either top markers or any chosen features in many different ways:
The heatmap takes a while to run, but provides a comprehensive view of a marker expression over the cells:
The dotplot summarizes the information into the heatmap and gives the same importance to all clusters, independent of size:
FeaturePlot
is the same as DimPlot
, but using one or many continuous values instead:
There are an infinity of methods for cell type annotation. I’ve found that using a good reference is more important than the method for annotation.
We’ll show reference-based annotation using SingleR, mostly because it’s easy to use but also well performing.
Every automatic annotation method uses a reference. celldex
provides an easy way to access some annotation sources, notably the Human Primary Cell Atlas, which we will use in this example.
# A tibble: 7 × 10
name version path title description taxonomy_id genome samples labels
<chr> <chr> <chr> <chr> <chr> <list> <list> <int> <list>
1 dice 2024-0… <NA> Huma… Human bulk… <list [1]> <list> 1561 <list>
2 blueprint_e… 2024-0… <NA> Huma… Human bulk… <list [1]> <list> 259 <list>
3 immgen 2024-0… <NA> Mous… Mouse micr… <list [1]> <list> 830 <list>
4 mouse_rnaseq 2024-0… <NA> Bulk… Bulk RNA-s… <list [1]> <list> 358 <list>
5 hpca 2024-0… <NA> Micr… Microarray… <list [1]> <list> 713 <list>
6 novershtern… 2024-0… <NA> Bulk… Bulk micro… <list [1]> <list> 211 <list>
7 monaco_immu… 2024-0… <NA> Huma… Human bulk… <list [1]> <list> 114 <list>
# ℹ 1 more variable: sources <list>
class: SummarizedExperiment
dim: 19363 713
metadata(0):
assays(1): logcounts
rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
colData names(3): label.main label.fine label.ont
It’s given as a SummarizedExperiment
of purified bulk tumors.=, annotated as follows:
[1] "DC" "Smooth_muscle_cells" "Epithelial_cells"
[4] "B_cell" "Neutrophils" "T_cells"
[7] "Monocyte" "Erythroblast" "BM & Prog."
[10] "Endothelial_cells" "Gametocytes" "Neurons"
[13] "Keratinocytes" "HSC_-G-CSF" "Macrophage"
[16] "NK_cell" "Embryonic_stem_cells" "Tissue_stem_cells"
[19] "Chondrocytes" "Osteoblasts" "BM"
[22] "Platelets" "Fibroblasts" "iPS_cells"
[25] "Hepatocytes" "MSC" "Neuroepithelial_cell"
[28] "Astrocyte" "HSC_CD34+" "CMP"
[31] "GMP" "MEP" "Myelocyte"
[34] "Pre-B_cell_CD34-" "Pro-B_cell_CD34+" "Pro-Myelocyte"
There’s also a finer annotation, which we won’t use:
[1] "DC:monocyte-derived:immature" "DC:monocyte-derived:Galectin-1"
[3] "DC:monocyte-derived:LPS" "DC:monocyte-derived"
[5] "Smooth_muscle_cells:bronchial:vit_D" "Smooth_muscle_cells:bronchial"
[7] "Epithelial_cells:bronchial" "B_cell"
[9] "Neutrophil" "T_cell:CD8+_Central_memory"
[11] "T_cell:CD8+" "T_cell:CD4+"
[13] "T_cell:CD8+_effector_memory_RA" "T_cell:CD8+_effector_memory"
[15] "T_cell:CD8+_naive" "Monocyte"
[17] "Erythroblast" "BM"
[19] "DC:monocyte-derived:rosiglitazone" "DC:monocyte-derived:AM580"
SingleR is really simple to use, we just need to provide an expression matrix and reference as SummarizedExperiment
:
DataFrame with 44535 rows and 4 columns
scores labels
<matrix> <character>
AAACCTGAGGCCCTCA-1_1 0.2029037:0.213122:0.201966:... Epithelial_cells
AAACCTGAGTTCGCGC-1_1 0.0681688:0.198615:0.175877:... B_cell
AAACCTGCACTGTTAG-1_1 0.0876658:0.201213:0.204570:... T_cells
AAACCTGCAGACGCTC-1_1 0.1877214:0.195236:0.193748:... Epithelial_cells
AAACCTGGTACAGCAG-1_1 0.0911602:0.192878:0.188423:... T_cells
... ... ...
TTTGTCAGTAAGTAGT-1_10 0.1291041:0.126240:0.0979679:... Endothelial_cells
TTTGTCAGTCCTCCAT-1_10 0.2087580:0.136874:0.1276952:... Smooth_muscle_cells
TTTGTCAGTCTAGTCA-1_10 0.1394180:0.249968:0.2645048:... Monocyte
TTTGTCAGTGCTGTAT-1_10 0.2323622:0.210428:0.1944528:... Endothelial_cells
TTTGTCATCTAGCACA-1_10 0.0968535:0.202014:0.1887533:... NK_cell
delta.next pruned.labels
<numeric> <character>
AAACCTGAGGCCCTCA-1_1 0.0539724 Epithelial_cells
AAACCTGAGTTCGCGC-1_1 0.0693563 B_cell
AAACCTGCACTGTTAG-1_1 0.1751957 T_cells
AAACCTGCAGACGCTC-1_1 0.5717106 Epithelial_cells
AAACCTGGTACAGCAG-1_1 0.0322466 T_cells
... ... ...
TTTGTCAGTAAGTAGT-1_10 0.05911158 Endothelial_cells
TTTGTCAGTCCTCCAT-1_10 0.00671502 Smooth_muscle_cells
TTTGTCAGTCTAGTCA-1_10 0.02419472 Monocyte
TTTGTCAGTGCTGTAT-1_10 0.06293215 Endothelial_cells
TTTGTCATCTAGCACA-1_10 0.09986082 NK_cell
Once we have the predictions, I add it to the Seurat object:
We can check the prediction results and “harmonize” them by cluster, i.e. take the cluster-level mode of the annotation:
T_cells T_cells Epithelial_cells Epithelial_cells
"0" "1" "2" "3"
Monocyte Macrophage B_cell Tissue_stem_cells
"4" "5" "6" "7"
T_cells B_cell Epithelial_cells Epithelial_cells
"8" "9" "10" "11"
Endothelial_cells Monocyte CMP Pre-B_cell_CD34-
"12" "13" "14" "15"
DC
"16"
If we plot, we can see the difference between the “pred” and “clean”:
Most methods will work best with a relevant single-cell reference.
Here is a not-so-recent but independent benchmark: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1795-z
It can be interesting to try to use this tutorial to annotate a query dataset based on a reference atlas: https://satijalab.org/seurat/articles/integration_mapping
Note: always take automatic annotations with a grain of salt. They are a good departure point, but the assignment to a cell type needs to be checked manually. A good way is to go back to the differentially expressed markers using your group annotations.
In their manuscript, Salomé et al selected the T cells and subclustered them. This is a common strategy when there is a main population of interest of which we would like to characterize functional states or subpopulations. It can also be interesting to provide more detailed annotations, since the space of analysis changes.
To use subclusters, just subset the object to the annotated cells and re-run the pipeline to find new relevant subgroups.
Trajectory analysis is used to reconstruct and visualize the dynamic processes of cellular development, differentiation, and state transitions. It infers the order of cells in a particular trajectory, allows us to calculate continuous “pseudotimes”.
It can be used to map out potential lineage relationships, identify intermediate cell states, and elucidate the pathways underlying tissue development, immune responses, or disease progression, for example.
In Figure 4, Salomé et al used this type of analysis to show a subset of differentiated PD-1+ TRM CD8+ T Cells that retain independent function.
A not so new but very comprehensive benchmark of trajectory inference by Saelens et al (2019) is a great jumping point: https://www.nature.com/articles/s41587-019-0071-9
Accompanied by the superb dynverse
wrappers, which provide all methods in the benchmark, as well as guidance on what to choose: https://dynverse.org/
There are many methods to infer cell cycle stage and score cell cycle in Seurat, but it also has a built in one that works quite decently:
We can see very little cycling except in one “epithelial” (tumor) area we previously annotated.
RNA velocity is a computation method that attempts to predict the future transcriptional state of individual cells. It distinguishes between unspliced (pre-mRNA) and spliced (mature mRNA) transcripts within each cell, and therefore necessitates a special input in which these reads have been separated. This differentiation allows the estimation of the rate at which genes are being transcribed and processed, providing insights into the “direction” and “speed” of cellular state changes.
The primary goal of RNA velocity analysis is to elucidate the trajectories of cellular differentiation, activation, and other dynamic processes, thereby offering a deeper understanding of cellular development and lineage relationships. By forecasting how cells are likely to evolve, it can help in identifying transient cell states, and uncovering mechanisms underlying tissue development, regeneration, and disease progression.
It can be used together with traditional trajectory analysis.
In Salomé et al (2022), this analysis is used in Figure 5 to further characterize the axis of differentiation/exhaustion of CD8+ TRM cells, shown below:
Most RNA velocity tools work in Python, the two most well-known being velocyto
and scVelo
. Both have tutorials showing how to use reticulate
to run their workflows in R.
Pathway enrichment aims to identify biological pathways, processes, or functional categories that are significantly overrepresented within a set of genes of interest, such as those differentially expressed between distinct cell populations or states.
This analysis typically involves mapping the selected genes to curated pathway databases like Gene Ontology, KEGG, or Reactome and applying statistical methods to assess the significance of their enrichment.
There are many methods to do this, but a easy to use is decoupleR
, in which AUCell
is implemented, which is a well-performing scoring method: https://www.bioconductor.org/packages/release/bioc/vignettes/decoupleR/inst/doc/decoupleR.html
Gene regulatory network (GRN) analysis seeks to identify key transcription factors, regulatory motifs, and signaling pathways that drive cell states, differentiation, and responses to environmental stimuli. The ultimate goal of GRN analysis is to map out the regulatory architecture that underpins cellular heterogeneity, enabling us to pinpoint master regulators and understand the molecular mechanisms underlying various biological processes and disease states.
From an algorithmic point of view, this involves inferring relationships such as gene co-expression, regulatory influences, and causal interactions.
SCENIC
is the most popular method to do this, with the Python implementation being much more performant (https://github.com/aertslab/pySCENIC). The protocol is probably the best way to run this type of analysis currently: https://github.com/aertslab/SCENICprotocol
Copy Number Variation (CNV) analysis is a computational approach used to infer genomic copy number alterations from transcriptomic data. It leverages the variability in gene expression levels across the genome to detect gains or losses of genomic regions.
This method is particularly valuable in studies of cancer and other genetic disorders, where CNVs play a crucial role in disease progression and heterogeneity.
The most well-known tool used for this analysis is inferCNV
, which can be extremely slow and doesn’t perform very well with cells with low-read count. This is why we implemented fastCNV
, available to install for R on Github: https://github.com/must-bioinfo/fastCNVdata
Ligand-receptor analysis is a computational strategy to infer cell-cell communication by identifying interactions between signaling molecules (ligands) and their corresponding receptors expressed on different cell types or states.
This analysis involves systematically pairing ligands expressed by one or more cell populations with receptors expressed by neighboring or interacting cells, thereby predicting potential signaling pathways that facilitate intercellular communication.
Well known tools are CellPhoneDB
and CellChat
, but LIANA+
provides wrappers and an all-in-one framework for this kind of analysis: https://liana-py.readthedocs.io/en/latest/
Notably, this analysis can be much more powerful in spatially-resolved data.
Format conversion is often necessary when working with single-cell RNA-seq to use tools available for different objects.
This can be easy between SingleCellExperiment and SeuratObject, as we used it in this tutorial with as.SingleCellExperiment
and as.Seurat
.
However, these functions sometimes don’t work and the Seurat solution for h5 objects, SeuratDisk
, even fails its own tutorial.
sceasy
(https://github.com/cellgeni/sceasy) was made to resolve these conversion problems. Note that it requires reticulate
(to make Python work in R) to be installed.
---
title: R and **Seurat** to analyse single cell RNA-seq
subtitle: Case study - NKG2A and HLA-E define an alternative immune checkpoint axis in bladder cancer
author: Clarice Groeneveld
editor:
markdown:
wrap: 72
format:
html:
toc: true
toc-depth: 4
toc-title: Contents
lightbox: true
code-fold: true
code-tools: true
categories:
- r
- tutorial
- single-cell
date: "14 February 2025"
---
Download this notebook:
<https://raw.githubusercontent.com/csgroen/teaching/refs/heads/main/20250214_Tutorial_DU_Seurat.qmd>
## Pre-requisites
For the best understanding of this tutorial, please make sure you have
at least an intermediate understanding of R. Here are some recommended
resources to learn the basics and necessary intermediate concepts:
1. swirl: <https://swirlstats.com/students.html>
Specifically courses *R Programming*, *Getting and Cleaning Data* and
*Exploratory Data Analysis* may provide a good jump start.
2. Hands On Programming with R:
<https://rstudio-education.github.io/hopr/>
The very basics of R programming as also explained in this thorough book
from the Posit team (that builds RStudio).
3. R for Data Science: <https://r4ds.hadley.nz/>
This goes further than we need, but a base for most of the programming
questions a data analyst working in R could have to start with.
## Resources
Seurat has many very useful tutorials for all of its features. Elements
of this case study were taken from the following:
1. Seurat - Guided Clustering Tutorial
<https://satijalab.org/seurat/archive/v3.0/pbmc3k_tutorial.html>
2. Introduction to scRNA-seq integration:
<https://satijalab.org/seurat/articles/integration_introduction>
3. Mapping and annotating query datasets:
<https://satijalab.org/seurat/articles/integration_mapping>
## Install packages
To ensure we have all installed libraries, we'll use R package "pacman",
which allows us to check if packages are not installed and load them.
This is not a replacement for more comprehensive version control (e.g.
using `renv`), but it's simple to install and very lightweight.
```{r}
if(! "pacman" %in% installed.packages()) install.packages("pacman")
pacman::p_load("Seurat", "tidyverse", "R.utils", "ggpubr", "patchwork",
"celldex", "SingleR", "scRNAseq", "pheatmap",
"ggsci", "scds")
```
I'm personally not a big fan of the standard look of Seurat plots, so
I'll apply this ggplot2 theme to (almost) all Seurat plots we generate:
```{r}
theme_clean <-
theme_linedraw() +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
```
> Note: Seurat plots are ggplot-based and therefore can be "added to" in
> a grammar-of-graphics manner. Plots for multiple features are
> patchworks, which use a slightly different syntax to add to (this is
> why sometimes you'll see `&` instead of `+` being used).
## Download data
Our dataset comes from Salomé et al (2022), ["NKG2A and HLA-E define an
alternative immune checkpoint axis in bladder
cancer"](https://www.sciencedirect.com/science/article/pii/S1535610822003695).
This single-cell data is a bit particular: they have FACS-sorted their
cells into CD45+ and CD45-, notably to enrich their dataset in
hematopoetic cells.
We'll download the data from the link provided in the article:
```{r}
#| eval: false
dir.create("data")
download.file("https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/7yb7s9769c-1.zip", 'data/Salome_2022.zip')
unzip('data/Salome_2022.zip', exdir = "data")
```
## Read 10X data
```{r}
ref_dir <- "data/NKG2A and HLA-E define an alternative immune checkpoint axis in bladder cancer/"
list.files(ref_dir)
```
We can see the data is distributed as one folder per dataset, and
includes the bulk data, which we're not going to use. For this reason, I
will filter out the "bulk" folders and to get my vector of `sc_dts`
(single-cell datasets).
```{r}
sc_dts <- ref_dir |> list.files() |> str_subset("bulk", negate = TRUE)
sc_dts
```
> **Syntax note:**
> `ref_dir |> list.files() |> str_subset("bulk", negate = TRUE)` is
> equivalent to
> `str_subset(list.files(ref_dir), "bulk", negate = TRUE)`. The `|>` was
> first introduced as `%>%` from R package magrittr and officially
> incorporated into base R\>=4.0. It's purpose is to increase code
> legibility through "chaining" commands instead of "nesting" them. If
> you would like to learn more, you can check the help file `? |>`
We can check inside each folder:
```{r}
ref_dir |> filePath(sc_dts[1]) |> list.files()
```
This is a common export of single-cell data and can be read directly
into Seurat using function `Read10X`.
This snippet will read all of our samples at once:
```{r}
read_exp_sample <- function(sc_file) {
message('Reading file: ', sc_file)
so <- ref_dir |> filePath(sc_file) |> Read10X()
so <- CreateSeuratObject(so, project = "BLCA_NKG2A")
so[["sample"]] <- str_remove(sc_file, "^.{4}")
so[["patient"]] <- str_extract(sc_file, "Patient..")
so[["CD45"]] <- str_extract(sc_file, "(?<=CD45).*")
return(so)
}
sc_data <- lapply(sc_dts, read_exp_sample)
names(sc_data) <- sc_dts
```
### Merge samples
The script we ran generated 10 Seurat objects, which we will merge into
one for further analysis. Each count matrix, at first, will be stored as
a `Layer` in our Seurat object.
```{r}
sc_data <- merge(sc_data[[1]], sc_data[2:length(sc_data)])
```
## Seurat syntax basics for object accession
Here are some "essential commands" for accession of information inside
the Seurat object:
<https://satijalab.org/seurat/articles/essential_commands.html>.
The document also outlines the "basic pipeline".
To see some general info about the object, we can just print it:
```{r}
sc_data
```
Note the layers, for the different samples. We can also see what
analyses have been run in this object (none yet).
#### Cell metadata
`[[]]` allows us to access the cell metadata in a concise manner:
```{r}
sc_data[[]] |> head()
```
#### Assay matrix (e.g. counts)
To get matrices from the Seurat objects, there are many "official" ways,
but one of these two, which have exactly the same output, should work
well:
```{r}
GetAssayData(sc_data, layer = "counts.1")[1:5,1:5]
sc_data[["RNA"]]$counts.1[1:5,1:5]
```
### UMI ids
To get the "ids" of the cells, we can simply call:
```{r}
Cells(sc_data)[1:5]
```
### Features
For a single-cell RNA-seq object, we can see the genes (features) like
so:
```{r}
Features(sc_data)[1:5]
```
## Pre-processing workflow
There are very standard steps for pre-processing, but that require some
level of decision on the part of the user. The basic tutorial of Seurat,
on PBMCs, covers this in detail:
https://satijalab.org/seurat/articles/pbmc3k_tutorial
### QC and selecting cells for further analysis
This command will figure out the percentage of reads of mitchondrial
RNA:
```{r}
sc_data[["percent.mt"]] <- PercentageFeatureSet(sc_data, pattern = "^MT-")
```
There are three major pieces of information we'd like to check to make
filtering decisions: number of expressed features, number of reads and
the percentage of mitochondrial RNA. Here, I plot per sample:
```{r fig.width=10, fig.height=4}
(VlnPlot(sc_data,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
group.by = "sample",
alpha = 0.3,
ncol = 3) &
theme_clean) +
plot_layout(guides = "collect")
```
We can also plot this for a particular known variable. For example, here
I show for whether the cells were FACS-sorted as CD45 positive or
negative:
```{r fig.width=10, fig.height=4}
(VlnPlot(sc_data, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
alpha = 0.3,
group.by = "CD45", ncol = 3) &
stat_compare_means() &
theme_clean) +
plot_layout(guides = "collect") +
plot_annotation(title = "CD45")
```
### Filtering and Normalization
We'll reproduce the filtering and normalization steps proposed by the
paper as closely as possible:
> For each sample, cells were first selected as expressing less than
> 16–20% mitochondrial genes and displaying a minimum of 200–300 and a
> maximum of 2500–3500 features.
As we're standardizing pre-processing for all samples, we'll take the
least stringent:
```{r}
sc_data <- subset(sc_data, subset = nFeature_RNA > 200 & nFeature_RNA < 3500 & percent.mt < 20)
```
> Data were then log-normalized using a scale factor of 10,000.
```{r}
sc_data <- NormalizeData(sc_data, normalization.method = "LogNormalize", scale.factor = 1e4)
```
### Feature selection and PCA
> The 2,000 most variable features were then identified, data were
> scaled based on all the features \[...\]
```{r}
sc_data <- FindVariableFeatures(sc_data, nfeatures = 2000)
sc_data <- ScaleData(sc_data, features = rownames(sc_data))
```
> \[...\] and principal component analysis was performed.
```{r}
sc_data <- RunPCA(sc_data)
gc()
```
We can visualize the PCA with `DimPlot`, coloring by different
categorical variables. Here, we see the influence of both the sample
(batch) and CD45 positive vs negative (which is correlated to batch but
might also contribute to variance).
```{r fig.width=8, fig.height = 4}
DimPlot(sc_data, group.by = c("sample", "CD45"), reduction = "pca") &
theme_clean
```
#### Dimensionality determination
> Dimensionality of the dataset was then assessed using the JackStraw
> and ElbowPlot functions.
From Seurat documentation, we see that JackStraw is quite
computationally intensive:
> In Macosko et al, we implemented a resampling test inspired by the
> JackStraw procedure. While still available in Seurat (see previous
> vignette), this is a slow and computationally expensive procedure, and
> we is no longer routinely used in single cell analysis.
This is not the most beautiful Elbow plot ever, but the idea is to find
the "inflection point" of the plot (I'm estimating at about 10) and go a
little bit further into the "stable" area, where adding more dimensions
doesn't add more infromation.
```{r fig.width = 5, fig.height=3}
ElbowPlot(sc_data, ndims = 30)
```
For methods that require the dimensionality, we'll set it at 15.
### Doublet detection
It's never a bad idea to add some doublet detection.
Doublets are UMIs that contained more than one cell and thefore have a
mixed profile.
This benchmark shows some of the main methods (though new ones may
emerge):
<https://www.sciencedirect.com/science/article/pii/S2405471220304592>
We'll use `cxds` from `scdc` because it's fast and reasonably accurate.
However, it uses a `SingleCellExperiment` standard. Luckily, we can
easily convert between `SeuratObject` and `SingleCellExperiment`.
```{r}
cxds_scores <- lapply(SplitObject(sc_data, split.by = "sample"), function(samp_sc) {
sce <- samp_sc |>
as.SingleCellExperiment() |>
cxds()
return(sce$cxds_score)
})
gc()
names(cxds_scores) <- NULL
sc_data$cxds_scores <- do.call(c, cxds_scores)[Cells(sc_data)]
```
## Clustering
> Clusters were calculated \[...\]
Before clustering, we need to establish a neighborhood graph and compute
shared nearest neighbors (SNN). We'll do this on 15 PCs:
```{r}
sc_data <- FindNeighbors(sc_data, dims = 1:15)
```
From this, we can use "FindClusters" for Louvain clustering, which uses
a rather temperamental "resolution" parameter, very dependent on
datasets usually.
The largest the "resolution", the more clusters are obtained.
```{r}
sc_data <- FindClusters(sc_data, resolution = 0.5, cluster.name = "unintegrated_clusters")
```
UMAP and t-SNE are dimensionality reductions for visualization. They are
very popular and we'll use them widely. They minimize local distances,
often making "neat" looking clusters.
Be wary of not using the UMAP projection itself as an input for
analysis, as they contain lots of data distortions (which is normal of
anything that is reducing a space of 3000 features to 2 dimensions)
> \[...\] and data dimensions were reduced using the t-SNE and UMAP
> methods.
```{r}
sc_data <- RunUMAP(sc_data, dims = 1:15)
```
We can check the state of our Seurat object:
```{r}
sc_data
```
And plot the Seurat clusters... only to find we're definitely clustering
per sample.
```{r fig.width = 12, fig.height=5}
DimPlot(sc_data, group.by = c("unintegrated_clusters", "sample"), reduction = "umap", alpha = 0.3) &
theme_clean
```
## Integration
Batch effect is rampant in single-cell RNA-seq and there are lots of
integration methods proposed to deal with this. They often return a
dimensionality reduction (more precisely, latent space) that minimizes
this batch effect.
There are 5 methods implemented in Seurat alone. We'll use Harmony,
because it's fairly fast and effective, but see this benchmark for the
most performant methods:
See the reference tutorial:
<https://satijalab.org/seurat/articles/seurat5_integration>
> Bulk cells were merged for the HLA analysis, while CD8 T cells only
> were merged for the CD8 T cell analysis. CD8 T cells were identified
> using the FindAllMarkers and the VlnPlot functions.
```{r}
sc_data <- IntegrateLayers(sc_data,
method = HarmonyIntegration,
orig.reduction = "pca",
new.reduction = "harmony",
verbose = FALSE
)
gc()
```
We'll need re-run clustering algorithm in the Harmony latent space
(here, using 30 dimensions)
```{r}
sc_data <- FindNeighbors(sc_data, reduction = "harmony", dims = 1:30)
sc_data <- FindClusters(sc_data, resolution = 0.5, cluster.name = "harmony_clusters")
```
We can also run a UMAP in this latent space:
```{r}
sc_data <- RunUMAP(sc_data, reduction = "harmony", dims = 1:30, reduction.name = "umap_harmony")
```
```{r fig.width = 12, fig.height=7}
DimPlot(sc_data,
group.by = c("unintegrated_clusters", "harmony_clusters", "sample"),
reduction = "umap_harmony",
alpha = 0.2,
ncol = 2) &
theme_clean
```
### Check biases
We can check back on our features that could be markers of bias in a
cluster, and check back in with our doublet scores:
```{r}
sc_data$cxds_scores[1:5]
```
```{r fig.width = 12}
RidgePlot(sc_data,
features = c("nCount_RNA", "nFeature_RNA",
"percent.mt","cxds_scores"),
ncol = 2) &
guides(fill = "none") &
theme_clean
```
We can see, for example, that cluster 17 seems to be enriched in
doublets.
To compare the co-occurence of two categorical variables, we can do a
heatmap:
```{r}
table(sc_data$harmony_clusters, sc_data$sample) |>
pheatmap(scale = "row")
```
From this, we can exclude cluster 17:
```{r}
sc_data <- subset(sc_data, subset = harmony_clusters != 17)
```
## Annotation
### Differential Expression
Differential expression analysis will by default run on the "identity"
feature, which we make sure to set as "harmony_clusters" at this point:
```{r}
Idents(sc_data) <- sc_data$harmony_clusters
Idents(sc_data)[1:5]
```
For differential analysis to use all layers at the same time, we need to
join them before finding markers (each clusters vs all others):
```{r}
sc_data <- JoinLayers(sc_data)
markers <- FindAllMarkers(sc_data)
```
For annotation, we can now use this table:
```{r}
markers |> head()
```
And visualize either top markers or any chosen features in many
different ways:
```{r}
top_markers <- markers |>
group_by(cluster) |>
dplyr::filter(avg_log2FC > 1) |>
arrange(p_val_adj) |>
slice_head(n = 5) |>
ungroup()
top5 <- unique(top_markers$gene)
```
#### Heatmap
The heatmap takes a while to run, but provides a comprehensive view of a
marker expression over the cells:
```{r fig.width = 10, fig.heigth = 10, eval = FALSE}
DoHeatmap(subset(sc_data, downsample = 1e4), features = top5, raster = TRUE) +
scale_fill_viridis_c()
```
#### Dot plot
The dotplot summarizes the information into the heatmap and gives the
same importance to all clusters, independent of size:
```{r fig.width=15}
DotPlot(sc_data, features = top5, cluster.idents = TRUE) + theme_clean
```
#### FeaturePlot
`FeaturePlot` is the same as `DimPlot`, but using one or many continuous
values instead:
```{r fig.width=8, fig.height=12}
FeaturePlot(sc_data, reduction = "umap_harmony",
features = c("CD3D", "MS4A1",
"CD14", "PECAM1",
"KRT20", "KRT5")) &
theme_void()
```
### Reference-based
There are an infinity of methods for cell type annotation. I've found
that using a good reference is more important than the method for
annotation.
We'll show reference-based annotation using SingleR, mostly because it's
easy to use but also well performing.
#### "Built-in" references
Every automatic annotation method uses a reference. `celldex` provides
an easy way to access some annotation sources, notably the Human Primary
Cell Atlas, which we will use in this example.
```{r}
library(celldex)
surveyReferences() |> as_tibble()
```
```{r}
hpca_se <- celldex::HumanPrimaryCellAtlasData()
hpca_se
```
It's given as a `SummarizedExperiment` of purified bulk tumors.=,
annotated as follows:
```{r}
unique(hpca_se$label.main)
```
There's also a finer annotation, which we won't use:
```{r}
unique(hpca_se$label.fine) |> head(n = 20)
```
SingleR is really simple to use, we just need to provide an expression
matrix and reference as `SummarizedExperiment`:
```{r}
library(SingleR)
pred_hpca <- SingleR(test = GetAssayData(sc_data), ref = hpca_se, assay.type.test=1,
labels = hpca_se$label.main)
pred_hpca
```
Once we have the predictions, I add it to the Seurat object:
```{r}
sc_data@tools[["SingleR"]] <- pred_hpca |> as.data.frame()
sc_data$singleR_pred <- pred_hpca$pruned.labels
```
##### Harmonizing automatic predictions
We can check the prediction results and "harmonize" them by cluster,
i.e. take the cluster-level mode of the annotation:
```{r}
harmony_singler <- table(sc_data$singleR_pred, factor(sc_data$harmony_clusters))
pheatmap::pheatmap(harmony_singler, scale = "column")
```
```{r}
harmony_map1 <- apply(harmony_singler, 2, function(x) names(which.max(x)))
harmony_map1 <- structure(names(harmony_map1), names = harmony_map1)
harmony_map1
```
```{r}
sc_data$singleR_clean <- fct_recode(sc_data$harmony_clusters, !!! harmony_map1)
```
If we plot, we can see the difference between the "pred" and "clean":
```{r fig.width=12, fig.height=7}
DimPlot(sc_data,
group.by = c("harmony_clusters", "singleR_pred",
"singleR_clean", "sample"),
reduction = "umap_harmony",
ncol = 2,
alpha = 0.2) &
theme_clean &
scale_color_igv()
```
#### Single-cell
Most methods will work best with a relevant single-cell reference.
Here is a not-so-recent but independent benchmark:
<https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1795-z>
It can be interesting to try to use this tutorial to annotate a query
dataset based on a reference atlas:
<https://satijalab.org/seurat/articles/integration_mapping>
Note: always take automatic annotations with a grain of salt. They are a
good departure point, but the assignment to a cell type needs to be
checked manually. A good way is to go back to the differentially
expressed markers using your group annotations.
## Extras
### Sub-cluster
In their manuscript, Salomé et al selected the T cells and subclustered
them. This is a common strategy when there is a main population of
interest of which we would like to characterize functional states or
subpopulations. It can also be interesting to provide more detailed
annotations, since the space of analysis changes.
To use subclusters, just subset the object to the annotated cells and
re-run the pipeline to find new relevant subgroups.
### Trajectory analysis
Trajectory analysis is used to reconstruct and visualize the dynamic
processes of cellular development, differentiation, and state
transitions. It infers the order of cells in a particular trajectory,
allows us to calculate continuous "*pseudotimes*".
It can be used to map out potential lineage relationships, identify
intermediate cell states, and elucidate the pathways underlying tissue
development, immune responses, or disease progression, for example.
In Figure 4, Salomé et al used this type of analysis to show a subset of
differentiated PD-1+ T~RM~ CD8+ T Cells that retain independent
function.

A not so new but very comprehensive benchmark of trajectory inference by
Saelens et al (2019) is a great jumping point:
<https://www.nature.com/articles/s41587-019-0071-9>
Accompanied by the superb `dynverse` wrappers, which provide all methods
in the benchmark, as well as guidance on what to choose:
<https://dynverse.org/>
### Cell cycle
There are many methods to infer cell cycle stage and score cell cycle in
Seurat, but it also has a built in one that works quite decently:
```{r}
sc_data <- CellCycleScoring(sc_data,
s.features = cc.genes.updated.2019$s.genes,
g2m.features = cc.genes.updated.2019$g2m.genes)
```
```{r fig.width=8, fig.height=4}
FeaturePlot(sc_data,
features = c("S.Score", "G2M.Score"),
reduction = "umap_harmony",
ncol = 2,
alpha = 0.2) &
theme_clean
```
We can see very little cycling except in one "epithelial" (tumor) area
we previously annotated.
### RNA velocity
RNA velocity is a computation method that attempts to predict the future
transcriptional state of individual cells. It distinguishes between
unspliced (pre-mRNA) and spliced (mature mRNA) transcripts within each
cell, and therefore necessitates a special input in which these reads
have been separated. This differentiation allows the estimation of the
rate at which genes are being transcribed and processed, providing
insights into the "direction" and "speed" of cellular state changes.
The primary goal of RNA velocity analysis is to elucidate the
trajectories of cellular differentiation, activation, and other dynamic
processes, thereby offering a deeper understanding of cellular
development and lineage relationships. By forecasting how cells are
likely to evolve, it can help in identifying transient cell states, and
uncovering mechanisms underlying tissue development, regeneration, and
disease progression.
It can be used together with traditional trajectory analysis.
In Salomé et al (2022), this analysis is used in Figure 5 to further
characterize the axis of differentiation/exhaustion of CD8+ T~RM~ cells,
shown below:

Most RNA velocity tools work in Python, the two most well-known being
`velocyto` and `scVelo`. Both have tutorials showing how to use
`reticulate` to run their workflows in R.
### Pathway enrichment
Pathway enrichment aims to identify biological pathways, processes, or
functional categories that are significantly overrepresented within a
set of genes of interest, such as those differentially expressed between
distinct cell populations or states.
This analysis typically involves mapping the selected genes to curated
pathway databases like Gene Ontology, KEGG, or Reactome and applying
statistical methods to assess the significance of their enrichment.
There are many methods to do this, but a easy to use is `decoupleR`, in
which `AUCell` is implemented, which is a well-performing scoring
method:
<https://www.bioconductor.org/packages/release/bioc/vignettes/decoupleR/inst/doc/decoupleR.html>
### Gene regulatory network
Gene regulatory network (GRN) analysis seeks to identify key
transcription factors, regulatory motifs, and signaling pathways that
drive cell states, differentiation, and responses to environmental
stimuli. The ultimate goal of GRN analysis is to map out the regulatory
architecture that underpins cellular heterogeneity, enabling us to
pinpoint master regulators and understand the molecular mechanisms
underlying various biological processes and disease states.
From an algorithmic point of view, this involves inferring relationships
such as gene co-expression, regulatory influences, and causal
interactions.
`SCENIC` is the most popular method to do this, with the Python
implementation being much more performant
(<https://github.com/aertslab/pySCENIC>). The protocol is probably the
best way to run this type of analysis currently:
<https://github.com/aertslab/SCENICprotocol>
### CNV analysis
Copy Number Variation (CNV) analysis is a computational approach used to
infer genomic copy number alterations from transcriptomic data. It
leverages the variability in gene expression levels across the genome to
detect gains or losses of genomic regions.
This method is particularly valuable in studies of cancer and other
genetic disorders, where CNVs play a crucial role in disease progression
and heterogeneity.
The most well-known tool used for this analysis is `inferCNV`, which can
be extremely slow and doesn't perform very well with cells with low-read
count. This is why we implemented `fastCNV`, available to install for R
on Github: <https://github.com/must-bioinfo/fastCNVdata>
### Ligand-receptor analysis
Ligand-receptor analysis is a computational strategy to infer cell-cell
communication by identifying interactions between signaling molecules
(ligands) and their corresponding receptors expressed on different cell
types or states.
This analysis involves systematically pairing ligands expressed by one
or more cell populations with receptors expressed by neighboring or
interacting cells, thereby predicting potential signaling pathways that
facilitate intercellular communication.
Well known tools are `CellPhoneDB` and `CellChat`, but `LIANA+` provides
wrappers and an all-in-one framework for this kind of analysis:
<https://liana-py.readthedocs.io/en/latest/>
Notably, this analysis can be much more powerful in spatially-resolved
data.
### Object conversion
Format conversion is often necessary when working with single-cell
RNA-seq to use tools available for different objects.
This can be easy between SingleCellExperiment and SeuratObject, as we
used it in this tutorial with `as.SingleCellExperiment` and `as.Seurat`.
However, these functions sometimes don't work and the Seurat solution
for h5 objects, `SeuratDisk`, even fails its own tutorial.
`sceasy` (<https://github.com/cellgeni/sceasy>) was made to resolve
these conversion problems. Note that it requires `reticulate` (to make
Python work in R) to be installed.