R and Seurat to analyse single cell RNA-seq

Download this notebook: https://raw.githubusercontent.com/csgroen/teaching/refs/heads/main/20250214_Tutorial_DU_Seurat.qmd

Pre-requisites

For the best understanding of this tutorial, please make sure you have at least an intermediate understanding of R. Here are some recommended resources to learn the basics and necessary intermediate concepts:

swirl: https://swirlstats.com/students.html

Specifically courses R Programming, Getting and Cleaning Data and Exploratory Data Analysis may provide a good jump start.

Hands On Programming with R: https://rstudio-education.github.io/hopr/

The very basics of R programming as also explained in this thorough book from the Posit team (that builds RStudio).

R for Data Science: https://r4ds.hadley.nz/

This goes further than we need, but a base for most of the programming questions a data analyst working in R could have to start with.

Resources

Seurat has many very useful tutorials for all of its features. Elements of this case study were taken from the following:

Seurat - Guided Clustering Tutorial https://satijalab.org/seurat/archive/v3.0/pbmc3k_tutorial.html
Introduction to scRNA-seq integration: https://satijalab.org/seurat/articles/integration_introduction
Mapping and annotating query datasets: https://satijalab.org/seurat/articles/integration_mapping

Install packages

To ensure we have all installed libraries, we’ll use R package “pacman”, which allows us to check if packages are not installed and load them. This is not a replacement for more comprehensive version control (e.g. using renv), but it’s simple to install and very lightweight.

Code

if(! "pacman" %in% installed.packages()) install.packages("pacman")
pacman::p_load("Seurat", "tidyverse", "R.utils", "ggpubr", "patchwork", 
               "celldex", "SingleR", "scRNAseq", "pheatmap",
               "ggsci", "scds")

I’m personally not a big fan of the standard look of Seurat plots, so I’ll apply this ggplot2 theme to (almost) all Seurat plots we generate:

Code

theme_clean <-
  theme_linedraw() +
  theme(panel.grid = element_blank(),
        panel.border = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Note: Seurat plots are ggplot-based and therefore can be “added to” in a grammar-of-graphics manner. Plots for multiple features are patchworks, which use a slightly different syntax to add to (this is why sometimes you’ll see & instead of + being used).

Download data

Our dataset comes from Salomé et al (2022), “NKG2A and HLA-E define an alternative immune checkpoint axis in bladder cancer”. This single-cell data is a bit particular: they have FACS-sorted their cells into CD45+ and CD45-, notably to enrich their dataset in hematopoetic cells.

We’ll download the data from the link provided in the article:

Code

dir.create("data")
download.file("https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/7yb7s9769c-1.zip", 'data/Salome_2022.zip')
unzip('data/Salome_2022.zip', exdir = "data")

Read 10X data

Code

ref_dir <- "data/NKG2A and HLA-E define an alternative immune checkpoint axis in bladder cancer/"
list.files(ref_dir)

 [1] "385_Patient49_CD45pos" "394_Patient50_CD45pos" "397_Patient51_CD45pos"
 [4] "421_Patient52_CD45pos" "434_Patient53_CD45pos" "435_Patient54_CD45pos"
 [7] "453_Patient01_CD45pos" "454_Patient01_CD45neg" "456_Patient02_CD45pos"
[10] "457_Patient02_CD45neg" "584_Patient03_bulk"    "595_Patient04_bulk"   
[13] "613_Patient55_bulk"    "651_Patient05_bulk"    "656_Patient06_bulk"   
[16] "663_Patient07_bulk"    "695_Patient08_bulk"

We can see the data is distributed as one folder per dataset, and includes the bulk data, which we’re not going to use. For this reason, I will filter out the “bulk” folders and to get my vector of sc_dts (single-cell datasets).

Code

sc_dts <- ref_dir |> list.files() |> str_subset("bulk", negate = TRUE)
sc_dts

 [1] "385_Patient49_CD45pos" "394_Patient50_CD45pos" "397_Patient51_CD45pos"
 [4] "421_Patient52_CD45pos" "434_Patient53_CD45pos" "435_Patient54_CD45pos"
 [7] "453_Patient01_CD45pos" "454_Patient01_CD45neg" "456_Patient02_CD45pos"
[10] "457_Patient02_CD45neg"

Syntax note: ref_dir |> list.files() |> str_subset("bulk", negate = TRUE) is equivalent to str_subset(list.files(ref_dir), "bulk", negate = TRUE). The |> was first introduced as %>% from R package magrittr and officially incorporated into base R>=4.0. It’s purpose is to increase code legibility through “chaining” commands instead of “nesting” them. If you would like to learn more, you can check the help file ? |>

We can check inside each folder:

Code

ref_dir |> filePath(sc_dts[1]) |> list.files()

[1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"

This is a common export of single-cell data and can be read directly into Seurat using function Read10X.

This snippet will read all of our samples at once:

Code

read_exp_sample <- function(sc_file) {
  message('Reading file: ', sc_file)
  so <- ref_dir |> filePath(sc_file) |> Read10X()
  so <- CreateSeuratObject(so, project = "BLCA_NKG2A")
  so[["sample"]] <- str_remove(sc_file, "^.{4}")
  so[["patient"]] <- str_extract(sc_file, "Patient..")
  so[["CD45"]] <- str_extract(sc_file, "(?<=CD45).*")
  return(so)
}

sc_data <- lapply(sc_dts, read_exp_sample)

Reading file: 385_Patient49_CD45pos

Reading file: 394_Patient50_CD45pos

Reading file: 397_Patient51_CD45pos

Reading file: 421_Patient52_CD45pos

Reading file: 434_Patient53_CD45pos

Reading file: 435_Patient54_CD45pos

Reading file: 453_Patient01_CD45pos

Reading file: 454_Patient01_CD45neg

Reading file: 456_Patient02_CD45pos

Reading file: 457_Patient02_CD45neg

Code

names(sc_data) <- sc_dts

Merge samples

The script we ran generated 10 Seurat objects, which we will merge into one for further analysis. Each count matrix, at first, will be stored as a Layer in our Seurat object.

Code

sc_data <- merge(sc_data[[1]], sc_data[2:length(sc_data)])

Warning: Some cell names are duplicated across objects provided. Renaming to
enforce unique cell names.

Seurat syntax basics for object accession

Here are some “essential commands” for accession of information inside the Seurat object: https://satijalab.org/seurat/articles/essential_commands.html.

The document also outlines the “basic pipeline”.

To see some general info about the object, we can just print it:

Code

sc_data

An object of class Seurat 
33538 features across 50940 samples within 1 assay 
Active assay: RNA (33538 features, 0 variable features)
 10 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10

Note the layers, for the different samples. We can also see what analyses have been run in this object (none yet).

Cell metadata

[[]] allows us to access the cell metadata in a concise manner:

Code

sc_data[[]] |> head()

                     orig.ident nCount_RNA nFeature_RNA            sample
AAACCTGAGGCCCTCA-1_1 BLCA_NKG2A       4917         1671 Patient49_CD45pos
AAACCTGAGTTCGCGC-1_1 BLCA_NKG2A        891          369 Patient49_CD45pos
AAACCTGCACTGTTAG-1_1 BLCA_NKG2A       1121          606 Patient49_CD45pos
AAACCTGCAGACGCTC-1_1 BLCA_NKG2A       3877         1484 Patient49_CD45pos
AAACCTGGTACAGCAG-1_1 BLCA_NKG2A       1393          552 Patient49_CD45pos
AAACCTGGTCACACGC-1_1 BLCA_NKG2A       1491          709 Patient49_CD45pos
                       patient CD45
AAACCTGAGGCCCTCA-1_1 Patient49  pos
AAACCTGAGTTCGCGC-1_1 Patient49  pos
AAACCTGCACTGTTAG-1_1 Patient49  pos
AAACCTGCAGACGCTC-1_1 Patient49  pos
AAACCTGGTACAGCAG-1_1 Patient49  pos
AAACCTGGTCACACGC-1_1 Patient49  pos

Assay matrix (e.g. counts)

To get matrices from the Seurat objects, there are many “official” ways, but one of these two, which have exactly the same output, should work well:

Code

GetAssayData(sc_data, layer = "counts.1")[1:5,1:5]

5 x 5 sparse Matrix of class "dgCMatrix"
            AAACCTGAGGCCCTCA-1_1 AAACCTGAGTTCGCGC-1_1 AAACCTGCACTGTTAG-1_1
MIR1302-2HG                    .                    .                    .
FAM138A                        .                    .                    .
OR4F5                          .                    .                    .
AL627309.1                     .                    .                    .
AL627309.3                     .                    .                    .
            AAACCTGCAGACGCTC-1_1 AAACCTGGTACAGCAG-1_1
MIR1302-2HG                    .                    .
FAM138A                        .                    .
OR4F5                          .                    .
AL627309.1                     .                    .
AL627309.3                     .                    .

Code

sc_data[["RNA"]]$counts.1[1:5,1:5]

5 x 5 sparse Matrix of class "dgCMatrix"
            AAACCTGAGGCCCTCA-1_1 AAACCTGAGTTCGCGC-1_1 AAACCTGCACTGTTAG-1_1
MIR1302-2HG                    .                    .                    .
FAM138A                        .                    .                    .
OR4F5                          .                    .                    .
AL627309.1                     .                    .                    .
AL627309.3                     .                    .                    .
            AAACCTGCAGACGCTC-1_1 AAACCTGGTACAGCAG-1_1
MIR1302-2HG                    .                    .
FAM138A                        .                    .
OR4F5                          .                    .
AL627309.1                     .                    .
AL627309.3                     .                    .

UMI ids

To get the “ids” of the cells, we can simply call:

Code

Cells(sc_data)[1:5]

[1] "AAACCTGAGGCCCTCA-1_1" "AAACCTGAGTTCGCGC-1_1" "AAACCTGCACTGTTAG-1_1"
[4] "AAACCTGCAGACGCTC-1_1" "AAACCTGGTACAGCAG-1_1"

Features

For a single-cell RNA-seq object, we can see the genes (features) like so:

Code

Features(sc_data)[1:5]

[1] "MIR1302-2HG" "FAM138A"     "OR4F5"       "AL627309.1"  "AL627309.3"

Pre-processing workflow

There are very standard steps for pre-processing, but that require some level of decision on the part of the user. The basic tutorial of Seurat, on PBMCs, covers this in detail: https://satijalab.org/seurat/articles/pbmc3k_tutorial

QC and selecting cells for further analysis

This command will figure out the percentage of reads of mitchondrial RNA:

Code

sc_data[["percent.mt"]] <- PercentageFeatureSet(sc_data, pattern = "^MT-")

There are three major pieces of information we’d like to check to make filtering decisions: number of expressed features, number of reads and the percentage of mitochondrial RNA. Here, I plot per sample:

Code

(VlnPlot(sc_data, 
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
        group.by = "sample",
        alpha = 0.3, 
        ncol = 3) &
  theme_clean) +
  plot_layout(guides = "collect")

Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

We can also plot this for a particular known variable. For example, here I show for whether the cells were FACS-sorted as CD45 positive or negative:

Code

(VlnPlot(sc_data, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
         alpha = 0.3,
         group.by = "CD45", ncol = 3) &
  stat_compare_means() &
  theme_clean) +
  plot_layout(guides = "collect") +
  plot_annotation(title = "CD45")

Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

Filtering and Normalization

We’ll reproduce the filtering and normalization steps proposed by the paper as closely as possible:

For each sample, cells were first selected as expressing less than 16–20% mitochondrial genes and displaying a minimum of 200–300 and a maximum of 2500–3500 features.

As we’re standardizing pre-processing for all samples, we’ll take the least stringent:

Code

sc_data <- subset(sc_data, subset = nFeature_RNA > 200 & nFeature_RNA < 3500 & percent.mt < 20)

Data were then log-normalized using a scale factor of 10,000.

Code

sc_data <- NormalizeData(sc_data, normalization.method = "LogNormalize", scale.factor = 1e4)

Normalizing layer: counts.1

Normalizing layer: counts.2

Normalizing layer: counts.3

Normalizing layer: counts.4

Normalizing layer: counts.5

Normalizing layer: counts.6

Normalizing layer: counts.7

Normalizing layer: counts.8

Normalizing layer: counts.9

Normalizing layer: counts.10

Feature selection and PCA

The 2,000 most variable features were then identified, data were scaled based on all the features […]

Code

sc_data <- FindVariableFeatures(sc_data, nfeatures = 2000)

Finding variable features for layer counts.1

Finding variable features for layer counts.2

Finding variable features for layer counts.3

Finding variable features for layer counts.4

Finding variable features for layer counts.5

Finding variable features for layer counts.6

Finding variable features for layer counts.7

Finding variable features for layer counts.8

Finding variable features for layer counts.9

Finding variable features for layer counts.10

Code

sc_data <- ScaleData(sc_data, features = rownames(sc_data))

Centering and scaling data matrix

[…] and principal component analysis was performed.

Code

sc_data <- RunPCA(sc_data)

PC_ 1 
Positive:  S100P, SPINK1, KRT18, ADIRF, PSCA, KRT8, KRT19, MGST1, GDF15, KLF5 
       CD24, HPGD, DHRS2, HES1, VSIG2, SDC4, SMIM22, C19orf33, SLPI, S100A6 
       AGR2, SPINT2, TNFAIP2, GPRC5A, TSPAN6, S100A13, CLDN7, CD9, EFNA1, WFDC2 
Negative:  CD52, IGKC, LAPTM5, RGS1, S100A4, IGLC2, IL32, CD3D, CD74, IGHA1 
       IGHG1, CD69, LSP1, HSPA1A, VIM, LTB, CCL5, TRBC2, HLA-DPB1, COTL1 
       IGLC3, ALOX5AP, TRBC1, IGHG2, GPR183, RGS2, GZMA, TYROBP, NKG7, CD7 
PC_ 2 
Positive:  SPARC, IGFBP7, CALD1, COL1A2, COL3A1, COL6A2, COL4A1, IFITM3, MYL9, COL1A1 
       FN1, TPM2, CAVIN3, THY1, BGN, FSTL1, COX7A1, TIMP1, COL6A1, DCN 
       C1S, TAGLN, C1R, RGS5, COL5A2, LGALS1, SPARCL1, MFGE8, CTGF, COL6A3 
Negative:  HPGD, GATA3, CD24, S100P, RAB11FIP1, CD3D, SPINK1, PSCA, DHRS2, KRT8 
       KRT19, KLF5, CD52, UPK2, TRBC2, TMEM97, GDPD3, GDF15, TMPRSS2, VSIG2 
       UPK1A, SDC4, KRT20, SMIM22, SPTSSB, MGST1, CNGA1, WFDC2, CXADR, UPK3A 
PC_ 3 
Positive:  CALD1, SPARC, COL3A1, COL6A2, COL1A2, IGFBP7, IL32, COL4A1, COL1A1, RGS5 
       THY1, MYL9, TPM2, SELENOM, CAVIN3, COL6A1, FSTL1, SOD3, MFGE8, BGN 
       COL5A2, COX7A1, MGP, COL6A3, NDUFA4L2, SPARCL1, PCOLCE, CTGF, C1R, LHFPL6 
Negative:  TYROBP, FCER1G, LYZ, MS4A6A, CD14, HLA-DRA, FTL, CTSS, CD68, C1QC 
       C1QA, C1QB, MS4A7, FCGR2A, SPI1, OLR1, LST1, MS4A4A, C5AR1, PSAP 
       HLA-DRB5, SERPINA1, APOC1, CTSB, CST3, SAT1, CD163, HLA-DPA1, BCL2A1, HLA-DMA 
PC_ 4 
Positive:  RAMP2, PCAT19, VWF, CLEC14A, HSPG2, EGFL7, RAMP3, CALCRL, AQP1, ECSCR 
       PCDH17, PODXL, FLT1, CD34, CDH5, VWA1, FAM167B, CLDN5, SLCO2A1, CXorf36 
       ENG, PECAM1, CYYR1, MMRN2, NOTCH4, SLC9A3R2, INSR, HYAL2, TM4SF1, ESM1 
Negative:  COL1A2, COL3A1, COL1A1, DCN, COL6A3, TPM2, C1S, PCOLCE, C1R, COL5A2 
       TAGLN, LUM, SOD3, COL6A1, RARRES2, BGN, ACTA2, MYL9, AEBP1, SERPINF1 
       COL6A2, CYGB, MXRA8, RGS5, MFGE8, NDUFA4L2, MEG3, PPP1R14A, SDC2, SERPING1 
PC_ 5 
Positive:  RNASE1, UPK2, GDPD3, UBE2C, TMEM97, CNGA1, UPK1A, RAB11FIP1, S100A13, UPK3A 
       BHMT, TNFAIP2, STMN1, CD24, TMPRSS2, PLPP5, SDC4, BIRC5, NQO1, SPRR3 
       EMX2, KRT20, MKI67, ERBB2, CCNB2, CDK1, HMGB2, PCLAF, MYCL, MESP1 
Negative:  KRT17, FABP4, SERPINB5, DKK1, PLA2G2A, CLU, CLCA4, CRTAC1, LAMB3, S100A2 
       AQP3, TSPAN1, IGFBP2, IGFBP5, ERRFI1, IRF6, ANXA1, BTBD16, ANXA10, FHL2 
       SHH, PERP, FABP5, PRAC1, CYP1B1, FOSL1, HSD17B2, SOX15, SFN, GPR87

Code

gc()

             used    (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   13337773   712.4   19671861  1050.6   19671861  1050.6
Vcells 1682607264 12837.3 5535284006 42230.9 5765848852 43990.0

We can visualize the PCA with DimPlot, coloring by different categorical variables. Here, we see the influence of both the sample (batch) and CD45 positive vs negative (which is correlated to batch but might also contribute to variance).

Code

DimPlot(sc_data, group.by = c("sample", "CD45"), reduction = "pca") &
  theme_clean

Dimensionality determination

Dimensionality of the dataset was then assessed using the JackStraw and ElbowPlot functions.

From Seurat documentation, we see that JackStraw is quite computationally intensive:

In Macosko et al, we implemented a resampling test inspired by the JackStraw procedure. While still available in Seurat (see previous vignette), this is a slow and computationally expensive procedure, and we is no longer routinely used in single cell analysis.

This is not the most beautiful Elbow plot ever, but the idea is to find the “inflection point” of the plot (I’m estimating at about 10) and go a little bit further into the “stable” area, where adding more dimensions doesn’t add more infromation.

Code

ElbowPlot(sc_data, ndims = 30)

For methods that require the dimensionality, we’ll set it at 15.

Doublet detection

It’s never a bad idea to add some doublet detection.

Doublets are UMIs that contained more than one cell and thefore have a mixed profile.

This benchmark shows some of the main methods (though new ones may emerge): https://www.sciencedirect.com/science/article/pii/S2405471220304592

We’ll use cxds from scdc because it’s fast and reasonably accurate.

However, it uses a SingleCellExperiment standard. Luckily, we can easily convert between SeuratObject and SingleCellExperiment.

Code

cxds_scores <- lapply(SplitObject(sc_data, split.by = "sample"), function(samp_sc) {
  sce <- samp_sc |>
    as.SingleCellExperiment() |>
    cxds()
  return(sce$cxds_score)
})
gc()

             used    (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   13355280   713.3   19671861  1050.6   19671861  1050.6
Vcells 1682219141 12834.4 5535284006 42230.9 5765848852 43990.0

Code

names(cxds_scores) <- NULL
sc_data$cxds_scores <- do.call(c, cxds_scores)[Cells(sc_data)]

Clustering

Clusters were calculated […]

Before clustering, we need to establish a neighborhood graph and compute shared nearest neighbors (SNN). We’ll do this on 15 PCs:

Code

sc_data <- FindNeighbors(sc_data, dims = 1:15)

Computing nearest neighbor graph

Computing SNN

From this, we can use “FindClusters” for Louvain clustering, which uses a rather temperamental “resolution” parameter, very dependent on datasets usually.

The largest the “resolution”, the more clusters are obtained.

Code

sc_data <- FindClusters(sc_data, resolution = 0.5, cluster.name = "unintegrated_clusters")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 44699
Number of edges: 1534079

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9465
Number of communities: 26
Elapsed time: 7 seconds

UMAP and t-SNE are dimensionality reductions for visualization. They are very popular and we’ll use them widely. They minimize local distances, often making “neat” looking clusters.

Be wary of not using the UMAP projection itself as an input for analysis, as they contain lots of data distortions (which is normal of anything that is reducing a space of 3000 features to 2 dimensions)

[…] and data dimensions were reduced using the t-SNE and UMAP methods.

Code

sc_data <- RunUMAP(sc_data, dims = 1:15)

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

16:34:12 UMAP embedding parameters a = 0.9922 b = 1.112

16:34:12 Read 44699 rows and found 15 numeric columns

16:34:12 Using Annoy for neighbor search, n_neighbors = 30

16:34:12 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
16:34:15 Writing NN index file to temp file /tmp/RtmpPri6SU/file12975263d7093
16:34:15 Searching Annoy index using 1 thread, search_k = 3000
16:34:27 Annoy recall = 100%
16:34:28 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:34:29 Initializing from normalized Laplacian + noise (using RSpectra)
16:34:31 Commencing optimization for 200 epochs, with 1900728 positive edges
16:34:45 Optimization finished

We can check the state of our Seurat object:

Code

sc_data

An object of class Seurat 
33538 features across 44699 samples within 1 assay 
Active assay: RNA (33538 features, 2000 variable features)
 21 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10, data.1, data.2, data.3, data.4, data.5, data.6, data.7, data.8, data.9, data.10, scale.data
 2 dimensional reductions calculated: pca, umap

And plot the Seurat clusters… only to find we’re definitely clustering per sample.

Code

DimPlot(sc_data, group.by = c("unintegrated_clusters", "sample"), reduction = "umap", alpha = 0.3) & 
  theme_clean

Integration

Batch effect is rampant in single-cell RNA-seq and there are lots of integration methods proposed to deal with this. They often return a dimensionality reduction (more precisely, latent space) that minimizes this batch effect.

There are 5 methods implemented in Seurat alone. We’ll use Harmony, because it’s fairly fast and effective, but see this benchmark for the most performant methods:

See the reference tutorial: https://satijalab.org/seurat/articles/seurat5_integration

Bulk cells were merged for the HLA analysis, while CD8 T cells only were merged for the CD8 T cell analysis. CD8 T cells were identified using the FindAllMarkers and the VlnPlot functions.

Code

sc_data <- IntegrateLayers(sc_data, 
                           method = HarmonyIntegration,
                           orig.reduction = "pca", 
                           new.reduction = "harmony",
  verbose = FALSE
)

Warning in harmony::HarmonyMatrix(data_mat = Embeddings(object = orig), :
HarmonyMatrix is deprecated and will be removed in the future from the API in
the future

Warning: Warning: The parameters do_pca and npcs are deprecated. They will be ignored for this function call and please remove parameters do_pca and npcs and pass to harmony cell_embeddings directly.
This warning is displayed once per session.

Warning: Warning: The parameter tau is deprecated. It will be ignored for this function call and please remove parameter tau in future function calls. Advanced users can set value of parameter tau by using parameter .options and function harmony_options().
This warning is displayed once per session.

Warning: Warning: The parameter block.size is deprecated. It will be ignored for this function call and please remove parameter block.size in future function calls. Advanced users can set value of parameter block.size by using parameter .options and function harmony_options().
This warning is displayed once per session.

Warning: Warning: The parameter max.iter.harmony is replaced with parameter max_iter. It will be ignored for this function call and please use parameter max_iter in future function calls.
This warning is displayed once per session.

Warning: Warning: The parameter max.iter.cluster is deprecated. It will be ignored for this function call and please remove parameter max.iter.cluster in future function calls. Advanced users can set value of parameter max.iter.cluster by using parameter .options and function harmony_options().
This warning is displayed once per session.

Warning: Warning: The parameter epsilon.cluster is deprecated. It will be ignored for this function call and please remove parameter epsilon.cluster in future function calls. Advanced users can set value of parameter epsilon.cluster by using parameter .options and function harmony_options().
This warning is displayed once per session.

Warning: Warning: The parameter epsilon.harmony is deprecated. It will be ignored for this function call and please remove parameter epsilon.harmony in future function calls. If users want to control if harmony would stop early or not, use parameter early_stop. Advanced users can set value of parameter epsilon.harmony by using parameter .options and function harmony_options().
This warning is displayed once per session.

Code

gc()

             used    (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   13450658   718.4   19671861  1050.6   19671861  1050.6
Vcells 1691092990 12902.1 5535284006 42230.9 5765848852 43990.0

We’ll need re-run clustering algorithm in the Harmony latent space (here, using 30 dimensions)

Code

sc_data <- FindNeighbors(sc_data, reduction = "harmony", dims = 1:30)

Computing nearest neighbor graph

Computing SNN

Code

sc_data <- FindClusters(sc_data, resolution = 0.5, cluster.name = "harmony_clusters")

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 44699
Number of edges: 1710955

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9238
Number of communities: 18
Elapsed time: 10 seconds

We can also run a UMAP in this latent space:

Code

sc_data <- RunUMAP(sc_data, reduction = "harmony", dims = 1:30, reduction.name = "umap_harmony")

16:36:20 UMAP embedding parameters a = 0.9922 b = 1.112

16:36:20 Read 44699 rows and found 30 numeric columns

16:36:20 Using Annoy for neighbor search, n_neighbors = 30

16:36:20 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
16:36:23 Writing NN index file to temp file /tmp/RtmpPri6SU/file12975e7a5eda
16:36:23 Searching Annoy index using 1 thread, search_k = 3000
16:36:35 Annoy recall = 100%
16:36:36 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:36:37 Initializing from normalized Laplacian + noise (using RSpectra)
16:36:39 Commencing optimization for 200 epochs, with 2061946 positive edges
16:36:54 Optimization finished

Code

DimPlot(sc_data, 
        group.by = c("unintegrated_clusters", "harmony_clusters", "sample"), 
        reduction = "umap_harmony",
        alpha = 0.2,
        ncol = 2) & 
  theme_clean

Check biases

We can check back on our features that could be markers of bias in a cluster, and check back in with our doublet scores:

Code

sc_data$cxds_scores[1:5]

AAACCTGAGGCCCTCA-1_1 AAACCTGAGTTCGCGC-1_1 AAACCTGCACTGTTAG-1_1 
           266320.09             29805.91            219085.93 
AAACCTGCAGACGCTC-1_1 AAACCTGGTACAGCAG-1_1 
           100458.58            193619.85

Code

RidgePlot(sc_data, 
        features = c("nCount_RNA", "nFeature_RNA", 
                     "percent.mt","cxds_scores"), 
        ncol = 2) &
  guides(fill = "none") &
  theme_clean

Picking joint bandwidth of 569

Picking joint bandwidth of 123

Picking joint bandwidth of 0.469

Picking joint bandwidth of 6840

We can see, for example, that cluster 17 seems to be enriched in doublets.

To compare the co-occurence of two categorical variables, we can do a heatmap:

Code

table(sc_data$harmony_clusters, sc_data$sample) |>
  pheatmap(scale = "row")

From this, we can exclude cluster 17:

Code

sc_data <- subset(sc_data, subset = harmony_clusters != 17)

Annotation

Differential Expression

Differential expression analysis will by default run on the “identity” feature, which we make sure to set as “harmony_clusters” at this point:

Code

Idents(sc_data) <- sc_data$harmony_clusters
Idents(sc_data)[1:5]

AAACCTGAGGCCCTCA-1_1 AAACCTGAGTTCGCGC-1_1 AAACCTGCACTGTTAG-1_1 
                   2                    6                    1 
AAACCTGCAGACGCTC-1_1 AAACCTGGTACAGCAG-1_1 
                   3                    1 
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

For differential analysis to use all layers at the same time, we need to join them before finding markers (each clusters vs all others):

Code

sc_data <- JoinLayers(sc_data)
markers <- FindAllMarkers(sc_data)

Calculating cluster 0

Calculating cluster 1

Calculating cluster 2

Calculating cluster 3

Calculating cluster 4

Calculating cluster 5

Calculating cluster 6

Calculating cluster 7

Calculating cluster 8

Calculating cluster 9

Calculating cluster 10

Calculating cluster 11

Calculating cluster 12

Calculating cluster 13

Calculating cluster 14

Calculating cluster 15

Calculating cluster 16

For annotation, we can now use this table:

Code

markers |> head()

     p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
IL7R     0   2.929902 0.634 0.158         0       0 IL7R
CD3D     0   1.512433 0.750 0.281         0       0 CD3D
IL32     0   1.225477 0.865 0.404         0       0 IL32
TRAC     0   1.525893 0.722 0.273         0       0 TRAC
LTB      0   2.055308 0.713 0.273         0       0  LTB
CD2      0   1.805078 0.660 0.230         0       0  CD2

And visualize either top markers or any chosen features in many different ways:

Code

top_markers <- markers |>
    group_by(cluster) |>
    dplyr::filter(avg_log2FC > 1) |>
    arrange(p_val_adj) |>
    slice_head(n = 5) |>
    ungroup()
top5 <- unique(top_markers$gene)

Heatmap

The heatmap takes a while to run, but provides a comprehensive view of a marker expression over the cells:

Code

DoHeatmap(subset(sc_data, downsample = 1e4), features = top5, raster = TRUE) +
  scale_fill_viridis_c()

Dot plot

The dotplot summarizes the information into the heatmap and gives the same importance to all clusters, independent of size:

Code

DotPlot(sc_data, features = top5, cluster.idents = TRUE) + theme_clean

FeaturePlot

FeaturePlot is the same as DimPlot, but using one or many continuous values instead:

Code

FeaturePlot(sc_data, reduction = "umap_harmony",
            features = c("CD3D", "MS4A1",
                         "CD14", "PECAM1",
                         "KRT20", "KRT5")) &
  theme_void()

Reference-based

There are an infinity of methods for cell type annotation. I’ve found that using a good reference is more important than the method for annotation.

We’ll show reference-based annotation using SingleR, mostly because it’s easy to use but also well performing.

“Built-in” references

Every automatic annotation method uses a reference. celldex provides an easy way to access some annotation sources, notably the Human Primary Cell Atlas, which we will use in this example.

Code

library(celldex)
surveyReferences() |>  as_tibble()

# A tibble: 7 × 10
  name         version path  title description taxonomy_id genome samples labels
  <chr>        <chr>   <chr> <chr> <chr>       <list>      <list>   <int> <list>
1 dice         2024-0… <NA>  Huma… Human bulk… <list [1]>  <list>    1561 <list>
2 blueprint_e… 2024-0… <NA>  Huma… Human bulk… <list [1]>  <list>     259 <list>
3 immgen       2024-0… <NA>  Mous… Mouse micr… <list [1]>  <list>     830 <list>
4 mouse_rnaseq 2024-0… <NA>  Bulk… Bulk RNA-s… <list [1]>  <list>     358 <list>
5 hpca         2024-0… <NA>  Micr… Microarray… <list [1]>  <list>     713 <list>
6 novershtern… 2024-0… <NA>  Bulk… Bulk micro… <list [1]>  <list>     211 <list>
7 monaco_immu… 2024-0… <NA>  Huma… Human bulk… <list [1]>  <list>     114 <list>
# ℹ 1 more variable: sources <list>

Code

hpca_se <- celldex::HumanPrimaryCellAtlasData()
hpca_se

class: SummarizedExperiment 
dim: 19363 713 
metadata(0):
assays(1): logcounts
rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
colData names(3): label.main label.fine label.ont

It’s given as a SummarizedExperiment of purified bulk tumors.=, annotated as follows:

Code

unique(hpca_se$label.main)

 [1] "DC"                   "Smooth_muscle_cells"  "Epithelial_cells"    
 [4] "B_cell"               "Neutrophils"          "T_cells"             
 [7] "Monocyte"             "Erythroblast"         "BM & Prog."          
[10] "Endothelial_cells"    "Gametocytes"          "Neurons"             
[13] "Keratinocytes"        "HSC_-G-CSF"           "Macrophage"          
[16] "NK_cell"              "Embryonic_stem_cells" "Tissue_stem_cells"   
[19] "Chondrocytes"         "Osteoblasts"          "BM"                  
[22] "Platelets"            "Fibroblasts"          "iPS_cells"           
[25] "Hepatocytes"          "MSC"                  "Neuroepithelial_cell"
[28] "Astrocyte"            "HSC_CD34+"            "CMP"                 
[31] "GMP"                  "MEP"                  "Myelocyte"           
[34] "Pre-B_cell_CD34-"     "Pro-B_cell_CD34+"     "Pro-Myelocyte"

There’s also a finer annotation, which we won’t use:

Code

unique(hpca_se$label.fine) |> head(n = 20)

 [1] "DC:monocyte-derived:immature"        "DC:monocyte-derived:Galectin-1"     
 [3] "DC:monocyte-derived:LPS"             "DC:monocyte-derived"                
 [5] "Smooth_muscle_cells:bronchial:vit_D" "Smooth_muscle_cells:bronchial"      
 [7] "Epithelial_cells:bronchial"          "B_cell"                             
 [9] "Neutrophil"                          "T_cell:CD8+_Central_memory"         
[11] "T_cell:CD8+"                         "T_cell:CD4+"                        
[13] "T_cell:CD8+_effector_memory_RA"      "T_cell:CD8+_effector_memory"        
[15] "T_cell:CD8+_naive"                   "Monocyte"                           
[17] "Erythroblast"                        "BM"                                 
[19] "DC:monocyte-derived:rosiglitazone"   "DC:monocyte-derived:AM580"

SingleR is really simple to use, we just need to provide an expression matrix and reference as SummarizedExperiment:

Code

library(SingleR)
pred_hpca <- SingleR(test = GetAssayData(sc_data), ref = hpca_se, assay.type.test=1,
    labels = hpca_se$label.main)
pred_hpca

DataFrame with 44535 rows and 4 columns
                                                scores              labels
                                              <matrix>         <character>
AAACCTGAGGCCCTCA-1_1   0.2029037:0.213122:0.201966:...    Epithelial_cells
AAACCTGAGTTCGCGC-1_1   0.0681688:0.198615:0.175877:...              B_cell
AAACCTGCACTGTTAG-1_1   0.0876658:0.201213:0.204570:...             T_cells
AAACCTGCAGACGCTC-1_1   0.1877214:0.195236:0.193748:...    Epithelial_cells
AAACCTGGTACAGCAG-1_1   0.0911602:0.192878:0.188423:...             T_cells
...                                                ...                 ...
TTTGTCAGTAAGTAGT-1_10 0.1291041:0.126240:0.0979679:...   Endothelial_cells
TTTGTCAGTCCTCCAT-1_10 0.2087580:0.136874:0.1276952:... Smooth_muscle_cells
TTTGTCAGTCTAGTCA-1_10 0.1394180:0.249968:0.2645048:...            Monocyte
TTTGTCAGTGCTGTAT-1_10 0.2323622:0.210428:0.1944528:...   Endothelial_cells
TTTGTCATCTAGCACA-1_10 0.0968535:0.202014:0.1887533:...             NK_cell
                      delta.next       pruned.labels
                       <numeric>         <character>
AAACCTGAGGCCCTCA-1_1   0.0539724    Epithelial_cells
AAACCTGAGTTCGCGC-1_1   0.0693563              B_cell
AAACCTGCACTGTTAG-1_1   0.1751957             T_cells
AAACCTGCAGACGCTC-1_1   0.5717106    Epithelial_cells
AAACCTGGTACAGCAG-1_1   0.0322466             T_cells
...                          ...                 ...
TTTGTCAGTAAGTAGT-1_10 0.05911158   Endothelial_cells
TTTGTCAGTCCTCCAT-1_10 0.00671502 Smooth_muscle_cells
TTTGTCAGTCTAGTCA-1_10 0.02419472            Monocyte
TTTGTCAGTGCTGTAT-1_10 0.06293215   Endothelial_cells
TTTGTCATCTAGCACA-1_10 0.09986082             NK_cell

Once we have the predictions, I add it to the Seurat object:

Code

sc_data@tools[["SingleR"]] <- pred_hpca |> as.data.frame()
sc_data$singleR_pred <- pred_hpca$pruned.labels

Harmonizing automatic predictions

We can check the prediction results and “harmonize” them by cluster, i.e. take the cluster-level mode of the annotation:

Code

harmony_singler <- table(sc_data$singleR_pred, factor(sc_data$harmony_clusters))
pheatmap::pheatmap(harmony_singler, scale = "column")

Code

harmony_map1 <- apply(harmony_singler, 2, function(x) names(which.max(x)))
harmony_map1 <- structure(names(harmony_map1), names = harmony_map1)
harmony_map1

          T_cells           T_cells  Epithelial_cells  Epithelial_cells 
              "0"               "1"               "2"               "3" 
         Monocyte        Macrophage            B_cell Tissue_stem_cells 
              "4"               "5"               "6"               "7" 
          T_cells            B_cell  Epithelial_cells  Epithelial_cells 
              "8"               "9"              "10"              "11" 
Endothelial_cells          Monocyte               CMP  Pre-B_cell_CD34- 
             "12"              "13"              "14"              "15" 
               DC 
             "16"

Code

sc_data$singleR_clean <- fct_recode(sc_data$harmony_clusters, !!! harmony_map1)

If we plot, we can see the difference between the “pred” and “clean”:

Code

DimPlot(sc_data, 
        group.by = c("harmony_clusters", "singleR_pred", 
                     "singleR_clean", "sample"), 
        reduction = "umap_harmony",
        ncol = 2,
        alpha = 0.2) & 
  theme_clean &
  scale_color_igv()

Warning: Removed 67 rows containing missing values or values outside the scale range
(`geom_point()`).

Single-cell

Most methods will work best with a relevant single-cell reference.

Here is a not-so-recent but independent benchmark: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1795-z

It can be interesting to try to use this tutorial to annotate a query dataset based on a reference atlas: https://satijalab.org/seurat/articles/integration_mapping

Note: always take automatic annotations with a grain of salt. They are a good departure point, but the assignment to a cell type needs to be checked manually. A good way is to go back to the differentially expressed markers using your group annotations.

Extras

Sub-cluster

In their manuscript, Salomé et al selected the T cells and subclustered them. This is a common strategy when there is a main population of interest of which we would like to characterize functional states or subpopulations. It can also be interesting to provide more detailed annotations, since the space of analysis changes.

To use subclusters, just subset the object to the annotated cells and re-run the pipeline to find new relevant subgroups.

Trajectory analysis

Trajectory analysis is used to reconstruct and visualize the dynamic processes of cellular development, differentiation, and state transitions. It infers the order of cells in a particular trajectory, allows us to calculate continuous “pseudotimes”.

It can be used to map out potential lineage relationships, identify intermediate cell states, and elucidate the pathways underlying tissue development, immune responses, or disease progression, for example.

In Figure 4, Salomé et al used this type of analysis to show a subset of differentiated PD-1+ T_RM CD8+ T Cells that retain independent function.

A not so new but very comprehensive benchmark of trajectory inference by Saelens et al (2019) is a great jumping point: https://www.nature.com/articles/s41587-019-0071-9

Accompanied by the superb dynverse wrappers, which provide all methods in the benchmark, as well as guidance on what to choose: https://dynverse.org/

Cell cycle

There are many methods to infer cell cycle stage and score cell cycle in Seurat, but it also has a built in one that works quite decently:

Code

sc_data <- CellCycleScoring(sc_data, 
                 s.features = cc.genes.updated.2019$s.genes, 
                 g2m.features = cc.genes.updated.2019$g2m.genes)

Code

FeaturePlot(sc_data, 
        features = c("S.Score", "G2M.Score"),
        reduction = "umap_harmony",
        ncol = 2,
        alpha = 0.2) & 
  theme_clean

We can see very little cycling except in one “epithelial” (tumor) area we previously annotated.

RNA velocity

RNA velocity is a computation method that attempts to predict the future transcriptional state of individual cells. It distinguishes between unspliced (pre-mRNA) and spliced (mature mRNA) transcripts within each cell, and therefore necessitates a special input in which these reads have been separated. This differentiation allows the estimation of the rate at which genes are being transcribed and processed, providing insights into the “direction” and “speed” of cellular state changes.

The primary goal of RNA velocity analysis is to elucidate the trajectories of cellular differentiation, activation, and other dynamic processes, thereby offering a deeper understanding of cellular development and lineage relationships. By forecasting how cells are likely to evolve, it can help in identifying transient cell states, and uncovering mechanisms underlying tissue development, regeneration, and disease progression.

It can be used together with traditional trajectory analysis.

In Salomé et al (2022), this analysis is used in Figure 5 to further characterize the axis of differentiation/exhaustion of CD8+ T_RM cells, shown below:

Most RNA velocity tools work in Python, the two most well-known being velocyto and scVelo. Both have tutorials showing how to use reticulate to run their workflows in R.

Pathway enrichment

Pathway enrichment aims to identify biological pathways, processes, or functional categories that are significantly overrepresented within a set of genes of interest, such as those differentially expressed between distinct cell populations or states.

This analysis typically involves mapping the selected genes to curated pathway databases like Gene Ontology, KEGG, or Reactome and applying statistical methods to assess the significance of their enrichment.

There are many methods to do this, but a easy to use is decoupleR, in which AUCell is implemented, which is a well-performing scoring method: https://www.bioconductor.org/packages/release/bioc/vignettes/decoupleR/inst/doc/decoupleR.html

Gene regulatory network

Gene regulatory network (GRN) analysis seeks to identify key transcription factors, regulatory motifs, and signaling pathways that drive cell states, differentiation, and responses to environmental stimuli. The ultimate goal of GRN analysis is to map out the regulatory architecture that underpins cellular heterogeneity, enabling us to pinpoint master regulators and understand the molecular mechanisms underlying various biological processes and disease states.

From an algorithmic point of view, this involves inferring relationships such as gene co-expression, regulatory influences, and causal interactions.

SCENIC is the most popular method to do this, with the Python implementation being much more performant (https://github.com/aertslab/pySCENIC). The protocol is probably the best way to run this type of analysis currently: https://github.com/aertslab/SCENICprotocol

CNV analysis

Copy Number Variation (CNV) analysis is a computational approach used to infer genomic copy number alterations from transcriptomic data. It leverages the variability in gene expression levels across the genome to detect gains or losses of genomic regions.

This method is particularly valuable in studies of cancer and other genetic disorders, where CNVs play a crucial role in disease progression and heterogeneity.

The most well-known tool used for this analysis is inferCNV, which can be extremely slow and doesn’t perform very well with cells with low-read count. This is why we implemented fastCNV, available to install for R on Github: https://github.com/must-bioinfo/fastCNVdata

Ligand-receptor analysis

Ligand-receptor analysis is a computational strategy to infer cell-cell communication by identifying interactions between signaling molecules (ligands) and their corresponding receptors expressed on different cell types or states.

This analysis involves systematically pairing ligands expressed by one or more cell populations with receptors expressed by neighboring or interacting cells, thereby predicting potential signaling pathways that facilitate intercellular communication.

Well known tools are CellPhoneDB and CellChat, but LIANA+ provides wrappers and an all-in-one framework for this kind of analysis: https://liana-py.readthedocs.io/en/latest/

Notably, this analysis can be much more powerful in spatially-resolved data.

Object conversion

Format conversion is often necessary when working with single-cell RNA-seq to use tools available for different objects.

This can be easy between SingleCellExperiment and SeuratObject, as we used it in this tutorial with as.SingleCellExperiment and as.Seurat.

However, these functions sometimes don’t work and the Seurat solution for h5 objects, SeuratDisk, even fails its own tutorial.

sceasy (https://github.com/cellgeni/sceasy) was made to resolve these conversion problems. Note that it requires reticulate (to make Python work in R) to be installed.