Integration of cell lines¶
Introduction¶
If you haven't checked out our basic pipeline, we'd avise starting there as it will explain the different steps in greater detail.
We'll start by loading the single-cell RNA-seq dataset from the CCLE, which contains data on 200 cell lines. This is a very large dataset, so it may take some time to download.
import scycle as cc
import scanpy as sc
sc200 = cc.data.get_data('sc200_CCLE')
sc200
-- Loading data from cache...
AnnData object with n_obs × n_vars = 42362 × 30314 obs: 'Cell_line', 'Pool_ID', 'Cancer_type', 'Genes_expressed', 'Discrete_cluster_minpts5_eps1.8', 'Discrete_cluster_minpts5_eps1.5', 'Discrete_cluster_minpts5_eps1.2', 'CNA_subclone', 'SkinPig_score', 'EMTI_score', 'EMTII_score', 'EMTIII_score', 'IFNResp_score', 'p53Sen_score', 'EpiSen_score', 'StressResp_score', 'ProtMatu_score', 'ProtDegra_score', 'G1/S_score', 'G2/M_score', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt' var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
We'll first apply standard pre-processing using a scanpy
recipe without pooling to visualize the whole dataset.
cc.pp.filter_cells(sc200)
sc.pp.recipe_zheng17(sc200, n_top_genes = 10000)
34920 cells pass the count filter 42362 cells pass the mt filter Cells selected 34920
sc.pp.pca(sc200)
sc.pp.neighbors(sc200)
sc.tl.umap(sc200)
sc.pl.umap(sc200, color='Cancer_type')
/home/clarice/.local/lib/python3.8/site-packages/sklearn/manifold/_spectral_embedding.py:245: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
We can see, as was also described by the publication of these data, that cell lines are quite heterogenous by tissue of origin. We can better inspect it using an interactive plot:
import pandas as pd
import plotly.express as px
sc200_df = pd.DataFrame(sc200.obsm['X_umap'], index = sc200.obs.index, columns = ['x', 'y'])
sc200_df['Cancer_type'] = sc200.obs['Cancer_type'].values
sc200_df['Cell_line'] = sc200.obs['Cell_line'].values
ct_colors = sc200.uns['Cancer_type_colors']
sc200_df = sc200_df.sort_values('Cancer_type')
fig = (px.scatter(sc200_df, x='x', y='y', color ='Cancer_type',
color_discrete_sequence = ct_colors, hover_data=['Cell_line']))
fig.update_traces(marker = dict(size = 2))
fig.update_layout(scene = dict(xaxis_title = 'UMAP1', yaxis_title = 'UMAP2'))
fig
sc200_df.groupby('Cancer_type').count()['x'].sort_values(ascending = False)
Cancer_type Lung Cancer 8692 Head and Neck Cancer 4719 Skin Cancer 2777 Breast Cancer 2200 Brain Cancer 2038 Esophageal Cancer 1662 Pancreatic Cancer 1601 Ovarian Cancer 1427 Endometrial/Uterine Cancer 1412 Colon/Colorectal Cancer 1406 Kidney Cancer 1103 Bladder Cancer 967 Liver Cancer 959 Gastric Cancer 900 Thyroid Cancer 684 Bone Cancer 683 Sarcoma 608 Bile Duct Cancer 353 Neuroblastoma 263 Prostate Cancer 238 Fibroblast 147 Gallbladder Cancer 81 Name: x, dtype: int64
sc200_df.reset_index()[['Cancer_type', 'Cell_line']].drop_duplicates().groupby('Cancer_type').count()['Cell_line'].sort_values(ascending = False)
Cancer_type Lung Cancer 40 Head and Neck Cancer 19 Skin Cancer 16 Breast Cancer 15 Ovarian Cancer 13 Brain Cancer 12 Pancreatic Cancer 11 Colon/Colorectal Cancer 10 Endometrial/Uterine Cancer 10 Esophageal Cancer 7 Liver Cancer 7 Gastric Cancer 6 Bladder Cancer 6 Kidney Cancer 6 Bile Duct Cancer 4 Thyroid Cancer 4 Sarcoma 3 Bone Cancer 3 Neuroblastoma 2 Prostate Cancer 2 Gallbladder Cancer 1 Fibroblast 1 Name: Cell_line, dtype: int64
The two cancer types with larger representation of cells and cell lines are Lung Cancer and Head and Neck cancer. We'll be subsetting the big dataset to look at cell cycle in the Head and Neck cancer cell lines.
Head and neck cell line integration¶
We'll reload and subset the data so we can process using scycle
.
schn = sc200[sc200.obs['Cancer_type'] == 'Head and Neck Cancer']
We can now look into the cell lines. we see that we have a few cell lines with a low cell count, and that SCC255 and JHU011 have the highest number of cells.
schn.obs.reset_index()[['NAME', 'Cell_line']].groupby('Cell_line').count().sort_values('NAME', ascending = False)['NAME']
Cell_line SCC25_UPPER_AERODIGESTIVE_TRACT 575 JHU011_UPPER_AERODIGESTIVE_TRACT 541 YD38_UPPER_AERODIGESTIVE_TRACT 426 JHU029_UPPER_AERODIGESTIVE_TRACT 408 93VU_UPPER_AERODIGESTIVE_TRACT 391 SCC9_UPPER_AERODIGESTIVE_TRACT 348 SCC47_UPPER_AERODIGESTIVE_TRACT 332 BICR6_UPPER_AERODIGESTIVE_TRACT 243 BICR31_UPPER_AERODIGESTIVE_TRACT 200 BICR16_UPPER_AERODIGESTIVE_TRACT 175 PECAPJ49_UPPER_AERODIGESTIVE_TRACT 167 SCC90_UPPER_AERODIGESTIVE_TRACT 153 SNU1214_UPPER_AERODIGESTIVE_TRACT 152 JHU006_UPPER_AERODIGESTIVE_TRACT 137 SNU46_UPPER_AERODIGESTIVE_TRACT 126 DETROIT562_UPPER_AERODIGESTIVE_TRACT 117 BHY_UPPER_AERODIGESTIVE_TRACT 110 BICR56_UPPER_AERODIGESTIVE_TRACT 71 SNU899_UPPER_AERODIGESTIVE_TRACT 47 Name: NAME, dtype: int64
We'll follow the initial standard steps of scycle pipeline:
cc.tl.dimensionality_reduction(schn)
cc.pl.cell_cycle_pca(schn)
Trying to set attribute `.uns` of view, copying. /home/clarice/miniconda3/envs/scycle/lib/python3.8/site-packages/scycle/tools/_dimensionality_reduction.py:75: UserWarning: Data has not been pre-processed using scycle (`pp.prep_pooling`or `pp.prep_simple`), before dimensionality reduction
Dimensionality reduction using PCA...
<ggplot: (8781371292889)>
We can see in the visualization that we don't quite get a "circle" where we can fit a trajectory. We'll dissect this further:
cc.pl.cell_cycle_pca(schn, col_var = 'Cell_line', alpha = 0.5, size = 1)
<ggplot: (8781371219914)>
It seems that cells from the same cell line are following a roughly circular (or 'triangular') trajectory, and we're seeing all their trajectories stacked. Here's an example of single cell lines:
cc.pl.cell_cycle_pca(schn[schn.obs['Cell_line'] == 'SCC25_UPPER_AERODIGESTIVE_TRACT'], col_var = 'Cell_line', alpha = 0.5, size = 1)
<ggplot: (8781363244029)>
cc.pl.cell_cycle_pca(schn[schn.obs['Cell_line'] == 'JHU011_UPPER_AERODIGESTIVE_TRACT'], col_var = 'Cell_line', alpha = 0.5, size = 1)
<ggplot: (8781370340241)>
As JHU011 had a cleaner trajectory, we'll use it as the reference for integration. In scycle
, integration is implemented using weighted optimal transport (WOTi):
jhu011 = schn[schn.obs['Cell_line'] == 'JHU011_UPPER_AERODIGESTIVE_TRACT']
cc.tl.integration(schn, jhu011)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-15-ebf635749bfe> in <module> ----> 1 cc.tl.integration(schn, jhu011) ~/miniconda3/envs/scycle/lib/python3.8/site-packages/scycle/tools/_integration.py in integration(_adata_src, _adata_ref, components, method, entropy, hreg, unbalanced, mreg, weighting_strategy, downsampling, normalize, jitter_std, verbose, max_iter) 82 print("-- Automatically detecting cell-cycle components...") 83 if "find_cc_components" not in _adata_ref.uns["scycle"].keys(): ---> 84 find_cc_components(_adata_ref) 85 components = list(_adata_ref.uns["scycle"]["find_cc_components"]["indices"].values()) 86 ~/miniconda3/envs/scycle/lib/python3.8/site-packages/scycle/tools/_find_cc_components.py in find_cc_components(adata, thr, verbose) 33 34 #-- Get scores ---> 35 g1s_scores = _compute_scores(adata, g1s_markers) 36 g2m_scores = _compute_scores(adata, g2m_markers) 37 g2mi_scores = _compute_scores(adata, g2m_inhibitory_markers) ~/miniconda3/envs/scycle/lib/python3.8/site-packages/scycle/tools/_find_cc_components.py in _compute_scores(adata, marker_genes) 124 marker_genes = adata.var_names.intersection(marker_genes) 125 positions = [adata.var_names.get_loc(g) for g in marker_genes] --> 126 return adata.uns["dimRed"].S_[:, positions].mean(axis=1) 127 128 def enrich_components(adata, thr=3, verbose=True): AttributeError: 'PCA' object has no attribute 'S_'
We can inspect the results of the integration below:
cc.pl.cell_cycle_pca3d(schn, col_var = 'Cell_line', alpha = 0.5, size = 1.5)
Now we can proceed with the analysis of all the head and neck cell lines.
Run scycle on integrated dataset¶
Now we can simply follow the pipeline:
- cc.tl.trajectory finds the circular trajectory
- cc.tl.cell_division cuts the trajectory by finding the moment of cell division based on total_counts
- cc.tl.pseudotime calculates a pseudotime between 0 and 1 based on the trajectory
- cc.tl.cell_cycle_phase assigns a cell cycle phase to each cell based on their pseudotime and on the curvature of the trajectory in the embedded space
cc.tl.trajectory(schn)
WARNING: The initial number of nodes must be at least 3. This will be fixed
cc.pl.cell_cycle_pca(schn, trajectory = True)
<ggplot: (8756503093932)>
cc.tl.cell_division(schn)
Suggested moment of cell division: [4 5] Direction of cell cycle: 1 Remapping edges using [4 5] ...
cc.pl.cell_division(schn)
<ggplot: (8756511319372)>
cc.tl.pseudotime(schn)
Calculating pseudotimes for each cell...
cc.pl.pseudotime_histogram(schn)
<ggplot: (8756494910850)>
cc.tl.cell_cycle_phase(schn)
-- Suggested cell cycle division: G1: 0 - 0.3333333333333333 S: 0.3333333333333333 - 0.6666666666666666 G2-M: 0.6666666666666666 - 1
cc.pl.cell_cycle_scores(schn, alpha = 0.25)
/home/clarice/.local/lib/python3.8/site-packages/plotnine/layer.py:467: PlotnineWarning: geom_vline : Removed 1 rows containing missing values.
<ggplot: (8756600572353)>
cc.pl.cell_cycle_phase_pieplot(schn)
([<matplotlib.patches.Wedge at 0x7f6c84029970>, <matplotlib.patches.Wedge at 0x7f6c8401ae80>, <matplotlib.patches.Wedge at 0x7f6c842785e0>], [Text(0.5811813489928175, 0.9923347416990342, 'G2-M'), Text(-1.149458539442629, 0.03528549422096812, 'S'), Text(0.5504598632792116, -1.0096999251850185, 'G1')], [Text(0.2779562973443909, 0.47459487646475546, '33.1%'), Text(-0.5497410406029966, 0.016875671149158666, '32.8%'), Text(0.2632634128726664, -0.48289996421892184, '34.1%')])
Inspect results¶
Now that the pipeline is complete, the results generated using scycle
can also be easily accessed by scanpy and used in relation to other variables. Here, we'll simply plot them in the UMAP projection.
sc.pp.pca(schn)
sc.pp.neighbors(schn)
sc.tl.umap(schn)
schn.uns['cell_cycle_phase_colors'] = ['#8da0cb', '#fc8d62', '#66c2a5']
sc.pl.umap(schn, color = ['Cell_line', 'G1-S', 'G2-M', 'pseudotime', 'cell_cycle_phase'], ncols = 2)