Selecting genes for a MERFISH gene panel
This notebook shows how to use a reference scRNA-seq dataset and NSForest to identify genes to include in a MERFISH panel in order to resolve the cell types in the MERFISH results.
We will use the mammary gland droplet dataset from Tabula Muris Senis: https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102/2
[ ]:
import scanpy as sc
adata = sc.read("/storage/RNA_MERFISH/Collaborations/CellSenescence/TabulaMuris/Mammary_Gland_droplet.h5ad")
[116]:
sc.set_figure_params(facecolor="w", dpi=100)
sc.pl.umap(adata, color="cluster_names", legend_loc="on data", legend_fontsize="xx-small", legend_fontoutline=True, frameon=False)
We will use the “cluster_names” annotation for this example. First, we will run NSForest. If the count matrix is stored in the scanpy object as a sparse matrix, NSForest will generate an error, so we first need to convert it to a dense matrix.
[14]:
adata.X
[14]:
<11392x19860 sparse matrix of type '<class 'numpy.float32'>'
with 22792088 stored elements in Compressed Sparse Row format>
[15]:
adata.X = adata.X.toarray()
There is also an NSForest error that can occur with certain gene names. One way to address this is to keep re-running NSForest and adding the genes it gives an error on one-by-one to the “bad_genes” list below, however that can be a slow process. It doesn’t seem to like the group of genes with names such as “2010001M09Rik”, so below we just remove all such genes.
[16]:
bad_genes = ["2010001M09Rik"]
genes = [x for x in adata.var_names if x not in bad_genes and not x.endswith("Rik")]
adata = adata[:, genes]
Make sure NSForest_v3.py is in the same folder as this notebook. It can be downloaded here: https://github.com/JCVenterInstitute/NSForest
Some parameters can be changed in the NS_Forest function: - Median_Expression_Level (default 0). This is a cutoff for median expression level to remove negative markers. Negative markers may be useful for clustering, though! - Genes_to_testing (default 6). This is how many top genes will be considered for selecting the best set of markers. Every permutation of these genes is tested, so raising the number will increase the runtime exponentially. - betaValue (default 0.5). This is the beta value for the f-measure. 1 means precision and recall are weighted equally. Closer to 0 weights precision more and greater than 0 weights towards recall.
[ ]:
from NSForest_v3 import NS_Forest
adata_markers = NS_Forest(adata, clusterLabelcolumnHeader="cluster_names")
In the NSForest result, the NSForest_Markers column contains what NSForest considers to be the minimal set of genes necessary to differentiate the clusters, while the f-measure column is the accuracy of these markers for each cluster. The final column “Binary_Genes” contains the top genes that differentiated the cluster when considered independently of each other. These were what were tested to find the best combination for the NSForest_Markers column.
[18]:
adata_markers
[18]:
| clusterName | f-measure | markerCount | NSForest_Markers | True Positive | True Negative | False Positive | False Negative | 1 | 2 | 3 | 4 | 5 | 6 | index | Binary_Genes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0_basal cell | 0.980783 | 2 | [Krt14, Acta2] | 1378.0 | 9927.0 | 16.0 | 71.0 | Krt14 | Acta2 | 0_basal cell&Krt14>=1.7194600105285645&Acta2>=... | [Col9a2, Krt17, Krt5, Cxcl14, Krt14, Acta2, Ta... | ||||
| 1 | 10_T cell | 0.560000 | 2 | [Dapl1, Cd3g] | 168.0 | 10777.0 | 71.0 | 376.0 | Dapl1 | Cd3g | 10_T cell&Dapl1>=2.660470962524414&Cd3g>=2.809... | [Dapl1, Hspa1a, Cd8b1, Cd3g, Cd3e, Cd3d, Coro1... | ||||
| 2 | 11_macrophage | 0.720748 | 2 | [Alox5ap, Cfp] | 239.0 | 10825.0 | 45.0 | 283.0 | Alox5ap | Cfp | 11_macrophage&Alox5ap>=0.917403906583786&Cfp>=... | [Alox5ap, Lyz1, Lyz2, Mpeg1, Plbd1, Cfp, Ccl9,... | ||||
| 3 | 12_T cell | 0.789474 | 2 | [Cxcr6, Icos] | 141.0 | 11084.0 | 7.0 | 160.0 | Cxcr6 | Icos | 12_T cell&Cxcr6>=6.030929327011108&Icos>=8.281... | [Cxcr6, Actn2, Ccr2, Furin, Icos, Ramp3, Il7r,... | ||||
| 4 | 13_luminal epithelial cell of mammary gland | 0.738523 | 2 | [Krt6a, Ppp1r14c] | 74.0 | 11202.0 | 5.0 | 111.0 | Krt6a | Ppp1r14c | 13_luminal epithelial cell of mammary gland&Kr... | [Krt6a, Aldh3a1, Ppp1r14c, Ghr, Cldn10, F3, Da... | ||||
| 5 | 14_T cell | 0.858748 | 2 | [Apol7c, H2-M2] | 107.0 | 11209.0 | 4.0 | 72.0 | Apol7c | H2-M2 | 14_T cell&Apol7c>=4.285170078277588&H2-M2>=4.8... | [Ccl22, Apol7c, H2-M2, Cacnb3, Nudt17, Rogdi, ... | ||||
| 6 | 15_T cell | 0.931373 | 3 | [Glycam1, Cd3d, Rps16] | 133.0 | 11213.0 | 1.0 | 45.0 | Glycam1 | Cd3d | Rps16 | 15_T cell&Glycam1>=9.890417098999023&Cd3d>=2.3... | [Glycam1, Lck, Cd3d, Rps15a-ps4, Rps15a, Rps16... | |||
| 7 | 16_T cell | 0.790514 | 2 | [Tnfrsf4, Ikzf2] | 80.0 | 11224.0 | 6.0 | 82.0 | Tnfrsf4 | Ikzf2 | 16_T cell&Tnfrsf4>=8.367594718933105&Ikzf2>=5.... | [Tnfrsf4, Foxp3, Ikzf2, Tnfrsf18, Stx11, Folr4... | ||||
| 8 | 17_stromal cell | 0.873786 | 1 | [Myl9] | 72.0 | 11280.0 | 4.0 | 36.0 | Myl9 | 17_stromal cell&Myl9>=7.554762363433838 | [Pcp4l1, Nrip2, Ppp1r14a, Myl9, S1pr3, Des, Mu... | |||||
| 9 | 18_B cell | 0.878963 | 2 | [Glycam1, Cd79a] | 61.0 | 11310.0 | 7.0 | 14.0 | Glycam1 | Cd79a | 18_B cell&Glycam1>=9.848810195922852&Cd79a>=1.... | [Scd1, H2-Ob, Glycam1, Faim3, H2-DMa, Cd79a, C... | ||||
| 10 | 19_B cell | 0.921986 | 2 | [Igj, Cacna1s] | 52.0 | 11318.0 | 0.0 | 22.0 | Igj | Cacna1s | 19_B cell&Igj>=8.139252662658691&Cacna1s>=7.40... | [Igj, Derl3, Cacna1s, Edem1, Creld2, Txndc5, S... | ||||
| 11 | 1_B cell | 0.889702 | 2 | [Cd79a, H2-Aa] | 997.0 | 10110.0 | 111.0 | 174.0 | Cd79a | H2-Aa | 1_B cell&Cd79a>=2.0949586629867554&H2-Aa>=2.20... | [Ms4a1, Faim3, H2-Ob, Cd79b, Cd79a, H2-Aa, H2-... | ||||
| 12 | 20_stromal cell | 1.000000 | 2 | [Col6a5, Daglb] | 62.0 | 11330.0 | 0.0 | 0.0 | Col6a5 | Daglb | 20_stromal cell&Col6a5>=9.840846061706543&Dagl... | [Col6a5, Pmm1, Tdo2, Car4, Tcf21, Daglb, Ltc4s... | ||||
| 13 | 21_basal cell | 0.823529 | 2 | [Ccl19, Vcam1] | 14.0 | 11363.0 | 0.0 | 15.0 | Ccl19 | Vcam1 | 21_basal cell&Ccl19>=8.919580459594727&Vcam1>=... | [Ccl19, Vcam1, Ptgs2, Des, Cxcl12, C3, Cxcl1, ... | ||||
| 15 | 2_T cell | 0.658449 | 3 | [Satb1, Vps37b, Tmem66] | 438.0 | 10238.0 | 140.0 | 576.0 | Satb1 | Vps37b | Tmem66 | 2_T cell&Satb1>=2.778969645500183&Vps37b>=3.89... | [Ramp3, Gramd3, Satb1, Emb, Vps37b, Tmem66, Cd... | |||
| 16 | 3_stromal cell | 0.793898 | 2 | [Col6a3, Col4a1] | 510.0 | 10343.0 | 41.0 | 498.0 | Col6a3 | Col4a1 | 3_stromal cell&Col6a3>=3.2078611850738525&Col4... | [Smoc2, Entpd2, Col5a3, Cygb, Col6a3, Col4a1, ... | ||||
| 17 | 4_luminal epithelial cell of mammary gland | 0.825826 | 2 | [Fcgbp, Expi] | 660.0 | 10414.0 | 126.0 | 192.0 | Fcgbp | Expi | 4_luminal epithelial cell of mammary gland&Fcg... | [Fcgbp, Csn3, Elf5, Atp6v1b1, Rhov, Expi, Cldn... | ||||
| 18 | 5_luminal epithelial cell of mammary gland | 0.945302 | 2 | [Tmem56, Slc12a2] | 674.0 | 10550.0 | 9.0 | 159.0 | Tmem56 | Slc12a2 | 5_luminal epithelial cell of mammary gland&Tme... | [Tmem56, Cxcl15, Mtmr7, Fam25c, Slc7a2, Slc12a... | ||||
| 19 | 6_stromal cell | 0.902455 | 2 | [Sema3c, Pi16] | 544.0 | 10611.0 | 19.0 | 218.0 | Sema3c | Pi16 | 6_stromal cell&Sema3c>=3.4733753204345703&Pi16... | [Sema3c, Efhd1, Pi16, Fndc1, Cd248, Cxcr7, Fn1... | ||||
| 20 | 7_stromal cell | 0.757024 | 2 | [Itm2a, Hsd11b1] | 291.0 | 10691.0 | 19.0 | 391.0 | Itm2a | Hsd11b1 | 7_stromal cell&Itm2a>=2.9673908948898315&Hsd11... | [Penk, Cfh, Itm2a, Apod, Hsd11b1, Srpx, Enpp2,... | ||||
| 21 | 8_T cell | 0.818103 | 2 | [Nkg7, Ctla2a] | 376.0 | 10739.0 | 47.0 | 230.0 | Nkg7 | Ctla2a | 8_T cell&Nkg7>=3.0162134170532227&Ctla2a>=1.47... | [Xcl1, Ctsw, Cst7, Ly6c2, Nkg7, Ctla2a, Ccl5, ... | ||||
| 22 | 9_endothelial cell | 0.959259 | 2 | [Cdh5, Rasip1] | 518.0 | 10788.0 | 8.0 | 78.0 | Cdh5 | Rasip1 | 9_endothelial cell&Cdh5>=1.579826295375824&Ras... | [Cdh5, Egfl7, Aqp1, Rasip1, Emcn, Pecam1, Cd36... |
Now we want to validate this result. We can do this by subsetting our original reference dataset to only these marker genes to simulate a MERFISH experiment, and then check how well the original clusters are preserved.
[104]:
from functools import reduce
markers = reduce(lambda x, y: x+y, adata_markers["NSForest_Markers"])
markers = list(set(markers)) # Remove any duplicates
[105]:
len(markers)
[105]:
43
[106]:
test = adata[:,markers].copy()
[107]:
sc.tl.pca(test)
sc.pp.neighbors(test)
sc.tl.umap(test, min_dist=0.3)
[108]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(8,4))
sc.pl.umap(adata, color="cluster_names", title="Original clustering", legend_loc="on data", legend_fontsize="xx-small", legend_fontoutline=True, frameon=False, ax=ax[0], show=False)
sc.pl.umap(test, color="cluster_names", title="Re-clustering", legend_loc="on data", legend_fontsize="xx-small", legend_fontoutline=True, frameon=False, ax=ax[1], show=False);
In the UMAP plots above, we can see that some clusters such as 9_endothelial cell remain well separated in the re-clustering, however other clusters may have potential problems. For example cluster 17_stromal cell and 0_basal cell may be difficult to separate.
To quantify the difference between the original and re-clustering, we can use the silhouette score.
[109]:
from sklearn.metrics import silhouette_score, silhouette_samples
test.obs["silhouette_score"] = silhouette_samples(test.obsm["X_pca"], test.obs["cluster_names"])
[110]:
sc.pl.umap(test, color="silhouette_score", frameon=False, cmap="coolwarm", vcenter=0)
A silhouette score above 0 means that cell is closer on average to other cells in its cluster than to cells in the nearest neighboring cluster. Let’s find the average silhouette score per cluster.
[111]:
test_sils = test.obs.groupby("cluster_names").mean()["silhouette_score"].sort_values()
test_sils
[111]:
cluster_names
11_macrophage -0.166316
8_T cell -0.014398
3_stromal cell 0.047796
7_stromal cell 0.100698
10_T cell 0.118983
13_luminal epithelial cell of mammary gland 0.124980
15_T cell 0.125226
12_T cell 0.134146
2_T cell 0.138142
14_T cell 0.142750
17_stromal cell 0.167218
5_luminal epithelial cell of mammary gland 0.189672
9_endothelial cell 0.200774
16_T cell 0.229811
0_basal cell 0.236031
21_basal cell 0.255785
6_stromal cell 0.322757
1_B cell 0.341201
18_B cell 0.419972
19_B cell 0.425086
4_luminal epithelial cell of mammary gland 0.491056
20_stromal cell 0.560053
Name: silhouette_score, dtype: float32
Now we can see which clusters are the most problematic. With this gene list, we may have trouble separating cells in the 11_macrophage cluster from other clusters. But we should compare this with silhouette scores from the original clustering, as that may not have been perfect either.
[112]:
adata.obs["silhouette_score"] = silhouette_samples(adata.obsm["X_pca"], adata.obs["cluster_names"])
[113]:
import pandas as pd
adata_sils = adata.obs.groupby("cluster_names").mean()["silhouette_score"].sort_values()
sils = pd.DataFrame([adata_sils, test_sils], index=["Original", "MERFISH"]).T
sils["Loss"] = sils["Original"] - sils["MERFISH"]
sils.sort_values(by="MERFISH")
[113]:
| Original | MERFISH | Loss | |
|---|---|---|---|
| cluster_names | |||
| 11_macrophage | -0.015457 | -0.166316 | 0.150859 |
| 8_T cell | 0.097721 | -0.014398 | 0.112120 |
| 3_stromal cell | 0.155432 | 0.047796 | 0.107636 |
| 7_stromal cell | 0.204229 | 0.100698 | 0.103532 |
| 10_T cell | 0.274723 | 0.118983 | 0.155740 |
| 13_luminal epithelial cell of mammary gland | 0.334078 | 0.124980 | 0.209099 |
| 15_T cell | 0.125604 | 0.125226 | 0.000378 |
| 12_T cell | 0.263566 | 0.134146 | 0.129420 |
| 2_T cell | 0.279595 | 0.138142 | 0.141453 |
| 14_T cell | 0.394002 | 0.142750 | 0.251251 |
| 17_stromal cell | 0.421020 | 0.167218 | 0.253802 |
| 5_luminal epithelial cell of mammary gland | 0.309351 | 0.189672 | 0.119679 |
| 9_endothelial cell | 0.265348 | 0.200774 | 0.064573 |
| 16_T cell | 0.342089 | 0.229811 | 0.112278 |
| 0_basal cell | 0.362051 | 0.236031 | 0.126020 |
| 21_basal cell | 0.438598 | 0.255785 | 0.182813 |
| 6_stromal cell | 0.384681 | 0.322757 | 0.061924 |
| 1_B cell | 0.280059 | 0.341201 | -0.061142 |
| 18_B cell | 0.351161 | 0.419972 | -0.068810 |
| 19_B cell | 0.517053 | 0.425086 | 0.091967 |
| 4_luminal epithelial cell of mammary gland | 0.362803 | 0.491056 | -0.128252 |
| 20_stromal cell | 0.588689 | 0.560053 | 0.028635 |
The 11_macrophage cluster had the lowest silhouette score in the original clustering as well. The performance of this gene list can be summarized into a single metric by adding all the differences between per-cluster silhouette scores.
[114]:
sils["Loss"].sum()
[114]:
2.144974
Let’s try using the Binary_Genes from NSForest instead and see if the result is better.
[81]:
from functools import reduce
markers = reduce(lambda x, y: list(x)+list(y), adata_markers["Binary_Genes"])
markers = list(set(markers)) # Remove any duplicates
[82]:
len(markers)
[82]:
204
[83]:
test = adata[:,markers].copy()
[84]:
sc.tl.pca(test)
sc.pp.neighbors(test)
sc.tl.umap(test, min_dist=0.3)
[85]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(8,4))
sc.pl.umap(adata, color="cluster_names", title="Original clustering", legend_loc="on data", legend_fontsize="xx-small", legend_fontoutline=True, frameon=False, ax=ax[0], show=False)
sc.pl.umap(test, color="cluster_names", title="Re-clustering", legend_loc="on data", legend_fontsize="xx-small", legend_fontoutline=True, frameon=False, ax=ax[1], show=False);
[86]:
from sklearn.metrics import silhouette_score, silhouette_samples
test.obs["silhouette_score"] = silhouette_samples(test.obsm["X_pca"], test.obs["cluster_names"])
[87]:
sc.pl.umap(test, color="silhouette_score", frameon=False, cmap="coolwarm", vcenter=0)
[90]:
test_sils = test.obs.groupby("cluster_names").mean()["silhouette_score"].sort_values()
sils = pd.DataFrame([adata_sils, test_sils], index=["Original", "MERFISH"]).T
sils["Loss"] = sils["Original"] - sils["MERFISH"]
sils.sort_values(by="MERFISH")
[90]:
| Original | MERFISH | Loss | |
|---|---|---|---|
| cluster_names | |||
| 11_macrophage | -0.015457 | -0.066147 | 0.050690 |
| 8_T cell | 0.097721 | 0.065990 | 0.031732 |
| 12_T cell | 0.263566 | 0.118317 | 0.145249 |
| 3_stromal cell | 0.155432 | 0.120713 | 0.034720 |
| 2_T cell | 0.279595 | 0.130157 | 0.149437 |
| 15_T cell | 0.125604 | 0.132801 | -0.007198 |
| 10_T cell | 0.274723 | 0.138944 | 0.135779 |
| 7_stromal cell | 0.204229 | 0.217068 | -0.012839 |
| 16_T cell | 0.342089 | 0.241534 | 0.100556 |
| 14_T cell | 0.394002 | 0.245596 | 0.148406 |
| 13_luminal epithelial cell of mammary gland | 0.334078 | 0.258517 | 0.075561 |
| 5_luminal epithelial cell of mammary gland | 0.309351 | 0.283646 | 0.025705 |
| 9_endothelial cell | 0.265348 | 0.294066 | -0.028718 |
| 21_basal cell | 0.438598 | 0.357455 | 0.081143 |
| 4_luminal epithelial cell of mammary gland | 0.362803 | 0.394781 | -0.031978 |
| 6_stromal cell | 0.384681 | 0.401664 | -0.016983 |
| 19_B cell | 0.517053 | 0.409960 | 0.107093 |
| 1_B cell | 0.280059 | 0.418698 | -0.138640 |
| 18_B cell | 0.351161 | 0.433797 | -0.082636 |
| 17_stromal cell | 0.421020 | 0.441464 | -0.020443 |
| 0_basal cell | 0.362051 | 0.478556 | -0.116506 |
| 20_stromal cell | 0.588689 | 0.579484 | 0.009205 |
[94]:
sils["Loss"].sum()
[94]:
0.63933223
This has significantly reduced the total loss of silhouette score, but at the expense of needing over 200 genes. Further experimenting could be done to find the best trade-off between the number of genes and silhouette score loss, such as only adding more genes for the Binary_Genes column for the clusters with the highest silhouette score. Alternative methods of selecting genes could be used as well, such as the rank_genes_groups scanpy function to identify differentially expressed genes between clusters.