Turrial: How to use geneGATer#
geneGATer is a Python package designed to facilitate the identification and ranking of potential communication genes from spatial transcriptomics data. The package integrates an adapted metagene construction method by Hue et. al. [2022] and a Graph Attention Network (GAT) to enable the exploration of cell-to-cell communication without relying on explicit cell-type annotations.
First we need to download the latest build of the package from the geneGATer github repo, if we haven’t done so yet:
!python -m pip install git+https://github.com/dertrotl/geneGATer.git@main
import geneGATer as gt
import squidpy as sq
import numpy as np
import pandas as pd
import scanpy as sc
We demonstrate the geneGATer pipeline on the squidpy coronal mouse brain section Visium dataset. The turorial is split in two parts. In part one, we apply the gt.tl.getComGenes() function, to extract the potential communication gene candidates. In part two, we show how we can use geneGATer to learn a graph attention model (GAT) to extract the most “influential” genes by ranking them. We start by loading the dataset and preprocessing it.
adata = sq.datasets.visium_hne_adata()
adata.raw = adata
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.neighbors(adata)
Part I#
Next we apply a standard pre-clustering. The “cluster-key” parameter specifies where to store the clusterin in the adata.obs section.
new_cluster = gt.pp.pre_clustering(adata, cluster_key = "new_cluster", resolution = 1)
sq.pl.spatial_scatter(adata, color="new_cluster")
We see, that we clustered our spots in 12 different regions. Next, we want to extract potential communication candidates by building metagenes for each cluster in our pre-clustering. For this, we have to input our adata object and the pre-clustering from the first steps. This will take some minutes.
result = gt.tl.getComGenes(adata, new_cluster, tresh=0.0, groupby="new_cluster", verbosse = False)
plot = gt.pl.roc_auc_classification(adata=adata, all_metas = result[2], raw_cluster=new_cluster)
“result” saves multiple outputs, indexing “result[2]” gives us the metagene expression of each cluster for each spot. To validate the results of our metagenes, we plotted the above matrix, which highlights the roc-auc scores of the spot clustering by metagenes.
Part II#
Next, we want to train a GAT model to rank the our communication candidates by their importance, according to our model and validate these results. We do this by simply running the gt.tl.learn_model() function and setting a model type and loss function (nn negative binomial, nn poisson or mse loss). For this we need a list of genes, which are ideally the genes we extracted in part I (stored in “result[4]”).
np.save('squid_genes.npy', result[4])
gene_list = result[4]
model, data = gt.tl.learn_model(adata, gene_list = gene_list, model_type = "GAT", loss = "mse", epochs = 100)
/Users/benjaminweinert/gnn_test/wandb/run-20231024_165811-s03qgv1qGAT(
(conv1): GATv2Conv(293, 293, heads=1)
(conv2): GATv2Conv(293, 18078, heads=1)
)
Epoch: 10.0 Loss: 9.472657203674316
Epoch: 20.0 Loss: 1.3256076574325562
Epoch: 30.0 Loss: 0.6214395761489868
Epoch: 40.0 Loss: 0.47567155957221985
Epoch: 50.0 Loss: 0.14159516990184784
Epoch: 60.0 Loss: 0.10813288390636444
Epoch: 70.0 Loss: 0.07258807122707367
Epoch: 80.0 Loss: 0.0632665753364563
Epoch: 90.0 Loss: 0.06014275178313255
Run history:
| Test Eval Error | █▂▂▁▁▁▁▁▁▁ |
| Test R2 | ▁▇▇███████ |
| Test R2_lin | ▁▁▂▄▄▆▇███ |
| Train Eval Error | █▂▂▁▁▁▁▁▁▁ |
| Train R2 | ▁▇▇███████ |
| Train R2_lin | ▁▁▂▄▄▆▇███ |
| Validation Eval Error | █▂▂▁▁▁▁▁▁▁ |
| Validation R2 | ▁▆▇███████ |
| Validation R2_lin | ▁▁▂▄▄▆▇███ |
| epoch | ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ |
| loss | ▁█▂▂▃▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ |
Run summary:
| Test Eval Error | 0.05811 |
| Test R2 | -0.11428 |
| Test R2_lin | 0.6006 |
| Train Eval Error | 0.0591 |
| Train R2 | -0.1429 |
| Train R2_lin | 0.59605 |
| Validation Eval Error | 0.05957 |
| Validation R2 | -0.11087 |
| Validation R2_lin | 0.59305 |
| epoch | 100 |
| loss | 0.05902 |
View job at https://wandb.ai/thesis_weinert/my_model/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEwOTUyNDA2Mw==/version_details/v0
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
./wandb/run-20231024_165811-s03qgv1q/logsFor monitoring the learning performance, we use “weights & biases” (wandb). The results and the progress is directly uploaded to a wandb instance. Furthermore, we can analyse our results by using some of the plotting functions of this package. (The results indicated here can be significantly improved by learning up to 1000 epochs.)
gt.pl.attention_umap(model, data, adata, n_components=700, cluster_key="new_cluster", umap_model="umap") #tsne works as well
(<Figure size 1000x1000 with 1 Axes>,
array([[5.2995124, 4.820972 ],
[4.659695 , 4.6405478],
[4.6481504, 4.5228744],
...,
[2.0355809, 4.9695773],
[6.297827 , 5.3710666],
[6.3939495, 1.6471777]], dtype=float32))
We can get the top ranked genes by two different ranking methods:
Using classical saliancy scores using the
gt.pl.get_top_k_genes_saliency()function.Using receiving and sending weight signals using the
gt.pl.get_top_k_genes()function.
Furthermore we can compare the top ranked genes by a list of marker genes. In this tutorial, we want to compare our results with found ligand and receptor interactions using the CellphoneDB implementation in Squidpy. Genes which are also present in the marker gene list, are marked with an asterisk (*).
# Extracting L&R interactions using CellphoneDB and Squidpy
adata.obs["new_cluster"] = new_cluster
res = sq.gr.ligrec(adata, "new_cluster",
fdr_method=None,
copy=True,
interactions_params={"resources": "CellPhoneDB"},
threshold=0.1, seed=0, n_perms=10000, n_jobs=1, use_raw=False)
df = res["pvalues"]
print("Number of CellPhoneDB interactions:", len(df))
ligand_receptor_space = []
for tuples in range(0,res["metadata"].index.values.shape[0]):
for idx in range(0,2):
ligand_receptor_space.append(res["metadata"].index.values[tuples][idx])
ligand_receptor_space = list( set(ligand_receptor_space) )
print("Number of unique Ligand / Receptor Genes:", len(ligand_receptor_space))
gene_list_upper = [x.upper() for x in gene_list]
Number of CellPhoneDB interactions: 654
Number of unique Ligand / Receptor Genes: 493
gt.pl.get_top_k_genes_saliency(model, data, gene_list_upper, 10, marker_genes = ligand_receptor_space) # Plotting ranked genes by ranking method 1
(<Figure size 1000x600 with 1 Axes>,
all_genes
0 PVALB
1 NOV
2 TYRO3
3 CABP7
4 RAB3C
5 ST6GAL2
6 SH3GL3
7 CAMK2N1
8 SEPT8
9 RPS8)
res = gt.pl.get_top_k_genes(model, gene_list_upper, 10, marker_genes = ligand_receptor_space) # Plotting ranked genes by ranking method 2
More plotting functions can be found in the documentation.