Multi Repertoire Cluster Matrix
When combining TCR repertoires from multiple people into a metarepertoire, we lose sight of each subject/sample’s unique data. To solve this, we generate a cluster matrix to be able to compare data from different sources.
Here’s an example
1 | 2 | 3 | 4 | … | |
---|---|---|---|---|---|
Subject_1 | 29 | 38 | 16 | 8 | … |
Subject_2 | 16 | 8 | 17 | 4 | … |
Subject_3 | 2 | 3 | 0 | 1 | … |
Subject_4 | 0 | 2 | 0 | 0 | … |
Each row in the matrix is one person’s data and each column is a cluster (corresponding to the cluster-id from the clustering output). Each cell contains the amount of TCRs of a person found in a cluster. For example, there are 29 TCRs from Subject_1 found in cluster 1.
Implementation
To generate this matrix, we use the batch clustering methods with a couple of added tweaks.
After initializing the Clustering
object as described on the aforementioned page, we perform the batch preclustering. Now however, we’re adding a name for every dataset we precluster. This is the name that will be used in the cluster matrix.
In our case, we simply used the filenames
for filename, data in loaded_datasets:
clustering.batch_precluster(data, name=filename)
Afterwards, we do the batch clustering and simply specify that we want the cluster matrix to be calculated. Note that this process takes around 1 extra second per precluster.
for cluster in clustering.batch_cluster(calc_cluster_matrix=True):
print(cluster.clusters_df)
Lastly, the cluster matrix can be accessed
clustering.batch_cluster_matrix()