Usage

Clustering

Our interface follows the sklearn practices, so it might feel familiar.
To perform clustering, import the Clustering class from clustcr.

from clustcr import Clustering
clustering = Clustering()

Any parameters should be passed to the Clustering object at creation, the following are available

parameter	explanation	default
chain	A or B Specify alpha (A) or beta (B) chain. This choice does not influence the clustering process, but the information will be used during the downstream cluster analysis. Default = B.
method	mcl, faiss or two-step. We recommend using mcl for data sets containing < 50,000 CDR3 sequences, and two-step for all data sets with > 50,000 sequences. For more information check out the methods page.	two-step
n_cpus	Number of CPUs used in the MCL clustering. This drastically increases the speed of clusTCR. When set to ‘all’, all of your CPUs will be used.	1
use_gpu	Usage of GPU in the Faiss Clustering training step (needs clustcr-gpu to be installed)	False
faiss_cluster_size	The size of the clusters that faiss will generate, either using the faiss or the two-step method.	5000
mcl_params	MCL hyperparameters, which should be a list of [inflation, expansion]	[1.2, 2]
faiss_training_data, fitting_data_size, max_sequence_size	Only used for clustering in batches, see clustering large data	/

To perform the clustering, our interface provides a fit method which expects a pandas.Series. This method can be called multiple times, each time returning the clustering result.
For this tutorial, we use the test_cdr3 dataset.

from clustcr import datasets
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3)

Including CDR3 alpha chain

In case you have data that includes the alpha chain, you can optionally use this in the clustering (beware that this will drastically change your results). To include, simply pass a Series of your alpha chain to the fit method. As an example, we provide a vdjdb dataset that includes this data.

data = datasets.vdjdb_paired()
cdr3, alpha = data['CDR3_beta'], data['CDR3_alpha']
output = clustering.fit(cdr3, alpha=alpha)

#### [BETA] Including V gene information

In addition to CDR3-based clustering, ClusTCR provides the ability to include V gene information into the clustering process. When V gene clustering is enables, TCR sequences will first be sorted by V gene family, and clustering will be applied within each group of sequences that belong to that V gene family. By doing so, clustering accuracy will be increased at the cost of clustering retention (i.e. less sequences will end up in a cluster).

You can include V gene information into the clustering process by setting the include_vgene = True in the .fit() method. In addition, this process requires the user to specify the names of the columns containing the CDR3 and V gene information. Below, you can find an example of how this works in practice:

import pandas as pd
from clustcr import Clustering, datasets

# Import a file that contains at least a CDR3 column and a V gene column
data = pd.read_csv("mytcrfile.csv")
# Initiate a Clustering object
clustering = Clustering()
# Include V gene information by setting the include_vgene parameter to True
output = clustering.fit(
    data,
    include_vgene = True, # Enable V gene clustering
    cdr3_col = "cdr3", # Specificy CDR3 column name
    v_gene_col = "vgene" # Specificy V gene column name
	)

ClusteringResult

Dataframe

A dataframe containing the clusters can be accessed

output.clusters_df

	CDR3	cluster
0	CASSPSGTPYEQYF	0
1	CASSPSGTPYERYF	0
2	CASNELASGTDTQYF	1
3	CASSELASGTDTQYF	1
4	CASSALASGTDTQYF	1
..	…	…
637	CASSPRTSGTYEQYF	199
638	CASSFTLGTGGVEQYF	200
639	CASSITLGTGGVEQYF	200
640	CASSLIGVSSYNEQFF	201
641	CASSLRGVSSYNEQFF	201

CSV

To quickly store the clusters to file, the write_to_csv method can be used. A path is optional, by default clusTCR will save it in the current directory.

output.write_to_csv()

Cluster Contents

To have a representation of the CDR3s in each cluster, the following method can be used

output.cluster_contents()
>> [
    ['CASSPSGTPYEQYF', 'CASSPSGTPYERYF'], 
    ['CASSFTLGTGGVEQYF', 'CASSITLGTGGVEQYF'], ...
]

Summary

You can explore the clustering results by executing the .summary() on the ClusteringResult object. This will provide you with a pandas.DataFrame that contains the cluster index, number of sequences in the cluster, and a consensus motif for that cluster. For every position, the amino acid frequency is calculated. The most dominant amino acid is selected and if it exceeds the predefined cut-off (default = 0.7), that position is represented by the dominant amino acid in upper case. Else, if the sum of the frequencies of the two most dominant amino acids exceed the cut-off, both of them are considered. However, if the frequency of one the two is 2x larger than the other, the most dominant amino acid of the two will be used to represent that position. To emphasize that the frequency of this amino acid on itself does not exceed the cut-off, it will be shown in lower case. Otherwise shared dominance of amino acids is indicated by square brackets (‘ [ ] ‘). Positions where neither criteria are met are indicated with a wild card symbol (‘ . ‘).

output.summary()

	cluster_idx	size	motif
0	25	24	CASSgg.YGYTF
1	0	23	CASS.RSTDTQYF
2	15	15	CASSEA[AS]GGFYNEQFF
3	41	15	CASSL[LM]GPGQPQHF
4	5	14	CSAR.GLNNEQFF
..	…	…	…
235	135	2	CASS[LP]GWGLDQPQHF
236	137	2	CASSLLGQ[DY]NSPLHF
237	26	2	CASSLEG[DY]TEAFF
238	139	2	CASSSTGGGG[AT]EAFF
239	239	2	CASS[EL]GRETQYF