ClusTCR demo

In this section we will go over some typical scenarios that you might encounter when you will use ClusTCR.

> Scenario 1: clustering a single repertoire file

Say you want to cluster an individual RepSeq file, you can simply load it in and perform clustering. Make sure you are using one of the supported input formats. For more info about input formats supported by clusTCR, see the importing section. Let’s start by importing the appropriate modules:

from clustcr import read_cdr3, Clustering

Next, load in your data. In this demo we will use the example of an immuneACCESS file. First, adjust the file parameter with the path to the file you want to cluster.

data = read_cdr3(file='/path_to_data/HIP00110.tsv',
                data_format='immuneaccess')

This should return a pandas.Series of CDR3 sequences. Example:

print(data)
>> 0        CATTGTSGGPSQSTQYF
1           CASSLRVGGYGYTF
2          CASRRLGGLNTEAFF
3           CASSLRGSGNTIYF
4            CASRQDGSTEAFF
               ...        
77475            RVGSCEQYF
77476    CASSRYDLPGPRDTQYF
77477    CASSLVFYGQGQETQYF
77478           CATMGHGYTF
77479          CATSFRGEAFF
Length: 77480, dtype: object

Next, we will perform the actual clustering. First, we must define the clustering parameters (for more info, see the Clustering section), creating a Clustering object. Then we fit our data onto the newly created Clustering object.

clustering = Clustering(n_cpus=8) # Clustering parameters
result = clustering.fit(data) # This will generate the ClusteringResult object

That’s it! You have successfully clustered your RepSeq file with ClusTCR. To retrieve the clusters, you can call the clusters_df method on the ClusteringResult object…

result.clusters_df
  CDR3 cluster
0 CASSYSRASSGNTIYF 0
1 CASSYSRQSSGNTIYF 0
2 CASSYSGQGSGNTIYF 0
3 CASSYSRQGSGNTIYF 0
4 CASSYSRPSSGNTIYF 0
25065 CASSYGSSSTDTQYF 4516
25066 CASSPQWATGNTIYF 4517
25067 CSSSPQWATGNTIYF 4517
25068 CATSRDQGGYNEQFF 4518
25069 CATSRDRGGYNEQFF 4518

…or directly save the clusters to a file using the write_to_csv() method. This will require you to specify a path to where the results should be saved, else they will be stored in the current working directory with the generic clusTCR_clusters.csv file name.

result.write_to_csv(path='/results_folder/myclusters.csv')

Alternatively, you can just get a flat notation of the clustering result in the form of a list, in which each element is a list of the sequences within a cluster.

result.cluster_contents()
>> [
	['CASSRWTGTNTGELFF', 'CASSSWTGTNTGELFF'],
	['CASSLPGQGMNTEAFF', 'CASSLVGQGMNTEAFF', 'CASSLVGLGMNTEAFF'], ...
]

If you want a more concise overview of the clustering results, you can call the summary() method on the ClusteringResult object.

result.summary()
  cluster_idx size motif
0 2238 245 CASS[RS]WTGTNTGELFF
1 2249 241 CASSLVGQGMNTEAFF
2 1252 237 CASSLALQ[RG]YGNTIYF
3 3329 231 C[AS]SSGARLGYREKLFF
4 1246 226 CASSYS[RK]GGAGIWAFF
4514 1593 2 CASS[FT]TTGGGNEQFF
4515 3640 2 CASSPPRG[RQ]GETQYF
4516 1577 2 CASSSYDRK[AV]YEQYF
4517 3624 2 CASSFSGT[GL]GNTIYF
4518 0 2 CASSF[GS]GGAGDEQFF

> Scenario 2: clustering a set of repertoires simultaneously

Suppose that you have a data set containing various TCR repertoire samples, which you want to cluster simultaneously. Most likely, the size of this data will be larger than your maximum RAM capacity. To solve this problem, we can use clusTCR’s batch clustering functionality.

Start by importing the different modules that we’ll need:

from clustcr import read_cdr3, metarepertoire, Clustering
import os

First, make sure your files are stored in a separate directory. Batch clustering requires three specific parameters:

  • faiss_training_data: A training sample, from which it will compute the cluster centroids.
  • fitting_data_size: The total number of sequences in the data set.
  • max_sequence_size: The length of the largest CDR3 sequence in the data set.

We will start by determining the size of the data set. Then, we can calculate the recommended training sample size. For the sake of this example, we will illustrate this procedure for an immuneACCESS data set.

# First, we define the path to the data directory
datadir = '/path_to_data/'

# Now we count the number of sequences that are present in the data set
# We use os.listdir() to list all the files in the specified directory
total_cdr3s = 0
for file in os.listdir(datadir):
    total_cdr3s += len(read_cdr3(datadir + file, 
                                 data_format='immuneaccess'))

From this number, we can calculated the recommended sample size (see clustering large data section). We can then take a sample from the data set using the metarepertoire() function.

training_sample_size = round(1000 * (total_cdr3s / 5000))
training_sample = metarepertoire(directory=datadir,
                                 data_format='immuneaccess',
                                 n_sequences=training_sample_size)

We assume that the largest sequence in the sample also corresponds to the maximum sequence size in the complete data set. If we allow this assumption, the maximum sequence size can be easily determined:

max_seq_len = training_sample.str.len().max()

Next, we make a Clustering object. However, this time we need to specify the batch clustering-specific parameters.

clustering = Clustering(faiss_training_data=training_sample,
                        fitting_data_size=total_cdr3s,
                        max_sequence_size=max_seq_len,
                        n_cpus=8) # Multiprocessing using 8 CPUs

Finally, we cluster the sequences, starting first with the batch pre-clustering…

for file in files:
    # Load your data
    data = read_cdr3(file=os.path.join(datadir, file),
                    data_format='immuneaccess')
    clustering.batch_precluster(data)

…followed by the actual clustering step:

for cluster in clustering.batch_cluster():
	# Do something with the clusters
	# For example: - print(cluster.clusters_df)
	#			   - cluster.write_to_csv()
	#			   - ...

Once finished, you can clean up the intermediate results using the following command:

clustering.batch_cleanup()