Optimization of the first clustering step
ClusTCR was designed with the aim of clustering large AIRR data sets exceeding millions of unique sequences. To accommodate for algorithmic runtime and memory limitations, we have introduced a batch clustering procedure. Nonetheless, it is possible to achieve more performant results by changing the basic hyperparameters. This provides the user with even more flexibility, allowing them to decide upon the trade-off between clustering speed and accuracy.
Adjusting supercluster size for speed improvement
The supercluster size (described by the
faiss_cluster_size parameter) is an important determinant for the speed of ClusTCR. The MCL clustering step imposes a bottleneck to the speed of the algorithm. Therefore, by changing the
faiss_cluster_size parameter, one can in- or decrease the total number of MCL processes. MCL is extremely efficient at clustering small groups of sequences, but becomes much slower when the size of the input data increases. In some cases, it may therefore be more beneficial to tune down the size of the superclusters. The
faiss_cluster_size hyperparameter can be set as a variable in the
Let’s illustrate this with an example:
from clustcr import metarepertoire, Clustering # Generate a sample of 1 million sequences data = metarepertoire(directory='/emerson-2017/', data_format='immuneaccess')
We will now run ClusTCR with different preset values for the supercluster size.
clustering_1 = Clustering(faiss_cluster_size=5000, n_cpus=16) clustering_2 = Clustering(faiss_cluster_size=3000, n_cpus=16)
And observe the differences in algorithmic runtime.
clustering_1.fit(data) > 199.008 seconds # ~3min19sec clustering_2.fit(data) > 162.839 seconds # ~2min43sec
Adjusting supercluster size for batch clustering
As discussed in the section on batch clustering, the appropriate size for the training sample can be calculated as
1000 * (fitting_data_size / faiss_cluster_size). When working with extremely large data sets, even the training sample may be too large to fit into memory. This problem can be accounted for by increasing the
faiss_cluster_size parameter. Conversely, you can decrease the
faiss_cluster_size to improve the speed of the batch clustering procedure. Suppose our dataset contains 10,000,000 unique CDR3 sequences, the optimal sample size is
according to the default settings. By adjusting the
faiss_cluster_size we increase the total clustering time, but we won’t need such a large training sample: