## Optimization of the first clustering step

ClusTCR was designed with the aim of clustering large AIRR data sets exceeding millions of unique sequences. To accommodate for algorithmic runtime and memory limitations, we have introduced a batch clustering procedure. Nonetheless, it is possible to achieve more performant results by changing the basic hyperparameters. This provides the user with even more flexibility, allowing them to decide upon the trade-off between clustering speed and accuracy.

### Adjusting supercluster size for speed improvement

The supercluster size (described by the `faiss_cluster_size`

parameter) is an important determinant for the speed of ClusTCR. The MCL clustering step imposes a bottleneck to the speed of the algorithm. Therefore, by changing the `faiss_cluster_size`

parameter, one can in- or decrease the total number of MCL processes. MCL is extremely efficient at clustering small groups of sequences, but becomes much slower when the size of the input data increases. In some cases, it may therefore be more beneficial to tune down the size of the superclusters. The `faiss_cluster_size`

hyperparameter can be set as a variable in the `Clustering`

object.

Let’s illustrate this with an example:

```
from clustcr import metarepertoire, Clustering
# Generate a sample of 1 million sequences
data = metarepertoire(directory='/emerson-2017/', data_format='immuneaccess')
```

We will now run ClusTCR with different preset values for the supercluster size.

```
clustering_1 = Clustering(faiss_cluster_size=5000, n_cpus=16)
clustering_2 = Clustering(faiss_cluster_size=3000, n_cpus=16)
```

And observe the differences in algorithmic runtime.

```
clustering_1.fit(data)
> 199.008 seconds # ~3min19sec
clustering_2.fit(data)
> 162.839 seconds # ~2min43sec
```

### Adjusting supercluster size for batch clustering

As discussed in the section on batch clustering, the appropriate size for the training sample can be calculated as `1000 * (fitting_data_size / faiss_cluster_size)`

. When working with extremely large data sets, even the training sample may be too large to fit into memory. This problem can be accounted for by increasing the `faiss_cluster_size`

parameter. Conversely, you can decrease the `faiss_cluster_size`

to improve the speed of the batch clustering procedure. Suppose our dataset contains 10,000,000 unique CDR3 sequences, the optimal sample size is

according to the default settings. By adjusting the `faiss_cluster_size`

we increase the total clustering time, but we won’t need such a large training sample: