Evaluating Clustering Quality

When the epitopes of CDR3 sequences are available, clusTCR can easily evaluate the quality of the generated clustering. We start by making a clustering as normal.

from clustcr import Clustering, datasets
cdr3 = datasets.test_cdr3()
result = Clustering().fit(cdr3)

Epitopes

To plug in the epitope data, clusTCR expects a pandas.Dataframe containing CDR3 and Epitope columns. As an example, we have the epitope data for the test_cdr3 dataset available.

epitopes = datasets.test_epitopes()

Metrics

To access the metrics, simply call the metrics method on the clustering result and supply the epitope data.

metrics = result.metrics(epitopes)

Note The results are compared to a baseline clustering, which is generated by randomly permuting the assigned clusters, as to mimic an algorithm that performs random clustering.

Retention

Fraction of sequences that is assigned to any cluster.

metrics.retention()
>> 0.225184

Purity

Fraction of sequences within a cluster targeting the same epitope.
Returns a tuple of (result, baseline_result).

metrics.purity()
>> (0.872675, 0.346209)

Purity90

Fraction of clusters that have a purity greater than 90%.
Returns a tuple of (result, baseline_result).

metrics.purity_90()
>> (0.727723, 0.019802)

Consistency

Fraction of sequences targeting the same epitope that are assigned to the same cluster.
Returns a tuple of (result, baseline_result).

metrics.consistency()
>> (0.360515, 0.123033)

Summary

All available clustering metrics and as a pandas.DataFrame.

metrics.summary()

The pandas.DataFrame will look like this:

actual baseline metric
0.225184 0.225184 retention
0.872675 0.346209 purity
0.727723 0.019802 purity_90
0.360515 0.123033 consistency