Evaluating Clustering Quality
When the epitopes of CDR3 sequences are available, clusTCR can easily evaluate the quality of the generated clustering. We start by making a clustering as normal.
from clustcr import Clustering, datasets
cdr3 = datasets.test_cdr3()
result = Clustering().fit(cdr3)
Epitopes
To plug in the epitope data, clusTCR expects a pandas.Dataframe
containing CDR3
and Epitope
columns. As an example, we have the epitope data for the test_cdr3
dataset available.
epitopes = datasets.test_epitopes()
Metrics
To access the metrics, simply call the metrics method on the clustering result and supply the epitope data.
metrics = result.metrics(epitopes)
Note
The results are compared to a baseline clustering, which is generated by randomly permuting the assigned clusters, as to mimic an algorithm that performs random clustering.
Retention
Fraction of sequences that is assigned to any cluster.
metrics.retention()
>> 0.225184
Purity
Fraction of sequences within a cluster targeting the same epitope.
Returns a tuple of (result, baseline_result).
metrics.purity()
>> (0.872675, 0.346209)
Purity90
Fraction of clusters that have a purity greater than 90%.
Returns a tuple of (result, baseline_result).
metrics.purity_90()
>> (0.727723, 0.019802)
Consistency
Fraction of sequences targeting the same epitope that are assigned to the same cluster.
Returns a tuple of (result, baseline_result).
metrics.consistency()
>> (0.360515, 0.123033)
Summary
All available clustering metrics and as a pandas.DataFrame
.
metrics.summary()
The pandas.DataFrame
will look like this:
actual | baseline | metric |
---|---|---|
0.225184 | 0.225184 | retention |
0.872675 | 0.346209 | purity |
0.727723 | 0.019802 | purity_90 |
0.360515 | 0.123033 | consistency |