Features
Clusters can be represented by a feature matrix that describes several properties of the amino acid sequence within that cluster, including a number of physicochemical properties, entropy, size, length and generation probability. Table 1 contains a list of all features calculated by clusTCR, and provides a description for all of them.
Table 1: Cluster features calculated by clusTCR, and their description.
feature | description |
---|---|
h | Cluster entropy. This value describes the average information content per amino acid position (ignoring C and F at the first and last position respectively). as determined by the Shannon entropy at each position. A correction is applied that normalizes for cluster size. |
size | Cluster size. Number of sequences in the cluster. |
length | Length of CDR3 sequences in the cluster. Due to the fact that we use Hamming distance, (HD) all sequences within a cluster have exactly equal lengths (HD assumes equal length of sequences). |
basicity_avg | Average basicity of sequences in the cluster. |
basicity_var | Varience in basicity of sequences in the cluster. |
hydrophobicity_avg | Average hydrophobicity of sequences in the cluster. |
hydrophobicity_var | Varience in hydrophobicity of sequences in the cluster. |
helicity_avg | Average helicity of sequences in the cluster. |
helicity_var | Varience in helicity of sequences in the cluster. |
mutationstability_avg | Average mutation stability of sequences in the cluster. |
mutationstability_var | Varience in mutation stability of sequences in the cluster. |
pgen_avg | Average generation probability of CDR3 sequences in the cluster. Generation probability is calculated using the olga module. |
pgen_var | Variance in generation probability within the cluster. |
Note: if you want to compute the generation probability for the TCR alpha chain, user must specify the chain='A'
parameter in the Clustering
object.
These features can be calculated by calling the compute_features()
function on a ClusteringResult
object (see clustering section). The code block below shows a brief example of a workflow for calculating cluster features.
from clustcr import datasets, Clustering
# Load some data
data = datasets.test_cdr3()
# Perform clustering
clustering = Clustering(chain='B') # change to chain='A' for alpha chain
output = clustering.fit(data)
# Compute features of ClusteringResult object
features = output.compute_features(compute_pgen=True)
Calculating generation probabilities (pgen) is time-consuming. A compute_pgen
parameter is provided, so the user can indicate whether they want to calculate pgen values. If this parameter is set to False
, a feature matrix without pgen values is computed. Note that pgen is necessary if you want to use the cluster quality classifier’s functionality (see section on exploring clustering results).