Importing data

Data import functions are provided within the datasets method. A method for reading common rep-seq formats is also provided within clusTCR. Additionally, ClusTCR offers a function that randomly samples sequences from a data set to construct metarepertoires.

from clustcr import datasets, read_cdr3, metarepertoire

Built-in data

Test data

ClusTCR provides a test data set, which contains all high-quality human TCRB entries from the VDJdb. The example data sets can be used for benchmarking and/or exploration.

cdr3 = datasets.test_cdr3() # pd.Series of CDR3 sequences
epitope_data = datasets.test_epitopes() # CDR3 data with their corresponding epitopes

VDJdb

A complete, recent version of the VDJdb is provided within the package. Users can specify whether they want to solely import beta, alpha or paired chains. Corresponding epitope information can also be retrieved.

TCR sequences

You can import alpha, beta or paired chain sequences from VDJdb:

vdjdb_b = datasets.vdjdb_beta() # beta chain
vdjdb_a = datasets.vdjdb_alpha() # alpha chain
vdjdb_ab = datasets.vdjdb_paired() # paired alpha-beta

print(vdjdb_b.head()) # example

This will return a pandas.Series of TCRB sequences:

0    CASSYLPGQGDHYSNQPQHF
1     CASSFEAGQGFFSNQPQHF
2     CASSFEPGQGFYSNQPQHF
3    CASSYEPGQVSHYSNQPQHF
4           CASSFGVEDEQYF
Name: CDR3_beta, dtype: object
Epitope information

Similarly, you can add epitope information by specifying the epitopes=True argument.

vdjdb_b_epitopes = datasets.vdjdb_beta(epitopes=True)
print(vdjdb_b_epitopes.head())

This will return a two-column pandas.DataFrame:

              CDR3_beta   Epitope
0  CASSYLPGQGDHYSNQPQHF  FLKEKGGL
1   CASSFEAGQGFFSNQPQHF  FLKEKGGL
2   CASSFEPGQGFYSNQPQHF  FLKEKGGL
3  CASSYEPGQVSHYSNQPQHF  FLKEKGGL
4  CASSYLPGQGDHYSNQPQHF  FLKEQGGL
Quality cut-off

All VDJdb entries are assigned a score (q) that represents the confidence of the recorded TCR-epitope interaction. The VDJdb data incorporated in ClusTCR can be filtered based on the q score by specifying a value for the q argument. The authors of VDJdb give the following descriptions for the different scores:

score description
0 Low confidence/no information - a critical aspect of sequencing/specificity validation is missing
1 Moderate confidence - no verification / poor TCR sequence confidence
2 High confidence - has some specificity verification, good TCR sequence confidence
3 Very high confidence - has extensive verification or structural data

Example:

vdjdb_b_highqual = datasets.vdjdb_beta(q=2)

Importing a repertoire file

ClusTCR groups TCR sequences based on their CDR3 amino acid sequence, V/J gene information is not used. To parse and extract CDR3 sequences from a repertoire sequencing (rep-seq) file, you can use the .read_cdr3() method. This function takes two argument: the path to the file and the input format. For example, importing an immuneACCESS file might look something like this:

data = read_cdr3('immuneACCESS_file.csv', data_format='immuneaccess')

ClusTCR will return a pandas.Series of all unique CDR3 sequences from that file.

Supported input formats

You can import rep-seq data from different sources using the read_cdr3() method. clusTCR supports the following input formats.

Format Method Info
immuneACCESS read_cdr3(file, data_format='immuneaccess') Version automatically detected. More info about the immuneACCESS format: https://clients.adaptivebiotech.com/immuneaccess.
AIRR standards read_cdr3(file, data_format='airr') More info about the AIRR standards data representation: https://docs.airr-community.org/en/stable/datarep/rearrangements.html.
TCRex read_cdr3(file, data_format='tcrex') More info about the TCRex format: https://tcrex.biodatamining.be/instructions/.