Using peptdeep for MHC class I immunopeptidomics¶
This notebook introduces how to generate spectral libraries for immunopeptidomics analysis from a list of protein sequences. This entails several steps:
unspecific digestion of protein sequences
selection of peptide sequences used for library prediction by peptdeep-hla predicition 2.1 using the pretrained model 2.2 using an improved model by including a transfer learning step
spectral library prediction
matching the peptides back to the proteins (this can be done before or after library prediction or seach)
Note that pydivsufsort package is not installed by peptdeep by default. Install by:
pip install "peptdeep[development,hla]"
Or install within jupyter notebook:
[1]:
%pip install -q pydivsufsort
Note: you may need to restart the kernel to use updated packages.
[2]:
import torch # noqa: 401, to prevent crash in Mac Arm
1. Unspecific digestion in alphabase¶
The unspecific digestion workflow uses the longest common prefix (LCP) algorithm, which is based on suffix array data structure, has been proven to be very efficient for unspecific digestion [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-577]. Here we used pydivsufsort, a Python wrapper of a high-performance C library libdivsufsort [https://github.com/y-256/libdivsufsort], to facilitate LCP-based digestion.
This means, the digestion is performed on a single sequence of strings and retrives both the peptide sequence as well as the start and stop indices of the peptide within the complete sequence. Therefore, unspecific digestion in alphabase involves two steps:
concatenation of protein sequences into a single sequence
unspecific digestion
1.1 Concatenate protein sequences into a single sequence¶
The protein sequences are concatenated into a single sequence. The sequences are seperated by a sentinel character, in this case ‘$’, so that no peptides across proteins are formed. Note that the first and last sentinel characters are crutial as well.
[3]:
from peptdeep.hla.hla_utils import load_prot_df
fasta_path = "data/example.fasta"
protein_df = load_prot_df(fasta_path)
protein_df
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[3]:
| protein_id | full_name | gene_name | gene_org | description | sequence | nAA | |
|---|---|---|---|---|---|---|---|
| tr|A0A024R161|A0A024R161_HUMAN | A0A024R161 | tr|A0A024R161|A0A024R161_HUMAN | DNAJC25-GNG10 | A0A024R161_HUMAN | tr|A0A024R161|A0A024R161_HUMAN Guanine nucleot... | MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCG... | 153 |
| tr|A0A024RAP8|A0A024RAP8_HUMAN | A0A024RAP8 | tr|A0A024RAP8|A0A024RAP8_HUMAN | KLRC4-KLRK1 | A0A024RAP8_HUMAN | tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, iso... | MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQKQRCPVVKSKC... | 216 |
[4]:
from peptdeep.hla.hla_utils import cat_proteins
cat_sequence = cat_proteins(protein_df["sequence"])
cat_sequence
[4]:
'$MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCGTRDCYEVLGVSRSAGKAEIARAYRQLARRYHPDRYRPQPGDEGPGRTPQSAEEAFLLVATAYETLKVSQAAAELQQYCMQNACKDALLVGVPAGSNPFREPRSCALL$MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQKQRCPVVKSKCRENASPFFFCCFIAVAMGIRFIIMVTIWSAVFLNSLFNQEVQIPLTESYCGPCPKNWICYKNNCYQFFDESKNWYESQASCMSQNASLLKVYSKEDQDLLKLVKSYHWMGLVHIPTNGSWQWEDGSILSPNLLTIIEMQKGDCALYASSFKGYIENCSTPNTYICMQRTV$'
1.2 Unspecific digestion¶
Use alphabase.protein.lcp_digest.get_substring_indices to get all non-redundant non-specific peptide sequences from the concatenated protein sequence. The digested peptide sequences are stored in a dataframe based on their start and stop indices in the concantenated protein sequence string. To save the RAM, the peptdeep.hla module works on start and stop indices instead of on peptide sequences directly. This will save about 8 times of the RAM for HLA-I peptides (length from 7 to 14,
deomnstrated below). For a large protein sequence database, there will be millions of unspecific peptides, so working with strings is not feasible for a complete human fasta due to the requirements of extremely large RAM. (~ 70M unspecific sequences from the reviewed swissprot fasta require ~ 4-5 GB RAM already).
Using the get_substring_indices function we extract the start and stop indices of all peptide sequences between 7 and 14 aa (min_len, max_len) from the concatenated protein sequences. All peptides sequences are unique, guranteed by the LCP algorithm.
[5]:
from alphabase.protein.lcp_digest import get_substring_indices
import pandas as pd
import sys
start_idxes, stop_idxes = get_substring_indices(
cat_sequence, min_len=8, max_len=14, stop_char="$"
)
digest_pos_df = pd.DataFrame({
"start_pos": start_idxes,
"stop_pos": stop_idxes,
})
digest_pos_df
[5]:
| start_pos | stop_pos | |
|---|---|---|
| 0 | 1 | 9 |
| 1 | 1 | 10 |
| 2 | 1 | 11 |
| 3 | 1 | 12 |
| 4 | 1 | 13 |
| ... | ... | ... |
| 2438 | 361 | 370 |
| 2439 | 361 | 371 |
| 2440 | 362 | 370 |
| 2441 | 362 | 371 |
| 2442 | 363 | 371 |
2443 rows × 2 columns
[6]:
RAM_use_idxes = sys.getsizeof(digest_pos_df)*1e-6
The unspecific peptide sequences can be localted by the start_pos and stop_pos.
[7]:
digest_pos_df["sequence"] = digest_pos_df[
["start_pos","stop_pos"]
].apply(lambda x: cat_sequence[slice(*x)], axis=1)
digest_pos_df
[7]:
| start_pos | stop_pos | sequence | |
|---|---|---|---|
| 0 | 1 | 9 | MGAPLLSP |
| 1 | 1 | 10 | MGAPLLSPG |
| 2 | 1 | 11 | MGAPLLSPGW |
| 3 | 1 | 12 | MGAPLLSPGWG |
| 4 | 1 | 13 | MGAPLLSPGWGA |
| ... | ... | ... | ... |
| 2438 | 361 | 370 | NTYICMQRT |
| 2439 | 361 | 371 | NTYICMQRTV |
| 2440 | 362 | 370 | TYICMQRT |
| 2441 | 362 | 371 | TYICMQRTV |
| 2442 | 363 | 371 | YICMQRTV |
2443 rows × 3 columns
[8]:
RAM_use_seqs = sys.getsizeof(digest_pos_df["sequence"])*1e-6
[9]:
f"seq RAM = {RAM_use_seqs:.5f} Mb, idxes RAM = {RAM_use_idxes:.5f}, ratio = {RAM_use_seqs/RAM_use_idxes:.5f}"
[9]:
'seq RAM = 0.16623 Mb, idxes RAM = 0.01971, ratio = 8.43475'
2. Selection of peptide sequences used for library prediction¶
The digest_prot_df contains all unspecifically digested peptide sequences between 7 and 14 aa generatable from the concatenated protein sequences. This list is reduced using a HLA1_Binding_Classifier from peptdeep.hla.hla_class1. Two different model architectures are available, an LSTM model (HLA_Class_I_LSTM) and a BERT model (HLA_Class_I_BERT). A pretrained model is only available for the LSTM model architecture. The HLA1_Binding_Classifer can be used with a pretrained model, tuned with existing peptide data or trained from scratch. Training of a new model should be considered carefully and will not be covered in this tutorial.
Selection of peptide sequences for library predicition using the pretrained model can be done in a few steps. First, the Classifier model needs to be initialized and the pretrained model is loaded. Next, we can use any kind of dataframe containing peptide sequences to predict how likely there are HLA peptides, the only requirement beeing that the column containing the peptides is called ‘sequence’.
[10]:
from peptdeep.hla.hla_class1 import HLA1_Binding_Classifier
model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()
manual_prediction = model.predict(digest_pos_df)
manual_prediction
[10]:
| start_pos | stop_pos | sequence | nAA | HLA_prob_pred | |
|---|---|---|---|---|---|
| 0 | 1 | 9 | MGAPLLSP | 8 | 0.239477 |
| 1 | 145 | 153 | REPRSCAL | 8 | 0.061692 |
| 2 | 146 | 154 | EPRSCALL | 8 | 0.137313 |
| 3 | 155 | 163 | MGWIRGRR | 8 | 0.056462 |
| 4 | 156 | 164 | GWIRGRRS | 8 | 0.001298 |
| ... | ... | ... | ... | ... | ... |
| 2438 | 112 | 126 | KVSQAAAELQQYCM | 14 | 0.243115 |
| 2439 | 317 | 331 | NGSWQWEDGSILSP | 14 | 0.021114 |
| 2440 | 79 | 93 | DRYRPQPGDEGPGR | 14 | 0.060634 |
| 2441 | 113 | 127 | VSQAAAELQQYCMQ | 14 | 0.355900 |
| 2442 | 190 | 204 | KQRCPVVKSKCREN | 14 | 0.000362 |
2443 rows × 5 columns
Next, we can filter the list based on the HLA_prob_pred. The higher the probability, the more likely it is for the peptide sequence to be present in a immunopeptidomics sample. It is not recommended to use a cut-off below 0.7 as this inflates the spectral library. It is rather recommended to use more conservative cut-offs.
[11]:
manual_prediction[manual_prediction['HLA_prob_pred'] > 0.7]
[11]:
| start_pos | stop_pos | sequence | nAA | HLA_prob_pred | |
|---|---|---|---|---|---|
| 17 | 168 | 176 | EMSEFHNY | 8 | 0.793702 |
| 24 | 130 | 138 | KDALLVGV | 8 | 0.817415 |
| 31 | 137 | 145 | VPAGSNPF | 8 | 0.751329 |
| 37 | 170 | 178 | SEFHNYNL | 8 | 0.940019 |
| 67 | 181 | 189 | KSDFSTRW | 8 | 0.895964 |
| ... | ... | ... | ... | ... | ... |
| 2318 | 95 | 109 | QSAEEAFLLVATAY | 14 | 0.969541 |
| 2378 | 329 | 343 | SPNLLTIIEMQKGD | 14 | 0.756001 |
| 2382 | 5 | 19 | LLSPGWGAGAAGRR | 14 | 0.733784 |
| 2408 | 110 | 124 | TLKVSQAAAELQQY | 14 | 0.891976 |
| 2419 | 6 | 20 | LSPGWGAGAAGRRW | 14 | 0.842583 |
148 rows × 5 columns
As described above, directly using the sequences for classification can be memory intense for large lists of sequences. Thereby, the manual concatenation, unspecific digestion, predicition and filtering is only suggested for small sets of proteins or integration of selected sequences (e.g mutations, nuORFs etc.). This can be circumvented by directly predicting and filtering from a fasta using model.predict_from_proteins(). This executes the concatenation, unspecific digestion, predicition and filtering automatically in batches. Thereby the whole process can be done more efficient and be performed without a specialized computation infrastructure.
[12]:
sequence_df = model.predict_from_proteins(protein_df, prob_threshold=0.7)
sequence_df
100%|██████████| 1/1 [00:01<00:00, 1.27s/it]
[12]:
| start_pos | stop_pos | nAA | HLA_prob_pred | sequence | |
|---|---|---|---|---|---|
| 0 | 168 | 176 | 8 | 0.793702 | EMSEFHNY |
| 1 | 130 | 138 | 8 | 0.817415 | KDALLVGV |
| 2 | 137 | 145 | 8 | 0.751329 | VPAGSNPF |
| 3 | 170 | 178 | 8 | 0.940019 | SEFHNYNL |
| 4 | 181 | 189 | 8 | 0.895964 | KSDFSTRW |
| ... | ... | ... | ... | ... | ... |
| 143 | 95 | 109 | 14 | 0.969541 | QSAEEAFLLVATAY |
| 144 | 329 | 343 | 14 | 0.756001 | SPNLLTIIEMQKGD |
| 145 | 5 | 19 | 14 | 0.733784 | LLSPGWGAGAAGRR |
| 146 | 110 | 124 | 14 | 0.891976 | TLKVSQAAAELQQY |
| 147 | 6 | 20 | 14 | 0.842583 | LSPGWGAGAAGRRW |
148 rows × 5 columns
To perform transferlearning we need a list of peptide sequences we expect to be present in our sample. These peptides can be retrived from several different sources like DDA or directDIA search results. It is recommended to use at the very least 1000 sequences for transferlearning. The more sequences available the better the transferlearning step works. The model performance can be assessed after transferlearning and should be assessed before predicition.
First, the Classifier model needs to be initialized and the pretrained model is loaded. Next, a protein dataframe is added, in this example the previousely loaded fasta file. The protein dataframe is used by the Classifier internaly to draw negative training data during model training and testing.
[13]:
model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()
model.load_proteins(fasta_path)
[14]:
test_seq_df = digest_pos_df.sample(frac=0.2)
train_seq_df = digest_pos_df.drop(index=test_seq_df.index)
len(train_seq_df), len(test_seq_df)
[14]:
(1954, 489)
Now, we train the model using the training sequence dataframe. In this example we use 10 training epochs, in a real experiment more should be used. Good starting points are 40 epochs for a training dataset of around 10000 sequences or 100 epochs for a training dataset of around 1000 sequences. For a real experiment the warmup_epochs can be increased to 10.
[15]:
model.train(train_seq_df,
epoch=10, warmup_epoch=5,
verbose=True)
2024-07-23 14:22:06> Training with fixed sequence length: 0
[Training] Epoch=1, lr=4e-05, loss=1.39779794216156
[Training] Epoch=2, lr=6e-05, loss=1.0070140702383858
[Training] Epoch=3, lr=8e-05, loss=0.7982760497501918
[Training] Epoch=4, lr=0.0001, loss=0.7397338407380241
[Training] Epoch=5, lr=0.0001, loss=0.7099559647696358
[Training] Epoch=6, lr=9.045084971874738e-05, loss=0.7016251683235168
[Training] Epoch=7, lr=6.545084971874738e-05, loss=0.6965694086892265
[Training] Epoch=8, lr=3.4549150281252636e-05, loss=0.697939566203526
[Training] Epoch=9, lr=9.549150281252633e-06, loss=0.6959438664572579
[Training] Epoch=10, lr=1.0000000000000002e-14, loss=0.6928229417119708
We can assess the model performance after transferlearning using the model.test() function on the training and testing data. This can also be done before transferlearning to assess how well the model fits the available data already. The test assesses the precision, recall and fals positive rate of the model at different probability cut offs. As a rule of thumb a false postitve rate above 7% (@FENG adjust in case lower/higher) is not recomendable because the peptide list gets disproportionally larger, leading to lower IDs during the search. In case of a high false postitive rate, the probability cut off at which the peptides are predicted should be increased.
[16]:
model.test(train_seq_df)
[16]:
| HLA_prob_pred | precision | recall | false_positive | |
|---|---|---|---|---|
| 0 | 0.5 | 0.511434 | 0.595189 | 0.568577 |
| 1 | 0.6 | 0.416667 | 0.017912 | 0.025077 |
| 2 | 0.7 | 0.333333 | 0.000512 | 0.001024 |
| 3 | 0.8 | NaN | 0.000000 | 0.000000 |
| 4 | 0.9 | NaN | 0.000000 | 0.000000 |
[17]:
model.test(test_seq_df)
[17]:
| HLA_prob_pred | precision | recall | false_positive | |
|---|---|---|---|---|
| 0 | 0.5 | 0.450192 | 0.480573 | 0.586912 |
| 1 | 0.6 | 0.470588 | 0.016360 | 0.018405 |
| 2 | 0.7 | NaN | 0.000000 | 0.000000 |
| 3 | 0.8 | NaN | 0.000000 | 0.000000 |
| 4 | 0.9 | NaN | 0.000000 | 0.000000 |
After transferlearning and testing the new model, peptides can be predicted as with the pretrained model.
[18]:
model.predict_from_proteins(fasta_path, prob_threshold=0.6)
100%|██████████| 1/1 [00:01<00:00, 1.32s/it]
[18]:
| start_pos | stop_pos | nAA | HLA_prob_pred | sequence | |
|---|---|---|---|---|---|
| 0 | 170 | 178 | 8 | 0.711809 | SEFHNYNL |
| 1 | 62 | 70 | 8 | 0.627015 | KAEIARAY |
| 2 | 106 | 114 | 8 | 0.628822 | TAYETLKV |
| 3 | 299 | 307 | 8 | 0.605544 | LLKLVKSY |
| 4 | 346 | 354 | 8 | 0.646759 | YASSFKGY |
| 5 | 258 | 266 | 8 | 0.624555 | ICYKNNCY |
| 6 | 294 | 303 | 9 | 0.610476 | KEDQDLLKL |
| 7 | 298 | 307 | 9 | 0.645020 | DLLKLVKSY |
| 8 | 235 | 244 | 9 | 0.629079 | SLFNQEVQI |
| 9 | 257 | 266 | 9 | 0.623247 | WICYKNNCY |
| 10 | 267 | 276 | 9 | 0.611738 | FFDESKNWY |
| 11 | 17 | 26 | 9 | 0.605875 | RRWWMLLAP |
| 12 | 327 | 336 | 9 | 0.616737 | ILSPNLLTI |
| 13 | 74 | 83 | 9 | 0.611590 | RRYHPDRYR |
| 14 | 344 | 354 | 10 | 0.662783 | ALYASSFKGY |
| 15 | 232 | 242 | 10 | 0.651600 | FLNSLFNQEV |
| 16 | 221 | 231 | 10 | 0.617175 | FIIMVTIWSA |
| 17 | 222 | 232 | 10 | 0.600623 | IIMVTIWSAV |
| 18 | 74 | 84 | 10 | 0.614895 | RRYHPDRYRP |
| 19 | 221 | 232 | 11 | 0.608950 | FIIMVTIWSAV |
| 20 | 353 | 364 | 11 | 0.613787 | YIENCSTPNTY |
| 21 | 74 | 85 | 11 | 0.605368 | RRYHPDRYRPQ |
| 22 | 112 | 124 | 12 | 0.612270 | KVSQAAAELQQY |
| 23 | 42 | 54 | 12 | 0.607715 | GLYCGTRDCYEV |
| 24 | 351 | 363 | 12 | 0.616891 | KGYIENCSTPNT |
| 25 | 74 | 86 | 12 | 0.602210 | RRYHPDRYRPQP |
| 26 | 86 | 99 | 13 | 0.644656 | GDEGPGRTPQSAE |
| 27 | 351 | 364 | 13 | 0.603497 | KGYIENCSTPNTY |
| 28 | 73 | 86 | 13 | 0.622453 | ARRYHPDRYRPQP |
| 29 | 74 | 87 | 13 | 0.611441 | RRYHPDRYRPQPG |
| 30 | 334 | 347 | 13 | 0.604354 | TIIEMQKGDCALY |
| 31 | 141 | 154 | 13 | 0.601309 | SNPFREPRSCALL |
| 32 | 32 | 45 | 13 | 0.622797 | LVRPAGALVEGLY |
| 33 | 130 | 143 | 13 | 0.604786 | KDALLVGVPAGSN |
| 34 | 333 | 347 | 14 | 0.613545 | LTIIEMQKGDCALY |
| 35 | 60 | 74 | 14 | 0.607648 | AGKAEIARAYRQLA |
| 36 | 85 | 99 | 14 | 0.606241 | PGDEGPGRTPQSAE |
| 37 | 229 | 243 | 14 | 0.606759 | SAVFLNSLFNQEVQ |
| 38 | 86 | 100 | 14 | 0.622891 | GDEGPGRTPQSAEE |
| 39 | 167 | 181 | 14 | 0.611953 | WEMSEFHNYNLDLK |
| 40 | 117 | 131 | 14 | 0.619257 | AAELQQYCMQNACK |
| 41 | 73 | 87 | 14 | 0.608767 | ARRYHPDRYRPQPG |
| 42 | 329 | 343 | 14 | 0.600299 | SPNLLTIIEMQKGD |
3. Spectral library prediciton¶
Now the spectral library for the filtered peptide list can be predicted using PredictSpecLibFasta. First, one needs to select the models for rt/ccs/ms2 prediction using the ModelManager. One can select from a set of pretrained models or load externally trained models. Here we load the ‘HLA’ model (at the moment this still loads the generic model, but in the futer this is supposed to be replaced by an HLA specfic internal model).
[19]:
from peptdeep.spec_lib.predict_lib import ModelManager
from peptdeep.protein.fasta import PredictSpecLibFasta
model_mgr = ModelManager()
model_mgr.load_installed_models(model_type='HLA')
In the next step, the PredictSpecLibFasta is initialized using the preloaded model. The presettings here are selected for the prediction of tryptic libraries so some parameters need to be adjusted, in particular precursor_charge_min, precursor_charge_max. By default Carbamidomethylation is set as a fixed modification (fix_mod) and Acetylation and Oxidation are set as variable modifications (var_mod). Those can be removed by adding an empty list as shown for the variable modifications.
Of note, PredictSpecLibFasta can also be used to predict a library from a fasta file. Therfore one can also set the protease (default trypsin) and the minimum and maximum peptide length (7 to 35). Wee dont need to change those parameters here, as we wont make use of the digestion functions but rather provide a already digested sequence table.
[20]:
speclib = PredictSpecLibFasta(model_manager=model_mgr,
precursor_charge_min=1,
precursor_charge_max=3,
fix_mods=[])
To reduce the size of the dataframe and predicted library we give each peptide sequence a unique protein identifier (number). This enables the use of search engines that rely on protein information (such as AlphaDIA) but one needs to keep in mind to remove filtering steps based on how many peptides per protein are identified during data analysis. Alternatively, proteins of the peptide sequences may originate from can be infered using alphabase.protein.fasta.annotate_precursor_df()
(demonstrated below).
[21]:
sequence_df['protein_id'] = [str(i) for i in range(len(sequence_df))]
sequence_df['protein_idxes'] = sequence_df.protein_id.astype("U")
sequence_df['full_name'] = sequence_df['protein_id']
sequence_df['gene_org'] = sequence_df['protein_id']
sequence_df['gene_name'] = sequence_df['protein_id']
sequence_df["is_prot_nterm"] = False
sequence_df["is_prot_cterm"] = False
sequence_df
[21]:
| start_pos | stop_pos | nAA | HLA_prob_pred | sequence | protein_id | protein_idxes | full_name | gene_org | gene_name | is_prot_nterm | is_prot_cterm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 168 | 176 | 8 | 0.793702 | EMSEFHNY | 0 | 0 | 0 | 0 | 0 | False | False |
| 1 | 130 | 138 | 8 | 0.817415 | KDALLVGV | 1 | 1 | 1 | 1 | 1 | False | False |
| 2 | 137 | 145 | 8 | 0.751329 | VPAGSNPF | 2 | 2 | 2 | 2 | 2 | False | False |
| 3 | 170 | 178 | 8 | 0.940019 | SEFHNYNL | 3 | 3 | 3 | 3 | 3 | False | False |
| 4 | 181 | 189 | 8 | 0.895964 | KSDFSTRW | 4 | 4 | 4 | 4 | 4 | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 143 | 95 | 109 | 14 | 0.969541 | QSAEEAFLLVATAY | 143 | 143 | 143 | 143 | 143 | False | False |
| 144 | 329 | 343 | 14 | 0.756001 | SPNLLTIIEMQKGD | 144 | 144 | 144 | 144 | 144 | False | False |
| 145 | 5 | 19 | 14 | 0.733784 | LLSPGWGAGAAGRR | 145 | 145 | 145 | 145 | 145 | False | False |
| 146 | 110 | 124 | 14 | 0.891976 | TLKVSQAAAELQQY | 146 | 146 | 146 | 146 | 146 | False | False |
| 147 | 6 | 20 | 14 | 0.842583 | LSPGWGAGAAGRRW | 147 | 147 | 147 | 147 | 147 | False | False |
148 rows × 12 columns
The sequence dataframe contains all the relevant information to be passed to the protein_df and the precursor_df.
[22]:
speclib.protein_df = sequence_df[
["sequence","protein_id","nAA", 'full_name', 'gene_org', 'gene_name']
].copy()
speclib.protein_df
[22]:
| sequence | protein_id | nAA | full_name | gene_org | gene_name | |
|---|---|---|---|---|---|---|
| 0 | EMSEFHNY | 0 | 8 | 0 | 0 | 0 |
| 1 | KDALLVGV | 1 | 8 | 1 | 1 | 1 |
| 2 | VPAGSNPF | 2 | 8 | 2 | 2 | 2 |
| 3 | SEFHNYNL | 3 | 8 | 3 | 3 | 3 |
| 4 | KSDFSTRW | 4 | 8 | 4 | 4 | 4 |
| ... | ... | ... | ... | ... | ... | ... |
| 143 | QSAEEAFLLVATAY | 143 | 14 | 143 | 143 | 143 |
| 144 | SPNLLTIIEMQKGD | 144 | 14 | 144 | 144 | 144 |
| 145 | LLSPGWGAGAAGRR | 145 | 14 | 145 | 145 | 145 |
| 146 | TLKVSQAAAELQQY | 146 | 14 | 146 | 146 | 146 |
| 147 | LSPGWGAGAAGRRW | 147 | 14 | 147 | 147 | 147 |
148 rows × 6 columns
[23]:
speclib.precursor_df = sequence_df[
["sequence","protein_idxes","start_pos","stop_pos",
"nAA","HLA_prob_pred", 'is_prot_nterm', 'is_prot_cterm']
].copy()
speclib.precursor_df
[23]:
| sequence | protein_idxes | start_pos | stop_pos | nAA | HLA_prob_pred | is_prot_nterm | is_prot_cterm | |
|---|---|---|---|---|---|---|---|---|
| 0 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False |
| 1 | KDALLVGV | 1 | 130 | 138 | 8 | 0.817415 | False | False |
| 2 | VPAGSNPF | 2 | 137 | 145 | 8 | 0.751329 | False | False |
| 3 | SEFHNYNL | 3 | 170 | 178 | 8 | 0.940019 | False | False |
| 4 | KSDFSTRW | 4 | 181 | 189 | 8 | 0.895964 | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 143 | QSAEEAFLLVATAY | 143 | 95 | 109 | 14 | 0.969541 | False | False |
| 144 | SPNLLTIIEMQKGD | 144 | 329 | 343 | 14 | 0.756001 | False | False |
| 145 | LLSPGWGAGAAGRR | 145 | 5 | 19 | 14 | 0.733784 | False | False |
| 146 | TLKVSQAAAELQQY | 146 | 110 | 124 | 14 | 0.891976 | False | False |
| 147 | LSPGWGAGAAGRRW | 147 | 6 | 20 | 14 | 0.842583 | False | False |
148 rows × 8 columns
[24]:
speclib.precursor_df
[24]:
| sequence | protein_idxes | start_pos | stop_pos | nAA | HLA_prob_pred | is_prot_nterm | is_prot_cterm | |
|---|---|---|---|---|---|---|---|---|
| 0 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False |
| 1 | KDALLVGV | 1 | 130 | 138 | 8 | 0.817415 | False | False |
| 2 | VPAGSNPF | 2 | 137 | 145 | 8 | 0.751329 | False | False |
| 3 | SEFHNYNL | 3 | 170 | 178 | 8 | 0.940019 | False | False |
| 4 | KSDFSTRW | 4 | 181 | 189 | 8 | 0.895964 | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 143 | QSAEEAFLLVATAY | 143 | 95 | 109 | 14 | 0.969541 | False | False |
| 144 | SPNLLTIIEMQKGD | 144 | 329 | 343 | 14 | 0.756001 | False | False |
| 145 | LLSPGWGAGAAGRR | 145 | 5 | 19 | 14 | 0.733784 | False | False |
| 146 | TLKVSQAAAELQQY | 146 | 110 | 124 | 14 | 0.891976 | False | False |
| 147 | LSPGWGAGAAGRRW | 147 | 6 | 20 | 14 | 0.842583 | False | False |
148 rows × 8 columns
Next, the modifications and charges can be added to the peptide dataframe using add_modifications and add_charge. This creates a unique entry for every combination of charge and modification for all the sequences in the precursor dataframe.
[25]:
speclib.add_modifications()
speclib.add_charge()
speclib.precursor_df
[25]:
| sequence | protein_idxes | start_pos | stop_pos | nAA | HLA_prob_pred | is_prot_nterm | is_prot_cterm | mods | mod_sites | charge | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | Oxidation@M | 2 | 1 |
| 1 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | Oxidation@M | 2 | 2 |
| 2 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | Oxidation@M | 2 | 3 |
| 3 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | 1 | ||
| 4 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | 2 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 493 | TLKVSQAAAELQQY | 146 | 110 | 124 | 14 | 0.891976 | False | False | 2 | ||
| 494 | TLKVSQAAAELQQY | 146 | 110 | 124 | 14 | 0.891976 | False | False | 3 | ||
| 495 | LSPGWGAGAAGRRW | 147 | 6 | 20 | 14 | 0.842583 | False | False | 1 | ||
| 496 | LSPGWGAGAAGRRW | 147 | 6 | 20 | 14 | 0.842583 | False | False | 2 | ||
| 497 | LSPGWGAGAAGRRW | 147 | 6 | 20 | 14 | 0.842583 | False | False | 3 |
498 rows × 11 columns
Now ccs, rt and ms2 can be predicted for each entry
[26]:
speclib.predict_all()
2024-07-23 14:22:43> Predicting RT/IM/MS2 for 400 precursors ...
2024-07-23 14:22:43> Predicting RT ...
100%|██████████| 7/7 [00:00<00:00, 27.54it/s]
2024-07-23 14:22:43> Predicting mobility ...
100%|██████████| 7/7 [00:00<00:00, 50.06it/s]
2024-07-23 14:22:44> Predicting MS2 ...
100%|██████████| 7/7 [00:00<00:00, 23.73it/s]
2024-07-23 14:22:44> End predicting RT/IM/MS2
iRTs can be added using translate_rt_to_irt_pred. This is not neccessary for search engines like DIA-NN or AlphaDIA but required for Spectronaut.
[27]:
speclib.translate_rt_to_irt_pred()
Predict RT for 11 iRT precursors.
Linear regression of `rt_pred` to `irt`:
R_square R slope intercept test_num
0 0.99007 0.995022 152.235639 -39.232164 11
[27]:
| sequence | protein_idxes | start_pos | stop_pos | nAA | HLA_prob_pred | is_prot_nterm | is_prot_cterm | mods | mod_sites | ... | precursor_mz | rt_pred | rt_norm_pred | ccs_pred | mobility_pred | nce | instrument | frag_start_idx | frag_stop_idx | irt_pred | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | Oxidation@M | 2 | ... | 1072.404037 | 0.189650 | 0.189650 | 254.195892 | 1.253140 | 30.0 | Lumos | 0 | 7 | -10.360738 |
| 1 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | Oxidation@M | 2 | ... | 536.705657 | 0.189650 | 0.189650 | 337.328583 | 0.831494 | 30.0 | Lumos | 7 | 14 | -10.360738 |
| 2 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | ... | 1056.409123 | 0.289261 | 0.289261 | 255.103699 | 1.257373 | 30.0 | Lumos | 14 | 21 | 4.803681 | ||
| 3 | EMSEFHNY | 0 | 168 | 176 | 8 | 0.793702 | False | False | ... | 528.708200 | 0.289261 | 0.289261 | 337.444641 | 0.831621 | 30.0 | Lumos | 21 | 28 | 4.803681 | ||
| 4 | KDALLVGV | 1 | 130 | 138 | 8 | 0.817415 | False | False | ... | 814.503280 | 0.433791 | 0.433791 | 256.615204 | 1.260001 | 30.0 | Lumos | 28 | 35 | 26.806270 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 395 | TLKVSQAAAELQQY | 146 | 110 | 124 | 14 | 0.891976 | False | False | ... | 775.414662 | 0.489545 | 0.489545 | 429.360901 | 1.062514 | 30.0 | Lumos | 3810 | 3823 | 35.294030 | ||
| 396 | TLKVSQAAAELQQY | 146 | 110 | 124 | 14 | 0.891976 | False | False | ... | 517.278867 | 0.489545 | 0.489545 | 463.231049 | 0.764225 | 30.0 | Lumos | 3823 | 3836 | 35.294030 | ||
| 397 | LSPGWGAGAAGRRW | 147 | 6 | 20 | 14 | 0.842583 | False | False | ... | 1441.744742 | 0.377743 | 0.377743 | 289.200989 | 1.430378 | 30.0 | Lumos | 3836 | 3849 | 18.273780 | ||
| 398 | LSPGWGAGAAGRRW | 147 | 6 | 20 | 14 | 0.842583 | False | False | ... | 721.376009 | 0.377743 | 0.377743 | 404.633698 | 1.000659 | 30.0 | Lumos | 3849 | 3862 | 18.273780 | ||
| 399 | LSPGWGAGAAGRRW | 147 | 6 | 20 | 14 | 0.842583 | False | False | ... | 481.253098 | 0.377743 | 0.377743 | 476.655701 | 0.785851 | 30.0 | Lumos | 3862 | 3875 | 18.273780 |
400 rows × 21 columns
Now, the predicted library can be exported in an hdf format (AlphaDIA) or translated to a tsv. The tsv translation can be very time consuming. Before the spectral library can be translated, the gene and protein column need to be mapped from the protein_df into the precursor_df.
[28]:
# hdf_path = "D:\Software\FASTA\Human\speclib_example.hdf"
# tsv_path = "D:\Software\FASTA\Human\speclib_example.tsv"
# speclib.save_hdf(hdf_path) # save as hdf speclib
[29]:
from peptdeep.spec_lib.translate import translate_to_tsv
speclib.append_protein_name()
# translate_to_tsv(speclib=speclib, tsv = tsv_path) # save as tsv speclib
4. Matching peptides back to proteins¶
The peptide sequnces can be matched back to proteins using annotate_precursor_df, requiring a ‘sequence’ column and a protein_df like the previously loaded fasta file. This can be done with the sequence output of any search engine or before the library is generated.
[30]:
from alphabase.protein.fasta import annotate_precursor_df
inferred_sequence_df = annotate_precursor_df(sequence_df, protein_df)
inferred_sequence_df
100%|██████████| 2/2 [00:00<00:00, 7639.90it/s]
[30]:
| start_pos | stop_pos | nAA | HLA_prob_pred | sequence | protein_id | protein_idxes | full_name | gene_org | gene_name | is_prot_nterm | is_prot_cterm | genes | proteins | cardinality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 168 | 176 | 8 | 0.793702 | EMSEFHNY | 0 | 0 | 0 | 0 | 0 | False | False | A0A024RAP8_HUMAN | A0A024RAP8 | 1 |
| 1 | 130 | 138 | 8 | 0.817415 | KDALLVGV | 1 | 1 | 1 | 1 | 1 | False | False | A0A024R161_HUMAN | A0A024R161 | 1 |
| 2 | 137 | 145 | 8 | 0.751329 | VPAGSNPF | 2 | 2 | 2 | 2 | 2 | False | False | A0A024R161_HUMAN | A0A024R161 | 1 |
| 3 | 170 | 178 | 8 | 0.940019 | SEFHNYNL | 3 | 3 | 3 | 3 | 3 | False | False | A0A024RAP8_HUMAN | A0A024RAP8 | 1 |
| 4 | 181 | 189 | 8 | 0.895964 | KSDFSTRW | 4 | 4 | 4 | 4 | 4 | False | False | A0A024RAP8_HUMAN | A0A024RAP8 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 143 | 95 | 109 | 14 | 0.969541 | QSAEEAFLLVATAY | 143 | 143 | 143 | 143 | 143 | False | False | A0A024R161_HUMAN | A0A024R161 | 1 |
| 144 | 329 | 343 | 14 | 0.756001 | SPNLLTIIEMQKGD | 144 | 144 | 144 | 144 | 144 | False | False | A0A024RAP8_HUMAN | A0A024RAP8 | 1 |
| 145 | 5 | 19 | 14 | 0.733784 | LLSPGWGAGAAGRR | 145 | 145 | 145 | 145 | 145 | False | False | A0A024R161_HUMAN | A0A024R161 | 1 |
| 146 | 110 | 124 | 14 | 0.891976 | TLKVSQAAAELQQY | 146 | 146 | 146 | 146 | 146 | False | False | A0A024R161_HUMAN | A0A024R161 | 1 |
| 147 | 6 | 20 | 14 | 0.842583 | LSPGWGAGAAGRRW | 147 | 147 | 147 | 147 | 147 | False | False | A0A024R161_HUMAN | A0A024R161 | 1 |
148 rows × 15 columns
[ ]: