Using peptdeep for MHC class I immunopeptidomics

This notebook introduces how to generate spectral libraries for immunopeptidomics analysis from a list of protein sequences. This entails several steps:

  1. unspecific digestion of protein sequences

  2. selection of peptide sequences used for library prediction by peptdeep-hla predicition 2.1 using the pretrained model 2.2 using an improved model by including a transfer learning step

  3. spectral library prediction

  4. matching the peptides back to the proteins (this can be done before or after library prediction or seach)

Note that pydivsufsort package is not installed by peptdeep by default. Install by:

pip install "peptdeep[development,hla]"

Or install within jupyter notebook:

[1]:
%pip install -q pydivsufsort
Note: you may need to restart the kernel to use updated packages.
[2]:
import torch # noqa: 401, to prevent crash in Mac Arm

1. Unspecific digestion in alphabase

The unspecific digestion workflow uses the longest common prefix (LCP) algorithm, which is based on suffix array data structure, has been proven to be very efficient for unspecific digestion [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-577]. Here we used pydivsufsort, a Python wrapper of a high-performance C library libdivsufsort [https://github.com/y-256/libdivsufsort], to facilitate LCP-based digestion.

This means, the digestion is performed on a single sequence of strings and retrives both the peptide sequence as well as the start and stop indices of the peptide within the complete sequence. Therefore, unspecific digestion in alphabase involves two steps:

  1. concatenation of protein sequences into a single sequence

  2. unspecific digestion

1.1 Concatenate protein sequences into a single sequence

The protein sequences are concatenated into a single sequence. The sequences are seperated by a sentinel character, in this case ‘$’, so that no peptides across proteins are formed. Note that the first and last sentinel characters are crutial as well.

[3]:
from peptdeep.hla.hla_utils import load_prot_df
fasta_path = "data/example.fasta"
protein_df = load_prot_df(fasta_path)
protein_df
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[3]:
protein_id full_name gene_name gene_org description sequence nAA
tr|A0A024R161|A0A024R161_HUMAN A0A024R161 tr|A0A024R161|A0A024R161_HUMAN DNAJC25-GNG10 A0A024R161_HUMAN tr|A0A024R161|A0A024R161_HUMAN Guanine nucleot... MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCG... 153
tr|A0A024RAP8|A0A024RAP8_HUMAN A0A024RAP8 tr|A0A024RAP8|A0A024RAP8_HUMAN KLRC4-KLRK1 A0A024RAP8_HUMAN tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, iso... MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQKQRCPVVKSKC... 216
[4]:
from peptdeep.hla.hla_utils import cat_proteins
cat_sequence = cat_proteins(protein_df["sequence"])
cat_sequence
[4]:
'$MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCGTRDCYEVLGVSRSAGKAEIARAYRQLARRYHPDRYRPQPGDEGPGRTPQSAEEAFLLVATAYETLKVSQAAAELQQYCMQNACKDALLVGVPAGSNPFREPRSCALL$MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQKQRCPVVKSKCRENASPFFFCCFIAVAMGIRFIIMVTIWSAVFLNSLFNQEVQIPLTESYCGPCPKNWICYKNNCYQFFDESKNWYESQASCMSQNASLLKVYSKEDQDLLKLVKSYHWMGLVHIPTNGSWQWEDGSILSPNLLTIIEMQKGDCALYASSFKGYIENCSTPNTYICMQRTV$'

1.2 Unspecific digestion

Use alphabase.protein.lcp_digest.get_substring_indices to get all non-redundant non-specific peptide sequences from the concatenated protein sequence. The digested peptide sequences are stored in a dataframe based on their start and stop indices in the concantenated protein sequence string. To save the RAM, the peptdeep.hla module works on start and stop indices instead of on peptide sequences directly. This will save about 8 times of the RAM for HLA-I peptides (length from 7 to 14, deomnstrated below). For a large protein sequence database, there will be millions of unspecific peptides, so working with strings is not feasible for a complete human fasta due to the requirements of extremely large RAM. (~ 70M unspecific sequences from the reviewed swissprot fasta require ~ 4-5 GB RAM already).

Using the get_substring_indices function we extract the start and stop indices of all peptide sequences between 7 and 14 aa (min_len, max_len) from the concatenated protein sequences. All peptides sequences are unique, guranteed by the LCP algorithm.

[5]:
from alphabase.protein.lcp_digest import get_substring_indices
import pandas as pd
import sys

start_idxes, stop_idxes = get_substring_indices(
    cat_sequence, min_len=8, max_len=14, stop_char="$"
)
digest_pos_df = pd.DataFrame({
    "start_pos": start_idxes,
    "stop_pos": stop_idxes,
})
digest_pos_df
[5]:
start_pos stop_pos
0 1 9
1 1 10
2 1 11
3 1 12
4 1 13
... ... ...
2438 361 370
2439 361 371
2440 362 370
2441 362 371
2442 363 371

2443 rows × 2 columns

[6]:
RAM_use_idxes = sys.getsizeof(digest_pos_df)*1e-6

The unspecific peptide sequences can be localted by the start_pos and stop_pos.

[7]:
digest_pos_df["sequence"] = digest_pos_df[
    ["start_pos","stop_pos"]
].apply(lambda x: cat_sequence[slice(*x)], axis=1)
digest_pos_df
[7]:
start_pos stop_pos sequence
0 1 9 MGAPLLSP
1 1 10 MGAPLLSPG
2 1 11 MGAPLLSPGW
3 1 12 MGAPLLSPGWG
4 1 13 MGAPLLSPGWGA
... ... ... ...
2438 361 370 NTYICMQRT
2439 361 371 NTYICMQRTV
2440 362 370 TYICMQRT
2441 362 371 TYICMQRTV
2442 363 371 YICMQRTV

2443 rows × 3 columns

[8]:
RAM_use_seqs = sys.getsizeof(digest_pos_df["sequence"])*1e-6
[9]:
f"seq RAM = {RAM_use_seqs:.5f} Mb, idxes RAM = {RAM_use_idxes:.5f}, ratio = {RAM_use_seqs/RAM_use_idxes:.5f}"
[9]:
'seq RAM = 0.16623 Mb, idxes RAM = 0.01971, ratio = 8.43475'

2. Selection of peptide sequences used for library prediction

The digest_prot_df contains all unspecifically digested peptide sequences between 7 and 14 aa generatable from the concatenated protein sequences. This list is reduced using a HLA1_Binding_Classifier from peptdeep.hla.hla_class1. Two different model architectures are available, an LSTM model (HLA_Class_I_LSTM) and a BERT model (HLA_Class_I_BERT). A pretrained model is only available for the LSTM model architecture. The HLA1_Binding_Classifer can be used with a pretrained model, tuned with existing peptide data or trained from scratch. Training of a new model should be considered carefully and will not be covered in this tutorial.

Selection of peptide sequences for library predicition using the pretrained model can be done in a few steps. First, the Classifier model needs to be initialized and the pretrained model is loaded. Next, we can use any kind of dataframe containing peptide sequences to predict how likely there are HLA peptides, the only requirement beeing that the column containing the peptides is called ‘sequence’.

[10]:
from peptdeep.hla.hla_class1 import HLA1_Binding_Classifier

model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()
manual_prediction = model.predict(digest_pos_df)
manual_prediction

[10]:
start_pos stop_pos sequence nAA HLA_prob_pred
0 1 9 MGAPLLSP 8 0.239477
1 145 153 REPRSCAL 8 0.061692
2 146 154 EPRSCALL 8 0.137313
3 155 163 MGWIRGRR 8 0.056462
4 156 164 GWIRGRRS 8 0.001298
... ... ... ... ... ...
2438 112 126 KVSQAAAELQQYCM 14 0.243115
2439 317 331 NGSWQWEDGSILSP 14 0.021114
2440 79 93 DRYRPQPGDEGPGR 14 0.060634
2441 113 127 VSQAAAELQQYCMQ 14 0.355900
2442 190 204 KQRCPVVKSKCREN 14 0.000362

2443 rows × 5 columns

Next, we can filter the list based on the HLA_prob_pred. The higher the probability, the more likely it is for the peptide sequence to be present in a immunopeptidomics sample. It is not recommended to use a cut-off below 0.7 as this inflates the spectral library. It is rather recommended to use more conservative cut-offs.

[11]:
manual_prediction[manual_prediction['HLA_prob_pred'] > 0.7]
[11]:
start_pos stop_pos sequence nAA HLA_prob_pred
17 168 176 EMSEFHNY 8 0.793702
24 130 138 KDALLVGV 8 0.817415
31 137 145 VPAGSNPF 8 0.751329
37 170 178 SEFHNYNL 8 0.940019
67 181 189 KSDFSTRW 8 0.895964
... ... ... ... ... ...
2318 95 109 QSAEEAFLLVATAY 14 0.969541
2378 329 343 SPNLLTIIEMQKGD 14 0.756001
2382 5 19 LLSPGWGAGAAGRR 14 0.733784
2408 110 124 TLKVSQAAAELQQY 14 0.891976
2419 6 20 LSPGWGAGAAGRRW 14 0.842583

148 rows × 5 columns

As described above, directly using the sequences for classification can be memory intense for large lists of sequences. Thereby, the manual concatenation, unspecific digestion, predicition and filtering is only suggested for small sets of proteins or integration of selected sequences (e.g mutations, nuORFs etc.). This can be circumvented by directly predicting and filtering from a fasta using model.predict_from_proteins(). This executes the concatenation, unspecific digestion, predicition and filtering automatically in batches. Thereby the whole process can be done more efficient and be performed without a specialized computation infrastructure.

[12]:
sequence_df = model.predict_from_proteins(protein_df, prob_threshold=0.7)
sequence_df
100%|██████████| 1/1 [00:01<00:00,  1.27s/it]
[12]:
start_pos stop_pos nAA HLA_prob_pred sequence
0 168 176 8 0.793702 EMSEFHNY
1 130 138 8 0.817415 KDALLVGV
2 137 145 8 0.751329 VPAGSNPF
3 170 178 8 0.940019 SEFHNYNL
4 181 189 8 0.895964 KSDFSTRW
... ... ... ... ... ...
143 95 109 14 0.969541 QSAEEAFLLVATAY
144 329 343 14 0.756001 SPNLLTIIEMQKGD
145 5 19 14 0.733784 LLSPGWGAGAAGRR
146 110 124 14 0.891976 TLKVSQAAAELQQY
147 6 20 14 0.842583 LSPGWGAGAAGRRW

148 rows × 5 columns

To perform transferlearning we need a list of peptide sequences we expect to be present in our sample. These peptides can be retrived from several different sources like DDA or directDIA search results. It is recommended to use at the very least 1000 sequences for transferlearning. The more sequences available the better the transferlearning step works. The model performance can be assessed after transferlearning and should be assessed before predicition.

First, the Classifier model needs to be initialized and the pretrained model is loaded. Next, a protein dataframe is added, in this example the previousely loaded fasta file. The protein dataframe is used by the Classifier internaly to draw negative training data during model training and testing.

[13]:
model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()
model.load_proteins(fasta_path)
Next, we load the peptide sequences wee use for transferlearning and split it into a training and testing dataset. This step is very important to assess the model performance after transferlearning. Here, we use the digest_pos_df generated above. As these are no immunopeptides, but a list of unspecifically digested proteins, the model performance will not improve, but the pronciples remain the same.
@ Feng should we include a example file so that the model is actually improved or just use this?
[14]:
test_seq_df = digest_pos_df.sample(frac=0.2)
train_seq_df = digest_pos_df.drop(index=test_seq_df.index)
len(train_seq_df), len(test_seq_df)
[14]:
(1954, 489)

Now, we train the model using the training sequence dataframe. In this example we use 10 training epochs, in a real experiment more should be used. Good starting points are 40 epochs for a training dataset of around 10000 sequences or 100 epochs for a training dataset of around 1000 sequences. For a real experiment the warmup_epochs can be increased to 10.

[15]:
model.train(train_seq_df,
            epoch=10, warmup_epoch=5,
            verbose=True)
2024-07-23 14:22:06> Training with fixed sequence length: 0
[Training] Epoch=1, lr=4e-05, loss=1.39779794216156
[Training] Epoch=2, lr=6e-05, loss=1.0070140702383858
[Training] Epoch=3, lr=8e-05, loss=0.7982760497501918
[Training] Epoch=4, lr=0.0001, loss=0.7397338407380241
[Training] Epoch=5, lr=0.0001, loss=0.7099559647696358
[Training] Epoch=6, lr=9.045084971874738e-05, loss=0.7016251683235168
[Training] Epoch=7, lr=6.545084971874738e-05, loss=0.6965694086892265
[Training] Epoch=8, lr=3.4549150281252636e-05, loss=0.697939566203526
[Training] Epoch=9, lr=9.549150281252633e-06, loss=0.6959438664572579
[Training] Epoch=10, lr=1.0000000000000002e-14, loss=0.6928229417119708

We can assess the model performance after transferlearning using the model.test() function on the training and testing data. This can also be done before transferlearning to assess how well the model fits the available data already. The test assesses the precision, recall and fals positive rate of the model at different probability cut offs. As a rule of thumb a false postitve rate above 7% (@FENG adjust in case lower/higher) is not recomendable because the peptide list gets disproportionally larger, leading to lower IDs during the search. In case of a high false postitive rate, the probability cut off at which the peptides are predicted should be increased.

[16]:
model.test(train_seq_df)
[16]:
HLA_prob_pred precision recall false_positive
0 0.5 0.511434 0.595189 0.568577
1 0.6 0.416667 0.017912 0.025077
2 0.7 0.333333 0.000512 0.001024
3 0.8 NaN 0.000000 0.000000
4 0.9 NaN 0.000000 0.000000
[17]:
model.test(test_seq_df)
[17]:
HLA_prob_pred precision recall false_positive
0 0.5 0.450192 0.480573 0.586912
1 0.6 0.470588 0.016360 0.018405
2 0.7 NaN 0.000000 0.000000
3 0.8 NaN 0.000000 0.000000
4 0.9 NaN 0.000000 0.000000

After transferlearning and testing the new model, peptides can be predicted as with the pretrained model.

[18]:
model.predict_from_proteins(fasta_path, prob_threshold=0.6)
100%|██████████| 1/1 [00:01<00:00,  1.32s/it]
[18]:
start_pos stop_pos nAA HLA_prob_pred sequence
0 170 178 8 0.711809 SEFHNYNL
1 62 70 8 0.627015 KAEIARAY
2 106 114 8 0.628822 TAYETLKV
3 299 307 8 0.605544 LLKLVKSY
4 346 354 8 0.646759 YASSFKGY
5 258 266 8 0.624555 ICYKNNCY
6 294 303 9 0.610476 KEDQDLLKL
7 298 307 9 0.645020 DLLKLVKSY
8 235 244 9 0.629079 SLFNQEVQI
9 257 266 9 0.623247 WICYKNNCY
10 267 276 9 0.611738 FFDESKNWY
11 17 26 9 0.605875 RRWWMLLAP
12 327 336 9 0.616737 ILSPNLLTI
13 74 83 9 0.611590 RRYHPDRYR
14 344 354 10 0.662783 ALYASSFKGY
15 232 242 10 0.651600 FLNSLFNQEV
16 221 231 10 0.617175 FIIMVTIWSA
17 222 232 10 0.600623 IIMVTIWSAV
18 74 84 10 0.614895 RRYHPDRYRP
19 221 232 11 0.608950 FIIMVTIWSAV
20 353 364 11 0.613787 YIENCSTPNTY
21 74 85 11 0.605368 RRYHPDRYRPQ
22 112 124 12 0.612270 KVSQAAAELQQY
23 42 54 12 0.607715 GLYCGTRDCYEV
24 351 363 12 0.616891 KGYIENCSTPNT
25 74 86 12 0.602210 RRYHPDRYRPQP
26 86 99 13 0.644656 GDEGPGRTPQSAE
27 351 364 13 0.603497 KGYIENCSTPNTY
28 73 86 13 0.622453 ARRYHPDRYRPQP
29 74 87 13 0.611441 RRYHPDRYRPQPG
30 334 347 13 0.604354 TIIEMQKGDCALY
31 141 154 13 0.601309 SNPFREPRSCALL
32 32 45 13 0.622797 LVRPAGALVEGLY
33 130 143 13 0.604786 KDALLVGVPAGSN
34 333 347 14 0.613545 LTIIEMQKGDCALY
35 60 74 14 0.607648 AGKAEIARAYRQLA
36 85 99 14 0.606241 PGDEGPGRTPQSAE
37 229 243 14 0.606759 SAVFLNSLFNQEVQ
38 86 100 14 0.622891 GDEGPGRTPQSAEE
39 167 181 14 0.611953 WEMSEFHNYNLDLK
40 117 131 14 0.619257 AAELQQYCMQNACK
41 73 87 14 0.608767 ARRYHPDRYRPQPG
42 329 343 14 0.600299 SPNLLTIIEMQKGD

3. Spectral library prediciton

Now the spectral library for the filtered peptide list can be predicted using PredictSpecLibFasta. First, one needs to select the models for rt/ccs/ms2 prediction using the ModelManager. One can select from a set of pretrained models or load externally trained models. Here we load the ‘HLA’ model (at the moment this still loads the generic model, but in the futer this is supposed to be replaced by an HLA specfic internal model).

[19]:
from peptdeep.spec_lib.predict_lib import  ModelManager
from peptdeep.protein.fasta import PredictSpecLibFasta

model_mgr = ModelManager()
model_mgr.load_installed_models(model_type='HLA')

In the next step, the PredictSpecLibFasta is initialized using the preloaded model. The presettings here are selected for the prediction of tryptic libraries so some parameters need to be adjusted, in particular precursor_charge_min, precursor_charge_max. By default Carbamidomethylation is set as a fixed modification (fix_mod) and Acetylation and Oxidation are set as variable modifications (var_mod). Those can be removed by adding an empty list as shown for the variable modifications.

Of note, PredictSpecLibFasta can also be used to predict a library from a fasta file. Therfore one can also set the protease (default trypsin) and the minimum and maximum peptide length (7 to 35). Wee dont need to change those parameters here, as we wont make use of the digestion functions but rather provide a already digested sequence table.

[20]:
speclib = PredictSpecLibFasta(model_manager=model_mgr,
                              precursor_charge_min=1,
                              precursor_charge_max=3,
                              fix_mods=[])

To reduce the size of the dataframe and predicted library we give each peptide sequence a unique protein identifier (number). This enables the use of search engines that rely on protein information (such as AlphaDIA) but one needs to keep in mind to remove filtering steps based on how many peptides per protein are identified during data analysis. Alternatively, proteins of the peptide sequences may originate from can be infered using alphabase.protein.fasta.annotate_precursor_df() (demonstrated below).

[21]:
sequence_df['protein_id'] = [str(i) for i in range(len(sequence_df))]
sequence_df['protein_idxes'] = sequence_df.protein_id.astype("U")
sequence_df['full_name'] = sequence_df['protein_id']
sequence_df['gene_org'] = sequence_df['protein_id']
sequence_df['gene_name'] = sequence_df['protein_id']
sequence_df["is_prot_nterm"] = False
sequence_df["is_prot_cterm"] = False
sequence_df
[21]:
start_pos stop_pos nAA HLA_prob_pred sequence protein_id protein_idxes full_name gene_org gene_name is_prot_nterm is_prot_cterm
0 168 176 8 0.793702 EMSEFHNY 0 0 0 0 0 False False
1 130 138 8 0.817415 KDALLVGV 1 1 1 1 1 False False
2 137 145 8 0.751329 VPAGSNPF 2 2 2 2 2 False False
3 170 178 8 0.940019 SEFHNYNL 3 3 3 3 3 False False
4 181 189 8 0.895964 KSDFSTRW 4 4 4 4 4 False False
... ... ... ... ... ... ... ... ... ... ... ... ...
143 95 109 14 0.969541 QSAEEAFLLVATAY 143 143 143 143 143 False False
144 329 343 14 0.756001 SPNLLTIIEMQKGD 144 144 144 144 144 False False
145 5 19 14 0.733784 LLSPGWGAGAAGRR 145 145 145 145 145 False False
146 110 124 14 0.891976 TLKVSQAAAELQQY 146 146 146 146 146 False False
147 6 20 14 0.842583 LSPGWGAGAAGRRW 147 147 147 147 147 False False

148 rows × 12 columns

The sequence dataframe contains all the relevant information to be passed to the protein_df and the precursor_df.

[22]:
speclib.protein_df = sequence_df[
    ["sequence","protein_id","nAA", 'full_name', 'gene_org', 'gene_name']
].copy()
speclib.protein_df
[22]:
sequence protein_id nAA full_name gene_org gene_name
0 EMSEFHNY 0 8 0 0 0
1 KDALLVGV 1 8 1 1 1
2 VPAGSNPF 2 8 2 2 2
3 SEFHNYNL 3 8 3 3 3
4 KSDFSTRW 4 8 4 4 4
... ... ... ... ... ... ...
143 QSAEEAFLLVATAY 143 14 143 143 143
144 SPNLLTIIEMQKGD 144 14 144 144 144
145 LLSPGWGAGAAGRR 145 14 145 145 145
146 TLKVSQAAAELQQY 146 14 146 146 146
147 LSPGWGAGAAGRRW 147 14 147 147 147

148 rows × 6 columns

[23]:
speclib.precursor_df = sequence_df[
    ["sequence","protein_idxes","start_pos","stop_pos",
     "nAA","HLA_prob_pred", 'is_prot_nterm', 'is_prot_cterm']
].copy()
speclib.precursor_df
[23]:
sequence protein_idxes start_pos stop_pos nAA HLA_prob_pred is_prot_nterm is_prot_cterm
0 EMSEFHNY 0 168 176 8 0.793702 False False
1 KDALLVGV 1 130 138 8 0.817415 False False
2 VPAGSNPF 2 137 145 8 0.751329 False False
3 SEFHNYNL 3 170 178 8 0.940019 False False
4 KSDFSTRW 4 181 189 8 0.895964 False False
... ... ... ... ... ... ... ... ...
143 QSAEEAFLLVATAY 143 95 109 14 0.969541 False False
144 SPNLLTIIEMQKGD 144 329 343 14 0.756001 False False
145 LLSPGWGAGAAGRR 145 5 19 14 0.733784 False False
146 TLKVSQAAAELQQY 146 110 124 14 0.891976 False False
147 LSPGWGAGAAGRRW 147 6 20 14 0.842583 False False

148 rows × 8 columns

[24]:
speclib.precursor_df
[24]:
sequence protein_idxes start_pos stop_pos nAA HLA_prob_pred is_prot_nterm is_prot_cterm
0 EMSEFHNY 0 168 176 8 0.793702 False False
1 KDALLVGV 1 130 138 8 0.817415 False False
2 VPAGSNPF 2 137 145 8 0.751329 False False
3 SEFHNYNL 3 170 178 8 0.940019 False False
4 KSDFSTRW 4 181 189 8 0.895964 False False
... ... ... ... ... ... ... ... ...
143 QSAEEAFLLVATAY 143 95 109 14 0.969541 False False
144 SPNLLTIIEMQKGD 144 329 343 14 0.756001 False False
145 LLSPGWGAGAAGRR 145 5 19 14 0.733784 False False
146 TLKVSQAAAELQQY 146 110 124 14 0.891976 False False
147 LSPGWGAGAAGRRW 147 6 20 14 0.842583 False False

148 rows × 8 columns

Next, the modifications and charges can be added to the peptide dataframe using add_modifications and add_charge. This creates a unique entry for every combination of charge and modification for all the sequences in the precursor dataframe.

[25]:
speclib.add_modifications()
speclib.add_charge()
speclib.precursor_df
[25]:
sequence protein_idxes start_pos stop_pos nAA HLA_prob_pred is_prot_nterm is_prot_cterm mods mod_sites charge
0 EMSEFHNY 0 168 176 8 0.793702 False False Oxidation@M 2 1
1 EMSEFHNY 0 168 176 8 0.793702 False False Oxidation@M 2 2
2 EMSEFHNY 0 168 176 8 0.793702 False False Oxidation@M 2 3
3 EMSEFHNY 0 168 176 8 0.793702 False False 1
4 EMSEFHNY 0 168 176 8 0.793702 False False 2
... ... ... ... ... ... ... ... ... ... ... ...
493 TLKVSQAAAELQQY 146 110 124 14 0.891976 False False 2
494 TLKVSQAAAELQQY 146 110 124 14 0.891976 False False 3
495 LSPGWGAGAAGRRW 147 6 20 14 0.842583 False False 1
496 LSPGWGAGAAGRRW 147 6 20 14 0.842583 False False 2
497 LSPGWGAGAAGRRW 147 6 20 14 0.842583 False False 3

498 rows × 11 columns

Now ccs, rt and ms2 can be predicted for each entry

[26]:
speclib.predict_all()
2024-07-23 14:22:43> Predicting RT/IM/MS2 for 400 precursors ...
2024-07-23 14:22:43> Predicting RT ...
100%|██████████| 7/7 [00:00<00:00, 27.54it/s]
2024-07-23 14:22:43> Predicting mobility ...

100%|██████████| 7/7 [00:00<00:00, 50.06it/s]
2024-07-23 14:22:44> Predicting MS2 ...

100%|██████████| 7/7 [00:00<00:00, 23.73it/s]
2024-07-23 14:22:44> End predicting RT/IM/MS2

iRTs can be added using translate_rt_to_irt_pred. This is not neccessary for search engines like DIA-NN or AlphaDIA but required for Spectronaut.

[27]:
speclib.translate_rt_to_irt_pred()
Predict RT for 11 iRT precursors.
Linear regression of `rt_pred` to `irt`:
   R_square         R       slope  intercept  test_num
0   0.99007  0.995022  152.235639 -39.232164        11
[27]:
sequence protein_idxes start_pos stop_pos nAA HLA_prob_pred is_prot_nterm is_prot_cterm mods mod_sites ... precursor_mz rt_pred rt_norm_pred ccs_pred mobility_pred nce instrument frag_start_idx frag_stop_idx irt_pred
0 EMSEFHNY 0 168 176 8 0.793702 False False Oxidation@M 2 ... 1072.404037 0.189650 0.189650 254.195892 1.253140 30.0 Lumos 0 7 -10.360738
1 EMSEFHNY 0 168 176 8 0.793702 False False Oxidation@M 2 ... 536.705657 0.189650 0.189650 337.328583 0.831494 30.0 Lumos 7 14 -10.360738
2 EMSEFHNY 0 168 176 8 0.793702 False False ... 1056.409123 0.289261 0.289261 255.103699 1.257373 30.0 Lumos 14 21 4.803681
3 EMSEFHNY 0 168 176 8 0.793702 False False ... 528.708200 0.289261 0.289261 337.444641 0.831621 30.0 Lumos 21 28 4.803681
4 KDALLVGV 1 130 138 8 0.817415 False False ... 814.503280 0.433791 0.433791 256.615204 1.260001 30.0 Lumos 28 35 26.806270
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395 TLKVSQAAAELQQY 146 110 124 14 0.891976 False False ... 775.414662 0.489545 0.489545 429.360901 1.062514 30.0 Lumos 3810 3823 35.294030
396 TLKVSQAAAELQQY 146 110 124 14 0.891976 False False ... 517.278867 0.489545 0.489545 463.231049 0.764225 30.0 Lumos 3823 3836 35.294030
397 LSPGWGAGAAGRRW 147 6 20 14 0.842583 False False ... 1441.744742 0.377743 0.377743 289.200989 1.430378 30.0 Lumos 3836 3849 18.273780
398 LSPGWGAGAAGRRW 147 6 20 14 0.842583 False False ... 721.376009 0.377743 0.377743 404.633698 1.000659 30.0 Lumos 3849 3862 18.273780
399 LSPGWGAGAAGRRW 147 6 20 14 0.842583 False False ... 481.253098 0.377743 0.377743 476.655701 0.785851 30.0 Lumos 3862 3875 18.273780

400 rows × 21 columns

Now, the predicted library can be exported in an hdf format (AlphaDIA) or translated to a tsv. The tsv translation can be very time consuming. Before the spectral library can be translated, the gene and protein column need to be mapped from the protein_df into the precursor_df.

[28]:
# hdf_path = "D:\Software\FASTA\Human\speclib_example.hdf"
# tsv_path = "D:\Software\FASTA\Human\speclib_example.tsv"
# speclib.save_hdf(hdf_path) # save as hdf speclib
[29]:
from peptdeep.spec_lib.translate import translate_to_tsv
speclib.append_protein_name()
# translate_to_tsv(speclib=speclib, tsv = tsv_path) # save as tsv speclib

4. Matching peptides back to proteins

The peptide sequnces can be matched back to proteins using annotate_precursor_df, requiring a ‘sequence’ column and a protein_df like the previously loaded fasta file. This can be done with the sequence output of any search engine or before the library is generated.

[30]:
from alphabase.protein.fasta import annotate_precursor_df
inferred_sequence_df = annotate_precursor_df(sequence_df, protein_df)
inferred_sequence_df
100%|██████████| 2/2 [00:00<00:00, 7639.90it/s]
[30]:
start_pos stop_pos nAA HLA_prob_pred sequence protein_id protein_idxes full_name gene_org gene_name is_prot_nterm is_prot_cterm genes proteins cardinality
0 168 176 8 0.793702 EMSEFHNY 0 0 0 0 0 False False A0A024RAP8_HUMAN A0A024RAP8 1
1 130 138 8 0.817415 KDALLVGV 1 1 1 1 1 False False A0A024R161_HUMAN A0A024R161 1
2 137 145 8 0.751329 VPAGSNPF 2 2 2 2 2 False False A0A024R161_HUMAN A0A024R161 1
3 170 178 8 0.940019 SEFHNYNL 3 3 3 3 3 False False A0A024RAP8_HUMAN A0A024RAP8 1
4 181 189 8 0.895964 KSDFSTRW 4 4 4 4 4 False False A0A024RAP8_HUMAN A0A024RAP8 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
143 95 109 14 0.969541 QSAEEAFLLVATAY 143 143 143 143 143 False False A0A024R161_HUMAN A0A024R161 1
144 329 343 14 0.756001 SPNLLTIIEMQKGD 144 144 144 144 144 False False A0A024RAP8_HUMAN A0A024RAP8 1
145 5 19 14 0.733784 LLSPGWGAGAAGRR 145 145 145 145 145 False False A0A024R161_HUMAN A0A024R161 1
146 110 124 14 0.891976 TLKVSQAAAELQQY 146 146 146 146 146 False False A0A024R161_HUMAN A0A024R161 1
147 6 20 14 0.842583 LSPGWGAGAAGRRW 147 147 147 147 147 False False A0A024R161_HUMAN A0A024R161 1

148 rows × 15 columns

[ ]: