Using peptdeep for MHC class I immunopeptidomics¶

This notebook introduces how to generate spectral libraries for immunopeptidomics analysis from a list of protein sequences. This entails several steps:

unspecific digestion of protein sequences
selection of peptide sequences used for library prediction by peptdeep-hla predicition 2.1 using the pretrained model 2.2 using an improved model by including a transfer learning step
spectral library prediction
matching the peptides back to the proteins (this can be done before or after library prediction or seach)

Note that pydivsufsort package is not installed by peptdeep by default. Install by:

pip install "peptdeep[development,hla]"

Or install within jupyter notebook:

[1]:

%pip install -q pydivsufsort

Note: you may need to restart the kernel to use updated packages.

[2]:

import torch # noqa: 401, to prevent crash in Mac Arm

1. Unspecific digestion in alphabase¶

The unspecific digestion workflow uses the longest common prefix (LCP) algorithm, which is based on suffix array data structure, has been proven to be very efficient for unspecific digestion [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-577]. Here we used pydivsufsort, a Python wrapper of a high-performance C library libdivsufsort [https://github.com/y-256/libdivsufsort], to facilitate LCP-based digestion.

This means, the digestion is performed on a single sequence of strings and retrives both the peptide sequence as well as the start and stop indices of the peptide within the complete sequence. Therefore, unspecific digestion in alphabase involves two steps:

concatenation of protein sequences into a single sequence
unspecific digestion

1.1 Concatenate protein sequences into a single sequence¶

The protein sequences are concatenated into a single sequence. The sequences are seperated by a sentinel character, in this case ‘$’, so that no peptides across proteins are formed. Note that the first and last sentinel characters are crutial as well.

[3]:

from peptdeep.hla.hla_utils import load_prot_df
fasta_path = "data/example.fasta"
protein_df = load_prot_df(fasta_path)
protein_df

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

[3]:

	protein_id	full_name	gene_name	gene_org	description	sequence	nAA
tr\|A0A024R161\|A0A024R161_HUMAN	A0A024R161	tr\|A0A024R161\|A0A024R161_HUMAN	DNAJC25-GNG10	A0A024R161_HUMAN	tr\|A0A024R161\|A0A024R161_HUMAN Guanine nucleot...	MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCG...	153
tr\|A0A024RAP8\|A0A024RAP8_HUMAN	A0A024RAP8	tr\|A0A024RAP8\|A0A024RAP8_HUMAN	KLRC4-KLRK1	A0A024RAP8_HUMAN	tr\|A0A024RAP8\|A0A024RAP8_HUMAN HCG2009644, iso...	MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQKQRCPVVKSKC...	216

[4]:

from peptdeep.hla.hla_utils import cat_proteins
cat_sequence = cat_proteins(protein_df["sequence"])
cat_sequence

[4]:

'$MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCGTRDCYEVLGVSRSAGKAEIARAYRQLARRYHPDRYRPQPGDEGPGRTPQSAEEAFLLVATAYETLKVSQAAAELQQYCMQNACKDALLVGVPAGSNPFREPRSCALL$MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQKQRCPVVKSKCRENASPFFFCCFIAVAMGIRFIIMVTIWSAVFLNSLFNQEVQIPLTESYCGPCPKNWICYKNNCYQFFDESKNWYESQASCMSQNASLLKVYSKEDQDLLKLVKSYHWMGLVHIPTNGSWQWEDGSILSPNLLTIIEMQKGDCALYASSFKGYIENCSTPNTYICMQRTV$'

1.2 Unspecific digestion¶

Use alphabase.protein.lcp_digest.get_substring_indices to get all non-redundant non-specific peptide sequences from the concatenated protein sequence. The digested peptide sequences are stored in a dataframe based on their start and stop indices in the concantenated protein sequence string. To save the RAM, the peptdeep.hla module works on start and stop indices instead of on peptide sequences directly. This will save about 8 times of the RAM for HLA-I peptides (length from 7 to 14, deomnstrated below). For a large protein sequence database, there will be millions of unspecific peptides, so working with strings is not feasible for a complete human fasta due to the requirements of extremely large RAM. (~ 70M unspecific sequences from the reviewed swissprot fasta require ~ 4-5 GB RAM already).

Using the get_substring_indices function we extract the start and stop indices of all peptide sequences between 7 and 14 aa (min_len, max_len) from the concatenated protein sequences. All peptides sequences are unique, guranteed by the LCP algorithm.

[5]:

from alphabase.protein.lcp_digest import get_substring_indices
import pandas as pd
import sys

start_idxes, stop_idxes = get_substring_indices(
    cat_sequence, min_len=8, max_len=14, stop_char="$"
)
digest_pos_df = pd.DataFrame({
    "start_pos": start_idxes,
    "stop_pos": stop_idxes,
})
digest_pos_df

[5]:

	start_pos	stop_pos
0	1	9
1	1	10
2	1	11
3	1	12
4	1	13
...	...	...
2438	361	370
2439	361	371
2440	362	370
2441	362	371
2442	363	371

2443 rows × 2 columns

[6]:

RAM_use_idxes = sys.getsizeof(digest_pos_df)*1e-6

The unspecific peptide sequences can be localted by the start_pos and stop_pos.

[7]:

digest_pos_df["sequence"] = digest_pos_df[
    ["start_pos","stop_pos"]
].apply(lambda x: cat_sequence[slice(*x)], axis=1)
digest_pos_df

[7]:

	start_pos	stop_pos	sequence
0	1	9	MGAPLLSP
1	1	10	MGAPLLSPG
2	1	11	MGAPLLSPGW
3	1	12	MGAPLLSPGWG
4	1	13	MGAPLLSPGWGA
...	...	...	...
2438	361	370	NTYICMQRT
2439	361	371	NTYICMQRTV
2440	362	370	TYICMQRT
2441	362	371	TYICMQRTV
2442	363	371	YICMQRTV

2443 rows × 3 columns

[8]:

RAM_use_seqs = sys.getsizeof(digest_pos_df["sequence"])*1e-6

[9]:

f"seq RAM = {RAM_use_seqs:.5f} Mb, idxes RAM = {RAM_use_idxes:.5f}, ratio = {RAM_use_seqs/RAM_use_idxes:.5f}"

[9]:

'seq RAM = 0.16623 Mb, idxes RAM = 0.01971, ratio = 8.43475'

2. Selection of peptide sequences used for library prediction¶

The digest_prot_df contains all unspecifically digested peptide sequences between 7 and 14 aa generatable from the concatenated protein sequences. This list is reduced using a HLA1_Binding_Classifier from peptdeep.hla.hla_class1. Two different model architectures are available, an LSTM model (HLA_Class_I_LSTM) and a BERT model (HLA_Class_I_BERT). A pretrained model is only available for the LSTM model architecture. The HLA1_Binding_Classifer can be used with a pretrained model, tuned with existing peptide data or trained from scratch. Training of a new model should be considered carefully and will not be covered in this tutorial.

Selection of peptide sequences for library predicition using the pretrained model can be done in a few steps. First, the Classifier model needs to be initialized and the pretrained model is loaded. Next, we can use any kind of dataframe containing peptide sequences to predict how likely there are HLA peptides, the only requirement beeing that the column containing the peptides is called ‘sequence’.

[10]:

from peptdeep.hla.hla_class1 import HLA1_Binding_Classifier

model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()
manual_prediction = model.predict(digest_pos_df)
manual_prediction

[10]:

	start_pos	stop_pos	sequence	nAA	HLA_prob_pred
0	1	9	MGAPLLSP	8	0.239477
1	145	153	REPRSCAL	8	0.061692
2	146	154	EPRSCALL	8	0.137313
3	155	163	MGWIRGRR	8	0.056462
4	156	164	GWIRGRRS	8	0.001298
...	...	...	...	...	...
2438	112	126	KVSQAAAELQQYCM	14	0.243115
2439	317	331	NGSWQWEDGSILSP	14	0.021114
2440	79	93	DRYRPQPGDEGPGR	14	0.060634
2441	113	127	VSQAAAELQQYCMQ	14	0.355900
2442	190	204	KQRCPVVKSKCREN	14	0.000362

2443 rows × 5 columns

Next, we can filter the list based on the HLA_prob_pred. The higher the probability, the more likely it is for the peptide sequence to be present in a immunopeptidomics sample. It is not recommended to use a cut-off below 0.7 as this inflates the spectral library. It is rather recommended to use more conservative cut-offs.

[11]:

manual_prediction[manual_prediction['HLA_prob_pred'] > 0.7]

[11]:

	start_pos	stop_pos	sequence	nAA	HLA_prob_pred
17	168	176	EMSEFHNY	8	0.793702
24	130	138	KDALLVGV	8	0.817415
31	137	145	VPAGSNPF	8	0.751329
37	170	178	SEFHNYNL	8	0.940019
67	181	189	KSDFSTRW	8	0.895964
...	...	...	...	...	...
2318	95	109	QSAEEAFLLVATAY	14	0.969541
2378	329	343	SPNLLTIIEMQKGD	14	0.756001
2382	5	19	LLSPGWGAGAAGRR	14	0.733784
2408	110	124	TLKVSQAAAELQQY	14	0.891976
2419	6	20	LSPGWGAGAAGRRW	14	0.842583

148 rows × 5 columns

As described above, directly using the sequences for classification can be memory intense for large lists of sequences. Thereby, the manual concatenation, unspecific digestion, predicition and filtering is only suggested for small sets of proteins or integration of selected sequences (e.g mutations, nuORFs etc.). This can be circumvented by directly predicting and filtering from a fasta using model.predict_from_proteins(). This executes the concatenation, unspecific digestion, predicition and filtering automatically in batches. Thereby the whole process can be done more efficient and be performed without a specialized computation infrastructure.

[12]:

sequence_df = model.predict_from_proteins(protein_df, prob_threshold=0.7)
sequence_df

100%|██████████| 1/1 [00:01<00:00,  1.27s/it]

[12]:

	start_pos	stop_pos	nAA	HLA_prob_pred	sequence
0	168	176	8	0.793702	EMSEFHNY
1	130	138	8	0.817415	KDALLVGV
2	137	145	8	0.751329	VPAGSNPF
3	170	178	8	0.940019	SEFHNYNL
4	181	189	8	0.895964	KSDFSTRW
...	...	...	...	...	...
143	95	109	14	0.969541	QSAEEAFLLVATAY
144	329	343	14	0.756001	SPNLLTIIEMQKGD
145	5	19	14	0.733784	LLSPGWGAGAAGRR
146	110	124	14	0.891976	TLKVSQAAAELQQY
147	6	20	14	0.842583	LSPGWGAGAAGRRW

148 rows × 5 columns

To perform transferlearning we need a list of peptide sequences we expect to be present in our sample. These peptides can be retrived from several different sources like DDA or directDIA search results. It is recommended to use at the very least 1000 sequences for transferlearning. The more sequences available the better the transferlearning step works. The model performance can be assessed after transferlearning and should be assessed before predicition.

First, the Classifier model needs to be initialized and the pretrained model is loaded. Next, a protein dataframe is added, in this example the previousely loaded fasta file. The protein dataframe is used by the Classifier internaly to draw negative training data during model training and testing.

[13]:

model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()
model.load_proteins(fasta_path)

Next, we load the peptide sequences wee use for transferlearning and split it into a training and testing dataset. This step is very important to assess the model performance after transferlearning. Here, we use the digest_pos_df generated above. As these are no immunopeptides, but a list of unspecifically digested proteins, the model performance will not improve, but the pronciples remain the same.

@ Feng should we include a example file so that the model is actually improved or just use this?

[14]:

test_seq_df = digest_pos_df.sample(frac=0.2)
train_seq_df = digest_pos_df.drop(index=test_seq_df.index)
len(train_seq_df), len(test_seq_df)

[14]:

(1954, 489)

Now, we train the model using the training sequence dataframe. In this example we use 10 training epochs, in a real experiment more should be used. Good starting points are 40 epochs for a training dataset of around 10000 sequences or 100 epochs for a training dataset of around 1000 sequences. For a real experiment the warmup_epochs can be increased to 10.

[15]:

model.train(train_seq_df,
            epoch=10, warmup_epoch=5,
            verbose=True)

2024-07-23 14:22:06> Training with fixed sequence length: 0
[Training] Epoch=1, lr=4e-05, loss=1.39779794216156
[Training] Epoch=2, lr=6e-05, loss=1.0070140702383858
[Training] Epoch=3, lr=8e-05, loss=0.7982760497501918
[Training] Epoch=4, lr=0.0001, loss=0.7397338407380241
[Training] Epoch=5, lr=0.0001, loss=0.7099559647696358
[Training] Epoch=6, lr=9.045084971874738e-05, loss=0.7016251683235168
[Training] Epoch=7, lr=6.545084971874738e-05, loss=0.6965694086892265
[Training] Epoch=8, lr=3.4549150281252636e-05, loss=0.697939566203526
[Training] Epoch=9, lr=9.549150281252633e-06, loss=0.6959438664572579
[Training] Epoch=10, lr=1.0000000000000002e-14, loss=0.6928229417119708

We can assess the model performance after transferlearning using the model.test() function on the training and testing data. This can also be done before transferlearning to assess how well the model fits the available data already. The test assesses the precision, recall and fals positive rate of the model at different probability cut offs. As a rule of thumb a false postitve rate above 7% (@FENG adjust in case lower/higher) is not recomendable because the peptide list gets disproportionally larger, leading to lower IDs during the search. In case of a high false postitive rate, the probability cut off at which the peptides are predicted should be increased.

[16]:

model.test(train_seq_df)

[16]:

	HLA_prob_pred	precision	recall	false_positive
0	0.5	0.511434	0.595189	0.568577
1	0.6	0.416667	0.017912	0.025077
2	0.7	0.333333	0.000512	0.001024
3	0.8	NaN	0.000000	0.000000
4	0.9	NaN	0.000000	0.000000

[17]:

model.test(test_seq_df)

[17]:

	HLA_prob_pred	precision	recall	false_positive
0	0.5	0.450192	0.480573	0.586912
1	0.6	0.470588	0.016360	0.018405
2	0.7	NaN	0.000000	0.000000
3	0.8	NaN	0.000000	0.000000
4	0.9	NaN	0.000000	0.000000

After transferlearning and testing the new model, peptides can be predicted as with the pretrained model.

[18]:

model.predict_from_proteins(fasta_path, prob_threshold=0.6)

100%|██████████| 1/1 [00:01<00:00,  1.32s/it]

[18]:

	start_pos	stop_pos	nAA	HLA_prob_pred	sequence
0	170	178	8	0.711809	SEFHNYNL
1	62	70	8	0.627015	KAEIARAY
2	106	114	8	0.628822	TAYETLKV
3	299	307	8	0.605544	LLKLVKSY
4	346	354	8	0.646759	YASSFKGY
5	258	266	8	0.624555	ICYKNNCY
6	294	303	9	0.610476	KEDQDLLKL
7	298	307	9	0.645020	DLLKLVKSY
8	235	244	9	0.629079	SLFNQEVQI
9	257	266	9	0.623247	WICYKNNCY
10	267	276	9	0.611738	FFDESKNWY
11	17	26	9	0.605875	RRWWMLLAP
12	327	336	9	0.616737	ILSPNLLTI
13	74	83	9	0.611590	RRYHPDRYR
14	344	354	10	0.662783	ALYASSFKGY
15	232	242	10	0.651600	FLNSLFNQEV
16	221	231	10	0.617175	FIIMVTIWSA
17	222	232	10	0.600623	IIMVTIWSAV
18	74	84	10	0.614895	RRYHPDRYRP
19	221	232	11	0.608950	FIIMVTIWSAV
20	353	364	11	0.613787	YIENCSTPNTY
21	74	85	11	0.605368	RRYHPDRYRPQ
22	112	124	12	0.612270	KVSQAAAELQQY
23	42	54	12	0.607715	GLYCGTRDCYEV
24	351	363	12	0.616891	KGYIENCSTPNT
25	74	86	12	0.602210	RRYHPDRYRPQP
26	86	99	13	0.644656	GDEGPGRTPQSAE
27	351	364	13	0.603497	KGYIENCSTPNTY
28	73	86	13	0.622453	ARRYHPDRYRPQP
29	74	87	13	0.611441	RRYHPDRYRPQPG
30	334	347	13	0.604354	TIIEMQKGDCALY
31	141	154	13	0.601309	SNPFREPRSCALL
32	32	45	13	0.622797	LVRPAGALVEGLY
33	130	143	13	0.604786	KDALLVGVPAGSN
34	333	347	14	0.613545	LTIIEMQKGDCALY
35	60	74	14	0.607648	AGKAEIARAYRQLA
36	85	99	14	0.606241	PGDEGPGRTPQSAE
37	229	243	14	0.606759	SAVFLNSLFNQEVQ
38	86	100	14	0.622891	GDEGPGRTPQSAEE
39	167	181	14	0.611953	WEMSEFHNYNLDLK
40	117	131	14	0.619257	AAELQQYCMQNACK
41	73	87	14	0.608767	ARRYHPDRYRPQPG
42	329	343	14	0.600299	SPNLLTIIEMQKGD

3. Spectral library prediciton¶

Now the spectral library for the filtered peptide list can be predicted using PredictSpecLibFasta. First, one needs to select the models for rt/ccs/ms2 prediction using the ModelManager. One can select from a set of pretrained models or load externally trained models. Here we load the ‘HLA’ model (at the moment this still loads the generic model, but in the futer this is supposed to be replaced by an HLA specfic internal model).

[19]:

from peptdeep.spec_lib.predict_lib import  ModelManager
from peptdeep.protein.fasta import PredictSpecLibFasta

model_mgr = ModelManager()
model_mgr.load_installed_models(model_type='HLA')

In the next step, the PredictSpecLibFasta is initialized using the preloaded model. The presettings here are selected for the prediction of tryptic libraries so some parameters need to be adjusted, in particular precursor_charge_min, precursor_charge_max. By default Carbamidomethylation is set as a fixed modification (fix_mod) and Acetylation and Oxidation are set as variable modifications (var_mod). Those can be removed by adding an empty list as shown for the variable modifications.

Of note, PredictSpecLibFasta can also be used to predict a library from a fasta file. Therfore one can also set the protease (default trypsin) and the minimum and maximum peptide length (7 to 35). Wee dont need to change those parameters here, as we wont make use of the digestion functions but rather provide a already digested sequence table.

[20]:

speclib = PredictSpecLibFasta(model_manager=model_mgr,
                              precursor_charge_min=1,
                              precursor_charge_max=3,
                              fix_mods=[])

To reduce the size of the dataframe and predicted library we give each peptide sequence a unique protein identifier (number). This enables the use of search engines that rely on protein information (such as AlphaDIA) but one needs to keep in mind to remove filtering steps based on how many peptides per protein are identified during data analysis. Alternatively, proteins of the peptide sequences may originate from can be infered using alphabase.protein.fasta.annotate_precursor_df() (demonstrated below).

[21]:

sequence_df['protein_id'] = [str(i) for i in range(len(sequence_df))]
sequence_df['protein_idxes'] = sequence_df.protein_id.astype("U")
sequence_df['full_name'] = sequence_df['protein_id']
sequence_df['gene_org'] = sequence_df['protein_id']
sequence_df['gene_name'] = sequence_df['protein_id']
sequence_df["is_prot_nterm"] = False
sequence_df["is_prot_cterm"] = False
sequence_df

[21]:

	start_pos	stop_pos	nAA	HLA_prob_pred	sequence	protein_id	protein_idxes	full_name	gene_org	gene_name	is_prot_nterm	is_prot_cterm
0	168	176	8	0.793702	EMSEFHNY	0	0	0	0	0	False	False
1	130	138	8	0.817415	KDALLVGV	1	1	1	1	1	False	False
2	137	145	8	0.751329	VPAGSNPF	2	2	2	2	2	False	False
3	170	178	8	0.940019	SEFHNYNL	3	3	3	3	3	False	False
4	181	189	8	0.895964	KSDFSTRW	4	4	4	4	4	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...
143	95	109	14	0.969541	QSAEEAFLLVATAY	143	143	143	143	143	False	False
144	329	343	14	0.756001	SPNLLTIIEMQKGD	144	144	144	144	144	False	False
145	5	19	14	0.733784	LLSPGWGAGAAGRR	145	145	145	145	145	False	False
146	110	124	14	0.891976	TLKVSQAAAELQQY	146	146	146	146	146	False	False
147	6	20	14	0.842583	LSPGWGAGAAGRRW	147	147	147	147	147	False	False

148 rows × 12 columns

The sequence dataframe contains all the relevant information to be passed to the protein_df and the precursor_df.

[22]:

speclib.protein_df = sequence_df[
    ["sequence","protein_id","nAA", 'full_name', 'gene_org', 'gene_name']
].copy()
speclib.protein_df

[22]:

	sequence	protein_id	nAA	full_name	gene_org	gene_name
0	EMSEFHNY	0	8	0	0	0
1	KDALLVGV	1	8	1	1	1
2	VPAGSNPF	2	8	2	2	2
3	SEFHNYNL	3	8	3	3	3
4	KSDFSTRW	4	8	4	4	4
...	...	...	...	...	...	...
143	QSAEEAFLLVATAY	143	14	143	143	143
144	SPNLLTIIEMQKGD	144	14	144	144	144
145	LLSPGWGAGAAGRR	145	14	145	145	145
146	TLKVSQAAAELQQY	146	14	146	146	146
147	LSPGWGAGAAGRRW	147	14	147	147	147

148 rows × 6 columns

[23]:

speclib.precursor_df = sequence_df[
    ["sequence","protein_idxes","start_pos","stop_pos",
     "nAA","HLA_prob_pred", 'is_prot_nterm', 'is_prot_cterm']
].copy()
speclib.precursor_df

[23]:

	sequence	protein_idxes	start_pos	stop_pos	nAA	HLA_prob_pred	is_prot_nterm	is_prot_cterm
0	EMSEFHNY	0	168	176	8	0.793702	False	False
1	KDALLVGV	1	130	138	8	0.817415	False	False
2	VPAGSNPF	2	137	145	8	0.751329	False	False
3	SEFHNYNL	3	170	178	8	0.940019	False	False
4	KSDFSTRW	4	181	189	8	0.895964	False	False
...	...	...	...	...	...	...	...	...
143	QSAEEAFLLVATAY	143	95	109	14	0.969541	False	False
144	SPNLLTIIEMQKGD	144	329	343	14	0.756001	False	False
145	LLSPGWGAGAAGRR	145	5	19	14	0.733784	False	False
146	TLKVSQAAAELQQY	146	110	124	14	0.891976	False	False
147	LSPGWGAGAAGRRW	147	6	20	14	0.842583	False	False

148 rows × 8 columns

[24]:

speclib.precursor_df

[24]:

	sequence	protein_idxes	start_pos	stop_pos	nAA	HLA_prob_pred	is_prot_nterm	is_prot_cterm
0	EMSEFHNY	0	168	176	8	0.793702	False	False
1	KDALLVGV	1	130	138	8	0.817415	False	False
2	VPAGSNPF	2	137	145	8	0.751329	False	False
3	SEFHNYNL	3	170	178	8	0.940019	False	False
4	KSDFSTRW	4	181	189	8	0.895964	False	False
...	...	...	...	...	...	...	...	...
143	QSAEEAFLLVATAY	143	95	109	14	0.969541	False	False
144	SPNLLTIIEMQKGD	144	329	343	14	0.756001	False	False
145	LLSPGWGAGAAGRR	145	5	19	14	0.733784	False	False
146	TLKVSQAAAELQQY	146	110	124	14	0.891976	False	False
147	LSPGWGAGAAGRRW	147	6	20	14	0.842583	False	False

148 rows × 8 columns

Next, the modifications and charges can be added to the peptide dataframe using add_modifications and add_charge. This creates a unique entry for every combination of charge and modification for all the sequences in the precursor dataframe.

[25]:

speclib.add_modifications()
speclib.add_charge()
speclib.precursor_df

[25]:

	sequence	protein_idxes	start_pos	stop_pos	nAA	HLA_prob_pred	is_prot_nterm	is_prot_cterm	mods	mod_sites	charge
0	EMSEFHNY	0	168	176	8	0.793702	False	False	Oxidation@M	2	1
1	EMSEFHNY	0	168	176	8	0.793702	False	False	Oxidation@M	2	2
2	EMSEFHNY	0	168	176	8	0.793702	False	False	Oxidation@M	2	3
3	EMSEFHNY	0	168	176	8	0.793702	False	False			1
4	EMSEFHNY	0	168	176	8	0.793702	False	False			2
...	...	...	...	...	...	...	...	...	...	...	...
493	TLKVSQAAAELQQY	146	110	124	14	0.891976	False	False			2
494	TLKVSQAAAELQQY	146	110	124	14	0.891976	False	False			3
495	LSPGWGAGAAGRRW	147	6	20	14	0.842583	False	False			1
496	LSPGWGAGAAGRRW	147	6	20	14	0.842583	False	False			2
497	LSPGWGAGAAGRRW	147	6	20	14	0.842583	False	False			3

498 rows × 11 columns

Now ccs, rt and ms2 can be predicted for each entry

[26]:

speclib.predict_all()

2024-07-23 14:22:43> Predicting RT/IM/MS2 for 400 precursors ...
2024-07-23 14:22:43> Predicting RT ...

100%|██████████| 7/7 [00:00<00:00, 27.54it/s]

2024-07-23 14:22:43> Predicting mobility ...


100%|██████████| 7/7 [00:00<00:00, 50.06it/s]

2024-07-23 14:22:44> Predicting MS2 ...


100%|██████████| 7/7 [00:00<00:00, 23.73it/s]

2024-07-23 14:22:44> End predicting RT/IM/MS2

iRTs can be added using translate_rt_to_irt_pred. This is not neccessary for search engines like DIA-NN or AlphaDIA but required for Spectronaut.

[27]:

speclib.translate_rt_to_irt_pred()

Predict RT for 11 iRT precursors.
Linear regression of `rt_pred` to `irt`:
   R_square         R       slope  intercept  test_num
0   0.99007  0.995022  152.235639 -39.232164        11

[27]:

	sequence	protein_idxes	start_pos	stop_pos	nAA	HLA_prob_pred	is_prot_nterm	is_prot_cterm	mods	mod_sites	...	precursor_mz	rt_pred	rt_norm_pred	ccs_pred	mobility_pred	nce	instrument	frag_start_idx	frag_stop_idx	irt_pred
0	EMSEFHNY	0	168	176	8	0.793702	False	False	Oxidation@M	2	...	1072.404037	0.189650	0.189650	254.195892	1.253140	30.0	Lumos	0	7	-10.360738
1	EMSEFHNY	0	168	176	8	0.793702	False	False	Oxidation@M	2	...	536.705657	0.189650	0.189650	337.328583	0.831494	30.0	Lumos	7	14	-10.360738
2	EMSEFHNY	0	168	176	8	0.793702	False	False			...	1056.409123	0.289261	0.289261	255.103699	1.257373	30.0	Lumos	14	21	4.803681
3	EMSEFHNY	0	168	176	8	0.793702	False	False			...	528.708200	0.289261	0.289261	337.444641	0.831621	30.0	Lumos	21	28	4.803681
4	KDALLVGV	1	130	138	8	0.817415	False	False			...	814.503280	0.433791	0.433791	256.615204	1.260001	30.0	Lumos	28	35	26.806270
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
395	TLKVSQAAAELQQY	146	110	124	14	0.891976	False	False			...	775.414662	0.489545	0.489545	429.360901	1.062514	30.0	Lumos	3810	3823	35.294030
396	TLKVSQAAAELQQY	146	110	124	14	0.891976	False	False			...	517.278867	0.489545	0.489545	463.231049	0.764225	30.0	Lumos	3823	3836	35.294030
397	LSPGWGAGAAGRRW	147	6	20	14	0.842583	False	False			...	1441.744742	0.377743	0.377743	289.200989	1.430378	30.0	Lumos	3836	3849	18.273780
398	LSPGWGAGAAGRRW	147	6	20	14	0.842583	False	False			...	721.376009	0.377743	0.377743	404.633698	1.000659	30.0	Lumos	3849	3862	18.273780
399	LSPGWGAGAAGRRW	147	6	20	14	0.842583	False	False			...	481.253098	0.377743	0.377743	476.655701	0.785851	30.0	Lumos	3862	3875	18.273780

400 rows × 21 columns

Now, the predicted library can be exported in an hdf format (AlphaDIA) or translated to a tsv. The tsv translation can be very time consuming. Before the spectral library can be translated, the gene and protein column need to be mapped from the protein_df into the precursor_df.

[28]:

# hdf_path = "D:\Software\FASTA\Human\speclib_example.hdf"
# tsv_path = "D:\Software\FASTA\Human\speclib_example.tsv"
# speclib.save_hdf(hdf_path) # save as hdf speclib

[29]:

from peptdeep.spec_lib.translate import translate_to_tsv
speclib.append_protein_name()
# translate_to_tsv(speclib=speclib, tsv = tsv_path) # save as tsv speclib

4. Matching peptides back to proteins¶

The peptide sequnces can be matched back to proteins using annotate_precursor_df, requiring a ‘sequence’ column and a protein_df like the previously loaded fasta file. This can be done with the sequence output of any search engine or before the library is generated.

[30]:

from alphabase.protein.fasta import annotate_precursor_df
inferred_sequence_df = annotate_precursor_df(sequence_df, protein_df)
inferred_sequence_df

100%|██████████| 2/2 [00:00<00:00, 7639.90it/s]

[30]:

	start_pos	stop_pos	nAA	HLA_prob_pred	sequence	protein_id	protein_idxes	full_name	gene_org	gene_name	is_prot_nterm	is_prot_cterm	genes	proteins	cardinality
0	168	176	8	0.793702	EMSEFHNY	0	0	0	0	0	False	False	A0A024RAP8_HUMAN	A0A024RAP8	1
1	130	138	8	0.817415	KDALLVGV	1	1	1	1	1	False	False	A0A024R161_HUMAN	A0A024R161	1
2	137	145	8	0.751329	VPAGSNPF	2	2	2	2	2	False	False	A0A024R161_HUMAN	A0A024R161	1
3	170	178	8	0.940019	SEFHNYNL	3	3	3	3	3	False	False	A0A024RAP8_HUMAN	A0A024RAP8	1
4	181	189	8	0.895964	KSDFSTRW	4	4	4	4	4	False	False	A0A024RAP8_HUMAN	A0A024RAP8	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
143	95	109	14	0.969541	QSAEEAFLLVATAY	143	143	143	143	143	False	False	A0A024R161_HUMAN	A0A024R161	1
144	329	343	14	0.756001	SPNLLTIIEMQKGD	144	144	144	144	144	False	False	A0A024RAP8_HUMAN	A0A024RAP8	1
145	5	19	14	0.733784	LLSPGWGAGAAGRR	145	145	145	145	145	False	False	A0A024R161_HUMAN	A0A024R161	1
146	110	124	14	0.891976	TLKVSQAAAELQQY	146	146	146	146	146	False	False	A0A024R161_HUMAN	A0A024R161	1
147	6	20	14	0.842583	LSPGWGAGAAGRRW	147	147	147	147	147	False	False	A0A024R161_HUMAN	A0A024R161	1

148 rows × 15 columns

[ ]: