Tutorial: Predicting Spectral Library from Fasta

[ ]:
%reload_ext autoreload
%autoreload 2

Predict fasta libray and save as HDF file using this notebook. And then use alphapeptdeep_hdf_to_tsv.ipynb to translate hdf into tsv (diann/spectronaut) format.

Prepare the data and settings

[ ]:
from alphabase.peptide.fragment import get_charged_frag_types
import pandas as pd

fasta_list = [
    r"y:\User\Feng\fasta\uniprot_human_reviewed_20210309.fasta"
]
# output spectral library in hdf format
hdf_path = r'y:\User\Feng\speclib\human_swissprot.speclib.hdf'

protease="trypsin"
nce = 30
instrument = 'timsTOF'

add_phos=False

protease_dict = {
    "trypsin": "([KR])", # this is in fact the "trypsin/P"
    "lysc": "([K])",
    "lysn": "\w(?=K)",
}
min_pep_len = 7
max_pep_len = 35
max_miss_cleave = 1
max_var_mods = 1
min_pep_mz = 400
max_pep_mz = 1200
precursor_charge_min = 2
precursor_charge_max = 4

var_mods = []
var_mods += ['Oxidation@M']
#var_mods += ['Phospho@S','Phospho@T','Phospho@Y']


frag_types = get_charged_frag_types(
    ['b','y']+
    (['b_modloss','y_modloss'] if add_phos else []),
    2
)
[ ]:
digest = protease_dict[protease] # Or digest = "trypsin/P"

protease and digest are designed by regular expression. alphabase provides several built-in enzymes, we don’t need to design the regular expression for most of the enzymes. Here are all the built-in enzymes:

[ ]:
from alphabase.protein.fasta import protease_dict
protease_dict
{'arg-c': 'R',
 'asp-n': '\\w(?=D)',
 'bnps-skatole': 'W',
 'caspase 1': '(?<=[FWYL]\\w[HAT])D(?=[^PEDQKR])',
 'caspase 2': '(?<=DVA)D(?=[^PEDQKR])',
 'caspase 3': '(?<=DMQ)D(?=[^PEDQKR])',
 'caspase 4': '(?<=LEV)D(?=[^PEDQKR])',
 'caspase 5': '(?<=[LW]EH)D',
 'caspase 6': '(?<=VE[HI])D(?=[^PEDQKR])',
 'caspase 7': '(?<=DEV)D(?=[^PEDQKR])',
 'caspase 8': '(?<=[IL]ET)D(?=[^PEDQKR])',
 'caspase 9': '(?<=LEH)D',
 'caspase 10': '(?<=IEA)D',
 'chymotrypsin high specificity': '([FY](?=[^P]))|(W(?=[^MP]))',
 'chymotrypsin low specificity': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))',
 'chymotrypsin': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))',
 'clostripain': 'R',
 'cnbr': 'M',
 'enterokinase': '(?<=[DE]{3})K',
 'factor xa': '(?<=[AFGILTVM][DE]G)R',
 'formic acid': 'D',
 'glutamyl endopeptidase': 'E',
 'glu-c': 'E',
 'granzyme b': '(?<=IEP)D',
 'hydroxylamine': 'N(?=G)',
 'iodosobenzoic acid': 'W',
 'lys-c': 'K',
 'lys-n': '\\w(?=K)',
 'ntcb': '\\w(?=C)',
 'pepsin ph1.3': '((?<=[^HKR][^P])[^R](?=[FL][^P]))|((?<=[^HKR][^P])[FL](?=\\w[^P]))',
 'pepsin ph2.0': '((?<=[^HKR][^P])[^R](?=[FLWY][^P]))|((?<=[^HKR][^P])[FLWY](?=\\w[^P]))',
 'proline endopeptidase': '(?<=[HKR])P(?=[^P])',
 'proteinase k': '[AEFILTVWY]',
 'staphylococcal peptidase i': '(?<=[^E])E',
 'thermolysin': '[^DE](?=[AFILMV])',
 'thrombin': '((?<=G)R(?=G))|((?<=[AFGILTVM][AFGILTVWA]P)R(?=[^DE][^DE]))',
 'trypsin_full': '([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P))',
 'trypsin_exception': '((?<=[CD])K(?=D))|((?<=C)K(?=[HY]))|((?<=C)R(?=K))|((?<=R)R(?=[HR]))',
 'trypsin': '([KR](?=[^P]))',
 'trypsin/P': '([KR])',
 'non-specific': '()',
 'no-cleave': '_'}

Initialize a PredictSpecLibFasta object

[ ]:
from peptdeep.protein.fasta import PredictSpecLibFasta
from peptdeep.pretrained_models import ModelManager

model_mgr = ModelManager(device='gpu')

model_mgr.nce = nce
model_mgr.instrument = instrument

fasta_lib = PredictSpecLibFasta(
    model_mgr,
    protease=digest,
    charged_frag_types=frag_types,
    var_mods=var_mods,
    fix_mods=['Carbamidomethyl@C'],
    max_missed_cleavages=max_miss_cleave,
    max_var_mod_num=max_var_mods,
    peptide_length_max=max_pep_len,
    peptide_length_min=min_pep_len,
    precursor_charge_min=precursor_charge_min,
    precursor_charge_max=precursor_charge_max,
    precursor_mz_min=min_pep_mz,
    precursor_mz_max=max_pep_mz,
    decoy=None
)

Digest

[ ]:
fasta_lib.get_peptides_from_fasta_list(fasta_list)

If we have a sequence DataFrame (seq_df) containing peptide sequences in the sequence column, we can skip get_peptides_from_fasta_list. Just assign seq_df to fasta_lib._precursor_df and perform all following steps.

fasta_lib._precursor_df = seq_df

Append decoy sequences and add modifications

[ ]:
fasta_lib.append_decoy_sequence()
fasta_lib.add_modifications()

We will get a protein DataFrame (protein_df) after digestion

[ ]:
fasta_lib.protein_df
protein_id full_name gene_name description sequence
0 Q9H9K5 sp|Q9H9K5|MER34_HUMAN ERVMER34-1 sp|Q9H9K5|MER34_HUMAN Endogenous retroviral en... MGSLSNYALLQLTLTAFLTILVQPQHLLAPVFRTLSILTNQSNCWL...
1 P04439 sp|P04439|HLAA_HUMAN HLA-A sp|P04439|HLAA_HUMAN HLA class I histocompatib... MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRF...
2 P01911 sp|P01911|DRB1_HUMAN HLA-DRB1 sp|P01911|DRB1_HUMAN HLA class II histocompati... MVCLKLPGGSCMTALTVTLMVLSSPLALSGDTRPRFLWQPKRECHF...
3 P01889 sp|P01889|HLAB_HUMAN HLA-B sp|P01889|HLAB_HUMAN HLA class I histocompatib... MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRF...
4 P31689 sp|P31689|DNJA1_HUMAN DNAJA1 sp|P31689|DNJA1_HUMAN DnaJ homolog subfamily A... MVKETTYYDVLGVKPNATQEELKKAYRKLALKYHPDKNPNEGEKFK...
... ... ... ... ... ...
20391 Q8WVZ7 sp|Q8WVZ7|RN133_HUMAN RNF133 sp|Q8WVZ7|RN133_HUMAN E3 ubiquitin-protein lig... MHLLKVGTWRNNTASSWLMKFSVLWLVSQNCCRASVVWMAYMNISF...
20392 P05387 sp|P05387|RLA2_HUMAN RPLP2 sp|P05387|RLA2_HUMAN 60S acidic ribosomal prot... MRYVASYLLAALGGNSSPSAKDIKKILDSVGIEADDDRLNKVISEL...
20393 P51991 sp|P51991|ROA3_HUMAN HNRNPA3 sp|P51991|ROA3_HUMAN Heterogeneous nuclear rib... MEVKPPPGRPQPDSGRRRRRRGEEGHDPKEPEQLRKLFIGGLSFET...
20394 Q9BZX4 sp|Q9BZX4|ROP1B_HUMAN ROPN1B sp|Q9BZX4|ROP1B_HUMAN Ropporin-1B OS=Homo sapi... MAQTDKPTCIPPELPKMLKEFAKAAIRAQPQDLIQWGADYFEALSR...
20395 P34096 sp|P34096|RNAS4_HUMAN RNASE4 sp|P34096|RNAS4_HUMAN Ribonuclease 4 OS=Homo s... MALQRTHSLLLLLLLTLLGLGLVQPSYGQDGMYQRFLRQHVHPEET...

20396 rows × 5 columns

precursor_df contains the main information of peptides.

[ ]:
fasta_lib.precursor_df['nAA'] = fasta_lib.precursor_df.sequence.str.len()
fasta_lib.precursor_df.sort_values('nAA', inplace=True)
fasta_lib.precursor_df.reset_index(drop=True, inplace=True)

Check precursor_df after adding charge states.

[ ]:
fasta_lib.add_charge()
fasta_lib.precursor_df
sequence protein_idxes miss_cleavage is_prot_nterm is_prot_cterm mods mod_sites nAA charge
0 RIHTGQR 19786 1 False False 7 2
1 RIHTGQR 19786 1 False False 7 3
2 RIHTGQR 19786 1 False False 7 4
3 LVDSAYK 12819 0 False False 7 2
4 LVDSAYK 12819 0 False False 7 3
... ... ... ... ... ... ... ... ... ...
5617819 KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK 2299 1 False False 35 3
5617820 KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK 2299 1 False False 35 4
5617821 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 False False Carbamidomethyl@C 24 35 2
5617822 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 False False Carbamidomethyl@C 24 35 3
5617823 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 False False Carbamidomethyl@C 24 35 4

5617824 rows × 9 columns

PredictSpecLibFasta.calc_precursor_mz will append a precursor_mz column to the precursor_df.

PredictSpecLibFasta.hash_precursor_df will append mod_seq_hash and mod_seq_charge_hash columns to the precursor_df. mod_seq_hash column contains the unique signatures (np.int64) for corresponding peptides ( sequence,mods and mod_sites). mod_seq_charge_hash column contains the unique signatures (np.int64) for corresponding precursors ( sequence,mods, mod_sites and charge).

[ ]:
fasta_lib.hash_precursor_df()
fasta_lib.calc_precursor_mz()
fasta_lib.precursor_df
sequence protein_idxes miss_cleavage is_prot_nterm is_prot_cterm mods mod_sites nAA charge mod_seq_hash mod_seq_charge_hash precursor_mz
0 RIHTGQR 19786 1 False False 7 2 471662500970219628 471662500970219630 434.249018
1 PMPMPVR 9448 0 False False 7 2 -5301076820607700090 -5301076820607700088 414.216952
2 PMPMPVR 9448 0 False False Oxidation@M 4 7 2 6057464136741449831 6057464136741449833 422.214409
3 PMPMPVR 9448 0 False False Oxidation@M 2 7 2 -6431722582867031756 -6431722582867031754 422.214409
4 QEWFCTR 12819 0 False False Carbamidomethyl@C 5 7 2 -7409729050206298801 -7409729050206298799 513.726727
... ... ... ... ... ... ... ... ... ... ... ... ...
3654202 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 False False 35 4 7192344052213098704 7192344052213098708 866.228888
3654203 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 False False Oxidation@M 17 35 3 -1485306056792248111 -1485306056792248108 1159.967730
3654204 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 False False Oxidation@M 17 35 4 -1485306056792248111 -1485306056792248107 870.227616
3654205 KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK 2299 1 False False 35 4 5191231126132273751 5191231126132273755 976.910866
3654206 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 False False Carbamidomethyl@C 24 35 4 -7707559913944666938 -7707559913944666934 958.434460

3654207 rows × 12 columns

Predict MS2/RT/CCS(mobility)

[ ]:
fasta_lib.precursor_df['instrument'] = model_mgr.instrument
fasta_lib.precursor_df['nce'] = model_mgr.nce
res = fasta_lib.model_manager.predict_all(
    fasta_lib.precursor_df,
    predict_items=['rt','mobility','ms2'],
    frag_types = frag_types,
)
fasta_lib.set_precursor_and_fragment(
    **res
)
2022-08-03 14:14:41> Predicting RT ...
100%|██████████| 29/29 [01:30<00:00,  3.11s/it]
2022-08-03 14:16:12> Predicting mobility ...

100%|██████████| 29/29 [01:31<00:00,  3.14s/it]
2022-08-03 14:18:10> Predicting MS2 ...
100%|██████████| 29/29 [04:53<00:00, 10.13s/it]

Check memory usage for the library

[ ]:
import os, psutil
import numpy as np
process = psutil.Process(os.getpid())
print(f'{len(fasta_lib.precursor_df)*1e-6:.2f}M precursors with {np.prod(fasta_lib.fragment_mz_df.values.shape, dtype=float)*(1e-6):.2f}M fragments used {process.memory_info().rss/1024**3:.4f} GB memory')
3.65M precursors with 241.62M fragments used 6.9472 GB memory

The predicted fragment intensities

[ ]:
fasta_lib.fragment_intensity_df
b_z1 b_z2 y_z1 y_z2
0 0.000000 0.0 0.611678 0.0
1 0.056326 0.0 1.000000 0.0
2 0.437313 0.0 0.729849 0.0
3 0.219575 0.0 0.292181 0.0
4 0.346306 0.0 0.033992 0.0
... ... ... ... ...
60404997 0.000000 0.0 0.322072 0.0
60404998 0.000000 0.0 0.206371 0.0
60404999 0.000000 0.0 0.033532 0.0
60405000 0.000000 0.0 0.040032 0.0
60405001 0.000000 0.0 0.000000 0.0

60405002 rows × 4 columns

The calculated fragment m/z values

[ ]:
fasta_lib.fragment_mz_df
b_z1 b_z2 y_z1 y_z2
0 157.108387 79.057832 711.389648 356.198462
1 270.192451 135.599864 598.305584 299.656430
2 407.251363 204.129320 461.246672 231.126974
3 508.299042 254.653159 360.198993 180.603135
4 565.320506 283.163891 303.177530 152.092403
... ... ... ... ...
60404997 3285.398701 1643.202989 546.324588 273.665932
60404998 3386.446379 1693.726828 445.276909 223.142093
60404999 3443.467843 1722.237560 388.255446 194.631361
60405000 3571.526420 1786.266848 260.196868 130.602072
60405001 3684.610484 1842.808880 147.112804 74.060040

60405002 rows × 4 columns

PredictSpecLibFasta.rt_to_irt_pred will translate the predicted RT values to iRT values (rt_pred to irt_pred). This is useful for DiaNN and Spectronaut search.

[ ]:
fasta_lib.translate_rt_to_irt_pred()
fasta_lib.precursor_df
sequence protein_idxes miss_cleavage is_prot_nterm is_prot_cterm mods mod_sites nAA charge mod_seq_hash ... precursor_mz instrument nce rt_pred rt_norm_pred ccs_pred mobility_pred frag_stop_idx frag_start_idx irt_pred
0 RIHTGQR 19786 1 False False 7 2 471662500970219628 ... 434.249018 timsTOF 30 0.115377 0.115377 315.529022 0.775438 6 0 -37.187631
1 PMPMPVR 9448 0 False False 7 2 -5301076820607700090 ... 414.216952 timsTOF 30 0.208976 0.208976 304.965790 0.748912 12 6 -16.331142
2 PMPMPVR 9448 0 False False Oxidation@M 4 7 2 6057464136741449831 ... 422.214409 timsTOF 30 0.158058 0.158058 304.080536 0.746970 18 12 -27.677099
3 PMPMPVR 9448 0 False False Oxidation@M 2 7 2 -6431722582867031756 ... 422.214409 timsTOF 30 0.157143 0.157143 305.825348 0.751256 24 18 -27.881022
4 QEWFCTR 12819 0 False False Carbamidomethyl@C 5 7 2 -7409729050206298801 ... 513.726727 timsTOF 30 0.423747 0.423747 330.547638 0.814317 30 24 31.526291
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3654202 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 False False 35 4 7192344052213098704 ... 866.228888 timsTOF 30 0.831350 0.831350 891.748413 1.108824 60404866 60404832 122.352159
3654203 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 False False Oxidation@M 17 35 3 -1485306056792248111 ... 1159.967730 timsTOF 30 0.826977 0.826977 785.478699 1.302269 60404900 60404866 121.377815
3654204 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 False False Oxidation@M 17 35 4 -1485306056792248111 ... 870.227616 timsTOF 30 0.826977 0.826977 892.459656 1.109729 60404934 60404900 121.377815
3654205 KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK 2299 1 False False 35 4 5191231126132273751 ... 976.910866 timsTOF 30 0.670129 0.670129 791.322266 0.984398 60404968 60404934 86.427514
3654206 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 False False Carbamidomethyl@C 24 35 4 -7707559913944666938 ... 958.434460 timsTOF 30 0.725150 0.725150 823.819214 1.024754 60405002 60404968 98.687774

3654207 rows × 21 columns

Save the library into a HDF5 (.hdf) file

[ ]:
fasta_lib.save_hdf(hdf_path)
hdf_path
'y:\\User\\Feng\\speclib\\human_swissprot.speclib.hdf'

Now use alphapeptdeep_hdf_to_tsv.ipynb to translate hdf into TSV (diann/spectronaut) format. Translation is quite slow because writing TSV file is slow for large libraries.