Tutorial: Predicting Spectral Library from Fasta¶
[ ]:
%reload_ext autoreload
%autoreload 2
Predict fasta libray and save as HDF file using this notebook. And then use alphapeptdeep_hdf_to_tsv.ipynb to translate hdf into tsv (diann/spectronaut) format.
Prepare the data and settings¶
[ ]:
from alphabase.peptide.fragment import get_charged_frag_types
import pandas as pd
fasta_list = [
r"y:\User\Feng\fasta\uniprot_human_reviewed_20210309.fasta"
]
# output spectral library in hdf format
hdf_path = r'y:\User\Feng\speclib\human_swissprot.speclib.hdf'
protease="trypsin"
nce = 30
instrument = 'timsTOF'
add_phos=False
protease_dict = {
"trypsin": "([KR])", # this is in fact the "trypsin/P"
"lysc": "([K])",
"lysn": "\w(?=K)",
}
min_pep_len = 7
max_pep_len = 35
max_miss_cleave = 1
max_var_mods = 1
min_pep_mz = 400
max_pep_mz = 1200
precursor_charge_min = 2
precursor_charge_max = 4
var_mods = []
var_mods += ['Oxidation@M']
#var_mods += ['Phospho@S','Phospho@T','Phospho@Y']
frag_types = get_charged_frag_types(
['b','y']+
(['b_modloss','y_modloss'] if add_phos else []),
2
)
[ ]:
digest = protease_dict[protease] # Or digest = "trypsin/P"
protease and digest are designed by regular expression. alphabase provides several built-in enzymes, we don’t need to design the regular expression for most of the enzymes. Here are all the built-in enzymes:
[ ]:
from alphabase.protein.fasta import protease_dict
protease_dict
{'arg-c': 'R',
'asp-n': '\\w(?=D)',
'bnps-skatole': 'W',
'caspase 1': '(?<=[FWYL]\\w[HAT])D(?=[^PEDQKR])',
'caspase 2': '(?<=DVA)D(?=[^PEDQKR])',
'caspase 3': '(?<=DMQ)D(?=[^PEDQKR])',
'caspase 4': '(?<=LEV)D(?=[^PEDQKR])',
'caspase 5': '(?<=[LW]EH)D',
'caspase 6': '(?<=VE[HI])D(?=[^PEDQKR])',
'caspase 7': '(?<=DEV)D(?=[^PEDQKR])',
'caspase 8': '(?<=[IL]ET)D(?=[^PEDQKR])',
'caspase 9': '(?<=LEH)D',
'caspase 10': '(?<=IEA)D',
'chymotrypsin high specificity': '([FY](?=[^P]))|(W(?=[^MP]))',
'chymotrypsin low specificity': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))',
'chymotrypsin': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))',
'clostripain': 'R',
'cnbr': 'M',
'enterokinase': '(?<=[DE]{3})K',
'factor xa': '(?<=[AFGILTVM][DE]G)R',
'formic acid': 'D',
'glutamyl endopeptidase': 'E',
'glu-c': 'E',
'granzyme b': '(?<=IEP)D',
'hydroxylamine': 'N(?=G)',
'iodosobenzoic acid': 'W',
'lys-c': 'K',
'lys-n': '\\w(?=K)',
'ntcb': '\\w(?=C)',
'pepsin ph1.3': '((?<=[^HKR][^P])[^R](?=[FL][^P]))|((?<=[^HKR][^P])[FL](?=\\w[^P]))',
'pepsin ph2.0': '((?<=[^HKR][^P])[^R](?=[FLWY][^P]))|((?<=[^HKR][^P])[FLWY](?=\\w[^P]))',
'proline endopeptidase': '(?<=[HKR])P(?=[^P])',
'proteinase k': '[AEFILTVWY]',
'staphylococcal peptidase i': '(?<=[^E])E',
'thermolysin': '[^DE](?=[AFILMV])',
'thrombin': '((?<=G)R(?=G))|((?<=[AFGILTVM][AFGILTVWA]P)R(?=[^DE][^DE]))',
'trypsin_full': '([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P))',
'trypsin_exception': '((?<=[CD])K(?=D))|((?<=C)K(?=[HY]))|((?<=C)R(?=K))|((?<=R)R(?=[HR]))',
'trypsin': '([KR](?=[^P]))',
'trypsin/P': '([KR])',
'non-specific': '()',
'no-cleave': '_'}
Initialize a PredictSpecLibFasta object¶
[ ]:
from peptdeep.protein.fasta import PredictSpecLibFasta
from peptdeep.pretrained_models import ModelManager
model_mgr = ModelManager(device='gpu')
model_mgr.nce = nce
model_mgr.instrument = instrument
fasta_lib = PredictSpecLibFasta(
model_mgr,
protease=digest,
charged_frag_types=frag_types,
var_mods=var_mods,
fix_mods=['Carbamidomethyl@C'],
max_missed_cleavages=max_miss_cleave,
max_var_mod_num=max_var_mods,
peptide_length_max=max_pep_len,
peptide_length_min=min_pep_len,
precursor_charge_min=precursor_charge_min,
precursor_charge_max=precursor_charge_max,
precursor_mz_min=min_pep_mz,
precursor_mz_max=max_pep_mz,
decoy=None
)
Digest¶
[ ]:
fasta_lib.get_peptides_from_fasta_list(fasta_list)
If we have a sequence DataFrame (seq_df) containing peptide sequences in the sequence column, we can skip get_peptides_from_fasta_list. Just assign seq_df to fasta_lib._precursor_df and perform all following steps.
fasta_lib._precursor_df = seq_df
Append decoy sequences and add modifications¶
[ ]:
fasta_lib.append_decoy_sequence()
fasta_lib.add_modifications()
We will get a protein DataFrame (protein_df) after digestion
[ ]:
fasta_lib.protein_df
| protein_id | full_name | gene_name | description | sequence | |
|---|---|---|---|---|---|
| 0 | Q9H9K5 | sp|Q9H9K5|MER34_HUMAN | ERVMER34-1 | sp|Q9H9K5|MER34_HUMAN Endogenous retroviral en... | MGSLSNYALLQLTLTAFLTILVQPQHLLAPVFRTLSILTNQSNCWL... |
| 1 | P04439 | sp|P04439|HLAA_HUMAN | HLA-A | sp|P04439|HLAA_HUMAN HLA class I histocompatib... | MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRF... |
| 2 | P01911 | sp|P01911|DRB1_HUMAN | HLA-DRB1 | sp|P01911|DRB1_HUMAN HLA class II histocompati... | MVCLKLPGGSCMTALTVTLMVLSSPLALSGDTRPRFLWQPKRECHF... |
| 3 | P01889 | sp|P01889|HLAB_HUMAN | HLA-B | sp|P01889|HLAB_HUMAN HLA class I histocompatib... | MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRF... |
| 4 | P31689 | sp|P31689|DNJA1_HUMAN | DNAJA1 | sp|P31689|DNJA1_HUMAN DnaJ homolog subfamily A... | MVKETTYYDVLGVKPNATQEELKKAYRKLALKYHPDKNPNEGEKFK... |
| ... | ... | ... | ... | ... | ... |
| 20391 | Q8WVZ7 | sp|Q8WVZ7|RN133_HUMAN | RNF133 | sp|Q8WVZ7|RN133_HUMAN E3 ubiquitin-protein lig... | MHLLKVGTWRNNTASSWLMKFSVLWLVSQNCCRASVVWMAYMNISF... |
| 20392 | P05387 | sp|P05387|RLA2_HUMAN | RPLP2 | sp|P05387|RLA2_HUMAN 60S acidic ribosomal prot... | MRYVASYLLAALGGNSSPSAKDIKKILDSVGIEADDDRLNKVISEL... |
| 20393 | P51991 | sp|P51991|ROA3_HUMAN | HNRNPA3 | sp|P51991|ROA3_HUMAN Heterogeneous nuclear rib... | MEVKPPPGRPQPDSGRRRRRRGEEGHDPKEPEQLRKLFIGGLSFET... |
| 20394 | Q9BZX4 | sp|Q9BZX4|ROP1B_HUMAN | ROPN1B | sp|Q9BZX4|ROP1B_HUMAN Ropporin-1B OS=Homo sapi... | MAQTDKPTCIPPELPKMLKEFAKAAIRAQPQDLIQWGADYFEALSR... |
| 20395 | P34096 | sp|P34096|RNAS4_HUMAN | RNASE4 | sp|P34096|RNAS4_HUMAN Ribonuclease 4 OS=Homo s... | MALQRTHSLLLLLLLTLLGLGLVQPSYGQDGMYQRFLRQHVHPEET... |
20396 rows × 5 columns
precursor_df contains the main information of peptides.
[ ]:
fasta_lib.precursor_df['nAA'] = fasta_lib.precursor_df.sequence.str.len()
fasta_lib.precursor_df.sort_values('nAA', inplace=True)
fasta_lib.precursor_df.reset_index(drop=True, inplace=True)
Check precursor_df after adding charge states.
[ ]:
fasta_lib.add_charge()
fasta_lib.precursor_df
| sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | charge | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | RIHTGQR | 19786 | 1 | False | False | 7 | 2 | ||
| 1 | RIHTGQR | 19786 | 1 | False | False | 7 | 3 | ||
| 2 | RIHTGQR | 19786 | 1 | False | False | 7 | 4 | ||
| 3 | LVDSAYK | 12819 | 0 | False | False | 7 | 2 | ||
| 4 | LVDSAYK | 12819 | 0 | False | False | 7 | 3 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5617819 | KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK | 2299 | 1 | False | False | 35 | 3 | ||
| 5617820 | KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK | 2299 | 1 | False | False | 35 | 4 | ||
| 5617821 | AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK | 10080 | 1 | False | False | Carbamidomethyl@C | 24 | 35 | 2 |
| 5617822 | AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK | 10080 | 1 | False | False | Carbamidomethyl@C | 24 | 35 | 3 |
| 5617823 | AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK | 10080 | 1 | False | False | Carbamidomethyl@C | 24 | 35 | 4 |
5617824 rows × 9 columns
PredictSpecLibFasta.calc_precursor_mz will append a precursor_mz column to the precursor_df.
PredictSpecLibFasta.hash_precursor_df will append mod_seq_hash and mod_seq_charge_hash columns to the precursor_df. mod_seq_hash column contains the unique signatures (np.int64) for corresponding peptides ( sequence,mods and mod_sites). mod_seq_charge_hash column contains the unique signatures (np.int64) for corresponding precursors ( sequence,mods, mod_sites and charge).
[ ]:
fasta_lib.hash_precursor_df()
fasta_lib.calc_precursor_mz()
fasta_lib.precursor_df
| sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | charge | mod_seq_hash | mod_seq_charge_hash | precursor_mz | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RIHTGQR | 19786 | 1 | False | False | 7 | 2 | 471662500970219628 | 471662500970219630 | 434.249018 | ||
| 1 | PMPMPVR | 9448 | 0 | False | False | 7 | 2 | -5301076820607700090 | -5301076820607700088 | 414.216952 | ||
| 2 | PMPMPVR | 9448 | 0 | False | False | Oxidation@M | 4 | 7 | 2 | 6057464136741449831 | 6057464136741449833 | 422.214409 |
| 3 | PMPMPVR | 9448 | 0 | False | False | Oxidation@M | 2 | 7 | 2 | -6431722582867031756 | -6431722582867031754 | 422.214409 |
| 4 | QEWFCTR | 12819 | 0 | False | False | Carbamidomethyl@C | 5 | 7 | 2 | -7409729050206298801 | -7409729050206298799 | 513.726727 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3654202 | NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR | 978 | 1 | False | False | 35 | 4 | 7192344052213098704 | 7192344052213098708 | 866.228888 | ||
| 3654203 | NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR | 978 | 1 | False | False | Oxidation@M | 17 | 35 | 3 | -1485306056792248111 | -1485306056792248108 | 1159.967730 |
| 3654204 | NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR | 978 | 1 | False | False | Oxidation@M | 17 | 35 | 4 | -1485306056792248111 | -1485306056792248107 | 870.227616 |
| 3654205 | KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK | 2299 | 1 | False | False | 35 | 4 | 5191231126132273751 | 5191231126132273755 | 976.910866 | ||
| 3654206 | AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK | 10080 | 1 | False | False | Carbamidomethyl@C | 24 | 35 | 4 | -7707559913944666938 | -7707559913944666934 | 958.434460 |
3654207 rows × 12 columns
Predict MS2/RT/CCS(mobility)¶
[ ]:
fasta_lib.precursor_df['instrument'] = model_mgr.instrument
fasta_lib.precursor_df['nce'] = model_mgr.nce
res = fasta_lib.model_manager.predict_all(
fasta_lib.precursor_df,
predict_items=['rt','mobility','ms2'],
frag_types = frag_types,
)
fasta_lib.set_precursor_and_fragment(
**res
)
2022-08-03 14:14:41> Predicting RT ...
100%|██████████| 29/29 [01:30<00:00, 3.11s/it]
2022-08-03 14:16:12> Predicting mobility ...
100%|██████████| 29/29 [01:31<00:00, 3.14s/it]
2022-08-03 14:18:10> Predicting MS2 ...
100%|██████████| 29/29 [04:53<00:00, 10.13s/it]
Check memory usage for the library
[ ]:
import os, psutil
import numpy as np
process = psutil.Process(os.getpid())
print(f'{len(fasta_lib.precursor_df)*1e-6:.2f}M precursors with {np.prod(fasta_lib.fragment_mz_df.values.shape, dtype=float)*(1e-6):.2f}M fragments used {process.memory_info().rss/1024**3:.4f} GB memory')
3.65M precursors with 241.62M fragments used 6.9472 GB memory
The predicted fragment intensities
[ ]:
fasta_lib.fragment_intensity_df
| b_z1 | b_z2 | y_z1 | y_z2 | |
|---|---|---|---|---|
| 0 | 0.000000 | 0.0 | 0.611678 | 0.0 |
| 1 | 0.056326 | 0.0 | 1.000000 | 0.0 |
| 2 | 0.437313 | 0.0 | 0.729849 | 0.0 |
| 3 | 0.219575 | 0.0 | 0.292181 | 0.0 |
| 4 | 0.346306 | 0.0 | 0.033992 | 0.0 |
| ... | ... | ... | ... | ... |
| 60404997 | 0.000000 | 0.0 | 0.322072 | 0.0 |
| 60404998 | 0.000000 | 0.0 | 0.206371 | 0.0 |
| 60404999 | 0.000000 | 0.0 | 0.033532 | 0.0 |
| 60405000 | 0.000000 | 0.0 | 0.040032 | 0.0 |
| 60405001 | 0.000000 | 0.0 | 0.000000 | 0.0 |
60405002 rows × 4 columns
The calculated fragment m/z values
[ ]:
fasta_lib.fragment_mz_df
| b_z1 | b_z2 | y_z1 | y_z2 | |
|---|---|---|---|---|
| 0 | 157.108387 | 79.057832 | 711.389648 | 356.198462 |
| 1 | 270.192451 | 135.599864 | 598.305584 | 299.656430 |
| 2 | 407.251363 | 204.129320 | 461.246672 | 231.126974 |
| 3 | 508.299042 | 254.653159 | 360.198993 | 180.603135 |
| 4 | 565.320506 | 283.163891 | 303.177530 | 152.092403 |
| ... | ... | ... | ... | ... |
| 60404997 | 3285.398701 | 1643.202989 | 546.324588 | 273.665932 |
| 60404998 | 3386.446379 | 1693.726828 | 445.276909 | 223.142093 |
| 60404999 | 3443.467843 | 1722.237560 | 388.255446 | 194.631361 |
| 60405000 | 3571.526420 | 1786.266848 | 260.196868 | 130.602072 |
| 60405001 | 3684.610484 | 1842.808880 | 147.112804 | 74.060040 |
60405002 rows × 4 columns
PredictSpecLibFasta.rt_to_irt_pred will translate the predicted RT values to iRT values (rt_pred to irt_pred). This is useful for DiaNN and Spectronaut search.
[ ]:
fasta_lib.translate_rt_to_irt_pred()
fasta_lib.precursor_df
| sequence | protein_idxes | miss_cleavage | is_prot_nterm | is_prot_cterm | mods | mod_sites | nAA | charge | mod_seq_hash | ... | precursor_mz | instrument | nce | rt_pred | rt_norm_pred | ccs_pred | mobility_pred | frag_stop_idx | frag_start_idx | irt_pred | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RIHTGQR | 19786 | 1 | False | False | 7 | 2 | 471662500970219628 | ... | 434.249018 | timsTOF | 30 | 0.115377 | 0.115377 | 315.529022 | 0.775438 | 6 | 0 | -37.187631 | ||
| 1 | PMPMPVR | 9448 | 0 | False | False | 7 | 2 | -5301076820607700090 | ... | 414.216952 | timsTOF | 30 | 0.208976 | 0.208976 | 304.965790 | 0.748912 | 12 | 6 | -16.331142 | ||
| 2 | PMPMPVR | 9448 | 0 | False | False | Oxidation@M | 4 | 7 | 2 | 6057464136741449831 | ... | 422.214409 | timsTOF | 30 | 0.158058 | 0.158058 | 304.080536 | 0.746970 | 18 | 12 | -27.677099 |
| 3 | PMPMPVR | 9448 | 0 | False | False | Oxidation@M | 2 | 7 | 2 | -6431722582867031756 | ... | 422.214409 | timsTOF | 30 | 0.157143 | 0.157143 | 305.825348 | 0.751256 | 24 | 18 | -27.881022 |
| 4 | QEWFCTR | 12819 | 0 | False | False | Carbamidomethyl@C | 5 | 7 | 2 | -7409729050206298801 | ... | 513.726727 | timsTOF | 30 | 0.423747 | 0.423747 | 330.547638 | 0.814317 | 30 | 24 | 31.526291 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3654202 | NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR | 978 | 1 | False | False | 35 | 4 | 7192344052213098704 | ... | 866.228888 | timsTOF | 30 | 0.831350 | 0.831350 | 891.748413 | 1.108824 | 60404866 | 60404832 | 122.352159 | ||
| 3654203 | NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR | 978 | 1 | False | False | Oxidation@M | 17 | 35 | 3 | -1485306056792248111 | ... | 1159.967730 | timsTOF | 30 | 0.826977 | 0.826977 | 785.478699 | 1.302269 | 60404900 | 60404866 | 121.377815 |
| 3654204 | NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR | 978 | 1 | False | False | Oxidation@M | 17 | 35 | 4 | -1485306056792248111 | ... | 870.227616 | timsTOF | 30 | 0.826977 | 0.826977 | 892.459656 | 1.109729 | 60404934 | 60404900 | 121.377815 |
| 3654205 | KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK | 2299 | 1 | False | False | 35 | 4 | 5191231126132273751 | ... | 976.910866 | timsTOF | 30 | 0.670129 | 0.670129 | 791.322266 | 0.984398 | 60404968 | 60404934 | 86.427514 | ||
| 3654206 | AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK | 10080 | 1 | False | False | Carbamidomethyl@C | 24 | 35 | 4 | -7707559913944666938 | ... | 958.434460 | timsTOF | 30 | 0.725150 | 0.725150 | 823.819214 | 1.024754 | 60405002 | 60404968 | 98.687774 |
3654207 rows × 21 columns
Save the library into a HDF5 (.hdf) file¶
[ ]:
fasta_lib.save_hdf(hdf_path)
hdf_path
'y:\\User\\Feng\\speclib\\human_swissprot.speclib.hdf'
Now use alphapeptdeep_hdf_to_tsv.ipynb to translate hdf into TSV (diann/spectronaut) format. Translation is quite slow because writing TSV file is slow for large libraries.