Tutorial: Predicting Spectral Library from Fasta¶

[ ]:

%reload_ext autoreload
%autoreload 2

Predict fasta libray and save as HDF file using this notebook. And then use alphapeptdeep_hdf_to_tsv.ipynb to translate hdf into tsv (diann/spectronaut) format.

Prepare the data and settings¶

[ ]:

from alphabase.peptide.fragment import get_charged_frag_types
import pandas as pd

fasta_list = [
    r"y:\User\Feng\fasta\uniprot_human_reviewed_20210309.fasta"
]
# output spectral library in hdf format
hdf_path = r'y:\User\Feng\speclib\human_swissprot.speclib.hdf'

protease="trypsin"
nce = 30
instrument = 'timsTOF'

add_phos=False

protease_dict = {
    "trypsin": "([KR])", # this is in fact the "trypsin/P"
    "lysc": "([K])",
    "lysn": "\w(?=K)",
}
min_pep_len = 7
max_pep_len = 35
max_miss_cleave = 1
max_var_mods = 1
min_pep_mz = 400
max_pep_mz = 1200
precursor_charge_min = 2
precursor_charge_max = 4

var_mods = []
var_mods += ['Oxidation@M']
#var_mods += ['Phospho@S','Phospho@T','Phospho@Y']


frag_types = get_charged_frag_types(
    ['b','y']+
    (['b_modloss','y_modloss'] if add_phos else []),
    2
)

[ ]:

digest = protease_dict[protease] # Or digest = "trypsin/P"

protease and digest are designed by regular expression. alphabase provides several built-in enzymes, we don’t need to design the regular expression for most of the enzymes. Here are all the built-in enzymes:

[ ]:

from alphabase.protein.fasta import protease_dict
protease_dict

{'arg-c': 'R',
 'asp-n': '\\w(?=D)',
 'bnps-skatole': 'W',
 'caspase 1': '(?<=[FWYL]\\w[HAT])D(?=[^PEDQKR])',
 'caspase 2': '(?<=DVA)D(?=[^PEDQKR])',
 'caspase 3': '(?<=DMQ)D(?=[^PEDQKR])',
 'caspase 4': '(?<=LEV)D(?=[^PEDQKR])',
 'caspase 5': '(?<=[LW]EH)D',
 'caspase 6': '(?<=VE[HI])D(?=[^PEDQKR])',
 'caspase 7': '(?<=DEV)D(?=[^PEDQKR])',
 'caspase 8': '(?<=[IL]ET)D(?=[^PEDQKR])',
 'caspase 9': '(?<=LEH)D',
 'caspase 10': '(?<=IEA)D',
 'chymotrypsin high specificity': '([FY](?=[^P]))|(W(?=[^MP]))',
 'chymotrypsin low specificity': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))',
 'chymotrypsin': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))',
 'clostripain': 'R',
 'cnbr': 'M',
 'enterokinase': '(?<=[DE]{3})K',
 'factor xa': '(?<=[AFGILTVM][DE]G)R',
 'formic acid': 'D',
 'glutamyl endopeptidase': 'E',
 'glu-c': 'E',
 'granzyme b': '(?<=IEP)D',
 'hydroxylamine': 'N(?=G)',
 'iodosobenzoic acid': 'W',
 'lys-c': 'K',
 'lys-n': '\\w(?=K)',
 'ntcb': '\\w(?=C)',
 'pepsin ph1.3': '((?<=[^HKR][^P])[^R](?=[FL][^P]))|((?<=[^HKR][^P])[FL](?=\\w[^P]))',
 'pepsin ph2.0': '((?<=[^HKR][^P])[^R](?=[FLWY][^P]))|((?<=[^HKR][^P])[FLWY](?=\\w[^P]))',
 'proline endopeptidase': '(?<=[HKR])P(?=[^P])',
 'proteinase k': '[AEFILTVWY]',
 'staphylococcal peptidase i': '(?<=[^E])E',
 'thermolysin': '[^DE](?=[AFILMV])',
 'thrombin': '((?<=G)R(?=G))|((?<=[AFGILTVM][AFGILTVWA]P)R(?=[^DE][^DE]))',
 'trypsin_full': '([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P))',
 'trypsin_exception': '((?<=[CD])K(?=D))|((?<=C)K(?=[HY]))|((?<=C)R(?=K))|((?<=R)R(?=[HR]))',
 'trypsin': '([KR](?=[^P]))',
 'trypsin/P': '([KR])',
 'non-specific': '()',
 'no-cleave': '_'}

Initialize a `PredictSpecLibFasta` object¶

[ ]:

from peptdeep.protein.fasta import PredictSpecLibFasta
from peptdeep.pretrained_models import ModelManager

model_mgr = ModelManager(device='gpu')

model_mgr.nce = nce
model_mgr.instrument = instrument

fasta_lib = PredictSpecLibFasta(
    model_mgr,
    protease=digest,
    charged_frag_types=frag_types,
    var_mods=var_mods,
    fix_mods=['Carbamidomethyl@C'],
    max_missed_cleavages=max_miss_cleave,
    max_var_mod_num=max_var_mods,
    peptide_length_max=max_pep_len,
    peptide_length_min=min_pep_len,
    precursor_charge_min=precursor_charge_min,
    precursor_charge_max=precursor_charge_max,
    precursor_mz_min=min_pep_mz,
    precursor_mz_max=max_pep_mz,
    decoy=None
)

Digest¶

[ ]:

fasta_lib.get_peptides_from_fasta_list(fasta_list)

If we have a sequence DataFrame (seq_df) containing peptide sequences in the sequence column, we can skip get_peptides_from_fasta_list. Just assign seq_df to fasta_lib._precursor_df and perform all following steps.

fasta_lib._precursor_df = seq_df

Append decoy sequences and add modifications¶

[ ]:

fasta_lib.append_decoy_sequence()
fasta_lib.add_modifications()

We will get a protein DataFrame (protein_df) after digestion

[ ]:

fasta_lib.protein_df

	protein_id	full_name	gene_name	description	sequence
0	Q9H9K5	sp\|Q9H9K5\|MER34_HUMAN	ERVMER34-1	sp\|Q9H9K5\|MER34_HUMAN Endogenous retroviral en...	MGSLSNYALLQLTLTAFLTILVQPQHLLAPVFRTLSILTNQSNCWL...
1	P04439	sp\|P04439\|HLAA_HUMAN	HLA-A	sp\|P04439\|HLAA_HUMAN HLA class I histocompatib...	MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRF...
2	P01911	sp\|P01911\|DRB1_HUMAN	HLA-DRB1	sp\|P01911\|DRB1_HUMAN HLA class II histocompati...	MVCLKLPGGSCMTALTVTLMVLSSPLALSGDTRPRFLWQPKRECHF...
3	P01889	sp\|P01889\|HLAB_HUMAN	HLA-B	sp\|P01889\|HLAB_HUMAN HLA class I histocompatib...	MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRF...
4	P31689	sp\|P31689\|DNJA1_HUMAN	DNAJA1	sp\|P31689\|DNJA1_HUMAN DnaJ homolog subfamily A...	MVKETTYYDVLGVKPNATQEELKKAYRKLALKYHPDKNPNEGEKFK...
...	...	...	...	...	...
20391	Q8WVZ7	sp\|Q8WVZ7\|RN133_HUMAN	RNF133	sp\|Q8WVZ7\|RN133_HUMAN E3 ubiquitin-protein lig...	MHLLKVGTWRNNTASSWLMKFSVLWLVSQNCCRASVVWMAYMNISF...
20392	P05387	sp\|P05387\|RLA2_HUMAN	RPLP2	sp\|P05387\|RLA2_HUMAN 60S acidic ribosomal prot...	MRYVASYLLAALGGNSSPSAKDIKKILDSVGIEADDDRLNKVISEL...
20393	P51991	sp\|P51991\|ROA3_HUMAN	HNRNPA3	sp\|P51991\|ROA3_HUMAN Heterogeneous nuclear rib...	MEVKPPPGRPQPDSGRRRRRRGEEGHDPKEPEQLRKLFIGGLSFET...
20394	Q9BZX4	sp\|Q9BZX4\|ROP1B_HUMAN	ROPN1B	sp\|Q9BZX4\|ROP1B_HUMAN Ropporin-1B OS=Homo sapi...	MAQTDKPTCIPPELPKMLKEFAKAAIRAQPQDLIQWGADYFEALSR...
20395	P34096	sp\|P34096\|RNAS4_HUMAN	RNASE4	sp\|P34096\|RNAS4_HUMAN Ribonuclease 4 OS=Homo s...	MALQRTHSLLLLLLLTLLGLGLVQPSYGQDGMYQRFLRQHVHPEET...

20396 rows × 5 columns

precursor_df contains the main information of peptides.

[ ]:

fasta_lib.precursor_df['nAA'] = fasta_lib.precursor_df.sequence.str.len()
fasta_lib.precursor_df.sort_values('nAA', inplace=True)
fasta_lib.precursor_df.reset_index(drop=True, inplace=True)

Check precursor_df after adding charge states.

[ ]:

fasta_lib.add_charge()
fasta_lib.precursor_df

	sequence	protein_idxes	miss_cleavage	is_prot_nterm	is_prot_cterm	mods	mod_sites	nAA	charge
0	RIHTGQR	19786	1	False	False			7	2
1	RIHTGQR	19786	1	False	False			7	3
2	RIHTGQR	19786	1	False	False			7	4
3	LVDSAYK	12819	0	False	False			7	2
4	LVDSAYK	12819	0	False	False			7	3
...	...	...	...	...	...	...	...	...	...
5617819	KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK	2299	1	False	False			35	3
5617820	KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK	2299	1	False	False			35	4
5617821	AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK	10080	1	False	False	Carbamidomethyl@C	24	35	2
5617822	AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK	10080	1	False	False	Carbamidomethyl@C	24	35	3
5617823	AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK	10080	1	False	False	Carbamidomethyl@C	24	35	4

5617824 rows × 9 columns

PredictSpecLibFasta.calc_precursor_mz will append a precursor_mz column to the precursor_df.

PredictSpecLibFasta.hash_precursor_df will append mod_seq_hash and mod_seq_charge_hash columns to the precursor_df. mod_seq_hash column contains the unique signatures (np.int64) for corresponding peptides ( sequence,mods and mod_sites). mod_seq_charge_hash column contains the unique signatures (np.int64) for corresponding precursors ( sequence,mods, mod_sites and charge).

[ ]:

fasta_lib.hash_precursor_df()
fasta_lib.calc_precursor_mz()
fasta_lib.precursor_df

	sequence	protein_idxes	miss_cleavage	is_prot_nterm	is_prot_cterm	mods	mod_sites	nAA	charge	mod_seq_hash	mod_seq_charge_hash	precursor_mz
0	RIHTGQR	19786	1	False	False			7	2	471662500970219628	471662500970219630	434.249018
1	PMPMPVR	9448	0	False	False			7	2	-5301076820607700090	-5301076820607700088	414.216952
2	PMPMPVR	9448	0	False	False	Oxidation@M	4	7	2	6057464136741449831	6057464136741449833	422.214409
3	PMPMPVR	9448	0	False	False	Oxidation@M	2	7	2	-6431722582867031756	-6431722582867031754	422.214409
4	QEWFCTR	12819	0	False	False	Carbamidomethyl@C	5	7	2	-7409729050206298801	-7409729050206298799	513.726727
...	...	...	...	...	...	...	...	...	...	...	...	...
3654202	NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR	978	1	False	False			35	4	7192344052213098704	7192344052213098708	866.228888
3654203	NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR	978	1	False	False	Oxidation@M	17	35	3	-1485306056792248111	-1485306056792248108	1159.967730
3654204	NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR	978	1	False	False	Oxidation@M	17	35	4	-1485306056792248111	-1485306056792248107	870.227616
3654205	KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK	2299	1	False	False			35	4	5191231126132273751	5191231126132273755	976.910866
3654206	AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK	10080	1	False	False	Carbamidomethyl@C	24	35	4	-7707559913944666938	-7707559913944666934	958.434460

3654207 rows × 12 columns

Predict MS2/RT/CCS(mobility)¶

[ ]:

fasta_lib.precursor_df['instrument'] = model_mgr.instrument
fasta_lib.precursor_df['nce'] = model_mgr.nce
res = fasta_lib.model_manager.predict_all(
    fasta_lib.precursor_df,
    predict_items=['rt','mobility','ms2'],
    frag_types = frag_types,
)
fasta_lib.set_precursor_and_fragment(
    **res
)

2022-08-03 14:14:41> Predicting RT ...

100%|██████████| 29/29 [01:30<00:00,  3.11s/it]

2022-08-03 14:16:12> Predicting mobility ...


100%|██████████| 29/29 [01:31<00:00,  3.14s/it]

2022-08-03 14:18:10> Predicting MS2 ...

100%|██████████| 29/29 [04:53<00:00, 10.13s/it]

Check memory usage for the library

[ ]:

import os, psutil
import numpy as np
process = psutil.Process(os.getpid())
print(f'{len(fasta_lib.precursor_df)*1e-6:.2f}M precursors with {np.prod(fasta_lib.fragment_mz_df.values.shape, dtype=float)*(1e-6):.2f}M fragments used {process.memory_info().rss/1024**3:.4f} GB memory')

3.65M precursors with 241.62M fragments used 6.9472 GB memory

The predicted fragment intensities

[ ]:

fasta_lib.fragment_intensity_df

	b_z1	b_z2	y_z1	y_z2
0	0.000000	0.0	0.611678	0.0
1	0.056326	0.0	1.000000	0.0
2	0.437313	0.0	0.729849	0.0
3	0.219575	0.0	0.292181	0.0
4	0.346306	0.0	0.033992	0.0
...	...	...	...	...
60404997	0.000000	0.0	0.322072	0.0
60404998	0.000000	0.0	0.206371	0.0
60404999	0.000000	0.0	0.033532	0.0
60405000	0.000000	0.0	0.040032	0.0
60405001	0.000000	0.0	0.000000	0.0

60405002 rows × 4 columns

The calculated fragment m/z values

[ ]:

fasta_lib.fragment_mz_df

	b_z1	b_z2	y_z1	y_z2
0	157.108387	79.057832	711.389648	356.198462
1	270.192451	135.599864	598.305584	299.656430
2	407.251363	204.129320	461.246672	231.126974
3	508.299042	254.653159	360.198993	180.603135
4	565.320506	283.163891	303.177530	152.092403
...	...	...	...	...
60404997	3285.398701	1643.202989	546.324588	273.665932
60404998	3386.446379	1693.726828	445.276909	223.142093
60404999	3443.467843	1722.237560	388.255446	194.631361
60405000	3571.526420	1786.266848	260.196868	130.602072
60405001	3684.610484	1842.808880	147.112804	74.060040

60405002 rows × 4 columns

PredictSpecLibFasta.rt_to_irt_pred will translate the predicted RT values to iRT values (rt_pred to irt_pred). This is useful for DiaNN and Spectronaut search.

[ ]:

fasta_lib.translate_rt_to_irt_pred()
fasta_lib.precursor_df

	sequence	protein_idxes	miss_cleavage	is_prot_nterm	is_prot_cterm	mods	mod_sites	nAA	charge	mod_seq_hash	...	precursor_mz	instrument	nce	rt_pred	rt_norm_pred	ccs_pred	mobility_pred	frag_stop_idx	frag_start_idx	irt_pred
0	RIHTGQR	19786	1	False	False			7	2	471662500970219628	...	434.249018	timsTOF	30	0.115377	0.115377	315.529022	0.775438	6	0	-37.187631
1	PMPMPVR	9448	0	False	False			7	2	-5301076820607700090	...	414.216952	timsTOF	30	0.208976	0.208976	304.965790	0.748912	12	6	-16.331142
2	PMPMPVR	9448	0	False	False	Oxidation@M	4	7	2	6057464136741449831	...	422.214409	timsTOF	30	0.158058	0.158058	304.080536	0.746970	18	12	-27.677099
3	PMPMPVR	9448	0	False	False	Oxidation@M	2	7	2	-6431722582867031756	...	422.214409	timsTOF	30	0.157143	0.157143	305.825348	0.751256	24	18	-27.881022
4	QEWFCTR	12819	0	False	False	Carbamidomethyl@C	5	7	2	-7409729050206298801	...	513.726727	timsTOF	30	0.423747	0.423747	330.547638	0.814317	30	24	31.526291
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3654202	NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR	978	1	False	False			35	4	7192344052213098704	...	866.228888	timsTOF	30	0.831350	0.831350	891.748413	1.108824	60404866	60404832	122.352159
3654203	NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR	978	1	False	False	Oxidation@M	17	35	3	-1485306056792248111	...	1159.967730	timsTOF	30	0.826977	0.826977	785.478699	1.302269	60404900	60404866	121.377815
3654204	NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR	978	1	False	False	Oxidation@M	17	35	4	-1485306056792248111	...	870.227616	timsTOF	30	0.826977	0.826977	892.459656	1.109729	60404934	60404900	121.377815
3654205	KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK	2299	1	False	False			35	4	5191231126132273751	...	976.910866	timsTOF	30	0.670129	0.670129	791.322266	0.984398	60404968	60404934	86.427514
3654206	AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK	10080	1	False	False	Carbamidomethyl@C	24	35	4	-7707559913944666938	...	958.434460	timsTOF	30	0.725150	0.725150	823.819214	1.024754	60405002	60404968	98.687774

3654207 rows × 21 columns

Save the library into a HDF5 (.hdf) file¶

[ ]:

fasta_lib.save_hdf(hdf_path)
hdf_path

'y:\\User\\Feng\\speclib\\human_swissprot.speclib.hdf'

Now use alphapeptdeep_hdf_to_tsv.ipynb to translate hdf into TSV (diann/spectronaut) format. Translation is quite slow because writing TSV file is slow for large libraries.