{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: Predicting Spectral Library from Fasta" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict fasta libray and save as HDF file using this notebook.\n", "And then use [alphapeptdeep_hdf_to_tsv.ipynb](alphapeptdeep_hdf_to_tsv.ipynb) to translate hdf into tsv (diann/spectronaut) format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Prepare the data and settings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from alphabase.peptide.fragment import get_charged_frag_types\n", "import pandas as pd\n", "\n", "fasta_list = [\n", " r\"y:\\User\\Feng\\fasta\\uniprot_human_reviewed_20210309.fasta\"\n", "]\n", "# output spectral library in hdf format\n", "hdf_path = r'y:\\User\\Feng\\speclib\\human_swissprot.speclib.hdf'\n", "\n", "protease=\"trypsin\"\n", "nce = 30\n", "instrument = 'timsTOF'\n", "\n", "add_phos=False\n", "\n", "protease_dict = {\n", " \"trypsin\": \"([KR])\", # this is in fact the \"trypsin/P\"\n", " \"lysc\": \"([K])\",\n", " \"lysn\": \"\\w(?=K)\",\n", "}\n", "min_pep_len = 7\n", "max_pep_len = 35\n", "max_miss_cleave = 1\n", "max_var_mods = 1\n", "min_pep_mz = 400\n", "max_pep_mz = 1200\n", "precursor_charge_min = 2\n", "precursor_charge_max = 4\n", "\n", "var_mods = []\n", "var_mods += ['Oxidation@M']\n", "#var_mods += ['Phospho@S','Phospho@T','Phospho@Y']\n", "\n", "\n", "frag_types = get_charged_frag_types(\n", " ['b','y']+\n", " (['b_modloss','y_modloss'] if add_phos else []), \n", " 2\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "digest = protease_dict[protease] # Or digest = \"trypsin/P\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`protease` and `digest` are designed by regular expression. alphabase provides several built-in enzymes, we don't need to design the regular expression for most of the enzymes. Here are all the built-in enzymes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'arg-c': 'R',\n", " 'asp-n': '\\\\w(?=D)',\n", " 'bnps-skatole': 'W',\n", " 'caspase 1': '(?<=[FWYL]\\\\w[HAT])D(?=[^PEDQKR])',\n", " 'caspase 2': '(?<=DVA)D(?=[^PEDQKR])',\n", " 'caspase 3': '(?<=DMQ)D(?=[^PEDQKR])',\n", " 'caspase 4': '(?<=LEV)D(?=[^PEDQKR])',\n", " 'caspase 5': '(?<=[LW]EH)D',\n", " 'caspase 6': '(?<=VE[HI])D(?=[^PEDQKR])',\n", " 'caspase 7': '(?<=DEV)D(?=[^PEDQKR])',\n", " 'caspase 8': '(?<=[IL]ET)D(?=[^PEDQKR])',\n", " 'caspase 9': '(?<=LEH)D',\n", " 'caspase 10': '(?<=IEA)D',\n", " 'chymotrypsin high specificity': '([FY](?=[^P]))|(W(?=[^MP]))',\n", " 'chymotrypsin low specificity': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))',\n", " 'chymotrypsin': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))',\n", " 'clostripain': 'R',\n", " 'cnbr': 'M',\n", " 'enterokinase': '(?<=[DE]{3})K',\n", " 'factor xa': '(?<=[AFGILTVM][DE]G)R',\n", " 'formic acid': 'D',\n", " 'glutamyl endopeptidase': 'E',\n", " 'glu-c': 'E',\n", " 'granzyme b': '(?<=IEP)D',\n", " 'hydroxylamine': 'N(?=G)',\n", " 'iodosobenzoic acid': 'W',\n", " 'lys-c': 'K',\n", " 'lys-n': '\\\\w(?=K)',\n", " 'ntcb': '\\\\w(?=C)',\n", " 'pepsin ph1.3': '((?<=[^HKR][^P])[^R](?=[FL][^P]))|((?<=[^HKR][^P])[FL](?=\\\\w[^P]))',\n", " 'pepsin ph2.0': '((?<=[^HKR][^P])[^R](?=[FLWY][^P]))|((?<=[^HKR][^P])[FLWY](?=\\\\w[^P]))',\n", " 'proline endopeptidase': '(?<=[HKR])P(?=[^P])',\n", " 'proteinase k': '[AEFILTVWY]',\n", " 'staphylococcal peptidase i': '(?<=[^E])E',\n", " 'thermolysin': '[^DE](?=[AFILMV])',\n", " 'thrombin': '((?<=G)R(?=G))|((?<=[AFGILTVM][AFGILTVWA]P)R(?=[^DE][^DE]))',\n", " 'trypsin_full': '([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P))',\n", " 'trypsin_exception': '((?<=[CD])K(?=D))|((?<=C)K(?=[HY]))|((?<=C)R(?=K))|((?<=R)R(?=[HR]))',\n", " 'trypsin': '([KR](?=[^P]))',\n", " 'trypsin/P': '([KR])',\n", " 'non-specific': '()',\n", " 'no-cleave': '_'}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from alphabase.protein.fasta import protease_dict\n", "protease_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Initialize a `PredictSpecLibFasta` object" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from peptdeep.protein.fasta import PredictSpecLibFasta\n", "from peptdeep.pretrained_models import ModelManager\n", "\n", "model_mgr = ModelManager(device='gpu')\n", "\n", "model_mgr.nce = nce\n", "model_mgr.instrument = instrument\n", "\n", "fasta_lib = PredictSpecLibFasta(\n", " model_mgr, \n", " protease=digest,\n", " charged_frag_types=frag_types, \n", " var_mods=var_mods, \n", " fix_mods=['Carbamidomethyl@C'],\n", " max_missed_cleavages=max_miss_cleave,\n", " max_var_mod_num=max_var_mods,\n", " peptide_length_max=max_pep_len,\n", " peptide_length_min=min_pep_len,\n", " precursor_charge_min=precursor_charge_min,\n", " precursor_charge_max=precursor_charge_max,\n", " precursor_mz_min=min_pep_mz,\n", " precursor_mz_max=max_pep_mz,\n", " decoy=None\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Digest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fasta_lib.get_peptides_from_fasta_list(fasta_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we have a sequence DataFrame (`seq_df`) containing peptide sequences in the `sequence` column, we can skip `get_peptides_from_fasta_list`. Just assign `seq_df` to `fasta_lib._precursor_df` and perform all following steps.\n", "\n", "```\n", "fasta_lib._precursor_df = seq_df\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Append decoy sequences and add modifications" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fasta_lib.append_decoy_sequence()\n", "fasta_lib.add_modifications()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will get a protein DataFrame (`protein_df`) after digestion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
protein_idfull_namegene_namedescriptionsequence
0Q9H9K5sp|Q9H9K5|MER34_HUMANERVMER34-1sp|Q9H9K5|MER34_HUMAN Endogenous retroviral en...MGSLSNYALLQLTLTAFLTILVQPQHLLAPVFRTLSILTNQSNCWL...
1P04439sp|P04439|HLAA_HUMANHLA-Asp|P04439|HLAA_HUMAN HLA class I histocompatib...MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRF...
2P01911sp|P01911|DRB1_HUMANHLA-DRB1sp|P01911|DRB1_HUMAN HLA class II histocompati...MVCLKLPGGSCMTALTVTLMVLSSPLALSGDTRPRFLWQPKRECHF...
3P01889sp|P01889|HLAB_HUMANHLA-Bsp|P01889|HLAB_HUMAN HLA class I histocompatib...MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRF...
4P31689sp|P31689|DNJA1_HUMANDNAJA1sp|P31689|DNJA1_HUMAN DnaJ homolog subfamily A...MVKETTYYDVLGVKPNATQEELKKAYRKLALKYHPDKNPNEGEKFK...
..................
20391Q8WVZ7sp|Q8WVZ7|RN133_HUMANRNF133sp|Q8WVZ7|RN133_HUMAN E3 ubiquitin-protein lig...MHLLKVGTWRNNTASSWLMKFSVLWLVSQNCCRASVVWMAYMNISF...
20392P05387sp|P05387|RLA2_HUMANRPLP2sp|P05387|RLA2_HUMAN 60S acidic ribosomal prot...MRYVASYLLAALGGNSSPSAKDIKKILDSVGIEADDDRLNKVISEL...
20393P51991sp|P51991|ROA3_HUMANHNRNPA3sp|P51991|ROA3_HUMAN Heterogeneous nuclear rib...MEVKPPPGRPQPDSGRRRRRRGEEGHDPKEPEQLRKLFIGGLSFET...
20394Q9BZX4sp|Q9BZX4|ROP1B_HUMANROPN1Bsp|Q9BZX4|ROP1B_HUMAN Ropporin-1B OS=Homo sapi...MAQTDKPTCIPPELPKMLKEFAKAAIRAQPQDLIQWGADYFEALSR...
20395P34096sp|P34096|RNAS4_HUMANRNASE4sp|P34096|RNAS4_HUMAN Ribonuclease 4 OS=Homo s...MALQRTHSLLLLLLLTLLGLGLVQPSYGQDGMYQRFLRQHVHPEET...
\n", "

20396 rows × 5 columns

\n", "
" ], "text/plain": [ " protein_id full_name gene_name \\\n", "0 Q9H9K5 sp|Q9H9K5|MER34_HUMAN ERVMER34-1 \n", "1 P04439 sp|P04439|HLAA_HUMAN HLA-A \n", "2 P01911 sp|P01911|DRB1_HUMAN HLA-DRB1 \n", "3 P01889 sp|P01889|HLAB_HUMAN HLA-B \n", "4 P31689 sp|P31689|DNJA1_HUMAN DNAJA1 \n", "... ... ... ... \n", "20391 Q8WVZ7 sp|Q8WVZ7|RN133_HUMAN RNF133 \n", "20392 P05387 sp|P05387|RLA2_HUMAN RPLP2 \n", "20393 P51991 sp|P51991|ROA3_HUMAN HNRNPA3 \n", "20394 Q9BZX4 sp|Q9BZX4|ROP1B_HUMAN ROPN1B \n", "20395 P34096 sp|P34096|RNAS4_HUMAN RNASE4 \n", "\n", " description \\\n", "0 sp|Q9H9K5|MER34_HUMAN Endogenous retroviral en... \n", "1 sp|P04439|HLAA_HUMAN HLA class I histocompatib... \n", "2 sp|P01911|DRB1_HUMAN HLA class II histocompati... \n", "3 sp|P01889|HLAB_HUMAN HLA class I histocompatib... \n", "4 sp|P31689|DNJA1_HUMAN DnaJ homolog subfamily A... \n", "... ... \n", "20391 sp|Q8WVZ7|RN133_HUMAN E3 ubiquitin-protein lig... \n", "20392 sp|P05387|RLA2_HUMAN 60S acidic ribosomal prot... \n", "20393 sp|P51991|ROA3_HUMAN Heterogeneous nuclear rib... \n", "20394 sp|Q9BZX4|ROP1B_HUMAN Ropporin-1B OS=Homo sapi... \n", "20395 sp|P34096|RNAS4_HUMAN Ribonuclease 4 OS=Homo s... \n", "\n", " sequence \n", "0 MGSLSNYALLQLTLTAFLTILVQPQHLLAPVFRTLSILTNQSNCWL... \n", "1 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRF... \n", "2 MVCLKLPGGSCMTALTVTLMVLSSPLALSGDTRPRFLWQPKRECHF... \n", "3 MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRF... \n", "4 MVKETTYYDVLGVKPNATQEELKKAYRKLALKYHPDKNPNEGEKFK... \n", "... ... \n", "20391 MHLLKVGTWRNNTASSWLMKFSVLWLVSQNCCRASVVWMAYMNISF... \n", "20392 MRYVASYLLAALGGNSSPSAKDIKKILDSVGIEADDDRLNKVISEL... \n", "20393 MEVKPPPGRPQPDSGRRRRRRGEEGHDPKEPEQLRKLFIGGLSFET... \n", "20394 MAQTDKPTCIPPELPKMLKEFAKAAIRAQPQDLIQWGADYFEALSR... \n", "20395 MALQRTHSLLLLLLLTLLGLGLVQPSYGQDGMYQRFLRQHVHPEET... \n", "\n", "[20396 rows x 5 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fasta_lib.protein_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`precursor_df` contains the main information of peptides." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fasta_lib.precursor_df['nAA'] = fasta_lib.precursor_df.sequence.str.len()\n", "fasta_lib.precursor_df.sort_values('nAA', inplace=True)\n", "fasta_lib.precursor_df.reset_index(drop=True, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check `precursor_df` after adding charge states." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequenceprotein_idxesmiss_cleavageis_prot_ntermis_prot_ctermmodsmod_sitesnAAcharge
0RIHTGQR197861FalseFalse72
1RIHTGQR197861FalseFalse73
2RIHTGQR197861FalseFalse74
3LVDSAYK128190FalseFalse72
4LVDSAYK128190FalseFalse73
..............................
5617819KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK22991FalseFalse353
5617820KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK22991FalseFalse354
5617821AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK100801FalseFalseCarbamidomethyl@C24352
5617822AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK100801FalseFalseCarbamidomethyl@C24353
5617823AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK100801FalseFalseCarbamidomethyl@C24354
\n", "

5617824 rows × 9 columns

\n", "
" ], "text/plain": [ " sequence protein_idxes miss_cleavage \\\n", "0 RIHTGQR 19786 1 \n", "1 RIHTGQR 19786 1 \n", "2 RIHTGQR 19786 1 \n", "3 LVDSAYK 12819 0 \n", "4 LVDSAYK 12819 0 \n", "... ... ... ... \n", "5617819 KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK 2299 1 \n", "5617820 KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK 2299 1 \n", "5617821 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 \n", "5617822 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 \n", "5617823 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 \n", "\n", " is_prot_nterm is_prot_cterm mods mod_sites nAA \\\n", "0 False False 7 \n", "1 False False 7 \n", "2 False False 7 \n", "3 False False 7 \n", "4 False False 7 \n", "... ... ... ... ... ... \n", "5617819 False False 35 \n", "5617820 False False 35 \n", "5617821 False False Carbamidomethyl@C 24 35 \n", "5617822 False False Carbamidomethyl@C 24 35 \n", "5617823 False False Carbamidomethyl@C 24 35 \n", "\n", " charge \n", "0 2 \n", "1 3 \n", "2 4 \n", "3 2 \n", "4 3 \n", "... ... \n", "5617819 3 \n", "5617820 4 \n", "5617821 2 \n", "5617822 3 \n", "5617823 4 \n", "\n", "[5617824 rows x 9 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fasta_lib.add_charge()\n", "fasta_lib.precursor_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`PredictSpecLibFasta.calc_precursor_mz` will append a `precursor_mz` column to the `precursor_df`.\n", "\n", "`PredictSpecLibFasta.hash_precursor_df` will append `mod_seq_hash` and `mod_seq_charge_hash` columns to the `precursor_df`. `mod_seq_hash` column contains the unique signatures (np.int64) for corresponding peptides ( `sequence`,`mods` and `mod_sites`). `mod_seq_charge_hash` column contains the unique signatures (np.int64) for corresponding precursors ( `sequence`,`mods`, `mod_sites` and `charge`). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequenceprotein_idxesmiss_cleavageis_prot_ntermis_prot_ctermmodsmod_sitesnAAchargemod_seq_hashmod_seq_charge_hashprecursor_mz
0RIHTGQR197861FalseFalse72471662500970219628471662500970219630434.249018
1PMPMPVR94480FalseFalse72-5301076820607700090-5301076820607700088414.216952
2PMPMPVR94480FalseFalseOxidation@M47260574641367414498316057464136741449833422.214409
3PMPMPVR94480FalseFalseOxidation@M272-6431722582867031756-6431722582867031754422.214409
4QEWFCTR128190FalseFalseCarbamidomethyl@C572-7409729050206298801-7409729050206298799513.726727
.......................................
3654202NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR9781FalseFalse35471923440522130987047192344052213098708866.228888
3654203NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR9781FalseFalseOxidation@M17353-1485306056792248111-14853060567922481081159.967730
3654204NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR9781FalseFalseOxidation@M17354-1485306056792248111-1485306056792248107870.227616
3654205KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK22991FalseFalse35451912311261322737515191231126132273755976.910866
3654206AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK100801FalseFalseCarbamidomethyl@C24354-7707559913944666938-7707559913944666934958.434460
\n", "

3654207 rows × 12 columns

\n", "
" ], "text/plain": [ " sequence protein_idxes miss_cleavage \\\n", "0 RIHTGQR 19786 1 \n", "1 PMPMPVR 9448 0 \n", "2 PMPMPVR 9448 0 \n", "3 PMPMPVR 9448 0 \n", "4 QEWFCTR 12819 0 \n", "... ... ... ... \n", "3654202 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 \n", "3654203 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 \n", "3654204 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 \n", "3654205 KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK 2299 1 \n", "3654206 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 \n", "\n", " is_prot_nterm is_prot_cterm mods mod_sites nAA \\\n", "0 False False 7 \n", "1 False False 7 \n", "2 False False Oxidation@M 4 7 \n", "3 False False Oxidation@M 2 7 \n", "4 False False Carbamidomethyl@C 5 7 \n", "... ... ... ... ... ... \n", "3654202 False False 35 \n", "3654203 False False Oxidation@M 17 35 \n", "3654204 False False Oxidation@M 17 35 \n", "3654205 False False 35 \n", "3654206 False False Carbamidomethyl@C 24 35 \n", "\n", " charge mod_seq_hash mod_seq_charge_hash precursor_mz \n", "0 2 471662500970219628 471662500970219630 434.249018 \n", "1 2 -5301076820607700090 -5301076820607700088 414.216952 \n", "2 2 6057464136741449831 6057464136741449833 422.214409 \n", "3 2 -6431722582867031756 -6431722582867031754 422.214409 \n", "4 2 -7409729050206298801 -7409729050206298799 513.726727 \n", "... ... ... ... ... \n", "3654202 4 7192344052213098704 7192344052213098708 866.228888 \n", "3654203 3 -1485306056792248111 -1485306056792248108 1159.967730 \n", "3654204 4 -1485306056792248111 -1485306056792248107 870.227616 \n", "3654205 4 5191231126132273751 5191231126132273755 976.910866 \n", "3654206 4 -7707559913944666938 -7707559913944666934 958.434460 \n", "\n", "[3654207 rows x 12 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fasta_lib.hash_precursor_df()\n", "fasta_lib.calc_precursor_mz()\n", "fasta_lib.precursor_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict MS2/RT/CCS(mobility)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-08-03 14:14:41> Predicting RT ...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 29/29 [01:30<00:00, 3.11s/it]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-08-03 14:16:12> Predicting mobility ...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n", "100%|██████████| 29/29 [01:31<00:00, 3.14s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-08-03 14:18:10> Predicting MS2 ...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 29/29 [04:53<00:00, 10.13s/it]\n" ] } ], "source": [ "fasta_lib.precursor_df['instrument'] = model_mgr.instrument\n", "fasta_lib.precursor_df['nce'] = model_mgr.nce\n", "res = fasta_lib.model_manager.predict_all(\n", " fasta_lib.precursor_df,\n", " predict_items=['rt','mobility','ms2'],\n", " frag_types = frag_types,\n", ")\n", "fasta_lib.set_precursor_and_fragment(\n", " **res\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check memory usage for the library" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.65M precursors with 241.62M fragments used 6.9472 GB memory\n" ] } ], "source": [ "import os, psutil\n", "import numpy as np\n", "process = psutil.Process(os.getpid())\n", "print(f'{len(fasta_lib.precursor_df)*1e-6:.2f}M precursors with {np.prod(fasta_lib.fragment_mz_df.values.shape, dtype=float)*(1e-6):.2f}M fragments used {process.memory_info().rss/1024**3:.4f} GB memory')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The predicted fragment intensities" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
b_z1b_z2y_z1y_z2
00.0000000.00.6116780.0
10.0563260.01.0000000.0
20.4373130.00.7298490.0
30.2195750.00.2921810.0
40.3463060.00.0339920.0
...............
604049970.0000000.00.3220720.0
604049980.0000000.00.2063710.0
604049990.0000000.00.0335320.0
604050000.0000000.00.0400320.0
604050010.0000000.00.0000000.0
\n", "

60405002 rows × 4 columns

\n", "
" ], "text/plain": [ " b_z1 b_z2 y_z1 y_z2\n", "0 0.000000 0.0 0.611678 0.0\n", "1 0.056326 0.0 1.000000 0.0\n", "2 0.437313 0.0 0.729849 0.0\n", "3 0.219575 0.0 0.292181 0.0\n", "4 0.346306 0.0 0.033992 0.0\n", "... ... ... ... ...\n", "60404997 0.000000 0.0 0.322072 0.0\n", "60404998 0.000000 0.0 0.206371 0.0\n", "60404999 0.000000 0.0 0.033532 0.0\n", "60405000 0.000000 0.0 0.040032 0.0\n", "60405001 0.000000 0.0 0.000000 0.0\n", "\n", "[60405002 rows x 4 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fasta_lib.fragment_intensity_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The calculated fragment m/z values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
b_z1b_z2y_z1y_z2
0157.10838779.057832711.389648356.198462
1270.192451135.599864598.305584299.656430
2407.251363204.129320461.246672231.126974
3508.299042254.653159360.198993180.603135
4565.320506283.163891303.177530152.092403
...............
604049973285.3987011643.202989546.324588273.665932
604049983386.4463791693.726828445.276909223.142093
604049993443.4678431722.237560388.255446194.631361
604050003571.5264201786.266848260.196868130.602072
604050013684.6104841842.808880147.11280474.060040
\n", "

60405002 rows × 4 columns

\n", "
" ], "text/plain": [ " b_z1 b_z2 y_z1 y_z2\n", "0 157.108387 79.057832 711.389648 356.198462\n", "1 270.192451 135.599864 598.305584 299.656430\n", "2 407.251363 204.129320 461.246672 231.126974\n", "3 508.299042 254.653159 360.198993 180.603135\n", "4 565.320506 283.163891 303.177530 152.092403\n", "... ... ... ... ...\n", "60404997 3285.398701 1643.202989 546.324588 273.665932\n", "60404998 3386.446379 1693.726828 445.276909 223.142093\n", "60404999 3443.467843 1722.237560 388.255446 194.631361\n", "60405000 3571.526420 1786.266848 260.196868 130.602072\n", "60405001 3684.610484 1842.808880 147.112804 74.060040\n", "\n", "[60405002 rows x 4 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fasta_lib.fragment_mz_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`PredictSpecLibFasta.rt_to_irt_pred` will translate the predicted RT values to iRT values (`rt_pred` to `irt_pred`). This is useful for DiaNN and Spectronaut search." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequenceprotein_idxesmiss_cleavageis_prot_ntermis_prot_ctermmodsmod_sitesnAAchargemod_seq_hash...precursor_mzinstrumentncert_predrt_norm_predccs_predmobility_predfrag_stop_idxfrag_start_idxirt_pred
0RIHTGQR197861FalseFalse72471662500970219628...434.249018timsTOF300.1153770.115377315.5290220.77543860-37.187631
1PMPMPVR94480FalseFalse72-5301076820607700090...414.216952timsTOF300.2089760.208976304.9657900.748912126-16.331142
2PMPMPVR94480FalseFalseOxidation@M4726057464136741449831...422.214409timsTOF300.1580580.158058304.0805360.7469701812-27.677099
3PMPMPVR94480FalseFalseOxidation@M272-6431722582867031756...422.214409timsTOF300.1571430.157143305.8253480.7512562418-27.881022
4QEWFCTR128190FalseFalseCarbamidomethyl@C572-7409729050206298801...513.726727timsTOF300.4237470.423747330.5476380.814317302431.526291
..................................................................
3654202NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR9781FalseFalse3547192344052213098704...866.228888timsTOF300.8313500.831350891.7484131.1088246040486660404832122.352159
3654203NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR9781FalseFalseOxidation@M17353-1485306056792248111...1159.967730timsTOF300.8269770.826977785.4786991.3022696040490060404866121.377815
3654204NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR9781FalseFalseOxidation@M17354-1485306056792248111...870.227616timsTOF300.8269770.826977892.4596561.1097296040493460404900121.377815
3654205KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK22991FalseFalse3545191231126132273751...976.910866timsTOF300.6701290.670129791.3222660.984398604049686040493486.427514
3654206AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK100801FalseFalseCarbamidomethyl@C24354-7707559913944666938...958.434460timsTOF300.7251500.725150823.8192141.024754604050026040496898.687774
\n", "

3654207 rows × 21 columns

\n", "
" ], "text/plain": [ " sequence protein_idxes miss_cleavage \\\n", "0 RIHTGQR 19786 1 \n", "1 PMPMPVR 9448 0 \n", "2 PMPMPVR 9448 0 \n", "3 PMPMPVR 9448 0 \n", "4 QEWFCTR 12819 0 \n", "... ... ... ... \n", "3654202 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 \n", "3654203 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 \n", "3654204 NLTYVRGSVGPATSTLMFVAGVVGNGLALGILSAR 978 1 \n", "3654205 KNQAADDDDEDLNDTNYDEFNGYAGSLFSSGPYEK 2299 1 \n", "3654206 AYDADSGFNGKVLFTISDGNTDSCFNIDMETGQLK 10080 1 \n", "\n", " is_prot_nterm is_prot_cterm mods mod_sites nAA \\\n", "0 False False 7 \n", "1 False False 7 \n", "2 False False Oxidation@M 4 7 \n", "3 False False Oxidation@M 2 7 \n", "4 False False Carbamidomethyl@C 5 7 \n", "... ... ... ... ... ... \n", "3654202 False False 35 \n", "3654203 False False Oxidation@M 17 35 \n", "3654204 False False Oxidation@M 17 35 \n", "3654205 False False 35 \n", "3654206 False False Carbamidomethyl@C 24 35 \n", "\n", " charge mod_seq_hash ... precursor_mz instrument nce \\\n", "0 2 471662500970219628 ... 434.249018 timsTOF 30 \n", "1 2 -5301076820607700090 ... 414.216952 timsTOF 30 \n", "2 2 6057464136741449831 ... 422.214409 timsTOF 30 \n", "3 2 -6431722582867031756 ... 422.214409 timsTOF 30 \n", "4 2 -7409729050206298801 ... 513.726727 timsTOF 30 \n", "... ... ... ... ... ... .. \n", "3654202 4 7192344052213098704 ... 866.228888 timsTOF 30 \n", "3654203 3 -1485306056792248111 ... 1159.967730 timsTOF 30 \n", "3654204 4 -1485306056792248111 ... 870.227616 timsTOF 30 \n", "3654205 4 5191231126132273751 ... 976.910866 timsTOF 30 \n", "3654206 4 -7707559913944666938 ... 958.434460 timsTOF 30 \n", "\n", " rt_pred rt_norm_pred ccs_pred mobility_pred frag_stop_idx \\\n", "0 0.115377 0.115377 315.529022 0.775438 6 \n", "1 0.208976 0.208976 304.965790 0.748912 12 \n", "2 0.158058 0.158058 304.080536 0.746970 18 \n", "3 0.157143 0.157143 305.825348 0.751256 24 \n", "4 0.423747 0.423747 330.547638 0.814317 30 \n", "... ... ... ... ... ... \n", "3654202 0.831350 0.831350 891.748413 1.108824 60404866 \n", "3654203 0.826977 0.826977 785.478699 1.302269 60404900 \n", "3654204 0.826977 0.826977 892.459656 1.109729 60404934 \n", "3654205 0.670129 0.670129 791.322266 0.984398 60404968 \n", "3654206 0.725150 0.725150 823.819214 1.024754 60405002 \n", "\n", " frag_start_idx irt_pred \n", "0 0 -37.187631 \n", "1 6 -16.331142 \n", "2 12 -27.677099 \n", "3 18 -27.881022 \n", "4 24 31.526291 \n", "... ... ... \n", "3654202 60404832 122.352159 \n", "3654203 60404866 121.377815 \n", "3654204 60404900 121.377815 \n", "3654205 60404934 86.427514 \n", "3654206 60404968 98.687774 \n", "\n", "[3654207 rows x 21 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fasta_lib.translate_rt_to_irt_pred()\n", "fasta_lib.precursor_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Save the library into a HDF5 (.hdf) file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'y:\\\\User\\\\Feng\\\\speclib\\\\human_swissprot.speclib.hdf'" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fasta_lib.save_hdf(hdf_path)\n", "hdf_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now use [alphapeptdeep_hdf_to_tsv.ipynb](alphapeptdeep_hdf_to_tsv.ipynb) to translate hdf into TSV (diann/spectronaut) format. Translation is quite slow because writing TSV file is slow for large libraries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.3 ('base')", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "8a3b27e141e49c996c9b863f8707e97aabd49c4a7e8445b9b783b34e4a21a9b2" } } }, "nbformat": 4, "nbformat_minor": 2 }