{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: building new models from scratch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**even without experience in deep learning**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to predict or classify novel properties of peptides, the user simply needs to provide peptides with corresponding properties (e.g. 'binding_affinity'). \n", "\n", "We provides several generic `ModelInterface` and `Model` classes in `peptdeep.model.generic_property_prediction` module for users to easily build models for regression and classification problems. Examples are shown as following:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from peptdeep.model.generic_property_prediction import (\n", " ModelInterface_for_Generic_AASeq_BinaryClassification,\n", " ModelInterface_for_Generic_AASeq_Regression,\n", " ModelInterface_for_Generic_ModAASeq_BinaryClassification,\n", " ModelInterface_for_Generic_ModAASeq_Regression,\n", ")\n", "from peptdeep.model.generic_property_prediction import (\n", " Model_for_Generic_AASeq_BinaryClassification_LSTM,\n", " Model_for_Generic_AASeq_BinaryClassification_Transformer,\n", " Model_for_Generic_AASeq_Regression_LSTM,\n", " Model_for_Generic_AASeq_Regression_Transformer,\n", " Model_for_Generic_ModAASeq_BinaryClassification_LSTM,\n", " Model_for_Generic_ModAASeq_BinaryClassification_Transformer,\n", " Model_for_Generic_ModAASeq_Regression_LSTM,\n", " Model_for_Generic_ModAASeq_Regression_Transformer,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define example Table/DataFrame" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from peptdeep.model.rt import IRT_PEPTIDE_DF" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irt
0LGGNEQVTRRT-pep a-24.9290.000000
1GAGSSEPVTGLDAKRT-pep b0.00140.199488
2VEATFGVDESNAKRT-pep c12.39130.298671
3YILAGVENSKRT-pep d19.79100.357909
4TPVISGGPYEYRRT-pep e28.71120.429315
5TPVITGAPYEYRRT-pep f33.38120.466699
6DGLDAASYYAPVRRT-pep g42.26130.537784
7ADVTPADFSEWSKRT-pep h54.62130.636728
8GTFIIDPGGVIRRT-pep i70.52120.764009
9GTFIIDPAAVIRRT-pep k87.23120.897775
10LFLQFGAQGSPFLKRT-pep l100.00141.000000
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA normalized_irt\n", "0 LGGNEQVTR RT-pep a -24.92 9 0.000000\n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488\n", "2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671\n", "3 YILAGVENSK RT-pep d 19.79 10 0.357909\n", "4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315\n", "5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699\n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784\n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728\n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009\n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775\n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def create_example_input_dataframe_normalized_irt():\n", " irt_df=IRT_PEPTIDE_DF.copy()\n", " irt_df['normalized_irt'] = (\n", " irt_df.irt-irt_df.irt.min()\n", " )/(irt_df.irt.max()-irt_df.irt.min()) # 0 to 1 norm\n", " return irt_df\n", "create_example_input_dataframe_normalized_irt()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Steps to build a model from scratch\n", "\n", "In the following examples, we only need 7 steps to build a model.\n", "\n", "1. Prepare a training dataframe with `sequence` column (and `mods`,`mod_sites` columns if the model also takes modifications into consideration), and a target value column to train.\n", "2. Select a `ModelInterface` class based on the prediction problem (classification or regression for sequences or modified sequences). Select a `Model` class when initialzing the `ModelInterface` class.\n", "3. Tell the `ModelInterface` object which column in the training dataframe stores the target values, and which column stores the values to be predicted.\n", "4. `model.train()` for training.\n", "5. `model.predict()` for prediction.\n", "\n", "> Save and load models:\n", "6. `model.save(\"/model_folder/model.pth\")` to save the model.\n", "7. Use the same `ModelInterface` and `Model` classes, and call `model.load(\"/model_folder/model.pth\")` to load the model for transfer learning and prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Building an simple RT model based on `Model_for_Generic_AASeq_Regression_LSTM`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irtpredicted_normalized_irt
0LGGNEQVTRRT-pep a-24.9290.0000000.000000
1GAGSSEPVTGLDAKRT-pep b0.00140.1994880.203671
2VEATFGVDESNAKRT-pep c12.39130.2986710.312852
3YILAGVENSKRT-pep d19.79100.3579090.365846
4TPVISGGPYEYRRT-pep e28.71120.4293150.434760
5TPVITGAPYEYRRT-pep f33.38120.4666990.465173
6DGLDAASYYAPVRRT-pep g42.26130.5377840.564576
7ADVTPADFSEWSKRT-pep h54.62130.6367280.678894
8GTFIIDPGGVIRRT-pep i70.52120.7640090.893195
9GTFIIDPAAVIRRT-pep k87.23120.8977751.061624
10LFLQFGAQGSPFLKRT-pep l100.00141.0000001.100713
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA normalized_irt \\\n", "0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n", "2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n", "3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n", "4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n", "5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n", "\n", " predicted_normalized_irt \n", "0 0.000000 \n", "1 0.203671 \n", "2 0.312852 \n", "3 0.365846 \n", "4 0.434760 \n", "5 0.465173 \n", "6 0.564576 \n", "7 0.678894 \n", "8 0.893195 \n", "9 1.061624 \n", "10 1.100713 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_df = create_example_input_dataframe_normalized_irt()\n", "\n", "#initialize the modelinterface, specify which of the models to use\n", "model = ModelInterface_for_Generic_AASeq_Regression(\n", " model_class=Model_for_Generic_AASeq_Regression_LSTM\n", ")\n", "# specify the name of the column you want to use for traning\n", "model.target_column_to_train = 'normalized_irt'\n", "model.target_column_to_predict = 'predicted_normalized_irt'\n", "model.train(example_df, epoch=20)\n", "model.predict(example_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Building an simple RT model for only sequences based on `Model_for_Generic_AASeq_Regression_Transformer`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irtpredicted_normalized_irt
0LGGNEQVTRRT-pep a-24.9290.0000000.000000
1GAGSSEPVTGLDAKRT-pep b0.00140.1994880.000000
2VEATFGVDESNAKRT-pep c12.39130.2986710.140912
3YILAGVENSKRT-pep d19.79100.3579090.142185
4TPVISGGPYEYRRT-pep e28.71120.4293150.210857
5TPVITGAPYEYRRT-pep f33.38120.4666990.277200
6DGLDAASYYAPVRRT-pep g42.26130.5377840.088854
7ADVTPADFSEWSKRT-pep h54.62130.6367280.399164
8GTFIIDPGGVIRRT-pep i70.52120.7640090.596815
9GTFIIDPAAVIRRT-pep k87.23120.8977750.701862
10LFLQFGAQGSPFLKRT-pep l100.00141.0000000.877500
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA normalized_irt \\\n", "0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n", "2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n", "3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n", "4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n", "5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n", "\n", " predicted_normalized_irt \n", "0 0.000000 \n", "1 0.000000 \n", "2 0.140912 \n", "3 0.142185 \n", "4 0.210857 \n", "5 0.277200 \n", "6 0.088854 \n", "7 0.399164 \n", "8 0.596815 \n", "9 0.701862 \n", "10 0.877500 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_df = create_example_input_dataframe_normalized_irt()\n", "\n", "#initialize the modelinterface, specify which of the models to use\n", "model = ModelInterface_for_Generic_AASeq_Regression(\n", " model_class=Model_for_Generic_AASeq_Regression_Transformer\n", ")\n", "# specify the name of the column you want to use for traning\n", "model.target_column_to_train = 'normalized_irt'\n", "model.target_column_to_predict = 'predicted_normalized_irt'\n", "model.train(example_df, epoch=20)\n", "model.predict(example_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression models for predicting a scalar value for a given amino acid sequence and site-specific PTMs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Scalar regression model (RT) with modified AA sequences using `Model_for_Generic_ModAASeq_Regression_LSTM`" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irtpredicted_normalized_irt
0LGGNEQVTRRT-pep a-24.9290.0000000.000000
1GAGSSEPVTGLDAKRT-pep b0.00140.1994880.368827
2VEATFGVDESNAKRT-pep c12.39130.2986710.295824
3YILAGVENSKRT-pep d19.79100.3579090.337074
4TPVISGGPYEYRRT-pep e28.71120.4293150.527409
5TPVITGAPYEYRRT-pep f33.38120.4666990.506031
6DGLDAASYYAPVRRT-pep g42.26130.5377840.629531
7ADVTPADFSEWSKRT-pep h54.62130.6367280.708878
8GTFIIDPGGVIRRT-pep i70.52120.7640090.798570
9GTFIIDPAAVIRRT-pep k87.23120.8977750.856519
10LFLQFGAQGSPFLKRT-pep l100.00141.0000000.973729
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA normalized_irt \\\n", "0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n", "2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n", "3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n", "4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n", "5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n", "\n", " predicted_normalized_irt \n", "0 0.000000 \n", "1 0.368827 \n", "2 0.295824 \n", "3 0.337074 \n", "4 0.527409 \n", "5 0.506031 \n", "6 0.629531 \n", "7 0.708878 \n", "8 0.798570 \n", "9 0.856519 \n", "10 0.973729 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_df = create_example_input_dataframe_normalized_irt()\n", "\n", "#initialize the modelinterface, specify which of the models to use\n", "model = ModelInterface_for_Generic_ModAASeq_Regression(\n", " model_class=Model_for_Generic_ModAASeq_Regression_LSTM\n", ")\n", "# specify the name of the column you want to use for traning\n", "model.target_column_to_train = 'normalized_irt'\n", "model.target_column_to_predict = 'predicted_normalized_irt'\n", "model.train(example_df, epoch=20)\n", "model.predict(example_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Scalar regression model (RT) with modified AA sequences using `Model_for_Generic_ModAASeq_Regression_Transformer`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irtpredicted_normalized_irt
0LGGNEQVTRRT-pep a-24.9290.0000000.088521
1GAGSSEPVTGLDAKRT-pep b0.00Phospho@S4140.1994880.571920
2VEATFGVDESNAKRT-pep c12.39130.2986710.285101
3YILAGVENSKRT-pep d19.79100.3579090.367173
4TPVISGGPYEYRRT-pep e28.71120.4293150.615492
5TPVITGAPYEYRRT-pep f33.38120.4666990.589607
6DGLDAASYYAPVRRT-pep g42.26130.5377840.539454
7ADVTPADFSEWSKRT-pep h54.62130.6367280.587029
8GTFIIDPGGVIRRT-pep i70.52120.7640090.880274
9GTFIIDPAAVIRRT-pep k87.23120.8977750.811531
10LFLQFGAQGSPFLKRT-pep l100.00141.0000001.084086
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA \\\n", "0 LGGNEQVTR RT-pep a -24.92 9 \n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 Phospho@S 4 14 \n", "2 VEATFGVDESNAK RT-pep c 12.39 13 \n", "3 YILAGVENSK RT-pep d 19.79 10 \n", "4 TPVISGGPYEYR RT-pep e 28.71 12 \n", "5 TPVITGAPYEYR RT-pep f 33.38 12 \n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 \n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 \n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 \n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 \n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 \n", "\n", " normalized_irt predicted_normalized_irt \n", "0 0.000000 0.088521 \n", "1 0.199488 0.571920 \n", "2 0.298671 0.285101 \n", "3 0.357909 0.367173 \n", "4 0.429315 0.615492 \n", "5 0.466699 0.589607 \n", "6 0.537784 0.539454 \n", "7 0.636728 0.587029 \n", "8 0.764009 0.880274 \n", "9 0.897775 0.811531 \n", "10 1.000000 1.084086 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_df = create_example_input_dataframe_normalized_irt()\n", "example_df.loc[1,'mods'] = 'Phospho@S'\n", "example_df.loc[1,'mod_sites'] = '4'\n", "\n", "#initialize the modelinterface, specify which of the models to use\n", "model = ModelInterface_for_Generic_ModAASeq_Regression(\n", " model_class=Model_for_Generic_ModAASeq_Regression_Transformer\n", ")\n", "# specify the name of the column you want to use for traning\n", "model.target_column_to_train = 'normalized_irt'\n", "model.target_column_to_predict = 'predicted_normalized_irt'\n", "model.train(example_df, epoch=20)\n", "model.predict(example_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binary classification models for a given amino acid sequence" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# a simple classification dataset\n", "def create_example_input_dataframe_classification_rt():\n", " rt_df = create_example_input_dataframe_normalized_irt()\n", " rt_df['is_in_first_half_of_column'] = 0\n", " rt_df.loc[:5,'is_in_first_half_of_column']=1\n", " return rt_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A sequence classification model using `Model_for_Generic_AASeq_BinaryClassification_LSTM`" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irtis_in_first_half_of_columnpredicted_will_be_in_first_half_of_column
0LGGNEQVTRRT-pep a-24.9290.00000010.991829
1GAGSSEPVTGLDAKRT-pep b0.00140.19948810.990733
2VEATFGVDESNAKRT-pep c12.39130.29867110.991083
3YILAGVENSKRT-pep d19.79100.35790910.991600
4TPVISGGPYEYRRT-pep e28.71120.42931510.992202
5TPVITGAPYEYRRT-pep f33.38120.46669910.990124
6DGLDAASYYAPVRRT-pep g42.26130.53778400.351366
7ADVTPADFSEWSKRT-pep h54.62130.63672800.359982
8GTFIIDPGGVIRRT-pep i70.52120.76400900.352756
9GTFIIDPAAVIRRT-pep k87.23120.89777500.351209
10LFLQFGAQGSPFLKRT-pep l100.00141.00000000.349120
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA normalized_irt \\\n", "0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n", "2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n", "3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n", "4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n", "5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n", "\n", " is_in_first_half_of_column predicted_will_be_in_first_half_of_column \n", "0 1 0.991829 \n", "1 1 0.990733 \n", "2 1 0.991083 \n", "3 1 0.991600 \n", "4 1 0.992202 \n", "5 1 0.990124 \n", "6 0 0.351366 \n", "7 0 0.359982 \n", "8 0 0.352756 \n", "9 0 0.351209 \n", "10 0 0.349120 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_df = create_example_input_dataframe_classification_rt()\n", "\n", "#initialize the modelinterface, specify which of the models to use\n", "model = ModelInterface_for_Generic_AASeq_BinaryClassification(\n", " model_class=Model_for_Generic_AASeq_BinaryClassification_LSTM\n", ")\n", "# specify the name of the column you want to use for traning\n", "model.target_column_to_train = 'is_in_first_half_of_column' \n", "model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'\n", "model.train(example_df, epoch=20)\n", "model.predict(example_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A sequence classification model using `Model_for_Generic_AASeq_BinaryClassification_Transformer`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irtis_in_first_half_of_columnpredicted_will_be_in_first_half_of_column
0LGGNEQVTRRT-pep a-24.9290.00000010.997586
1GAGSSEPVTGLDAKRT-pep b0.00140.19948810.997438
2VEATFGVDESNAKRT-pep c12.39130.29867110.996627
3YILAGVENSKRT-pep d19.79100.35790910.997642
4TPVISGGPYEYRRT-pep e28.71120.42931510.996989
5TPVITGAPYEYRRT-pep f33.38120.46669910.996926
6DGLDAASYYAPVRRT-pep g42.26130.53778400.004032
7ADVTPADFSEWSKRT-pep h54.62130.63672800.004321
8GTFIIDPGGVIRRT-pep i70.52120.76400900.004137
9GTFIIDPAAVIRRT-pep k87.23120.89777500.003938
10LFLQFGAQGSPFLKRT-pep l100.00141.00000000.004114
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA normalized_irt \\\n", "0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n", "2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n", "3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n", "4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n", "5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n", "\n", " is_in_first_half_of_column predicted_will_be_in_first_half_of_column \n", "0 1 0.997586 \n", "1 1 0.997438 \n", "2 1 0.996627 \n", "3 1 0.997642 \n", "4 1 0.996989 \n", "5 1 0.996926 \n", "6 0 0.004032 \n", "7 0 0.004321 \n", "8 0 0.004137 \n", "9 0 0.003938 \n", "10 0 0.004114 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_df = create_example_input_dataframe_classification_rt()\n", "\n", "#initialize the modelinterface, specify which of the models to use\n", "model = ModelInterface_for_Generic_AASeq_BinaryClassification(\n", " model_class=Model_for_Generic_AASeq_BinaryClassification_Transformer\n", ")\n", "# specify the name of the column you want to use for traning\n", "model.target_column_to_train = 'is_in_first_half_of_column'\n", "model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'\n", "model.train(example_df, epoch=20)\n", "model.predict(example_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binary classification models for given amino acid sequence and site-specific PTMs" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def create_example_input_dataframe_classification_rt():\n", " rt_df = create_example_input_dataframe_normalized_irt()\n", " rt_df['is_in_first_half_of_column'] = 0\n", " rt_df.loc[:5,'is_in_first_half_of_column']=1\n", " return rt_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A sequence classification model using `Model_for_Generic_ModAASeq_BinaryClassification_LSTM`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irtis_in_first_half_of_columnpredicted_will_be_in_first_half_of_column
0LGGNEQVTRRT-pep a-24.9290.00000010.993120
1GAGSSEPVTGLDAKRT-pep b0.00140.19948810.990600
2VEATFGVDESNAKRT-pep c12.39130.29867110.992972
3YILAGVENSKRT-pep d19.79100.35790910.992984
4TPVISGGPYEYRRT-pep e28.71120.42931510.992323
5TPVITGAPYEYRRT-pep f33.38120.46669910.988538
6DGLDAASYYAPVRRT-pep g42.26130.53778400.370841
7ADVTPADFSEWSKRT-pep h54.62130.63672800.368691
8GTFIIDPGGVIRRT-pep i70.52120.76400900.378124
9GTFIIDPAAVIRRT-pep k87.23120.89777500.367393
10LFLQFGAQGSPFLKRT-pep l100.00141.00000000.365957
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA normalized_irt \\\n", "0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n", "2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n", "3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n", "4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n", "5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n", "\n", " is_in_first_half_of_column predicted_will_be_in_first_half_of_column \n", "0 1 0.993120 \n", "1 1 0.990600 \n", "2 1 0.992972 \n", "3 1 0.992984 \n", "4 1 0.992323 \n", "5 1 0.988538 \n", "6 0 0.370841 \n", "7 0 0.368691 \n", "8 0 0.378124 \n", "9 0 0.367393 \n", "10 0 0.365957 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_df = create_example_input_dataframe_classification_rt()\n", "\n", "#initialize the modelinterface, specify which of the models to use\n", "model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(\n", " model_class=Model_for_Generic_ModAASeq_BinaryClassification_LSTM\n", ")\n", "# specify the name of the column you want to use for traning\n", "model.target_column_to_train = 'is_in_first_half_of_column' \n", "model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'\n", "model.train(example_df, epoch=20)\n", "model.predict(example_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A sequence classification model using `Model_for_Generic_ModAASeq_BinaryClassification_Transformer`" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sequencepep_nameirtmodsmod_sitesnAAnormalized_irtis_in_first_half_of_columnpredicted_will_be_in_first_half_of_column
0LGGNEQVTRRT-pep a-24.9290.00000010.997545
1GAGSSEPVTGLDAKRT-pep b0.00140.19948810.996575
2VEATFGVDESNAKRT-pep c12.39130.29867110.995498
3YILAGVENSKRT-pep d19.79100.35790910.997241
4TPVISGGPYEYRRT-pep e28.71120.42931510.996784
5TPVITGAPYEYRRT-pep f33.38120.46669910.995732
6DGLDAASYYAPVRRT-pep g42.26130.53778400.004000
7ADVTPADFSEWSKRT-pep h54.62130.63672800.005084
8GTFIIDPGGVIRRT-pep i70.52120.76400900.004195
9GTFIIDPAAVIRRT-pep k87.23120.89777500.003547
10LFLQFGAQGSPFLKRT-pep l100.00141.00000000.003279
\n", "
" ], "text/plain": [ " sequence pep_name irt mods mod_sites nAA normalized_irt \\\n", "0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n", "1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n", "2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n", "3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n", "4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n", "5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n", "6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n", "7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n", "8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n", "9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n", "10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n", "\n", " is_in_first_half_of_column predicted_will_be_in_first_half_of_column \n", "0 1 0.997545 \n", "1 1 0.996575 \n", "2 1 0.995498 \n", "3 1 0.997241 \n", "4 1 0.996784 \n", "5 1 0.995732 \n", "6 0 0.004000 \n", "7 0 0.005084 \n", "8 0 0.004195 \n", "9 0 0.003547 \n", "10 0 0.003279 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_df = create_example_input_dataframe_classification_rt()\n", "\n", "#initialize the modelinterface, specify which of the models to use\n", "model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(\n", " model_class=Model_for_Generic_ModAASeq_BinaryClassification_Transformer\n", ")\n", "# specify the name of the column you want to use for traning\n", "model.target_column_to_train = 'is_in_first_half_of_column' \n", "model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'\n", "model.train(example_df, epoch=20)\n", "model.predict(example_df)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.3 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "8a3b27e141e49c996c9b863f8707e97aabd49c4a7e8445b9b783b34e4a21a9b2" } } }, "nbformat": 4, "nbformat_minor": 2 }