{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: building new models from scratch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**even without experience in deep learning**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%reload_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to predict or classify novel properties of peptides, the user simply needs to provide peptides with corresponding properties (e.g. 'binding_affinity'). \n",
"\n",
"We provides several generic `ModelInterface` and `Model` classes in `peptdeep.model.generic_property_prediction` module for users to easily build models for regression and classification problems. Examples are shown as following:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Imports"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from peptdeep.model.generic_property_prediction import (\n",
" ModelInterface_for_Generic_AASeq_BinaryClassification,\n",
" ModelInterface_for_Generic_AASeq_Regression,\n",
" ModelInterface_for_Generic_ModAASeq_BinaryClassification,\n",
" ModelInterface_for_Generic_ModAASeq_Regression,\n",
")\n",
"from peptdeep.model.generic_property_prediction import (\n",
" Model_for_Generic_AASeq_BinaryClassification_LSTM,\n",
" Model_for_Generic_AASeq_BinaryClassification_Transformer,\n",
" Model_for_Generic_AASeq_Regression_LSTM,\n",
" Model_for_Generic_AASeq_Regression_Transformer,\n",
" Model_for_Generic_ModAASeq_BinaryClassification_LSTM,\n",
" Model_for_Generic_ModAASeq_BinaryClassification_Transformer,\n",
" Model_for_Generic_ModAASeq_Regression_LSTM,\n",
" Model_for_Generic_ModAASeq_Regression_Transformer,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define example Table/DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from peptdeep.model.rt import IRT_PEPTIDE_DF"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 0.199488 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA normalized_irt\n",
"0 LGGNEQVTR RT-pep a -24.92 9 0.000000\n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488\n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671\n",
"3 YILAGVENSK RT-pep d 19.79 10 0.357909\n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315\n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699\n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784\n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728\n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009\n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775\n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def create_example_input_dataframe_normalized_irt():\n",
" irt_df=IRT_PEPTIDE_DF.copy()\n",
" irt_df['normalized_irt'] = (\n",
" irt_df.irt-irt_df.irt.min()\n",
" )/(irt_df.irt.max()-irt_df.irt.min()) # 0 to 1 norm\n",
" return irt_df\n",
"create_example_input_dataframe_normalized_irt()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Steps to build a model from scratch\n",
"\n",
"In the following examples, we only need 7 steps to build a model.\n",
"\n",
"1. Prepare a training dataframe with `sequence` column (and `mods`,`mod_sites` columns if the model also takes modifications into consideration), and a target value column to train.\n",
"2. Select a `ModelInterface` class based on the prediction problem (classification or regression for sequences or modified sequences). Select a `Model` class when initialzing the `ModelInterface` class.\n",
"3. Tell the `ModelInterface` object which column in the training dataframe stores the target values, and which column stores the values to be predicted.\n",
"4. `model.train()` for training.\n",
"5. `model.predict()` for prediction.\n",
"\n",
"> Save and load models:\n",
"6. `model.save(\"/model_folder/model.pth\")` to save the model.\n",
"7. Use the same `ModelInterface` and `Model` classes, and call `model.load(\"/model_folder/model.pth\")` to load the model for transfer learning and prediction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Building an simple RT model based on `Model_for_Generic_AASeq_Regression_LSTM`"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
" predicted_normalized_irt | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 0.199488 | \n",
" 0.203671 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
" 0.312852 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
" 0.365846 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
" 0.434760 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
" 0.465173 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
" 0.564576 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
" 0.678894 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
" 0.893195 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
" 1.061624 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
" 1.100713 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA normalized_irt \\\n",
"0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n",
"3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n",
"\n",
" predicted_normalized_irt \n",
"0 0.000000 \n",
"1 0.203671 \n",
"2 0.312852 \n",
"3 0.365846 \n",
"4 0.434760 \n",
"5 0.465173 \n",
"6 0.564576 \n",
"7 0.678894 \n",
"8 0.893195 \n",
"9 1.061624 \n",
"10 1.100713 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_df = create_example_input_dataframe_normalized_irt()\n",
"\n",
"#initialize the modelinterface, specify which of the models to use\n",
"model = ModelInterface_for_Generic_AASeq_Regression(\n",
" model_class=Model_for_Generic_AASeq_Regression_LSTM\n",
")\n",
"# specify the name of the column you want to use for traning\n",
"model.target_column_to_train = 'normalized_irt'\n",
"model.target_column_to_predict = 'predicted_normalized_irt'\n",
"model.train(example_df, epoch=20)\n",
"model.predict(example_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Building an simple RT model for only sequences based on `Model_for_Generic_AASeq_Regression_Transformer`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
" predicted_normalized_irt | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 0.199488 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
" 0.140912 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
" 0.142185 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
" 0.210857 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
" 0.277200 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
" 0.088854 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
" 0.399164 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
" 0.596815 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
" 0.701862 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
" 0.877500 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA normalized_irt \\\n",
"0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n",
"3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n",
"\n",
" predicted_normalized_irt \n",
"0 0.000000 \n",
"1 0.000000 \n",
"2 0.140912 \n",
"3 0.142185 \n",
"4 0.210857 \n",
"5 0.277200 \n",
"6 0.088854 \n",
"7 0.399164 \n",
"8 0.596815 \n",
"9 0.701862 \n",
"10 0.877500 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_df = create_example_input_dataframe_normalized_irt()\n",
"\n",
"#initialize the modelinterface, specify which of the models to use\n",
"model = ModelInterface_for_Generic_AASeq_Regression(\n",
" model_class=Model_for_Generic_AASeq_Regression_Transformer\n",
")\n",
"# specify the name of the column you want to use for traning\n",
"model.target_column_to_train = 'normalized_irt'\n",
"model.target_column_to_predict = 'predicted_normalized_irt'\n",
"model.train(example_df, epoch=20)\n",
"model.predict(example_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regression models for predicting a scalar value for a given amino acid sequence and site-specific PTMs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Scalar regression model (RT) with modified AA sequences using `Model_for_Generic_ModAASeq_Regression_LSTM`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
" predicted_normalized_irt | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 0.199488 | \n",
" 0.368827 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
" 0.295824 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
" 0.337074 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
" 0.527409 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
" 0.506031 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
" 0.629531 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
" 0.708878 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
" 0.798570 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
" 0.856519 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
" 0.973729 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA normalized_irt \\\n",
"0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n",
"3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n",
"\n",
" predicted_normalized_irt \n",
"0 0.000000 \n",
"1 0.368827 \n",
"2 0.295824 \n",
"3 0.337074 \n",
"4 0.527409 \n",
"5 0.506031 \n",
"6 0.629531 \n",
"7 0.708878 \n",
"8 0.798570 \n",
"9 0.856519 \n",
"10 0.973729 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_df = create_example_input_dataframe_normalized_irt()\n",
"\n",
"#initialize the modelinterface, specify which of the models to use\n",
"model = ModelInterface_for_Generic_ModAASeq_Regression(\n",
" model_class=Model_for_Generic_ModAASeq_Regression_LSTM\n",
")\n",
"# specify the name of the column you want to use for traning\n",
"model.target_column_to_train = 'normalized_irt'\n",
"model.target_column_to_predict = 'predicted_normalized_irt'\n",
"model.train(example_df, epoch=20)\n",
"model.predict(example_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Scalar regression model (RT) with modified AA sequences using `Model_for_Generic_ModAASeq_Regression_Transformer`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
" predicted_normalized_irt | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
" 0.088521 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" Phospho@S | \n",
" 4 | \n",
" 14 | \n",
" 0.199488 | \n",
" 0.571920 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
" 0.285101 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
" 0.367173 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
" 0.615492 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
" 0.589607 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
" 0.539454 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
" 0.587029 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
" 0.880274 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
" 0.811531 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
" 1.084086 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA \\\n",
"0 LGGNEQVTR RT-pep a -24.92 9 \n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 Phospho@S 4 14 \n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 \n",
"3 YILAGVENSK RT-pep d 19.79 10 \n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 \n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 \n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 \n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 \n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 \n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 \n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 \n",
"\n",
" normalized_irt predicted_normalized_irt \n",
"0 0.000000 0.088521 \n",
"1 0.199488 0.571920 \n",
"2 0.298671 0.285101 \n",
"3 0.357909 0.367173 \n",
"4 0.429315 0.615492 \n",
"5 0.466699 0.589607 \n",
"6 0.537784 0.539454 \n",
"7 0.636728 0.587029 \n",
"8 0.764009 0.880274 \n",
"9 0.897775 0.811531 \n",
"10 1.000000 1.084086 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_df = create_example_input_dataframe_normalized_irt()\n",
"example_df.loc[1,'mods'] = 'Phospho@S'\n",
"example_df.loc[1,'mod_sites'] = '4'\n",
"\n",
"#initialize the modelinterface, specify which of the models to use\n",
"model = ModelInterface_for_Generic_ModAASeq_Regression(\n",
" model_class=Model_for_Generic_ModAASeq_Regression_Transformer\n",
")\n",
"# specify the name of the column you want to use for traning\n",
"model.target_column_to_train = 'normalized_irt'\n",
"model.target_column_to_predict = 'predicted_normalized_irt'\n",
"model.train(example_df, epoch=20)\n",
"model.predict(example_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Binary classification models for a given amino acid sequence"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# a simple classification dataset\n",
"def create_example_input_dataframe_classification_rt():\n",
" rt_df = create_example_input_dataframe_normalized_irt()\n",
" rt_df['is_in_first_half_of_column'] = 0\n",
" rt_df.loc[:5,'is_in_first_half_of_column']=1\n",
" return rt_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### A sequence classification model using `Model_for_Generic_AASeq_BinaryClassification_LSTM`"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
" is_in_first_half_of_column | \n",
" predicted_will_be_in_first_half_of_column | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
" 1 | \n",
" 0.991829 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 0.199488 | \n",
" 1 | \n",
" 0.990733 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
" 1 | \n",
" 0.991083 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
" 1 | \n",
" 0.991600 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
" 1 | \n",
" 0.992202 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
" 1 | \n",
" 0.990124 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
" 0 | \n",
" 0.351366 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
" 0 | \n",
" 0.359982 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
" 0 | \n",
" 0.352756 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
" 0 | \n",
" 0.351209 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
" 0 | \n",
" 0.349120 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA normalized_irt \\\n",
"0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n",
"3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n",
"\n",
" is_in_first_half_of_column predicted_will_be_in_first_half_of_column \n",
"0 1 0.991829 \n",
"1 1 0.990733 \n",
"2 1 0.991083 \n",
"3 1 0.991600 \n",
"4 1 0.992202 \n",
"5 1 0.990124 \n",
"6 0 0.351366 \n",
"7 0 0.359982 \n",
"8 0 0.352756 \n",
"9 0 0.351209 \n",
"10 0 0.349120 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_df = create_example_input_dataframe_classification_rt()\n",
"\n",
"#initialize the modelinterface, specify which of the models to use\n",
"model = ModelInterface_for_Generic_AASeq_BinaryClassification(\n",
" model_class=Model_for_Generic_AASeq_BinaryClassification_LSTM\n",
")\n",
"# specify the name of the column you want to use for traning\n",
"model.target_column_to_train = 'is_in_first_half_of_column' \n",
"model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'\n",
"model.train(example_df, epoch=20)\n",
"model.predict(example_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### A sequence classification model using `Model_for_Generic_AASeq_BinaryClassification_Transformer`"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
" is_in_first_half_of_column | \n",
" predicted_will_be_in_first_half_of_column | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
" 1 | \n",
" 0.997586 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 0.199488 | \n",
" 1 | \n",
" 0.997438 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
" 1 | \n",
" 0.996627 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
" 1 | \n",
" 0.997642 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
" 1 | \n",
" 0.996989 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
" 1 | \n",
" 0.996926 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
" 0 | \n",
" 0.004032 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
" 0 | \n",
" 0.004321 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
" 0 | \n",
" 0.004137 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
" 0 | \n",
" 0.003938 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
" 0 | \n",
" 0.004114 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA normalized_irt \\\n",
"0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n",
"3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n",
"\n",
" is_in_first_half_of_column predicted_will_be_in_first_half_of_column \n",
"0 1 0.997586 \n",
"1 1 0.997438 \n",
"2 1 0.996627 \n",
"3 1 0.997642 \n",
"4 1 0.996989 \n",
"5 1 0.996926 \n",
"6 0 0.004032 \n",
"7 0 0.004321 \n",
"8 0 0.004137 \n",
"9 0 0.003938 \n",
"10 0 0.004114 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_df = create_example_input_dataframe_classification_rt()\n",
"\n",
"#initialize the modelinterface, specify which of the models to use\n",
"model = ModelInterface_for_Generic_AASeq_BinaryClassification(\n",
" model_class=Model_for_Generic_AASeq_BinaryClassification_Transformer\n",
")\n",
"# specify the name of the column you want to use for traning\n",
"model.target_column_to_train = 'is_in_first_half_of_column'\n",
"model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'\n",
"model.train(example_df, epoch=20)\n",
"model.predict(example_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Binary classification models for given amino acid sequence and site-specific PTMs"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def create_example_input_dataframe_classification_rt():\n",
" rt_df = create_example_input_dataframe_normalized_irt()\n",
" rt_df['is_in_first_half_of_column'] = 0\n",
" rt_df.loc[:5,'is_in_first_half_of_column']=1\n",
" return rt_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### A sequence classification model using `Model_for_Generic_ModAASeq_BinaryClassification_LSTM`"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
" is_in_first_half_of_column | \n",
" predicted_will_be_in_first_half_of_column | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
" 1 | \n",
" 0.993120 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 0.199488 | \n",
" 1 | \n",
" 0.990600 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
" 1 | \n",
" 0.992972 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
" 1 | \n",
" 0.992984 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
" 1 | \n",
" 0.992323 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
" 1 | \n",
" 0.988538 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
" 0 | \n",
" 0.370841 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
" 0 | \n",
" 0.368691 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
" 0 | \n",
" 0.378124 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
" 0 | \n",
" 0.367393 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
" 0 | \n",
" 0.365957 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA normalized_irt \\\n",
"0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n",
"3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n",
"\n",
" is_in_first_half_of_column predicted_will_be_in_first_half_of_column \n",
"0 1 0.993120 \n",
"1 1 0.990600 \n",
"2 1 0.992972 \n",
"3 1 0.992984 \n",
"4 1 0.992323 \n",
"5 1 0.988538 \n",
"6 0 0.370841 \n",
"7 0 0.368691 \n",
"8 0 0.378124 \n",
"9 0 0.367393 \n",
"10 0 0.365957 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_df = create_example_input_dataframe_classification_rt()\n",
"\n",
"#initialize the modelinterface, specify which of the models to use\n",
"model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(\n",
" model_class=Model_for_Generic_ModAASeq_BinaryClassification_LSTM\n",
")\n",
"# specify the name of the column you want to use for traning\n",
"model.target_column_to_train = 'is_in_first_half_of_column' \n",
"model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'\n",
"model.train(example_df, epoch=20)\n",
"model.predict(example_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### A sequence classification model using `Model_for_Generic_ModAASeq_BinaryClassification_Transformer`"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sequence | \n",
" pep_name | \n",
" irt | \n",
" mods | \n",
" mod_sites | \n",
" nAA | \n",
" normalized_irt | \n",
" is_in_first_half_of_column | \n",
" predicted_will_be_in_first_half_of_column | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" LGGNEQVTR | \n",
" RT-pep a | \n",
" -24.92 | \n",
" | \n",
" | \n",
" 9 | \n",
" 0.000000 | \n",
" 1 | \n",
" 0.997545 | \n",
"
\n",
" \n",
" | 1 | \n",
" GAGSSEPVTGLDAK | \n",
" RT-pep b | \n",
" 0.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 0.199488 | \n",
" 1 | \n",
" 0.996575 | \n",
"
\n",
" \n",
" | 2 | \n",
" VEATFGVDESNAK | \n",
" RT-pep c | \n",
" 12.39 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.298671 | \n",
" 1 | \n",
" 0.995498 | \n",
"
\n",
" \n",
" | 3 | \n",
" YILAGVENSK | \n",
" RT-pep d | \n",
" 19.79 | \n",
" | \n",
" | \n",
" 10 | \n",
" 0.357909 | \n",
" 1 | \n",
" 0.997241 | \n",
"
\n",
" \n",
" | 4 | \n",
" TPVISGGPYEYR | \n",
" RT-pep e | \n",
" 28.71 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.429315 | \n",
" 1 | \n",
" 0.996784 | \n",
"
\n",
" \n",
" | 5 | \n",
" TPVITGAPYEYR | \n",
" RT-pep f | \n",
" 33.38 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.466699 | \n",
" 1 | \n",
" 0.995732 | \n",
"
\n",
" \n",
" | 6 | \n",
" DGLDAASYYAPVR | \n",
" RT-pep g | \n",
" 42.26 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.537784 | \n",
" 0 | \n",
" 0.004000 | \n",
"
\n",
" \n",
" | 7 | \n",
" ADVTPADFSEWSK | \n",
" RT-pep h | \n",
" 54.62 | \n",
" | \n",
" | \n",
" 13 | \n",
" 0.636728 | \n",
" 0 | \n",
" 0.005084 | \n",
"
\n",
" \n",
" | 8 | \n",
" GTFIIDPGGVIR | \n",
" RT-pep i | \n",
" 70.52 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.764009 | \n",
" 0 | \n",
" 0.004195 | \n",
"
\n",
" \n",
" | 9 | \n",
" GTFIIDPAAVIR | \n",
" RT-pep k | \n",
" 87.23 | \n",
" | \n",
" | \n",
" 12 | \n",
" 0.897775 | \n",
" 0 | \n",
" 0.003547 | \n",
"
\n",
" \n",
" | 10 | \n",
" LFLQFGAQGSPFLK | \n",
" RT-pep l | \n",
" 100.00 | \n",
" | \n",
" | \n",
" 14 | \n",
" 1.000000 | \n",
" 0 | \n",
" 0.003279 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sequence pep_name irt mods mod_sites nAA normalized_irt \\\n",
"0 LGGNEQVTR RT-pep a -24.92 9 0.000000 \n",
"1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 \n",
"2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 \n",
"3 YILAGVENSK RT-pep d 19.79 10 0.357909 \n",
"4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 \n",
"5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 \n",
"6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 \n",
"7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 \n",
"8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 \n",
"9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 \n",
"10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 \n",
"\n",
" is_in_first_half_of_column predicted_will_be_in_first_half_of_column \n",
"0 1 0.997545 \n",
"1 1 0.996575 \n",
"2 1 0.995498 \n",
"3 1 0.997241 \n",
"4 1 0.996784 \n",
"5 1 0.995732 \n",
"6 0 0.004000 \n",
"7 0 0.005084 \n",
"8 0 0.004195 \n",
"9 0 0.003547 \n",
"10 0 0.003279 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_df = create_example_input_dataframe_classification_rt()\n",
"\n",
"#initialize the modelinterface, specify which of the models to use\n",
"model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(\n",
" model_class=Model_for_Generic_ModAASeq_BinaryClassification_Transformer\n",
")\n",
"# specify the name of the column you want to use for traning\n",
"model.target_column_to_train = 'is_in_first_half_of_column' \n",
"model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'\n",
"model.train(example_df, epoch=20)\n",
"model.predict(example_df)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.3 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"vscode": {
"interpreter": {
"hash": "8a3b27e141e49c996c9b863f8707e97aabd49c4a7e8445b9b783b34e4a21a9b2"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}