Tutorial: building new models from scratch¶
even without experience in deep learning
[1]:
%reload_ext autoreload
%autoreload 2
In order to predict or classify novel properties of peptides, the user simply needs to provide peptides with corresponding properties (e.g. ‘binding_affinity’).
We provides several generic ModelInterface and Model classes in peptdeep.model.generic_property_prediction module for users to easily build models for regression and classification problems. Examples are shown as following:
Imports¶
[2]:
from peptdeep.model.generic_property_prediction import (
ModelInterface_for_Generic_AASeq_BinaryClassification,
ModelInterface_for_Generic_AASeq_Regression,
ModelInterface_for_Generic_ModAASeq_BinaryClassification,
ModelInterface_for_Generic_ModAASeq_Regression,
)
from peptdeep.model.generic_property_prediction import (
Model_for_Generic_AASeq_BinaryClassification_LSTM,
Model_for_Generic_AASeq_BinaryClassification_Transformer,
Model_for_Generic_AASeq_Regression_LSTM,
Model_for_Generic_AASeq_Regression_Transformer,
Model_for_Generic_ModAASeq_BinaryClassification_LSTM,
Model_for_Generic_ModAASeq_BinaryClassification_Transformer,
Model_for_Generic_ModAASeq_Regression_LSTM,
Model_for_Generic_ModAASeq_Regression_Transformer,
)
Define example Table/DataFrame¶
[3]:
from peptdeep.model.rt import IRT_PEPTIDE_DF
[4]:
def create_example_input_dataframe_normalized_irt():
irt_df=IRT_PEPTIDE_DF.copy()
irt_df['normalized_irt'] = (
irt_df.irt-irt_df.irt.min()
)/(irt_df.irt.max()-irt_df.irt.min()) # 0 to 1 norm
return irt_df
create_example_input_dataframe_normalized_irt()
[4]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | |
|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | 14 | 0.199488 | ||
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 |
Steps to build a model from scratch¶
In the following examples, we only need 7 steps to build a model.
Prepare a training dataframe with
sequencecolumn (andmods,mod_sitescolumns if the model also takes modifications into consideration), and a target value column to train.Select a
ModelInterfaceclass based on the prediction problem (classification or regression for sequences or modified sequences). Select aModelclass when initialzing theModelInterfaceclass.Tell the
ModelInterfaceobject which column in the training dataframe stores the target values, and which column stores the values to be predicted.model.train()for training.model.predict()for prediction.
Save and load models:
model.save("/model_folder/model.pth")to save the model.Use the same
ModelInterfaceandModelclasses, and callmodel.load("/model_folder/model.pth")to load the model for transfer learning and prediction.
Building an simple RT model based on Model_for_Generic_AASeq_Regression_LSTM¶
[5]:
example_df = create_example_input_dataframe_normalized_irt()
#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_Regression(
model_class=Model_for_Generic_AASeq_Regression_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)
[5]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | predicted_normalized_irt | |
|---|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | 0.000000 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | 14 | 0.199488 | 0.203671 | ||
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | 0.312852 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | 0.365846 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | 0.434760 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | 0.465173 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | 0.564576 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | 0.678894 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | 0.893195 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | 1.061624 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 | 1.100713 |
Building an simple RT model for only sequences based on Model_for_Generic_AASeq_Regression_Transformer¶
[6]:
example_df = create_example_input_dataframe_normalized_irt()
#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_Regression(
model_class=Model_for_Generic_AASeq_Regression_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)
[6]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | predicted_normalized_irt | |
|---|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | 0.000000 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | 14 | 0.199488 | 0.000000 | ||
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | 0.140912 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | 0.142185 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | 0.210857 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | 0.277200 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | 0.088854 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | 0.399164 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | 0.596815 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | 0.701862 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 | 0.877500 |
Regression models for predicting a scalar value for a given amino acid sequence and site-specific PTMs¶
Scalar regression model (RT) with modified AA sequences using Model_for_Generic_ModAASeq_Regression_LSTM¶
[7]:
example_df = create_example_input_dataframe_normalized_irt()
#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_Regression(
model_class=Model_for_Generic_ModAASeq_Regression_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)
[7]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | predicted_normalized_irt | |
|---|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | 0.000000 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | 14 | 0.199488 | 0.368827 | ||
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | 0.295824 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | 0.337074 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | 0.527409 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | 0.506031 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | 0.629531 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | 0.708878 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | 0.798570 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | 0.856519 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 | 0.973729 |
Scalar regression model (RT) with modified AA sequences using Model_for_Generic_ModAASeq_Regression_Transformer¶
[8]:
example_df = create_example_input_dataframe_normalized_irt()
example_df.loc[1,'mods'] = 'Phospho@S'
example_df.loc[1,'mod_sites'] = '4'
#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_Regression(
model_class=Model_for_Generic_ModAASeq_Regression_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)
[8]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | predicted_normalized_irt | |
|---|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | 0.088521 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | Phospho@S | 4 | 14 | 0.199488 | 0.571920 |
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | 0.285101 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | 0.367173 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | 0.615492 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | 0.589607 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | 0.539454 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | 0.587029 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | 0.880274 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | 0.811531 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 | 1.084086 |
Binary classification models for a given amino acid sequence¶
[9]:
# a simple classification dataset
def create_example_input_dataframe_classification_rt():
rt_df = create_example_input_dataframe_normalized_irt()
rt_df['is_in_first_half_of_column'] = 0
rt_df.loc[:5,'is_in_first_half_of_column']=1
return rt_df
A sequence classification model using Model_for_Generic_AASeq_BinaryClassification_LSTM¶
[10]:
example_df = create_example_input_dataframe_classification_rt()
#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_BinaryClassification(
model_class=Model_for_Generic_AASeq_BinaryClassification_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)
[10]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | is_in_first_half_of_column | predicted_will_be_in_first_half_of_column | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | 1 | 0.991829 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | 14 | 0.199488 | 1 | 0.990733 | ||
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | 1 | 0.991083 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | 1 | 0.991600 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | 1 | 0.992202 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | 1 | 0.990124 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | 0 | 0.351366 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | 0 | 0.359982 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | 0 | 0.352756 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | 0 | 0.351209 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 | 0 | 0.349120 |
A sequence classification model using Model_for_Generic_AASeq_BinaryClassification_Transformer¶
[11]:
example_df = create_example_input_dataframe_classification_rt()
#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_BinaryClassification(
model_class=Model_for_Generic_AASeq_BinaryClassification_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)
[11]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | is_in_first_half_of_column | predicted_will_be_in_first_half_of_column | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | 1 | 0.997586 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | 14 | 0.199488 | 1 | 0.997438 | ||
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | 1 | 0.996627 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | 1 | 0.997642 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | 1 | 0.996989 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | 1 | 0.996926 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | 0 | 0.004032 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | 0 | 0.004321 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | 0 | 0.004137 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | 0 | 0.003938 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 | 0 | 0.004114 |
Binary classification models for given amino acid sequence and site-specific PTMs¶
[12]:
def create_example_input_dataframe_classification_rt():
rt_df = create_example_input_dataframe_normalized_irt()
rt_df['is_in_first_half_of_column'] = 0
rt_df.loc[:5,'is_in_first_half_of_column']=1
return rt_df
A sequence classification model using Model_for_Generic_ModAASeq_BinaryClassification_LSTM¶
[13]:
example_df = create_example_input_dataframe_classification_rt()
#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(
model_class=Model_for_Generic_ModAASeq_BinaryClassification_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)
[13]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | is_in_first_half_of_column | predicted_will_be_in_first_half_of_column | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | 1 | 0.993120 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | 14 | 0.199488 | 1 | 0.990600 | ||
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | 1 | 0.992972 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | 1 | 0.992984 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | 1 | 0.992323 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | 1 | 0.988538 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | 0 | 0.370841 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | 0 | 0.368691 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | 0 | 0.378124 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | 0 | 0.367393 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 | 0 | 0.365957 |
A sequence classification model using Model_for_Generic_ModAASeq_BinaryClassification_Transformer¶
[14]:
example_df = create_example_input_dataframe_classification_rt()
#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(
model_class=Model_for_Generic_ModAASeq_BinaryClassification_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)
[14]:
| sequence | pep_name | irt | mods | mod_sites | nAA | normalized_irt | is_in_first_half_of_column | predicted_will_be_in_first_half_of_column | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LGGNEQVTR | RT-pep a | -24.92 | 9 | 0.000000 | 1 | 0.997545 | ||
| 1 | GAGSSEPVTGLDAK | RT-pep b | 0.00 | 14 | 0.199488 | 1 | 0.996575 | ||
| 2 | VEATFGVDESNAK | RT-pep c | 12.39 | 13 | 0.298671 | 1 | 0.995498 | ||
| 3 | YILAGVENSK | RT-pep d | 19.79 | 10 | 0.357909 | 1 | 0.997241 | ||
| 4 | TPVISGGPYEYR | RT-pep e | 28.71 | 12 | 0.429315 | 1 | 0.996784 | ||
| 5 | TPVITGAPYEYR | RT-pep f | 33.38 | 12 | 0.466699 | 1 | 0.995732 | ||
| 6 | DGLDAASYYAPVR | RT-pep g | 42.26 | 13 | 0.537784 | 0 | 0.004000 | ||
| 7 | ADVTPADFSEWSK | RT-pep h | 54.62 | 13 | 0.636728 | 0 | 0.005084 | ||
| 8 | GTFIIDPGGVIR | RT-pep i | 70.52 | 12 | 0.764009 | 0 | 0.004195 | ||
| 9 | GTFIIDPAAVIR | RT-pep k | 87.23 | 12 | 0.897775 | 0 | 0.003547 | ||
| 10 | LFLQFGAQGSPFLK | RT-pep l | 100.00 | 14 | 1.000000 | 0 | 0.003279 |