Tutorial: building new models from scratch

even without experience in deep learning

[1]:
%reload_ext autoreload
%autoreload 2

In order to predict or classify novel properties of peptides, the user simply needs to provide peptides with corresponding properties (e.g. ‘binding_affinity’).

We provides several generic ModelInterface and Model classes in peptdeep.model.generic_property_prediction module for users to easily build models for regression and classification problems. Examples are shown as following:

Imports

[2]:
from peptdeep.model.generic_property_prediction import (
    ModelInterface_for_Generic_AASeq_BinaryClassification,
    ModelInterface_for_Generic_AASeq_Regression,
    ModelInterface_for_Generic_ModAASeq_BinaryClassification,
    ModelInterface_for_Generic_ModAASeq_Regression,
)
from peptdeep.model.generic_property_prediction import (
    Model_for_Generic_AASeq_BinaryClassification_LSTM,
    Model_for_Generic_AASeq_BinaryClassification_Transformer,
    Model_for_Generic_AASeq_Regression_LSTM,
    Model_for_Generic_AASeq_Regression_Transformer,
    Model_for_Generic_ModAASeq_BinaryClassification_LSTM,
    Model_for_Generic_ModAASeq_BinaryClassification_Transformer,
    Model_for_Generic_ModAASeq_Regression_LSTM,
    Model_for_Generic_ModAASeq_Regression_Transformer,
)

Define example Table/DataFrame

[3]:
from peptdeep.model.rt import IRT_PEPTIDE_DF
[4]:
def create_example_input_dataframe_normalized_irt():
    irt_df=IRT_PEPTIDE_DF.copy()
    irt_df['normalized_irt'] = (
        irt_df.irt-irt_df.irt.min()
    )/(irt_df.irt.max()-irt_df.irt.min()) # 0 to 1 norm
    return irt_df
create_example_input_dataframe_normalized_irt()
[4]:
sequence pep_name irt mods mod_sites nAA normalized_irt
0 LGGNEQVTR RT-pep a -24.92 9 0.000000
1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671
3 YILAGVENSK RT-pep d 19.79 10 0.357909
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000

Steps to build a model from scratch

In the following examples, we only need 7 steps to build a model.

  1. Prepare a training dataframe with sequence column (and mods,mod_sites columns if the model also takes modifications into consideration), and a target value column to train.

  2. Select a ModelInterface class based on the prediction problem (classification or regression for sequences or modified sequences). Select a Model class when initialzing the ModelInterface class.

  3. Tell the ModelInterface object which column in the training dataframe stores the target values, and which column stores the values to be predicted.

  4. model.train() for training.

  5. model.predict() for prediction.

Save and load models:

  1. model.save("/model_folder/model.pth") to save the model.

  2. Use the same ModelInterface and Model classes, and call model.load("/model_folder/model.pth") to load the model for transfer learning and prediction.

Building an simple RT model based on Model_for_Generic_AASeq_Regression_LSTM

[5]:
example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_Regression(
    model_class=Model_for_Generic_AASeq_Regression_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)
[5]:
sequence pep_name irt mods mod_sites nAA normalized_irt predicted_normalized_irt
0 LGGNEQVTR RT-pep a -24.92 9 0.000000 0.000000
1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 0.203671
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 0.312852
3 YILAGVENSK RT-pep d 19.79 10 0.357909 0.365846
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 0.434760
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 0.465173
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 0.564576
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 0.678894
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 0.893195
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 1.061624
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 1.100713

Building an simple RT model for only sequences based on Model_for_Generic_AASeq_Regression_Transformer

[6]:
example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_Regression(
    model_class=Model_for_Generic_AASeq_Regression_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)
[6]:
sequence pep_name irt mods mod_sites nAA normalized_irt predicted_normalized_irt
0 LGGNEQVTR RT-pep a -24.92 9 0.000000 0.000000
1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 0.000000
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 0.140912
3 YILAGVENSK RT-pep d 19.79 10 0.357909 0.142185
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 0.210857
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 0.277200
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 0.088854
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 0.399164
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 0.596815
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 0.701862
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 0.877500

Regression models for predicting a scalar value for a given amino acid sequence and site-specific PTMs

Scalar regression model (RT) with modified AA sequences using Model_for_Generic_ModAASeq_Regression_LSTM

[7]:
example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_Regression(
    model_class=Model_for_Generic_ModAASeq_Regression_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)
[7]:
sequence pep_name irt mods mod_sites nAA normalized_irt predicted_normalized_irt
0 LGGNEQVTR RT-pep a -24.92 9 0.000000 0.000000
1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 0.368827
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 0.295824
3 YILAGVENSK RT-pep d 19.79 10 0.357909 0.337074
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 0.527409
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 0.506031
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 0.629531
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 0.708878
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 0.798570
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 0.856519
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 0.973729

Scalar regression model (RT) with modified AA sequences using Model_for_Generic_ModAASeq_Regression_Transformer

[8]:
example_df = create_example_input_dataframe_normalized_irt()
example_df.loc[1,'mods'] = 'Phospho@S'
example_df.loc[1,'mod_sites'] = '4'

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_Regression(
    model_class=Model_for_Generic_ModAASeq_Regression_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)
[8]:
sequence pep_name irt mods mod_sites nAA normalized_irt predicted_normalized_irt
0 LGGNEQVTR RT-pep a -24.92 9 0.000000 0.088521
1 GAGSSEPVTGLDAK RT-pep b 0.00 Phospho@S 4 14 0.199488 0.571920
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 0.285101
3 YILAGVENSK RT-pep d 19.79 10 0.357909 0.367173
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 0.615492
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 0.589607
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 0.539454
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 0.587029
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 0.880274
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 0.811531
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 1.084086

Binary classification models for a given amino acid sequence

[9]:
# a simple classification dataset
def create_example_input_dataframe_classification_rt():
    rt_df = create_example_input_dataframe_normalized_irt()
    rt_df['is_in_first_half_of_column'] = 0
    rt_df.loc[:5,'is_in_first_half_of_column']=1
    return rt_df

A sequence classification model using Model_for_Generic_AASeq_BinaryClassification_LSTM

[10]:
example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_BinaryClassification(
    model_class=Model_for_Generic_AASeq_BinaryClassification_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)
[10]:
sequence pep_name irt mods mod_sites nAA normalized_irt is_in_first_half_of_column predicted_will_be_in_first_half_of_column
0 LGGNEQVTR RT-pep a -24.92 9 0.000000 1 0.991829
1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 1 0.990733
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 1 0.991083
3 YILAGVENSK RT-pep d 19.79 10 0.357909 1 0.991600
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 1 0.992202
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 1 0.990124
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 0 0.351366
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 0 0.359982
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 0 0.352756
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 0 0.351209
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 0 0.349120

A sequence classification model using Model_for_Generic_AASeq_BinaryClassification_Transformer

[11]:
example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_BinaryClassification(
    model_class=Model_for_Generic_AASeq_BinaryClassification_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)
[11]:
sequence pep_name irt mods mod_sites nAA normalized_irt is_in_first_half_of_column predicted_will_be_in_first_half_of_column
0 LGGNEQVTR RT-pep a -24.92 9 0.000000 1 0.997586
1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 1 0.997438
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 1 0.996627
3 YILAGVENSK RT-pep d 19.79 10 0.357909 1 0.997642
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 1 0.996989
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 1 0.996926
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 0 0.004032
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 0 0.004321
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 0 0.004137
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 0 0.003938
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 0 0.004114

Binary classification models for given amino acid sequence and site-specific PTMs

[12]:
def create_example_input_dataframe_classification_rt():
    rt_df = create_example_input_dataframe_normalized_irt()
    rt_df['is_in_first_half_of_column'] = 0
    rt_df.loc[:5,'is_in_first_half_of_column']=1
    return rt_df

A sequence classification model using Model_for_Generic_ModAASeq_BinaryClassification_LSTM

[13]:
example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(
    model_class=Model_for_Generic_ModAASeq_BinaryClassification_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)
[13]:
sequence pep_name irt mods mod_sites nAA normalized_irt is_in_first_half_of_column predicted_will_be_in_first_half_of_column
0 LGGNEQVTR RT-pep a -24.92 9 0.000000 1 0.993120
1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 1 0.990600
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 1 0.992972
3 YILAGVENSK RT-pep d 19.79 10 0.357909 1 0.992984
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 1 0.992323
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 1 0.988538
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 0 0.370841
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 0 0.368691
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 0 0.378124
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 0 0.367393
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 0 0.365957

A sequence classification model using Model_for_Generic_ModAASeq_BinaryClassification_Transformer

[14]:
example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(
    model_class=Model_for_Generic_ModAASeq_BinaryClassification_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)
[14]:
sequence pep_name irt mods mod_sites nAA normalized_irt is_in_first_half_of_column predicted_will_be_in_first_half_of_column
0 LGGNEQVTR RT-pep a -24.92 9 0.000000 1 0.997545
1 GAGSSEPVTGLDAK RT-pep b 0.00 14 0.199488 1 0.996575
2 VEATFGVDESNAK RT-pep c 12.39 13 0.298671 1 0.995498
3 YILAGVENSK RT-pep d 19.79 10 0.357909 1 0.997241
4 TPVISGGPYEYR RT-pep e 28.71 12 0.429315 1 0.996784
5 TPVITGAPYEYR RT-pep f 33.38 12 0.466699 1 0.995732
6 DGLDAASYYAPVR RT-pep g 42.26 13 0.537784 0 0.004000
7 ADVTPADFSEWSK RT-pep h 54.62 13 0.636728 0 0.005084
8 GTFIIDPGGVIR RT-pep i 70.52 12 0.764009 0 0.004195
9 GTFIIDPAAVIR RT-pep k 87.23 12 0.897775 0 0.003547
10 LFLQFGAQGSPFLK RT-pep l 100.00 14 1.000000 0 0.003279