Tutorial: building new models from scratch¶

even without experience in deep learning

[1]:

%reload_ext autoreload
%autoreload 2

In order to predict or classify novel properties of peptides, the user simply needs to provide peptides with corresponding properties (e.g. ‘binding_affinity’).

We provides several generic ModelInterface and Model classes in peptdeep.model.generic_property_prediction module for users to easily build models for regression and classification problems. Examples are shown as following:

Imports¶

[2]:

from peptdeep.model.generic_property_prediction import (
    ModelInterface_for_Generic_AASeq_BinaryClassification,
    ModelInterface_for_Generic_AASeq_Regression,
    ModelInterface_for_Generic_ModAASeq_BinaryClassification,
    ModelInterface_for_Generic_ModAASeq_Regression,
)
from peptdeep.model.generic_property_prediction import (
    Model_for_Generic_AASeq_BinaryClassification_LSTM,
    Model_for_Generic_AASeq_BinaryClassification_Transformer,
    Model_for_Generic_AASeq_Regression_LSTM,
    Model_for_Generic_AASeq_Regression_Transformer,
    Model_for_Generic_ModAASeq_BinaryClassification_LSTM,
    Model_for_Generic_ModAASeq_BinaryClassification_Transformer,
    Model_for_Generic_ModAASeq_Regression_LSTM,
    Model_for_Generic_ModAASeq_Regression_Transformer,
)

Define example Table/DataFrame¶

[3]:

from peptdeep.model.rt import IRT_PEPTIDE_DF

[4]:

def create_example_input_dataframe_normalized_irt():
    irt_df=IRT_PEPTIDE_DF.copy()
    irt_df['normalized_irt'] = (
        irt_df.irt-irt_df.irt.min()
    )/(irt_df.irt.max()-irt_df.irt.min()) # 0 to 1 norm
    return irt_df
create_example_input_dataframe_normalized_irt()

[4]:

	sequence	pep_name	irt	nAA	normalized_irt
0	LGGNEQVTR	RT-pep a	-24.92	9	0.000000
1	GAGSSEPVTGLDAK	RT-pep b	0.00	14	0.199488
2	VEATFGVDESNAK	RT-pep c	12.39	13	0.298671
3	YILAGVENSK	RT-pep d	19.79	10	0.357909
4	TPVISGGPYEYR	RT-pep e	28.71	12	0.429315
5	TPVITGAPYEYR	RT-pep f	33.38	12	0.466699
6	DGLDAASYYAPVR	RT-pep g	42.26	13	0.537784
7	ADVTPADFSEWSK	RT-pep h	54.62	13	0.636728
8	GTFIIDPGGVIR	RT-pep i	70.52	12	0.764009
9	GTFIIDPAAVIR	RT-pep k	87.23	12	0.897775
10	LFLQFGAQGSPFLK	RT-pep l	100.00	14	1.000000

Steps to build a model from scratch¶

In the following examples, we only need 7 steps to build a model.

Prepare a training dataframe with sequence column (and mods,mod_sites columns if the model also takes modifications into consideration), and a target value column to train.
Select a ModelInterface class based on the prediction problem (classification or regression for sequences or modified sequences). Select a Model class when initialzing the ModelInterface class.
Tell the ModelInterface object which column in the training dataframe stores the target values, and which column stores the values to be predicted.
model.train() for training.
model.predict() for prediction.

Save and load models:

model.save("/model_folder/model.pth") to save the model.
Use the same ModelInterface and Model classes, and call model.load("/model_folder/model.pth") to load the model for transfer learning and prediction.

Building an simple RT model based on `Model_for_Generic_AASeq_Regression_LSTM`¶

[5]:

example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_Regression(
    model_class=Model_for_Generic_AASeq_Regression_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)

[5]:

	sequence	pep_name	irt	nAA	normalized_irt	predicted_normalized_irt
0	LGGNEQVTR	RT-pep a	-24.92	9	0.000000	0.000000
1	GAGSSEPVTGLDAK	RT-pep b	0.00	14	0.199488	0.203671
2	VEATFGVDESNAK	RT-pep c	12.39	13	0.298671	0.312852
3	YILAGVENSK	RT-pep d	19.79	10	0.357909	0.365846
4	TPVISGGPYEYR	RT-pep e	28.71	12	0.429315	0.434760
5	TPVITGAPYEYR	RT-pep f	33.38	12	0.466699	0.465173
6	DGLDAASYYAPVR	RT-pep g	42.26	13	0.537784	0.564576
7	ADVTPADFSEWSK	RT-pep h	54.62	13	0.636728	0.678894
8	GTFIIDPGGVIR	RT-pep i	70.52	12	0.764009	0.893195
9	GTFIIDPAAVIR	RT-pep k	87.23	12	0.897775	1.061624
10	LFLQFGAQGSPFLK	RT-pep l	100.00	14	1.000000	1.100713

Building an simple RT model for only sequences based on `Model_for_Generic_AASeq_Regression_Transformer`¶

[6]:

example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_Regression(
    model_class=Model_for_Generic_AASeq_Regression_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)

[6]:

	sequence	pep_name	irt	nAA	normalized_irt	predicted_normalized_irt
0	LGGNEQVTR	RT-pep a	-24.92	9	0.000000	0.000000
1	GAGSSEPVTGLDAK	RT-pep b	0.00	14	0.199488	0.000000
2	VEATFGVDESNAK	RT-pep c	12.39	13	0.298671	0.140912
3	YILAGVENSK	RT-pep d	19.79	10	0.357909	0.142185
4	TPVISGGPYEYR	RT-pep e	28.71	12	0.429315	0.210857
5	TPVITGAPYEYR	RT-pep f	33.38	12	0.466699	0.277200
6	DGLDAASYYAPVR	RT-pep g	42.26	13	0.537784	0.088854
7	ADVTPADFSEWSK	RT-pep h	54.62	13	0.636728	0.399164
8	GTFIIDPGGVIR	RT-pep i	70.52	12	0.764009	0.596815
9	GTFIIDPAAVIR	RT-pep k	87.23	12	0.897775	0.701862
10	LFLQFGAQGSPFLK	RT-pep l	100.00	14	1.000000	0.877500

Regression models for predicting a scalar value for a given amino acid sequence and site-specific PTMs¶

Scalar regression model (RT) with modified AA sequences using `Model_for_Generic_ModAASeq_Regression_LSTM`¶

[7]:

example_df = create_example_input_dataframe_normalized_irt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_Regression(
    model_class=Model_for_Generic_ModAASeq_Regression_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)

[7]:

	sequence	pep_name	irt	nAA	normalized_irt	predicted_normalized_irt
0	LGGNEQVTR	RT-pep a	-24.92	9	0.000000	0.000000
1	GAGSSEPVTGLDAK	RT-pep b	0.00	14	0.199488	0.368827
2	VEATFGVDESNAK	RT-pep c	12.39	13	0.298671	0.295824
3	YILAGVENSK	RT-pep d	19.79	10	0.357909	0.337074
4	TPVISGGPYEYR	RT-pep e	28.71	12	0.429315	0.527409
5	TPVITGAPYEYR	RT-pep f	33.38	12	0.466699	0.506031
6	DGLDAASYYAPVR	RT-pep g	42.26	13	0.537784	0.629531
7	ADVTPADFSEWSK	RT-pep h	54.62	13	0.636728	0.708878
8	GTFIIDPGGVIR	RT-pep i	70.52	12	0.764009	0.798570
9	GTFIIDPAAVIR	RT-pep k	87.23	12	0.897775	0.856519
10	LFLQFGAQGSPFLK	RT-pep l	100.00	14	1.000000	0.973729

Scalar regression model (RT) with modified AA sequences using `Model_for_Generic_ModAASeq_Regression_Transformer`¶

[8]:

example_df = create_example_input_dataframe_normalized_irt()
example_df.loc[1,'mods'] = 'Phospho@S'
example_df.loc[1,'mod_sites'] = '4'

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_Regression(
    model_class=Model_for_Generic_ModAASeq_Regression_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'normalized_irt'
model.target_column_to_predict = 'predicted_normalized_irt'
model.train(example_df, epoch=20)
model.predict(example_df)

[8]:

	sequence	pep_name	irt	mods	mod_sites	nAA	normalized_irt	predicted_normalized_irt
0	LGGNEQVTR	RT-pep a	-24.92			9	0.000000	0.088521
1	GAGSSEPVTGLDAK	RT-pep b	0.00	Phospho@S	4	14	0.199488	0.571920
2	VEATFGVDESNAK	RT-pep c	12.39			13	0.298671	0.285101
3	YILAGVENSK	RT-pep d	19.79			10	0.357909	0.367173
4	TPVISGGPYEYR	RT-pep e	28.71			12	0.429315	0.615492
5	TPVITGAPYEYR	RT-pep f	33.38			12	0.466699	0.589607
6	DGLDAASYYAPVR	RT-pep g	42.26			13	0.537784	0.539454
7	ADVTPADFSEWSK	RT-pep h	54.62			13	0.636728	0.587029
8	GTFIIDPGGVIR	RT-pep i	70.52			12	0.764009	0.880274
9	GTFIIDPAAVIR	RT-pep k	87.23			12	0.897775	0.811531
10	LFLQFGAQGSPFLK	RT-pep l	100.00			14	1.000000	1.084086

Binary classification models for a given amino acid sequence¶

[9]:

# a simple classification dataset
def create_example_input_dataframe_classification_rt():
    rt_df = create_example_input_dataframe_normalized_irt()
    rt_df['is_in_first_half_of_column'] = 0
    rt_df.loc[:5,'is_in_first_half_of_column']=1
    return rt_df

A sequence classification model using `Model_for_Generic_AASeq_BinaryClassification_LSTM`¶

[10]:

example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_BinaryClassification(
    model_class=Model_for_Generic_AASeq_BinaryClassification_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)

[10]:

	sequence	pep_name	irt	nAA	normalized_irt	is_in_first_half_of_column	predicted_will_be_in_first_half_of_column
0	LGGNEQVTR	RT-pep a	-24.92	9	0.000000	1	0.991829
1	GAGSSEPVTGLDAK	RT-pep b	0.00	14	0.199488	1	0.990733
2	VEATFGVDESNAK	RT-pep c	12.39	13	0.298671	1	0.991083
3	YILAGVENSK	RT-pep d	19.79	10	0.357909	1	0.991600
4	TPVISGGPYEYR	RT-pep e	28.71	12	0.429315	1	0.992202
5	TPVITGAPYEYR	RT-pep f	33.38	12	0.466699	1	0.990124
6	DGLDAASYYAPVR	RT-pep g	42.26	13	0.537784	0	0.351366
7	ADVTPADFSEWSK	RT-pep h	54.62	13	0.636728	0	0.359982
8	GTFIIDPGGVIR	RT-pep i	70.52	12	0.764009	0	0.352756
9	GTFIIDPAAVIR	RT-pep k	87.23	12	0.897775	0	0.351209
10	LFLQFGAQGSPFLK	RT-pep l	100.00	14	1.000000	0	0.349120

A sequence classification model using `Model_for_Generic_AASeq_BinaryClassification_Transformer`¶

[11]:

example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_AASeq_BinaryClassification(
    model_class=Model_for_Generic_AASeq_BinaryClassification_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)

[11]:

	sequence	pep_name	irt	nAA	normalized_irt	is_in_first_half_of_column	predicted_will_be_in_first_half_of_column
0	LGGNEQVTR	RT-pep a	-24.92	9	0.000000	1	0.997586
1	GAGSSEPVTGLDAK	RT-pep b	0.00	14	0.199488	1	0.997438
2	VEATFGVDESNAK	RT-pep c	12.39	13	0.298671	1	0.996627
3	YILAGVENSK	RT-pep d	19.79	10	0.357909	1	0.997642
4	TPVISGGPYEYR	RT-pep e	28.71	12	0.429315	1	0.996989
5	TPVITGAPYEYR	RT-pep f	33.38	12	0.466699	1	0.996926
6	DGLDAASYYAPVR	RT-pep g	42.26	13	0.537784	0	0.004032
7	ADVTPADFSEWSK	RT-pep h	54.62	13	0.636728	0	0.004321
8	GTFIIDPGGVIR	RT-pep i	70.52	12	0.764009	0	0.004137
9	GTFIIDPAAVIR	RT-pep k	87.23	12	0.897775	0	0.003938
10	LFLQFGAQGSPFLK	RT-pep l	100.00	14	1.000000	0	0.004114

Binary classification models for given amino acid sequence and site-specific PTMs¶

[12]:

def create_example_input_dataframe_classification_rt():
    rt_df = create_example_input_dataframe_normalized_irt()
    rt_df['is_in_first_half_of_column'] = 0
    rt_df.loc[:5,'is_in_first_half_of_column']=1
    return rt_df

A sequence classification model using `Model_for_Generic_ModAASeq_BinaryClassification_LSTM`¶

[13]:

example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(
    model_class=Model_for_Generic_ModAASeq_BinaryClassification_LSTM
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)

[13]:

	sequence	pep_name	irt	nAA	normalized_irt	is_in_first_half_of_column	predicted_will_be_in_first_half_of_column
0	LGGNEQVTR	RT-pep a	-24.92	9	0.000000	1	0.993120
1	GAGSSEPVTGLDAK	RT-pep b	0.00	14	0.199488	1	0.990600
2	VEATFGVDESNAK	RT-pep c	12.39	13	0.298671	1	0.992972
3	YILAGVENSK	RT-pep d	19.79	10	0.357909	1	0.992984
4	TPVISGGPYEYR	RT-pep e	28.71	12	0.429315	1	0.992323
5	TPVITGAPYEYR	RT-pep f	33.38	12	0.466699	1	0.988538
6	DGLDAASYYAPVR	RT-pep g	42.26	13	0.537784	0	0.370841
7	ADVTPADFSEWSK	RT-pep h	54.62	13	0.636728	0	0.368691
8	GTFIIDPGGVIR	RT-pep i	70.52	12	0.764009	0	0.378124
9	GTFIIDPAAVIR	RT-pep k	87.23	12	0.897775	0	0.367393
10	LFLQFGAQGSPFLK	RT-pep l	100.00	14	1.000000	0	0.365957

A sequence classification model using `Model_for_Generic_ModAASeq_BinaryClassification_Transformer`¶

[14]:

example_df = create_example_input_dataframe_classification_rt()

#initialize the modelinterface, specify which of the models to use
model = ModelInterface_for_Generic_ModAASeq_BinaryClassification(
    model_class=Model_for_Generic_ModAASeq_BinaryClassification_Transformer
)
# specify the name of the column you want to use for traning
model.target_column_to_train = 'is_in_first_half_of_column'
model.target_column_to_predict = 'predicted_will_be_in_first_half_of_column'
model.train(example_df, epoch=20)
model.predict(example_df)

[14]:

	sequence	pep_name	irt	nAA	normalized_irt	is_in_first_half_of_column	predicted_will_be_in_first_half_of_column
0	LGGNEQVTR	RT-pep a	-24.92	9	0.000000	1	0.997545
1	GAGSSEPVTGLDAK	RT-pep b	0.00	14	0.199488	1	0.996575
2	VEATFGVDESNAK	RT-pep c	12.39	13	0.298671	1	0.995498
3	YILAGVENSK	RT-pep d	19.79	10	0.357909	1	0.997241
4	TPVISGGPYEYR	RT-pep e	28.71	12	0.429315	1	0.996784
5	TPVITGAPYEYR	RT-pep f	33.38	12	0.466699	1	0.995732
6	DGLDAASYYAPVR	RT-pep g	42.26	13	0.537784	0	0.004000
7	ADVTPADFSEWSK	RT-pep h	54.62	13	0.636728	0	0.005084
8	GTFIIDPGGVIR	RT-pep i	70.52	12	0.764009	0	0.004195
9	GTFIIDPAAVIR	RT-pep k	87.23	12	0.897775	0	0.003547
10	LFLQFGAQGSPFLK	RT-pep l	100.00	14	1.000000	0	0.003279

Tutorial: building new models from scratch¶

Imports¶

Define example Table/DataFrame¶

Steps to build a model from scratch¶

Building an simple RT model based on Model_for_Generic_AASeq_Regression_LSTM¶

Building an simple RT model for only sequences based on Model_for_Generic_AASeq_Regression_Transformer¶

Regression models for predicting a scalar value for a given amino acid sequence and site-specific PTMs¶

Scalar regression model (RT) with modified AA sequences using Model_for_Generic_ModAASeq_Regression_LSTM¶

Scalar regression model (RT) with modified AA sequences using Model_for_Generic_ModAASeq_Regression_Transformer¶