ppi_bench

Info

Before running the ppi_bench experiment, one must either run the embeddings_dbs experiment on the PPI dataset you are training on or use the pre-computed vectors found in the data/embeddings folder (the default).

This folder holds the code required to train the pLM-based PPI inference models used in the manuscript.

train.py

To train the pLM-based PPI inference models from the manuscript, a CLI tool (train.py) was created. Here's how to use it:

python train.py MAX_EPOCHS NUM_LAYERS DATASET_FILE DATABASE_PATH BATCH_SIZE INPUT_DIM C_LEVEL <flags>

Arguments

Positional Arguments

Argument	Description	Manuscript Value
MAX_EPOCHS	How many epochs to train for.	100
NUM_LAYERS	How many fully-connected layers in the network.	3
DATASET_FILE	Path to the PPI dataset to train/test on. Must be in the RAPPPID/INTREPPPID format.	`../../data/ppi/rapppid_[common_string_9606.protein.links.detailed.v12.0_upkb.csv]_Mz70T9t-4Y-i6jWD9sEtcjOr0X8=.h5`
DATABASE_PATH	Path to the LMDB database with corresponding protein embeddings.	Various, all from the embeddings folder.
DATABASE_PATH	Path to the LMDB database with corresponding protein embeddings. This solely determines which pLM the PPI model is based on.	Various, all from the embeddings folder.
BATCH_SIZE	The number of pairs to train on at once. Increasing this value increases RAM/VRAM use.	128
INPUT_DIM	The number of elements in the embeddings inputted into the network. This is determined by the embeddings in DATABASE_PATH.	Corresponds to pLM
C_LEVEL	Which type of dataset (C1, C2, or C3) to use from the RAPPPID PPI dataset. See Park & Marcotte for details.	3

Flags

Short Flag	Long Flag	Default	Description
-w	--workers	16	Number of CPU threads to use. Set this to fewer threads than your CPU affords you.
-p	--pooling	None	What pooling function (if any) to apply to the embedding before inputting to the neural network. Valid values are 'None' (no pooling), 'average' (AdaptiveAvgPool1d), or 'max' (AdaptiveMaxPool1d).
-s	--seed	8675309	The random seed to use.

Example

To train a PPI inference model based on ProtT5 as we did in the manuscript, you could run the following Bash script:

DATASET_FILE="../../data/ppi/rapppid_[common_string_9606.protein.links.detailed.v12.0_upkb.csv]_Mz70T9t-4Y-i6jWD9sEtcjOr0X8=.h5"
PROT_T5_DB="../../data/embeddings/prottrans_t5.lmdb"
PROT_T5_DIM=1024

python train.py 100 3 $DATASET_FILE $PROT_T5_DB 128 $PROT_T5_DIM 3 -s 1

Output

When we train a model, it is assigned a model name based on the hash of its hyperparameters. Checkpoints are saved to the folder ../../data/chkpts/<MODEL_ID>/ in a file that ends in .ckpt. Also save to that folder are the hyperparameters for the model (end in .json.gz) and the inferred probabilities of the testing pairs (.csv).

Requirements

One can install the requirements for this module using the requirements.txt file in the experiments/ppi_bench folder.