Command Line Interface#

INTREPPPID has a CLI which can be used to easily train INTREPPPID.

Train#

To train the INTREPPPID model as it was in the manuscript, use the train e2e_rnn_triplet command:

$ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2
INTREPPPID Manuscript Values for e2e_rnn_triplet#

Argument/Flag

Default

Manuscript Value

Description

PPI_DATASET_PATH

None

See Data

Path to the PPI dataset. Must be in the INTREPPPID HDF5 format.

SENTENCEPIECE_PATH

None

See Data

Path to the SentencePiece model.

C_TYPE

None

3

Specifies which dataset in the INTREPPPID HDF5 dataset to use by specifying the C-type.

NUM_EPOCHS

None

100

Number of epochs to train the model for.

BATCH_SIZE

None

80

The number of samples to use in the batch.

--seed

None

8675309 or 5353456 or 3927704 depending on the experiment.

The random seed. If not specified, chosen at random.

--vocab_size

250

250

The number of tokens in the SentencePiece vocabulary.

--trunc_len

1500

1500

Length at which to truncate sequences.

--embedding_size

64

64

The size of embeddings.

--rnn_num_layers

2

2

The number of layers in the AWD-LSTM encoder to use.

--rnn_dropout_rate

0.3

0.3

The dropconnect rate for the AWD-LSTM encoder.

--variational_dropout

false

false

Whether to use variational dropout, as described in the AWD-LSTM manuscript.

--bi_reduce

last

last

Method to reduce the two LSTM embeddings for both directions. Must be one of “concat”, “max”, “mean”, “last”.

--workers

4

4

The number of processes to use for the DataLoader.

--embedding_droprate

0.3

0.3

The amount of Embedding Dropout to use (a la AWD-LSTM).

--do_rate

0.3

0.3

The amount of dropout to use in the MLP Classifier.

--log_path

"./logs/e2e_rnn_triplet"

"./logs/e2e_rnn_triplet"

The path to save logs.

--encoder_only_steps

-1 (No Steps)

-1 (No Steps)

The number of steps to only train the encoder and not the classifier.

--classifier_warm_up

-1 (No Steps)

-1 (No Steps)

The number of steps to only train the classifier and not the encoder.

--beta_classifier

4 (25% contribution of the classifier loss, 75% contribution of the orthologue loss)

2 (50% contribution of the classifier loss, 50% contribution of the orthologue loss)

Adjusts the amount of weight to give the PPI Classification loss, relative to the Orthologue Locality loss. The loss becomes (1/β)×(classifier_loss) + [1-(1/β)]×(orthologue_loss).

--lr

1e-2

1e-2

Learning rate to use.

--use_projection

false

false

Whether to use a projection network after the encoder.

--checkpoint_path

log_path / model_name / "chkpt"

log_path / model_name / "chkpt"

The location where checkpoints are to be saved.

--optimizer_type

ranger21

ranger21_xx

The optimizer to use while training. Must be one of ranger21, ranger21_xx, adamw, adamw_1cycle, or adamw_cosine.

Infer#

To infer edges using the cli, you’ll need to use the intrepppid infer command.

Currently, it is only possible to infer from a CSV using the following command

Usage: intrepppid infer from_csv INTERACTIONS_PATH SEQUENCES_PATH WEIGHTS_PATH SPM_PATH OUT_PATH <flags>
  optional flags:        --trunc_len | --low_memory | --db_path |
                         --dont_populate_db | --device | --get_from_uniprot

Here’s an example of inferring

Argument/Flag

Default

Description

INTERACTIONS_PATH

None

Path to the CSV file which contains pairs of protein IDs along with interaction identifiers. The interaction between the amino acid sequences that correspond to these identifiers will be identified. The protein identifiers must correspond to sequences in the FASTA file provided.

SEQUENCES_PATH

None

Path to the FASTA file which contains the sequences of the protein identifiers referred to in INTERACTIONS_PATH

WEIGHTS_PATH

None

Path to the pre-trained weights for the INTREPPPID model. You can learn how to download them here.

SPM_PATH

None

Path to the trained SentencePiece model. These are included in the weights on the GitHub release page.

OUT_PATH

None

The path where the inferred interaction probabilities will be written in CSV format.

--trunc_len

1500

Maximum number tokens to pass to the model. If a sequence has more tokens than trunc_len, they will be truncated. Note: tokens are between 1-2 amino acids long, so this corresponds to between 1500-3000 amino acids.

--low_memory

False

Operate in “low-memory” mode. When low_memory is False, all of the tokenized sequences computed from SEQUENCES_PATH must fit in memory. When low_memory is True, then tokenized sequences will be stored on-disk in a LMDB database, with minimal memory over-head.

--db_path

None

If low-memory is true, this specifies the folder where the tokenized sequence database will be stored. If not specified, a temporary folder will be used. Does nothing if low_memory is False.

--dont_populate_db

False

If low-memory is true, this uses the tokenized sequences stored in an existing database specified in db_path. It skips reading and tokenizing the sequences as a result. Does nothing if low_memory is false.

--device

“cpu”

What device to run INTREPPPID on. Valid values are described in the PyTorch Documentation, but suffice to say “cpu” runs on the CPU, “cuda” runs on a CUDA-capable GPU, and “cuda:0” runs on the zeroth CUDA-capable GPU.

--get_from_uniprot

False

When True, identifiers in INTERACTIONS_PATH are not found among the identifiers in SEQUENCES_PATH, it’ll look-up the sequences on UniProt.