Command Line Interface#
INTREPPPID has a CLI which can be used to easily train INTREPPPID.
Train#
To train the INTREPPPID model as it was in the manuscript, use the train e2e_rnn_triplet
command:
$ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2
Argument/Flag |
Default |
Manuscript Value |
Description |
---|---|---|---|
|
None |
See Data |
Path to the PPI dataset. Must be in the INTREPPPID HDF5 format. |
|
None |
See Data |
Path to the SentencePiece model. |
|
None |
|
Specifies which dataset in the INTREPPPID HDF5 dataset to use by specifying the C-type. |
|
None |
|
Number of epochs to train the model for. |
|
None |
|
The number of samples to use in the batch. |
|
None |
|
The random seed. If not specified, chosen at random. |
|
|
|
The number of tokens in the SentencePiece vocabulary. |
|
|
|
Length at which to truncate sequences. |
|
|
|
The size of embeddings. |
|
|
|
The number of layers in the AWD-LSTM encoder to use. |
|
|
|
The dropconnect rate for the AWD-LSTM encoder. |
|
|
|
Whether to use variational dropout, as described in the AWD-LSTM manuscript. |
|
|
|
Method to reduce the two LSTM embeddings for both directions. Must be one of “concat”, “max”, “mean”, “last”. |
|
|
|
The number of processes to use for the DataLoader. |
|
|
|
The amount of Embedding Dropout to use (a la AWD-LSTM). |
|
|
|
The amount of dropout to use in the MLP Classifier. |
|
|
|
The path to save logs. |
|
|
|
The number of steps to only train the encoder and not the classifier. |
|
|
|
The number of steps to only train the classifier and not the encoder. |
|
|
|
Adjusts the amount of weight to give the PPI Classification loss, relative to the Orthologue Locality loss. The loss becomes (1/β)×(classifier_loss) + [1-(1/β)]×(orthologue_loss). |
|
|
|
Learning rate to use. |
|
|
|
Whether to use a projection network after the encoder. |
|
|
|
The location where checkpoints are to be saved. |
|
|
|
The optimizer to use while training. Must be one of |
Infer#
To infer edges using the cli, you’ll need to use the intrepppid infer
command.
Currently, it is only possible to infer from a CSV using the following command
Usage: intrepppid infer from_csv INTERACTIONS_PATH SEQUENCES_PATH WEIGHTS_PATH SPM_PATH OUT_PATH <flags>
optional flags: --trunc_len | --low_memory | --db_path |
--dont_populate_db | --device | --get_from_uniprot
Here’s an example of inferring
Argument/Flag |
Default |
Description |
---|---|---|
|
None |
Path to the CSV file which contains pairs of protein IDs along with interaction identifiers. The interaction between the amino acid sequences that correspond to these identifiers will be identified. The protein identifiers must correspond to sequences in the FASTA file provided. |
|
None |
Path to the FASTA file which contains the sequences of the protein identifiers referred to in |
|
None |
Path to the pre-trained weights for the INTREPPPID model. You can learn how to download them here. |
|
None |
Path to the trained SentencePiece model. These are included in the weights on the GitHub release page. |
|
None |
The path where the inferred interaction probabilities will be written in CSV format. |
|
1500 |
Maximum number tokens to pass to the model. If a sequence has more tokens than |
|
False |
Operate in “low-memory” mode. When |
|
None |
If low-memory is true, this specifies the folder where the tokenized sequence database will be stored. If not specified, a temporary folder will be used. Does nothing if |
|
False |
If low-memory is true, this uses the tokenized sequences stored in an existing database specified in |
|
“cpu” |
What device to run INTREPPPID on. Valid values are described in the PyTorch Documentation, but suffice to say “cpu” runs on the CPU, “cuda” runs on a CUDA-capable GPU, and “cuda:0” runs on the zeroth CUDA-capable GPU. |
|
False |
When True, identifiers in |