API#

Data#

class intrepppid.data.ppi_oma.IntrepppidDataset(dataset_path: Path, c_type: int, split: str, model_file: Path, trunc_len: int = 1000, sos: bool = False, eos: bool = False, negative_omid: bool = False)#
__getitem__(idx: int)#

Get a sample from the dataset by its index.

This returns:

p1_seq, p2_seq, omid1_seq, omid2_seq, omid_neg_seq, label

if negative_omid is True. Otherwise:

p1_seq, p2_seq, omid1_seq, omid2_seq, label

Where p1_seq and p2_seq are two amino acid sequences whose interaction status is indicated by label, omid1_seq is the anchor protein sequence for the orthologous locality task, omid2_seq is the positive protein sequence, and omid_neg_seq is the negative protein sequence.

Negative proteins are randomly sampled.

Parameters:

idx – The index of the sample to return.

__init__(dataset_path: Path, c_type: int, split: str, model_file: Path, trunc_len: int = 1000, sos: bool = False, eos: bool = False, negative_omid: bool = False)#

Builds a PyTorch dataset from an HDF5 dataset in the INTREPPPID format.

Parameters:
  • dataset_path – The path to the HDF5 in the INTREPPPID format.

  • c_type – The C-type pairs to use in the dataset.

  • split – Which split the data should come from (i.e.: train, test, val)

  • model_file – The path to the SentencePiece model.

  • trunc_len – The length at which to truncate the amino acid sequence.

  • sos – Boolean indicating whether a start-of-sequence token should be used.

  • eos – Boolean indicating whether an end-of-sequence token should be used.

  • negative_omid – Boolean indicating whether to return a negative example for the orthologous locality task.

__len__()#

Get the number of samples in the dataset.

encode(seq: str, sp: bool = True, pad: bool = True)#

Encodes an amino-acid sequence using the SentencePiece model, pads, and adds start/end-of-sequence tokens.

Parameters:
  • seq – The amino acid sequence to be encoded.

  • sp – A boolean which indicates whether to encode the sequence with SentencePiece.

  • pad – Whether to pad up to the trunc len (True) or not (False). Pads on the right with zeroes.

get_omid_member(omid: int)#

Get a random Uniprot Accessions of a protein in the specified OMA group in this dataset.

Parameters:

omid – The OMA ID.

get_omid_members(omid: int)#

Get all the Uniprot Accessions of the proteins sharing in the same specified OMA group in this dataset.

Parameters:

omid – The OMA ID.

get_sequence(name: str)#

Get the amino-acid sequence of a protein in the dataset given its Uniprot Accession.

Parameters:

name – The Uniprot Accession

static static_encode(trunc_len: int, spp, seq: str, sp: bool = True, pad: bool = True, sampling: bool = True, sos: bool = False, eos: bool = False)#

Encodes an amino-acid sequence using the SentencePiece model, pads, and adds start/end-of-sequence tokens.

This is the static method, which doesn’t require this class to be instantiated.

Parameters:
  • trunc_len – The length at which to truncate the amino acid sequence.

  • spp – The SentencePiece model to use to encode the sample.

  • seq – The amino acid sequence to be encoded.

  • sp – A boolean which indicates whether to encode the sequence with SentencePiece

  • pad – Whether to pad up to the trunc len (True) or not (False). Pads on the right with zeroes.

  • sampling – Whether to randomly sample tokens (True) or not (False). See the SentencePiece library/manuscript for details.

  • sos – Boolean indicating whether to add a start-of-sequence token (True) or not (False)

  • eos – Boolean indicating whether to add an end-of-sequence token (True) or not (False)

class intrepppid.data.ppi_oma.IntrepppidDataModule(batch_size: int, dataset_path: Path, c_type: int, trunc_len: int, workers: int, vocab_size: int, model_file: str, seed: int, sos: bool, eos: bool, negative_omid: bool = False)#
__init__(batch_size: int, dataset_path: Path, c_type: int, trunc_len: int, workers: int, vocab_size: int, model_file: str, seed: int, sos: bool, eos: bool, negative_omid: bool = False)#

A PyTorch Lightning Data Module for INTREPPPID datasets.

Parameters:
  • batch_size – The size of the batches for the Data Module to generate

  • dataset_path – The path to the HDF5 in the INTREPPPID format.

  • c_type – The C-type pairs to use in the dataset.

  • trunc_len – The length at which to truncate the amino acid sequence.

  • workers – The number of CPU processes used to load data.

  • vocab_size – The number of tokens in the SentencePiece vocabular.

  • model_file – The path to the SentencePiece model.

  • seed – The random seed to use for sampling SentencePiece tokens.

  • sos – Boolean indicating whether to add a start-of-sequence token (True) or not (False)

  • eos – Boolean indicating whether to add an end-of-sequence token (True) or not (False)

  • negative_omid – Boolean indicating whether to return a negative example for the orthologous locality task.

setup(stage=None)#

Instantiate the internal datasets.

Parameters:

stage – Specifies the stage of training the model (must be one of train, val, or test).

test_dataloader()#

Returns the dataloader for the test set.

train_dataloader()#

Returns the dataloader for the training set.

val_dataloader()#

Returns the dataloader for the validation set.

Network#

intrepppid.intrepppid_network(steps_per_epoch: int, vocab_size: int = 250, embedding_size: int = 64, rnn_num_layers: int = 2, rnn_dropout_rate: float = 0.3, variational_dropout: bool = False, bi_reduce: str = 'last', embedding_droprate: float = 0.3, num_epochs: int = 100, do_rate: float = 0.3, beta_classifier: int = 2, lr: float = 0.01, use_projection: bool = False, optimizer_type: str = 'ranger21_xx')#

This builds a PyTorch nn.Module which represents the INTREPPPID network as defined in the manuscript.

It assembles a TripletE2ENet with an AWD-LSTM encoder and an MLP classifier.

Parameters:
  • steps_per_epoch – Number of mini-batch steps iterated over each epoch. Only really maters for training.

  • vocab_size – The number of tokens in the SentencePiece vocabulary. Defaults to 250.

  • embedding_size – The size of embeddings. Defaults to 64.

  • rnn_num_layers – The number of layers in the AWD-LSTM encoder to use. Defaults to 2.

  • rnn_dropout_rate – The dropconnect rate for the AWD-LSTM encoder. Defaults to 0.3.

  • variational_dropout – Whether to use variational dropout, as described in the AWD-LSTM manuscript. Defaults to False.

  • bi_reduce – Method to reduce the two LSTM embeddings for both directions. Must be one of “concat”, “max”, “mean”, “last”. Defaults to “last”.

  • embedding_droprate – The amount of Embedding Dropout to use (a la AWD-LSTM). Defaults to 0.3.

  • num_epochs – Number of epochs to train the model for.

  • do_rate – The amount of dropout to use in the MLP Classifier. Defaults to 0.3.

  • beta_classifier – Adjusts the amount of weight to give the PPI Classification loss, relative to the Orthologue Locality loss. The loss becomes (1/beta_classifier)*classifier loss + [1-(1/beta_classifier)]*orthologue_loss. Defaults to 1 (equal contribution of both losses).

  • lr – Learning rate to use. Defaults to 1e-2.

  • use_projection – Whether to use a projection network after the encoder. Defaults to False.

  • optimizer_type – The optimizer to use while training. Must be one of “ranger21”, “ranger21_xx”, “adamw”, “adamw_1cycle”, or “adamw_cosine”. Defaults to “ranger21”.