API#
Data#
- class intrepppid.data.ppi_oma.IntrepppidDataset(dataset_path: Path, c_type: int, split: str, model_file: Path, trunc_len: int = 1000, sos: bool = False, eos: bool = False, negative_omid: bool = False)#
- __getitem__(idx: int)#
Get a sample from the dataset by its index.
This returns:
p1_seq, p2_seq, omid1_seq, omid2_seq, omid_neg_seq, label
if
negative_omid
isTrue
. Otherwise:p1_seq, p2_seq, omid1_seq, omid2_seq, label
Where
p1_seq
andp2_seq
are two amino acid sequences whose interaction status is indicated bylabel
,omid1_seq
is the anchor protein sequence for the orthologous locality task,omid2_seq
is the positive protein sequence, andomid_neg_seq
is the negative protein sequence.Negative proteins are randomly sampled.
- Parameters:
idx – The index of the sample to return.
- __init__(dataset_path: Path, c_type: int, split: str, model_file: Path, trunc_len: int = 1000, sos: bool = False, eos: bool = False, negative_omid: bool = False)#
Builds a PyTorch dataset from an HDF5 dataset in the INTREPPPID format.
- Parameters:
dataset_path – The path to the HDF5 in the INTREPPPID format.
c_type – The C-type pairs to use in the dataset.
split – Which split the data should come from (i.e.:
train
,test
,val
)model_file – The path to the SentencePiece model.
trunc_len – The length at which to truncate the amino acid sequence.
sos – Boolean indicating whether a start-of-sequence token should be used.
eos – Boolean indicating whether an end-of-sequence token should be used.
negative_omid – Boolean indicating whether to return a negative example for the orthologous locality task.
- __len__()#
Get the number of samples in the dataset.
- encode(seq: str, sp: bool = True, pad: bool = True)#
Encodes an amino-acid sequence using the SentencePiece model, pads, and adds start/end-of-sequence tokens.
- Parameters:
seq – The amino acid sequence to be encoded.
sp – A boolean which indicates whether to encode the sequence with SentencePiece.
pad – Whether to pad up to the trunc len (
True
) or not (False
). Pads on the right with zeroes.
- get_omid_member(omid: int)#
Get a random Uniprot Accessions of a protein in the specified OMA group in this dataset.
- Parameters:
omid – The OMA ID.
- get_omid_members(omid: int)#
Get all the Uniprot Accessions of the proteins sharing in the same specified OMA group in this dataset.
- Parameters:
omid – The OMA ID.
- get_sequence(name: str)#
Get the amino-acid sequence of a protein in the dataset given its Uniprot Accession.
- Parameters:
name – The Uniprot Accession
- static static_encode(trunc_len: int, spp, seq: str, sp: bool = True, pad: bool = True, sampling: bool = True, sos: bool = False, eos: bool = False)#
Encodes an amino-acid sequence using the SentencePiece model, pads, and adds start/end-of-sequence tokens.
This is the static method, which doesn’t require this class to be instantiated.
- Parameters:
trunc_len – The length at which to truncate the amino acid sequence.
spp – The SentencePiece model to use to encode the sample.
seq – The amino acid sequence to be encoded.
sp – A boolean which indicates whether to encode the sequence with SentencePiece
pad – Whether to pad up to the trunc len (
True
) or not (False
). Pads on the right with zeroes.sampling – Whether to randomly sample tokens (
True
) or not (False
). See the SentencePiece library/manuscript for details.sos – Boolean indicating whether to add a start-of-sequence token (
True
) or not (False
)eos – Boolean indicating whether to add an end-of-sequence token (
True
) or not (False
)
- class intrepppid.data.ppi_oma.IntrepppidDataModule(batch_size: int, dataset_path: Path, c_type: int, trunc_len: int, workers: int, vocab_size: int, model_file: str, seed: int, sos: bool, eos: bool, negative_omid: bool = False)#
- __init__(batch_size: int, dataset_path: Path, c_type: int, trunc_len: int, workers: int, vocab_size: int, model_file: str, seed: int, sos: bool, eos: bool, negative_omid: bool = False)#
A PyTorch Lightning Data Module for INTREPPPID datasets.
- Parameters:
batch_size – The size of the batches for the Data Module to generate
dataset_path – The path to the HDF5 in the INTREPPPID format.
c_type – The C-type pairs to use in the dataset.
trunc_len – The length at which to truncate the amino acid sequence.
workers – The number of CPU processes used to load data.
vocab_size – The number of tokens in the SentencePiece vocabular.
model_file – The path to the SentencePiece model.
seed – The random seed to use for sampling SentencePiece tokens.
sos – Boolean indicating whether to add a start-of-sequence token (
True
) or not (False
)eos – Boolean indicating whether to add an end-of-sequence token (
True
) or not (False
)negative_omid – Boolean indicating whether to return a negative example for the orthologous locality task.
- setup(stage=None)#
Instantiate the internal datasets.
- Parameters:
stage – Specifies the stage of training the model (must be one of
train
,val
, ortest
).
- test_dataloader()#
Returns the dataloader for the test set.
- train_dataloader()#
Returns the dataloader for the training set.
- val_dataloader()#
Returns the dataloader for the validation set.
Network#
- intrepppid.intrepppid_network(steps_per_epoch: int, vocab_size: int = 250, embedding_size: int = 64, rnn_num_layers: int = 2, rnn_dropout_rate: float = 0.3, variational_dropout: bool = False, bi_reduce: str = 'last', embedding_droprate: float = 0.3, num_epochs: int = 100, do_rate: float = 0.3, beta_classifier: int = 2, lr: float = 0.01, use_projection: bool = False, optimizer_type: str = 'ranger21_xx')#
This builds a PyTorch nn.Module which represents the INTREPPPID network as defined in the manuscript.
It assembles a TripletE2ENet with an AWD-LSTM encoder and an MLP classifier.
- Parameters:
steps_per_epoch – Number of mini-batch steps iterated over each epoch. Only really maters for training.
vocab_size – The number of tokens in the SentencePiece vocabulary. Defaults to 250.
embedding_size – The size of embeddings. Defaults to 64.
rnn_num_layers – The number of layers in the AWD-LSTM encoder to use. Defaults to 2.
rnn_dropout_rate – The dropconnect rate for the AWD-LSTM encoder. Defaults to 0.3.
variational_dropout – Whether to use variational dropout, as described in the AWD-LSTM manuscript. Defaults to False.
bi_reduce – Method to reduce the two LSTM embeddings for both directions. Must be one of “concat”, “max”, “mean”, “last”. Defaults to “last”.
embedding_droprate – The amount of Embedding Dropout to use (a la AWD-LSTM). Defaults to 0.3.
num_epochs – Number of epochs to train the model for.
do_rate – The amount of dropout to use in the MLP Classifier. Defaults to 0.3.
beta_classifier – Adjusts the amount of weight to give the PPI Classification loss, relative to the Orthologue Locality loss. The loss becomes (1/beta_classifier)*classifier loss + [1-(1/beta_classifier)]*orthologue_loss. Defaults to 1 (equal contribution of both losses).
lr – Learning rate to use. Defaults to 1e-2.
use_projection – Whether to use a projection network after the encoder. Defaults to False.
optimizer_type – The optimizer to use while training. Must be one of “ranger21”, “ranger21_xx”, “adamw”, “adamw_1cycle”, or “adamw_cosine”. Defaults to “ranger21”.