Definitions & Glossary

Definitions

pLMs

ESM: Evolutionary Scale Modeling (ESM) is a protein large language model released by Meta and described in Verkuil et al.. We use the V2 650M weights released by Meta and retrieved by the PyTorch Hub. It is referred to as esm by the code for this manuscript.
ProtBERT: ProtBERT is one of the many protein large language models described by Elnaggar et al., and is based on BERT. Weights are retrieved from the HuggingFace repository Rostlab/prot_bert. It is referred to as prottrans_bert by the code for the manuscript.
ProtT5: ProtBERT is one of the many protein large language models described by Elnaggar et al., and is based on BERT. Weights are retrieved from the HuggingFace repository Rostlab/prot_t5_xl_half_uniref50-enc. It is referred to as prottrans_t5 by the code for the manuscript.
ProSE: ProSE is a protein large language models described by Bepler et al., and is based on an recurrent neural network architecture. Weights are retrieved from the GitHub repository. It is referred to as prose by the code for the manuscript.
ProteinBERT: ProteinBERT is a protein large language models described by Brandes et al., and is based on the BERT architecture. It uses an additional Gene Ontology (GO) term annotation prediction task. Weights and code were retrieved from the GitHub repository. It is referred to as proteinbert by the code for the manuscript.
SqueezeProt-SP (Strict): SqueezeProt-SP (Strict) is a novel protein large language model introduced in this manuscript. It is trained on a strict SWISS-PROT dataset which excludes proteins from the downstream PPI testing dataset. See the manuscript for more details. It is referred to as squeezeprot_sp_strict by the code for the manuscript.
SqueezeProt-SP (Non-strict): SqueezeProt-SP (Non-strict) is a novel protein large language model introduced in this manuscript. It is trained on a non-strict SWISS-PROT dataset which includes proteins from the downstream PPI testing dataset. See the manuscript for more details. It is referred to as squeezeprot_sp_nonstrict by the code for the manuscript.
SqueezeProt-U50: SqueezeProt-U50 is a novel protein large language model introduced in this manuscript. It is trained on a non-strict UniRef50 dataset which includes proteins from the downstream PPI testing dataset. See the manuscript for more details. It is referred to as squeezeprot_u50 by the code for the manuscript.

Non-pLM-based PPI Inference Methods

RAPPPID: RAPPPID is a non-pLM-based PPI inference method described in Szymborski et al. and is based on an AWD-LSTM regularized neural network.

Glossary

SqueezeProt-SP (Non-strict)

SqueezeProt-SP (Strict)

SqueezeProt-U50