Skip to content

embedding_dbs

Warning

This section involves juggling conflicting requirements across a large number of protein embedding models. If you want to avoid this, you can just skip this step as embeddings have been pre-computed for you.

This folder holds all the tools required to build databases of embeddings generated by a number of methods. These databases are used by many of the experiments conducted in the manuscript.

embed_batch.py

This folder contains the embed_batch.py python script, which embeds the sequences of the proteins found in our downstream PPI dataset.

python embed_batch.py PLLM INPUT_PATH MAX_BATCH_SIZE DEVICE OUTPUT

Arguments

PLLM

Specify the pLM to use to embed protein sequences. The valid options are:

Command pLM
squeezeprot_sp_nonstrict SqueezeProt-SP (Non-strict)
squeezeprot_sp_strict SqueezeProt-SP (Strict)
squeezeprot_u50 SqueezeProt-U50
prottrans_bert ProtBERT
prottrans_t5 ProtT5
proteinbert ProteinBERT
prose ProSE
esm ESM
rapppid RAPPPID

INPUT_PATH

This is the file path to the protein sequences. Valid file formats are:

Format Required Extension
FASTA .fasta
Gzip'd FASTA .fasta.gz
RAPPPID PPI Dataset .h5
CSV .csv

The CSV format must contain two columns, with the headers "accession" and "sequence", which must contain a protein ID (usually UniProt accession) and the amino acid sequence, respectively. Ultimately, it should look something like:

accession,sequence
P05067,MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGR...
Q29537,MEESQSELNIDPPLSQETFSELWNLLPENNVLSSELCPAV...

Tip

To encode the sequences from the PPI dataset used in the manuscript, just point this to ../../data/ppi/rapppid_[common_string_9606.protein.links.detailed.v12.0_upkb.csv]_Mz70T9t-4Y-i6jWD9sEtcjOr0X8=.h5.

MAX_BATCH_SIZE

The size of the batches of sequences to embed at a time. This will increase RAM/VRAM usage.

DEVICE

Either "cpu" or "cuda" depending on whether sequences are to be embedded on the CPU or using a GPU. Defaults to "cpu".

OUTPUT

Path to save the embeddings database to. The default paths change with the pLMs you specify:

pLM Path
SqueezeProt-SP (Non-strict) ../../data/embeddings/squeezeprot_sp_nonstrict.lmdb
SqueezeProt-SP (Strict) ../../data/embeddings/squeezeprot_sp_strict.lmdb
SqueezeProt-U50 ../../data/embeddings/squeezeprot_u50.lmdb
ProtBERT ../../data/embeddings/prottrans_bert.lmdb
ProtT5 ../../data/embeddings/prottrans_t5.lmdb
ProteinBERT ../../data/embeddings/proteinbert.lmdb
ProSE ../../data/embeddings/prose.lmdb
ESM ../../data/embeddings/esm.lmdb
RAPPPID ../../data/embeddings/rapppid.lmdb

Requirements

Because some models have conflicting dependencies, we need to create different environments for different models. You will then need to activate/deactivate those environments when running embed_batch.py depending on the model you are using.

RAPPPID

RAPPPID works well with Python 3.8, and the dependencies are in the file requirements_rapppid.txt.

More information can be found on the RAPPPID GitHub repository.

ProteinBERT

ProteinBERT works well with Python 3.6, and the dependencies are in the file requirements_proteinbert.txt. You must then install ProteinBERT proper by cloning the repository and installing thusly:

git clone https://github.com/nadavbra/protein_bert
cd protein_bert
git submodule init
git submodule update
python setup.py install

More information can be found on the ProteinBERT GitHub repository.

ProSE

ProSE works well with Python 3.7, and the dependencies are in the file requirements_prose.txt.

All other models

For all other models, we recommend Python 3.11. The dependencies are in the file requirements.txt.