embedding_dbs
Warning
This section involves juggling conflicting requirements across a large number of protein embedding models. If you want to avoid this, you can just skip this step as embeddings have been pre-computed for you.
This folder holds all the tools required to build databases of embeddings generated by a number of methods. These databases are used by many of the experiments conducted in the manuscript.
embed_batch.py
This folder contains the embed_batch.py python script, which embeds the sequences of the proteins found in our downstream PPI dataset.
python embed_batch.py PLLM INPUT_PATH MAX_BATCH_SIZE DEVICE OUTPUT
Arguments
PLLM
Specify the pLM to use to embed protein sequences. The valid options are:
| Command | pLM |
|---|---|
squeezeprot_sp_nonstrict |
SqueezeProt-SP (Non-strict) |
squeezeprot_sp_strict |
SqueezeProt-SP (Strict) |
squeezeprot_u50 |
SqueezeProt-U50 |
prottrans_bert |
ProtBERT |
prottrans_t5 |
ProtT5 |
proteinbert |
ProteinBERT |
prose |
ProSE |
esm |
ESM |
rapppid |
RAPPPID |
INPUT_PATH
This is the file path to the protein sequences. Valid file formats are:
| Format | Required Extension |
|---|---|
| FASTA | .fasta |
| Gzip'd FASTA | .fasta.gz |
| RAPPPID PPI Dataset | .h5 |
| CSV | .csv |
The CSV format must contain two columns, with the headers "accession" and "sequence", which must contain a protein ID (usually UniProt accession) and the amino acid sequence, respectively. Ultimately, it should look something like:
accession,sequence
P05067,MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGR...
Q29537,MEESQSELNIDPPLSQETFSELWNLLPENNVLSSELCPAV...
Tip
To encode the sequences from the PPI dataset used in the manuscript, just point this to ../../data/ppi/rapppid_[common_string_9606.protein.links.detailed.v12.0_upkb.csv]_Mz70T9t-4Y-i6jWD9sEtcjOr0X8=.h5.
MAX_BATCH_SIZE
The size of the batches of sequences to embed at a time. This will increase RAM/VRAM usage.
DEVICE
Either "cpu" or "cuda" depending on whether sequences are to be embedded on the CPU or using a GPU. Defaults to "cpu".
OUTPUT
Path to save the embeddings database to. The default paths change with the pLMs you specify:
| pLM | Path |
|---|---|
| SqueezeProt-SP (Non-strict) | ../../data/embeddings/squeezeprot_sp_nonstrict.lmdb |
| SqueezeProt-SP (Strict) | ../../data/embeddings/squeezeprot_sp_strict.lmdb |
| SqueezeProt-U50 | ../../data/embeddings/squeezeprot_u50.lmdb |
| ProtBERT | ../../data/embeddings/prottrans_bert.lmdb |
| ProtT5 | ../../data/embeddings/prottrans_t5.lmdb |
| ProteinBERT | ../../data/embeddings/proteinbert.lmdb |
| ProSE | ../../data/embeddings/prose.lmdb |
| ESM | ../../data/embeddings/esm.lmdb |
| RAPPPID | ../../data/embeddings/rapppid.lmdb |
Requirements
Because some models have conflicting dependencies, we need to create different environments for different models. You will then need to activate/deactivate those environments when running embed_batch.py depending on the model you are using.
RAPPPID
RAPPPID works well with Python 3.8, and the dependencies are in the file requirements_rapppid.txt.
More information can be found on the RAPPPID GitHub repository.
ProteinBERT
ProteinBERT works well with Python 3.6, and the dependencies are in the file requirements_proteinbert.txt. You must then install ProteinBERT proper by cloning the repository and installing thusly:
git clone https://github.com/nadavbra/protein_bert
cd protein_bert
git submodule init
git submodule update
python setup.py install
More information can be found on the ProteinBERT GitHub repository.
ProSE
ProSE works well with Python 3.7, and the dependencies are in the file requirements_prose.txt.
All other models
For all other models, we recommend Python 3.11. The dependencies are in the file requirements.txt.