Skip to content

A flaw in using pre-trained pLMs in protein-protein interaction inference models

encode_llm

A flaw in using pre-trained pLMs in protein-protein interaction inference models

Home
autofigures
autofigures
- Home
- Installation
- Usage
Experiments
Experiments
- Home
- encode_llm encode_llm
  Table of contents
- embedding_dbs
- speed_bench
- ppi_bench
- kw_bench
- sars_bench
- mutation_bench
Data
Definitions & Glossary

encode_llm

This module exposes a common API for inferring the embeddings used in this manuscript from amino acid sequences using the pLLMs, as well as RAPPPID, which is not a language model but from which embeddings can be had.

There are sub-modules for one of each of the pLLMs and RAPPPID, which are:

Sub-module	pLLM
`esmer.py`	ESM
`proser.py`	ProSE
`proteinberter.py`	ProteinBERT
`prottrans_bert.py`	ProtBERT
`prottrans_t5.py`	ProtT5
`rapppid.py`	RAPPPID
`squeezebert.py`	SqueezeProt-SP (Strict), SqueezeProt-SP (Non-strict), SqueezeProt-U50

Common API

All modules contain two functions, encode and encode_batch for calculating the embeddings from one single sequence or many, respectively.

encode

encode(sequence: str, device: str = "cpu") -> List[float]

Argument	Default	Type	Description
`sequence`	None	String	Amino-acid sequence to embed.
`device`	"cpu"	String	Which device to run this on. Must be a valid PyTorch device string.

encode_batch

encode_batch(batch: List[str], device: str = 'cpu') -> List[List[float]]

Argument	Default	Type	Description
`batch`	None	List[str]	A list of amino-acid sequences to embed.
`device`	"cpu"	String	Which device to run this on. Must be a valid PyTorch device string.

Selecting SqueezeProt Variants

There is only one module for all the SqueezeProt model, but there are three variants. You can use two additional arguments to specify the weights and tokenizer used to embed sequences. To embed with a specific variant, simply point those arguments to the paths of the corresponding weights for those variants.

encode

encode(seq: str, device: str = "cpu", weights_path: str = "../../data/chkpts/squeezeprot-sp.strict/checkpoint-1383824", tokenizer_path: str = "../../data/tokenizer/bert-base-cased/tokenizer.t0.s8675309")

Argument	Default	Type	Description
`sequence`	None	String	Amino-acid sequence to embed.
`device`	"cpu"	String	Which device to run this on. Must be a valid PyTorch device string.
`weights_path`	`"../../data/chkpts/squeezeprot-sp.strict/checkpoint-1383824"`	String	Path to the SqueezeProt weights to load. These are specific to the SqueezeProt variant. See the Data section for more.
`tokenizer_path`	`"../../data/tokenizer/bert-base-cased/tokenizer.t0.s8675309"`	String	Path to the tokenizer to load. This is the same value for all the variants. See the Data section for more.

encode_batch

encode_batch(batch: List[str], device: str = 'cpu') -> List[List[float]]

Argument	Default	Type	Description
`batch`	None	List[str]	A list of amino-acid sequences to embed.
`device`	"cpu"	String	Which device to run this on. Must be a valid PyTorch device string.
`weights_path`	`"../../data/chkpts/squeezeprot-sp.strict/checkpoint-1383824"`	String	Path to the SqueezeProt weights to load. These are specific to the SqueezeProt variant. See the Data section for more.
`tokenizer_path`	`"../../data/tokenizer/bert-base-cased/tokenizer.t0.s8675309"`	String	Path to the tokenizer to load. This is the same value for all the variants. See the Data section for more.

Requirements

One can install the requirements for this module using the requirements.txt file in the experiments/encode_llm folder.