encode_llm
This module exposes a common API for inferring the embeddings used in this manuscript from amino acid sequences using the pLLMs, as well as RAPPPID, which is not a language model but from which embeddings can be had.
There are sub-modules for one of each of the pLLMs and RAPPPID, which are:
| Sub-module | pLLM |
|---|---|
esmer.py |
ESM |
proser.py |
ProSE |
proteinberter.py |
ProteinBERT |
prottrans_bert.py |
ProtBERT |
prottrans_t5.py |
ProtT5 |
rapppid.py |
RAPPPID |
squeezebert.py |
SqueezeProt-SP (Strict), SqueezeProt-SP (Non-strict), SqueezeProt-U50 |
Common API
All modules contain two functions, encode and encode_batch for calculating the embeddings from one single sequence or many, respectively.
encode
encode(sequence: str, device: str = "cpu") -> List[float]
| Argument | Default | Type | Description |
|---|---|---|---|
sequence |
None | String | Amino-acid sequence to embed. |
device |
"cpu" | String | Which device to run this on. Must be a valid PyTorch device string. |
encode_batch
encode_batch(batch: List[str], device: str = 'cpu') -> List[List[float]]
| Argument | Default | Type | Description |
|---|---|---|---|
batch |
None | List[str] | A list of amino-acid sequences to embed. |
device |
"cpu" | String | Which device to run this on. Must be a valid PyTorch device string. |
Selecting SqueezeProt Variants
There is only one module for all the SqueezeProt model, but there are three variants. You can use two additional arguments to specify the weights and tokenizer used to embed sequences. To embed with a specific variant, simply point those arguments to the paths of the corresponding weights for those variants.
encode
encode(seq: str, device: str = "cpu", weights_path: str = "../../data/chkpts/squeezeprot-sp.strict/checkpoint-1383824", tokenizer_path: str = "../../data/tokenizer/bert-base-cased/tokenizer.t0.s8675309")
| Argument | Default | Type | Description |
|---|---|---|---|
sequence |
None | String | Amino-acid sequence to embed. |
device |
"cpu" | String | Which device to run this on. Must be a valid PyTorch device string. |
weights_path |
"../../data/chkpts/squeezeprot-sp.strict/checkpoint-1383824" |
String | Path to the SqueezeProt weights to load. These are specific to the SqueezeProt variant. See the Data section for more. |
tokenizer_path |
"../../data/tokenizer/bert-base-cased/tokenizer.t0.s8675309" |
String | Path to the tokenizer to load. This is the same value for all the variants. See the Data section for more. |
encode_batch
encode_batch(batch: List[str], device: str = 'cpu') -> List[List[float]]
| Argument | Default | Type | Description |
|---|---|---|---|
batch |
None | List[str] | A list of amino-acid sequences to embed. |
device |
"cpu" | String | Which device to run this on. Must be a valid PyTorch device string. |
weights_path |
"../../data/chkpts/squeezeprot-sp.strict/checkpoint-1383824" |
String | Path to the SqueezeProt weights to load. These are specific to the SqueezeProt variant. See the Data section for more. |
tokenizer_path |
"../../data/tokenizer/bert-base-cased/tokenizer.t0.s8675309" |
String | Path to the tokenizer to load. This is the same value for all the variants. See the Data section for more. |
Requirements
One can install the requirements for this module using the requirements.txt file in the experiments/encode_llm folder.