kw_bench
This folder contains the scripts required to train the SqueezeProt UniProt keyword annotation models found in the manuscript.
make_dataset.py
A pre-computed dataset is already present in the data folder, you can also re-compute it if you so desire by using the make_dataset.py CLI tool in the scripts folder.
Info
Before re-computing the dataset for the kw_bench experiment, one must either run the embeddings_dbs experiment on all SWISS-PROT proteins or use the pre-computed vectors found in the data/embeddings folder (the default).
python make_dataset.py <flags>
Arguments
Flags
| Short Flag | Long Flag | Default | Description |
|---|---|---|---|
| --kw_path | "../../../data/kw/kw.json.gz" | Path to the UniProt keyword annotation data. | |
| --kw_train_path | "../../../data/kw/kw_train.csv.gz" | Where to save the generated training annotation dataset. | |
| --kw_test_path | "../../../data/kw/kw_test.csv.gz" | Where to save the generated testing annotation dataset. | |
| --kw_test_vecs_path | "../../../data/kw/kw_test_vecs.csv.gz" | Where to store the pre-computed vectors for the testing proteins. | |
| --kw_train_vecs_path | "../../../data/kw/kw_train_vecs.csv.gz" | Where to store the pre-computed vectors for the testing proteins. | |
| --kw_features_path | "../../../data/kw/subloc_kw_feature.csv.gz" | Path to a CSV file with annotation data. We have prpovided this file, which was derived from the UniProt source. | |
| -e | --eval_seqs_path | "../../../data/sequences/uniprot_sprot.strict.eval.fasta.gz" | Path to protein sequences found in the evaluation set. |
| -s | --strict_database_path | "../../../data/embeddings/squeezeprot-sp.strict.lmdb" | Path to the LMDB database with SqueezeProt-SP (Strict) embeddings for all UniProt proteins. |
| -n | --nonstrict_database_path | "../../../data/embeddings/squeezeprot-sp.nonstrict.lmdb" | Path to the LMDB database with SqueezeProt-SP (Strict) embeddings for all UniProt proteins. |
train.py
To train the UniProt keyword model from the manuscript, a CLI tool (train.py) was created. You can find it in the scripts folder. Here's how to use it:
python train.py <flags>
Arguments
Flags
| Short Flag | Long Flag | Default | Description |
|---|---|---|---|
| -n | --nl_type | "mish" | Activation function to use. Must be one of "mish", "relu6", or "leaky_relu". |
| --loss_type | "asl" | What loss function to use. Must be one of "asl" for an Assymetric Loss, or "bce" for Binary Cross-Entropy loss. | |
| --label_weighting | "log" | What label weighting function to use. Must be one of "log" for log-scaled weighting, "linear" for linearly-scaled weighting, "root" for weighting scaled by its square root, or "None" for no label weighting. | |
| --train_path | "../../../data/kw/kw_train_vecs.csv.gz" | Location of the precomputed train sequence vectors. | |
| --test_path | "../../../data/kw/kw_test_vecs.csv.gz" | Location of the precomputed test sequence vectors. | |
| -m | --meta_path | "../../../data/kw/kw_meta.json" | Protein keyword dataset. |
| -h | --hparams_dir | "../../../data/kw/hparams/" | Directory where hyper-parameters are stored. |
| -c | --chkpts_dir | "../../../chkpts/kw/" | Directory where checkpoints are stored. |
| --logs_dir | "../../../chkpts/kw/logs/" | Directory where logs are stored. |
test.py
To infer on the testing set, you can run test.py like so:
python test.py
Requirements
One can install the requirements for this experiment using the requirements.txt file in the experiments/kw_bench folder.