Skip to content

kw_bench

This folder contains the scripts required to train the SqueezeProt UniProt keyword annotation models found in the manuscript.

make_dataset.py

A pre-computed dataset is already present in the data folder, you can also re-compute it if you so desire by using the make_dataset.py CLI tool in the scripts folder.

Info

Before re-computing the dataset for the kw_bench experiment, one must either run the embeddings_dbs experiment on all SWISS-PROT proteins or use the pre-computed vectors found in the data/embeddings folder (the default).

python make_dataset.py <flags>

Arguments

Flags

Short Flag Long Flag Default Description
--kw_path "../../../data/kw/kw.json.gz" Path to the UniProt keyword annotation data.
--kw_train_path "../../../data/kw/kw_train.csv.gz" Where to save the generated training annotation dataset.
--kw_test_path "../../../data/kw/kw_test.csv.gz" Where to save the generated testing annotation dataset.
--kw_test_vecs_path "../../../data/kw/kw_test_vecs.csv.gz" Where to store the pre-computed vectors for the testing proteins.
--kw_train_vecs_path "../../../data/kw/kw_train_vecs.csv.gz" Where to store the pre-computed vectors for the testing proteins.
--kw_features_path "../../../data/kw/subloc_kw_feature.csv.gz" Path to a CSV file with annotation data. We have prpovided this file, which was derived from the UniProt source.
-e --eval_seqs_path "../../../data/sequences/uniprot_sprot.strict.eval.fasta.gz" Path to protein sequences found in the evaluation set.
-s --strict_database_path "../../../data/embeddings/squeezeprot-sp.strict.lmdb" Path to the LMDB database with SqueezeProt-SP (Strict) embeddings for all UniProt proteins.
-n --nonstrict_database_path "../../../data/embeddings/squeezeprot-sp.nonstrict.lmdb" Path to the LMDB database with SqueezeProt-SP (Strict) embeddings for all UniProt proteins.

train.py

To train the UniProt keyword model from the manuscript, a CLI tool (train.py) was created. You can find it in the scripts folder. Here's how to use it:

python train.py <flags>

Arguments

Flags

Short Flag Long Flag Default Description
-n --nl_type "mish" Activation function to use. Must be one of "mish", "relu6", or "leaky_relu".
--loss_type "asl" What loss function to use. Must be one of "asl" for an Assymetric Loss, or "bce" for Binary Cross-Entropy loss.
--label_weighting "log" What label weighting function to use. Must be one of "log" for log-scaled weighting, "linear" for linearly-scaled weighting, "root" for weighting scaled by its square root, or "None" for no label weighting.
--train_path "../../../data/kw/kw_train_vecs.csv.gz" Location of the precomputed train sequence vectors.
--test_path "../../../data/kw/kw_test_vecs.csv.gz" Location of the precomputed test sequence vectors.
-m --meta_path "../../../data/kw/kw_meta.json" Protein keyword dataset.
-h --hparams_dir "../../../data/kw/hparams/" Directory where hyper-parameters are stored.
-c --chkpts_dir "../../../chkpts/kw/" Directory where checkpoints are stored.
--logs_dir "../../../chkpts/kw/logs/" Directory where logs are stored.

test.py

To infer on the testing set, you can run test.py like so:

python test.py

Requirements

One can install the requirements for this experiment using the requirements.txt file in the experiments/kw_bench folder.