Usage
Figures are generated by calling sub-commands. These sub-commands are:
| Sub-command | Description |
|---|---|
| get_scores | Infer and score pLM-based PPI models. Pre-requisite for some commands. |
| speed | Make figure 1a, which compares the speed and PPI classification performance of SoTA pLMs and SqueezeProt. |
| strict_nonstrict | Make figure 1c-d, comparing PPI classification performance of both strict and non-strict variants of SqueezeProt-SP. |
| kw | Make figure 1e-f, which reports the performance of strict and non-strict variants of SqueezeProt-SP on a UniProt Keyword annotation task. |
| concordance | Make figure 1g-h, which reports the concordance between pLM methods and SqueezeProt-SP variants. |
| length_histogram | Make figure 2a, which shows the distribution of protein lengths. |
| length_heatmap | Make figure 2b, which shows the performance of pLM-base PPIs as a function of the length of the proteins in the pair. |
| acc_by_length | Make figure 2c, which shows the performance of pLM-base PPIs as a function of the longest protein in the pair. |
| sars_cov2 | Make figure 2e, which reports AUROC curves of pLM-based PPI methods tested on both Human PPIs and Human-SARS-CoV-2 PPIs. |
| mutation | Make figure 2f, which reports the change in binding affinity as a function of the change in interaction prediction in mutated proteins. |
You can see how to regenerate all panels from scratch using the diagrams below. Blue squares are experiments that need to be run, red squares are autofigures commands, and the yellow circle indicate commands that generate the figures. The order of the squares also indicates .
Output and Data Folders
Common among these sub-commands are two flags: output_folder and data_folder. The former is where the results of the sub-commands are saved, and the latter is where the data used to run the sub-commands is kept. These default to "out" and "data" in the root directory if not specified.
get_scores
NAME
autofigures get_scores
SYNOPSIS
autofigures get_scores <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
Description
The get_scores command:
- Loads already-trained pLM-based PPI models.
- Infer interaction probabilities of the C3 PPI testing dataset using the loaded PPI models.
- Score the inferred probabilities using MCC and AUROC.
This command does not generate a figure, but running it a pre-requisite for other commands.
The checkpoints for the pLM-based PPI models are stored in the data folder (data/chkpts), and were generated by the ppi_bench experiment.
get_scores also needs pre-computed embeddings for ESM, ProtBERT, ProtT5, ProSE, ProteinBERT, SqueezeProt-SP (Strict), SqueezeProt-SP (Non-strict), SqueezeProt-U50.
The data used for this is from our previous manuscript "INTREPPPID—an orthologue-informed quintuplet network for cross-species prediction of protein–protein interaction". They are pairs of Human interactions from STRING that conform to Park & Marcotte's C3 criteria. We use the 8675309 seed for this study (extra points if you know where the seed is from).
Flags
| Long Flag | Short Flag | Default | Description |
|---|---|---|---|
| --output_folder | -o | None | Location of the output folder. |
| --data_folder | -d | None | Location of the data folders. |
speed
NAME
__main__.py speed
SYNOPSIS
__main__.py speed <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
-y, --y_metric=Y_METRIC
Type: str
Default: 'mcc'
Description
Make figure 1a, which compares the speed and PPI classification performance of SoTA pLMs and SqueezeProt.
This command will save the plot to disk as figures/speed.sv in the output directory.
Flags
| Long Flag | Short Flag | Default | Description |
|---|---|---|---|
| --output_folder | -o | None | Location of the output folder. |
| --data_folder | -d | None | Location of the data folders. |
| --y_metric | -y | "mcc" | What PPI classification metric should be reported on the y-axis. |
strict_nonstrict
NAME
__main__.py strict_nonstrict
SYNOPSIS
__main__.py strict_nonstrict <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
-a, --add_markers=ADD_MARKERS
Type: bool
Default: True
-f, --first_metric=FIRST_METRIC
Type: str
Default: 'auroc'
-s, --second_metric=SECOND_METRIC
Type: str
Default: 'mcc'
Description
The strict_nonstrict command creates the Fig. 1c-d from the manuscript, which compares the performance of strict and non-strict variants of SqueezeProt-SP.
Once it has run, it will save the plot as figure/strict_nonstrict.svg in the output directory.
Flags
| Long Flag | Short Flag | Default | Description |
|---|---|---|---|
| --output_folder | -o | None | Location of the output folder. |
| --data_folder | -d | None | Location of the data folders. |
| --add_markers | -a | True | Whether to add markers to the plot indicating replicate performances. Otherwise, whiskers are plotted to show standard deviation. |
| --first_metric | -f | "auroc" | Measure performance using which metric in the left panel. Must be one of "mcc", "auroc", "ap", or "f1". |
| --second_metric | -s | "mcc" | Measure performance using which metric in the left panel. Must be one of "mcc", "auroc", "ap", or "f1". |
kw
NAME
autofigures kw
SYNOPSIS
autofigures kw <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
-m, --metric_type=METRIC_TYPE
Type: str
Default: 'ndcg'
Description
Make figure 1e-f, which reports the performance of strict and non-strict variants of SqueezeProt-SP on a UniProt Keyword annotation task.
Once it has run, it will save the plot as figure/kw.svg in the output directory.
Flags
| Long Flag | Short Flag | Defaults | Description |
|---|---|---|---|
| --output_folder | -o | None | Location of the output folder. |
| --data_folder | -d | None | Location of the data folders. |
| --metric_type | -m | 'ndcg' | Metric to report. Default is normalized discounted cumulative gain. |
concordance
NAME
autofigures concordance
SYNOPSIS
autofigures concordance <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
-c, --cohen_kappa=COHEN_KAPPA
Type: bool
Default: False
Description
Make figure 1g-h, which reports the concordance between pLM methods and SqueezeProt-SP variants.
Once it has run, it will save the plot as figure/concordance.svg in the output directory.
Flags
| Long Flag | Short Flag | Default | Description |
|---|---|---|---|
| --output_folder | -o | None | Location of the output folder. |
| --data_folder | -d | None | Location of the data folders |
| --cohen_kappa | -c | False | Display Cohen's Kappa. If False, shows skill-normalized concordance. |
length_histogram
NAME
autofigures length_histogram
SYNOPSIS
autofigures length_histogram <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
Description
Make figure 2a, which shows the distribution of protein lengths.
Once it has run, it will save the plot as figure/length_histogram.svg in the output directory.
Flags
| Long Flag | Short Flag | Description |
|---|---|---|
| --output_folder | -o | Location of the output folder. |
| --data_folder | -d | Location of the data folders. |
length_heatmap
NAME
autofigures length_heatmap
SYNOPSIS
autofigures length_heatmap <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
Description
Make figure 2a, which shows the distribution of protein lengths.
Once it has run, it will save the plot the following plots in the output directory:
figures/length_heatmap_dscript.svgfigures/length_heatmap_esm.svgfigures/length_heatmap_intrepppid.svgfigures/length_heatmap_pipr.svgfigures/length_heatmap_prose.svgfigures/length_heatmap_proteinbert.svgfigures/length_heatmap_prottrans_bert.svgfigures/length_heatmap_prottrans_t5.svgfigures/length_heatmap_rapppid.svgfigures/length_heatmap_richoux.svgfigures/length_heatmap_sizes.svgfigures/length_heatmap_squeezeprot_sp_nonstrict.svgfigures/length_heatmap_squeezeprot_sp_strict.svgfigures/length_heatmap_squeezeprot_u50.svg
Flags
| Long Flag | Short Flag | Description |
|---|---|---|
| --output_folder | -o | Location of the output folder. |
| --data_folder | -d | Location of the data folders. |
acc_by_length
NAME
autofigures acc_by_length
SYNOPSIS
autofigures acc_by_length <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
-r, --random_window=RANDOM_WINDOW
Type: bool
Default: False
Description
Make figure 2c, which shows the performance of pLM-base PPIs as a function of the longest protein in the pair.
Once it has run, it will save the plot as figure/acc_by_length.svg in the output directory.
Flags
| Long Flag | Short Flag | Description |
|---|---|---|
| --output_folder | -o | Location of the output folder. |
| --data_folder | -d | Location of the data folders. |
| --random_window | -r | Use random window PPI models. |
sars_cov2
NAME
autofigures sars_cov2
SYNOPSIS
autofigures sars_cov2 <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
Description
Make figure 2e, which reports AUROC curves of pLM-based PPI methods tested on both Human PPIs and Human-SARS-CoV-2 PPIs.
Once it has run, it will save the plot the following plots in the output directory:
figures/sars_cov2_breakdown.svgfigures/sars_cov2_roc_highlight.svgfigures/sars_cov2_roc.svg
Flags
| Long Flag | Short Flag | Description |
|---|---|---|
| --output_folder | -o | Location of the output folder. |
| --data_folder | -d | Location of the data folders. |
mutation
NAME
autofigures mutation
SYNOPSIS
autofigures mutation <flags>
FLAGS
-o, --output_folder=OUTPUT_FOLDER
Type: Optional[Union]
Default: None
-d, --data_folder=DATA_FOLDER
Type: Optional[Union]
Default: None
Description
Make figure 2f, which reports the change in binding affinity as a function of the change in interaction prediction in mutated proteins.
Once it has run, it will save the plot the following plots in the output directory:
figures/mutation_esm.pngfigures/mutation_prose.pngfigures/mutation_proteinbert.pngfigures/mutation_prottrans_bert.pngfigures/mutation_prottrans_t5.pngfigures/mutation_squeezeprot_sp_nonstrict.pngfigures/mutation_squeezeprot_sp_strict.pngfigures/mutation_squeezeprot_u50.png
Flags
| Long Flag | Short Flag | Description |
|---|---|---|
| --output_folder | -o | Location of the output folder. |
| --data_folder | -d | Location of the data folders. |