Skip to content

Usage

Figures are generated by calling sub-commands. These sub-commands are:

Sub-command Description
get_scores Infer and score pLM-based PPI models. Pre-requisite for some commands.
speed Make figure 1a, which compares the speed and PPI classification performance of SoTA pLMs and SqueezeProt.
strict_nonstrict Make figure 1c-d, comparing PPI classification performance of both strict and non-strict variants of SqueezeProt-SP.
kw Make figure 1e-f, which reports the performance of strict and non-strict variants of SqueezeProt-SP on a UniProt Keyword annotation task.
concordance Make figure 1g-h, which reports the concordance between pLM methods and SqueezeProt-SP variants.
length_histogram Make figure 2a, which shows the distribution of protein lengths.
length_heatmap Make figure 2b, which shows the performance of pLM-base PPIs as a function of the length of the proteins in the pair.
acc_by_length Make figure 2c, which shows the performance of pLM-base PPIs as a function of the longest protein in the pair.
sars_cov2 Make figure 2e, which reports AUROC curves of pLM-based PPI methods tested on both Human PPIs and Human-SARS-CoV-2 PPIs.
mutation Make figure 2f, which reports the change in binding affinity as a function of the change in interaction prediction in mutated proteins.

You can see how to regenerate all panels from scratch using the diagrams below. Blue squares are experiments that need to be run, red squares are autofigures commands, and the yellow circle indicate commands that generate the figures. The order of the squares also indicates .

autofigures 1

autofigures 2

autofigures 3

Output and Data Folders

Common among these sub-commands are two flags: output_folder and data_folder. The former is where the results of the sub-commands are saved, and the latter is where the data used to run the sub-commands is kept. These default to "out" and "data" in the root directory if not specified.

get_scores

NAME
    autofigures get_scores

SYNOPSIS
    autofigures get_scores <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None

Description

The get_scores command:

  1. Loads already-trained pLM-based PPI models.
  2. Infer interaction probabilities of the C3 PPI testing dataset using the loaded PPI models.
  3. Score the inferred probabilities using MCC and AUROC.

This command does not generate a figure, but running it a pre-requisite for other commands.

The checkpoints for the pLM-based PPI models are stored in the data folder (data/chkpts), and were generated by the ppi_bench experiment.

get_scores also needs pre-computed embeddings for ESM, ProtBERT, ProtT5, ProSE, ProteinBERT, SqueezeProt-SP (Strict), SqueezeProt-SP (Non-strict), SqueezeProt-U50.

The data used for this is from our previous manuscript "INTREPPPID—an orthologue-informed quintuplet network for cross-species prediction of protein–protein interaction". They are pairs of Human interactions from STRING that conform to Park & Marcotte's C3 criteria. We use the 8675309 seed for this study (extra points if you know where the seed is from).

Flags

Long Flag Short Flag Default Description
--output_folder -o None Location of the output folder.
--data_folder -d None Location of the data folders.

speed

NAME
    __main__.py speed

SYNOPSIS
    __main__.py speed <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None
    -y, --y_metric=Y_METRIC
        Type: str
        Default: 'mcc'

Description

Make figure 1a, which compares the speed and PPI classification performance of SoTA pLMs and SqueezeProt.

This command will save the plot to disk as figures/speed.sv in the output directory.

Flags

Long Flag Short Flag Default Description
--output_folder -o None Location of the output folder.
--data_folder -d None Location of the data folders.
--y_metric -y "mcc" What PPI classification metric should be reported on the y-axis.

strict_nonstrict

NAME
    __main__.py strict_nonstrict

SYNOPSIS
    __main__.py strict_nonstrict <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None
    -a, --add_markers=ADD_MARKERS
        Type: bool
        Default: True
    -f, --first_metric=FIRST_METRIC
        Type: str
        Default: 'auroc'
    -s, --second_metric=SECOND_METRIC
        Type: str
        Default: 'mcc'

Description

The strict_nonstrict command creates the Fig. 1c-d from the manuscript, which compares the performance of strict and non-strict variants of SqueezeProt-SP.

Once it has run, it will save the plot as figure/strict_nonstrict.svg in the output directory.

Flags

Long Flag Short Flag Default Description
--output_folder -o None Location of the output folder.
--data_folder -d None Location of the data folders.
--add_markers -a True Whether to add markers to the plot indicating replicate performances. Otherwise, whiskers are plotted to show standard deviation.
--first_metric -f "auroc" Measure performance using which metric in the left panel. Must be one of "mcc", "auroc", "ap", or "f1".
--second_metric -s "mcc" Measure performance using which metric in the left panel. Must be one of "mcc", "auroc", "ap", or "f1".

kw

NAME
    autofigures kw

SYNOPSIS
    autofigures kw <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None
    -m, --metric_type=METRIC_TYPE
        Type: str
        Default: 'ndcg'

Description

Make figure 1e-f, which reports the performance of strict and non-strict variants of SqueezeProt-SP on a UniProt Keyword annotation task.

Once it has run, it will save the plot as figure/kw.svg in the output directory.

Flags

Long Flag Short Flag Defaults Description
--output_folder -o None Location of the output folder.
--data_folder -d None Location of the data folders.
--metric_type -m 'ndcg' Metric to report. Default is normalized discounted cumulative gain.

concordance

NAME
    autofigures concordance

SYNOPSIS
    autofigures concordance <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None
    -c, --cohen_kappa=COHEN_KAPPA
        Type: bool
        Default: False

Description

Make figure 1g-h, which reports the concordance between pLM methods and SqueezeProt-SP variants.

Once it has run, it will save the plot as figure/concordance.svg in the output directory.

Flags

Long Flag Short Flag Default Description
--output_folder -o None Location of the output folder.
--data_folder -d None Location of the data folders
--cohen_kappa -c False Display Cohen's Kappa. If False, shows skill-normalized concordance.

length_histogram

NAME
    autofigures length_histogram

SYNOPSIS
    autofigures length_histogram <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None

Description

Make figure 2a, which shows the distribution of protein lengths.

Once it has run, it will save the plot as figure/length_histogram.svg in the output directory.

Flags

Long Flag Short Flag Description
--output_folder -o Location of the output folder.
--data_folder -d Location of the data folders.

length_heatmap

NAME
    autofigures length_heatmap

SYNOPSIS
    autofigures length_heatmap <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None

Description

Make figure 2a, which shows the distribution of protein lengths.

Once it has run, it will save the plot the following plots in the output directory:

  • figures/length_heatmap_dscript.svg
  • figures/length_heatmap_esm.svg
  • figures/length_heatmap_intrepppid.svg
  • figures/length_heatmap_pipr.svg
  • figures/length_heatmap_prose.svg
  • figures/length_heatmap_proteinbert.svg
  • figures/length_heatmap_prottrans_bert.svg
  • figures/length_heatmap_prottrans_t5.svg
  • figures/length_heatmap_rapppid.svg
  • figures/length_heatmap_richoux.svg
  • figures/length_heatmap_sizes.svg
  • figures/length_heatmap_squeezeprot_sp_nonstrict.svg
  • figures/length_heatmap_squeezeprot_sp_strict.svg
  • figures/length_heatmap_squeezeprot_u50.svg

Flags

Long Flag Short Flag Description
--output_folder -o Location of the output folder.
--data_folder -d Location of the data folders.

acc_by_length

NAME
    autofigures acc_by_length

SYNOPSIS
    autofigures acc_by_length <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None
    -r, --random_window=RANDOM_WINDOW
        Type: bool
        Default: False

Description

Make figure 2c, which shows the performance of pLM-base PPIs as a function of the longest protein in the pair.

Once it has run, it will save the plot as figure/acc_by_length.svg in the output directory.

Flags

Long Flag Short Flag Description
--output_folder -o Location of the output folder.
--data_folder -d Location of the data folders.
--random_window -r Use random window PPI models.

sars_cov2

NAME
    autofigures sars_cov2

SYNOPSIS
    autofigures sars_cov2 <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None

Description

Make figure 2e, which reports AUROC curves of pLM-based PPI methods tested on both Human PPIs and Human-SARS-CoV-2 PPIs.

Once it has run, it will save the plot the following plots in the output directory:

  • figures/sars_cov2_breakdown.svg
  • figures/sars_cov2_roc_highlight.svg
  • figures/sars_cov2_roc.svg

Flags

Long Flag Short Flag Description
--output_folder -o Location of the output folder.
--data_folder -d Location of the data folders.

mutation

NAME
    autofigures mutation

SYNOPSIS
    autofigures mutation <flags>

FLAGS
    -o, --output_folder=OUTPUT_FOLDER
        Type: Optional[Union]
        Default: None
    -d, --data_folder=DATA_FOLDER
        Type: Optional[Union]
        Default: None

Description

Make figure 2f, which reports the change in binding affinity as a function of the change in interaction prediction in mutated proteins.

Once it has run, it will save the plot the following plots in the output directory:

  • figures/mutation_esm.png
  • figures/mutation_prose.png
  • figures/mutation_proteinbert.png
  • figures/mutation_prottrans_bert.png
  • figures/mutation_prottrans_t5.png
  • figures/mutation_squeezeprot_sp_nonstrict.png
  • figures/mutation_squeezeprot_sp_strict.png
  • figures/mutation_squeezeprot_u50.png

Flags

Long Flag Short Flag Description
--output_folder -o Location of the output folder.
--data_folder -d Location of the data folders.