Process#

The process module of PPI Origami allows you to transform files from their original formats and create new datasets.

You can find a description of all the possible download commands by running:

ppi_origami process --help

Information specific to arguments of commands can be found by running the command with the help flag:

ppi_origami process COMMAND --help

This information is reproduced on this page.

class ppi_origami.__main__.Process#
static common_format(processed_folder: Path, source: str, taxon: int | None, identifier: str, version: str) None#

Converts raw source dataset into a “common” format from which PPI Origami can further transform.

Currently, only the string_links_detailed and dscript dataset are supported.

The identifier argument specifies which identifier to use when referring to proteins in the file. Must be one of:

  1. upkb – The UniProtKB accession

  2. uniref50 – The UniRef50 identifier

  3. uniref90 – The UniRef90 identifier

  4. uniref100 – The UniRef100 identifier

You’ll have had to have added a column with the identifier you selected using the appropriate PPI Origami command/function.

For instance, if you specify upkb as the identifier, and string_links_detailed as the source, then, you’ll have had to run the PPI Origami command string_upkb.

You can call this function from the CLI using:

ppi_origami process common_format PROCESSED_FOLDER SOURCE TAXON IDENTIFIER VERSION
Parameters:
  • processed_folder (pathlib.Path) – The path to the processed folder.

  • source (str) – The name of the source. Must be one of string_links_detailed or dscript.

  • taxon (Optional[int]) – The NCBI Taxon number of the databasae to be converted. Set to None if there is no specific organism.

  • identifier (str) – The identifier to use to refer to proteins. Must be one of upkb, uniref50, uniref90, or uniref100.

  • version (int) – The version of the source to use.

Returns:

None

static common_to_rapppid(processed_folder: Path, common_path: Path, c_types: List[int], train_proportion: float = 0.8, val_proportion: float = 0.1, test_proportion: float = 0.1, neg_proportion: float = 1, uniref_threshold: int = 90, score_key: str | None = None, score_threshold: str | None = None, preloaded_protein_splits_path: Path | None = None, seed: int = 8675309, trim_unseen_proteins: bool = False, negatives_path: Path | None = None, taxon: int | None = None, weighted_random: bool = False, scramble_proteins: bool = False, exclude_preloaded_from_neg: bool = True) None#

Convert a dataset in the PPI Origami “common” format to the RAPPPID HDF5 format.

You can call this function from the CLI using:

ppi_origami process common_to_rapppid PROCESSED_FOLDER COMMON_PATH C_TYPES --train_proportion 0.8 --val_proportion 0.1 \
    --test_proportion 0.1 --neg_proportion 1 --uniref_threshold 90 --score_key string_combined_score --score_threshold 950 \
    --preloaded_protein_splits_path dataset.h5 --seed 8675309 --trim_unseen_proteins False --negatives_path None --taxon 9606 \
    --weighted_random False --scramble_proteins False --exclude_preloaded_from_neg True
Parameters:
  • processed_folder (pathlib.Path) – The processed folder, where you will deposit the new dataset.

  • common_path (pathlib.Path) – The path to the common file.

  • c_types (List[int]) – The different Park & Marcotte C-type levels to generate. Takes a list. e.g.: [1,2]

  • train_proportion (float) – The proportion of interactions to assign to the training fold. Defaults to 0.8.

  • val_proportion (float) – The proportion of interactions to assign to the validation fold. Defaults to 0.1.

  • test_proportion (float) – The proportion of interactions to assign to the testing fold. Defaults to 0.1.

  • neg_proportion (float) – The proportion of interactions that will be negative interactions. Defaults to 1.

  • uniref_threshold (int) – The UniRef threshold rate to use to ensure proteins between splits are not too similar. Defaults to 90.

  • score_key (Optional[str]) – The scoring key to theshold by, if any. Defaults to None.

  • score_threshold (Optional[str]) – The value to threshold the score key by. Values below this value will be filtered out. Defaults to None.

  • preloaded_protein_splits_path (Optional[pathlib.Path]) – Load protein splits from another RAPPPID dataset. Defaults to None.

  • seed (int) – An integer that will serve as the random seed for datasets. Defaults to 8675309.

  • trim_unseen_proteins (bool) – If true, when a protein loaded from the preloaded_protein_splits is not found in the common dataset, it is not included in the dataset. Defaults to False.

  • negatives_path (Optional[pathlib.Path]) – Optional, path to file with negative interactions. Defaults to None.

  • taxon (Optional[int]) – Optional, restrict a dataset to a certain organism. Defaults to None.

  • weighted_random (bool) – If true, negative samples will be sampled in such a way as to maintain the same protein degree as the positive samples. Defaults to False.

  • scramble_proteins (bool) – Scramble the association between protein ids and their sequences. Defaults to False.

  • exclude_preloaded_from_neg (bool) – Set this to true if you don’t want preloaded proteins to leak into negative. Defaults to True.

Returns:

None

static dscript_uniref(processed_folder: Path, threshold: int, taxon: int | None = None) None#

Add a column of UniRef IDs to a D-SCRIPT-formatted datasets.

You can call this function from the CLI using:

ppi_origami process dscript_uniref PROCESSED_FOLDER THRESHOLD --taxon 9606
Parameters:
  • processed_folder (pathlib.Path) – The folder to output processed data.

  • threshold (int) – The UniRef identity threshold. Must be one of 50, 90, 100.

  • taxon (Optional[int]) – The NCBI taxon ID of the organism whose links you wish to download. Omit for all organisms. Defaults to None.

Returns:

None

static dscript_upkb(raw_folder: Path, processed_folder: Path, taxon: int | None = None) None#

Add a column of UPKB accessions to a D-SCRIPT-formatted datasets.

You can call this function from the CLI using:

ppi_origami process dscript_upkb RAW_FOLDER PROCESSED_FOLDER --taxon 9606
Parameters:
  • raw_folder (pathlib.Path) – The folder datasets have been downloaded to.

  • processed_folder (pathlib.Path) – The folder to output processed data.

  • taxon (Optional[int]) – The NCBI taxon ID of the organism whose links you wish to download. Omit for all organisms. Defaults to None.

Returns:

None

static hippie_upkb(raw_folder: Path, processed_folder: Path) None#

Add a UPKB acession column to the HIPPIE dataset.

You can call this function from the CLI using:

ppi_origami process hippie_upkb RAW_FOLDER PROCESSED_FOLDER
Parameters:
  • raw_folder (pathlib.Path) – The raw folder where datasets are downloaded to.

  • processed_folder (pathlib.Path) – The processed folder, where you will deposit the new dataset.

Returns:

None

static multimerge_rapppid(processed_folder: Path, dataset_paths: str) None#

Combine multiple RAPPPID datasets into one.

It’s important to note that if the datasets you are merging have different protein splits, then you will have data leakage. Only perform this operation if the two datasets have the same protein splits.

You can call this function from the CLI using:

ppi_origami process multimerge_rapppid PROCESSED_FOLDER DATASET_PATHS
Parameters:
  • processed_folder (pathlib.Path) – The processed folder, where you will deposit the new dataset.

  • dataset_paths (str) – Comma-seperated paths to the datasets

Returns:

None

static oma_upkb_groups(raw_path: Path, processed_path: Path, limit_taxons: List[int] | None = None)#

Create a LevelDB database mapping UniProt accession codes to OMA Group IDs. To do this, the raw OMA XML file is parsed. This does, however mean, that you’ll have had to run the command:

ppi_origami download oma RAW_FOLDER

You can call this function from the CLI using:

ppi_origami process oma_upkb_groups RAW_PATH PROCESSED_PATH --limit_taxons [9606,10090]
Parameters:
  • raw_path (pathlib.Path) – The folder datasets have been downloaded to.

  • processed_path (pathlib.Path) – The folder to output processed data.

  • limit_taxons (Optional[List[int]]) – A list of NCBI Taxon IDs. Will only parse IDs that belong to these taxa. Defaults to None (i.e.: No restriction on taxa).

Returns:

None

static rapppid_to_deepppi(rapppid_path: Path, deepppi_folder: Path, c_types: List[int])#

Convert a RAPPPID HDF5 dataset to the DeepPPI dataset format.

You can call this function from the CLI using:

ppi_origami process rapppid_to_deepppi RAPPPID_PATH DEEPPPI_FOLDER C_TYPES
Parameters:
  • rapppid_path (pathlib.Path) – Path to the RAPPPID dataset.

  • deepppi_folder (pathlib.Path) – Path to the folder into which we write the DeepPPI dataset.

  • c_types (List[int]) – The Park & Marcotte C-type datasets to generate.

Returns:

None

static rapppid_to_dscript(rapppid_path: Path, dscript_folder: Path, c_types: List[int], trunc_len: int = 1500) None#

Convert a RAPPPID HDF5 dataset to the D-SCRIPT dataset format.

You can call this function from the CLI using:

ppi_origami process rapppid_to_dscript RAPPPID_PATH DSCRIPT_FOLDER C_TYPES --trunc_len 1500
Parameters:
  • rapppid_path (pathlib.Path) – Path to the RAPPPID dataset.

  • dscript_folder (pathlib.Path) – Path to the folder into which we write the D-SCRIPT dataset.

  • c_types (List[int]) – The Park & Marcotte C-type datasets to generate.

  • trunc_len (int) – Length at which to truncate amino acid sequences.

Returns:

None

static rapppid_to_intrepppid(processed_path: Path, rapppid_path: Path, intrepppid_path: Path, c_types: List[int], allowlist_taxon: List[int] | None = None, denylist_taxon: List[int] | None = None, scramble_interactions: bool = False, scramble_orthologs: bool = False, uniref_threshold: int = 90)#

Convert a RAPPPID dataset to the INTREPPPID format. This primarily involves adding orthology data to the dataset.

You can call this function from the CLI using:

ppi_origami process rapppid_to_intrepppid PROCESSED_PATH RAPPPID_PATH INTREPPPID_PATH C_TYPES --allowlist_taxon [9606,10090] --denylist_taxon None --scramble_interactions False --scramble_orthologs False --uniref_threshold 90
Parameters:
  • processed_path (pathlib.Path) – The folder where processed data is kept.

  • rapppid_path (pathlib.Path) – The path to the RAPPPID dataset to convert to INTREPPPID.

  • intrepppid_path (pathlib.Path) – The path to save the INTREPPPID dataset.

  • c_types (List[int]) – The different Park & Marcotte C-type levels to generate. Takes a list. e.g.: [1,2]

  • allowlist_taxon (Optional[List[int]]) – The NCBI Taxon IDs of organism for which orthologues are allowed to be from. Orthologues from orgnaism not in this list will be omitted. Cannot to be used with denylist. Defaults to None.

  • denylist_taxon (Optional[List[int]]) – The NCBI Taxon IDs of organism for which orthologues are not allowed to be from. Orthologues from orgnaism in this list will be omitted. Cannot to be used with denylist. Defaults to None.

  • scramble_interactions (bool) – If True, protein IDs will be scrambled, ablating the biological meaning of the interaction network. Defaults to False.

  • scramble_orthologs (bool) – If True, orthologue IDs will be scrambled, ablating the biological meaning of the orthologue data. Defaults to False.

  • uniref_threshold (int) – What uniref threshold to use when determining the similarity of proteins. Must be one of 50, 90, or 100. Defaults to 90.

Returns:

None

static rapppid_to_pipr(rapppid_path: Path, pipr_folder: Path, c_types: List[int])#

Convert a RAPPPID HDF5 dataset to the PIPR dataset format.

You can call this function from the CLI using:

ppi_origami process rapppid_to_pipr RAPPPID_PATH PIPR_FOLDER C_TYPES
Parameters:
  • rapppid_path (pathlib.Path) – Path to the RAPPPID dataset.

  • pipr_folder (pathlib.Path) – Path to the folder into which we write the PIPR dataset.

  • c_types (List[int]) – The Park & Marcotte C-type datasets to generate.

Returns:

None

static rapppid_to_sprint(rapppid_path: Path, sprint_folder: Path, c_types: List[int]) None#

Convert a RAPPPID HDF5 dataset to the SPRINT dataset format.

You can call this function from the CLI using:

ppi_origami process rapppid_to_sprint RAPPPID_PATH SPRINT_FOLDER C_TYPES
Parameters:
  • rapppid_path (pathlib.Path) – Path to the RAPPPID dataset.

  • sprint_folder (pathlib.Path) – Path to the folder into which we write the SPRINT dataset.

  • c_types (List[int]) – The Park & Marcotte C-type datasets to generate.

Returns:

None

static string_uniref(processed_folder: Path, threshold: int, version: str = '12.0', taxon: int | None = None) None#

Add a UniRef ID column to STRING rows.

You can call this function from the CLI using:

ppi_origami process string_uniref PROCESSED_FOLDER THRESHOLD --version 12.0 --taxon 9606
Parameters:
  • version – The version of STRING DB to process.

  • processed_folder – The folder to output processed data.

  • threshold – The UniRef identity threshold. Must be one of 50, 90, 100.

  • taxon – The NCBI taxon ID of the organism whose links you wish to download. Omit for all organisms.

Returns:

None

static string_upkb(raw_folder: Path, processed_folder: Path, version: str = '12.0', taxon: int | None = None) None#

Add a UniprotKB accession column to STRING rows.

This can be a long process, as a STRING ID to UniprotKB accession map must be built.

Any protein pair with a STRING ID for which PPI Origami can’t find the corresponding UniprotKB accession is omitted from the dataset.

You can call this function from the CLI using:

ppi_origami process string_upkb RAW_FOLDER PROCESSED_FOLDER VERSION TAXON
Parameters:
  • version – The version of STRING DB to process.

  • raw_folder – The folder datasets have been downloaded to.

  • processed_folder – The folder to output processed data.

  • taxon – The NCBI taxon ID of the organism whose links you wish to download. Omit for all organisms.

Returns:

None

static train_sentencepiece_model(processed_folder: Path, dataset_file: Path, seed: int, vocab_size: int) None#

Train a Uniword model using SentencePiece given a RAPPPID dataset.

You can call this function from the CLI using:

ppi_origami process train_sentencepiece_model PROCESSED_FOLDER DATASET_FILE SEED VOCAB_SIZE
Parameters:
  • processed_folder (pathlib.Path) – The processed folder, where you will deposit the new dataset.

  • dataset_file (pathlib.Path) – The RAPPPID file from which to train a Sentencepiece model.

  • seed (int) – The random seed to use.

  • vocab_size (int) – The number of tokens to learn.

Returns:

None

static uniprot_id_mapping(raw_folder: Path, processed_folder: Path) None#

Process UniProt ID mappings. This creates a new CSV file mapping each UPKB accession code to a different identifier. These files will have filenames of the format uniprot_idmapping_{id_type}.csv.gz.

You can call this function from the CLI using:

ppi_origami process uniprot_id_mapping RAW_FOLDER PROCESSED_FOLDER
Parameters:
  • raw_folder (pathlib.Path) – The folder to download the dataset to.

  • processed_folder (pathlib.Path) – The folder to output processed data.

Returns:

None

static uniref(raw_folder: Path, processed_folder: Path, threshold: int) None#

Process UniRef dataset. Namely, parse the UniRef XML file into a few new formats useful to PPI Origami:

  1. A CSV file mapping UniRef identifiers to UPKB accessions (uniref{threshold}_members_upkb.csv.gz)

  2. A CSV file mapping UniRef identifiers to UniParc identifiers (uniref{threshold}_members_uniparc.csv.gz)

  3. A CSV file mapping UniRef identifiers to their amino acid sequences (uniref{threshold}_sequences.csv.gz)

  4. A LevelDB databse with UPKB accessions keys and UniRef identifier values (uniref{threshold}_members_upkb.leveldb)

  5. A LevelDB databse with UniParc identifier keys and UniRef identifier values (uniref{threshold}_members_uniparc.leveldb)

  6. A LevelDB databse with UniRef identifier keys and values that correspond to the amino acid sequence of the representative protein of the cluster (uniref{threshold}_sequences.leveldb)

The files generated are particularly useful for identifying whether to proteins, as identified by their UPKB accessions, belong to the same UniRef cluster. PPI Origami uses this to identify similar proteins.

PPI Origami will parse the UniRef XML file by streaming it, avoiding having to load the whole file into memory.

The CSV files are simple to use, but the LevelDB databases are very fast without requiring you to load the whole database to memory.

You can call this function from the CLI using:

ppi_origami process uniref RAW_FOLDER PROCESSED_FOLDER THRESHOLD
Parameters:
  • raw_folder (pathlib.Path) – The folder datasets have been downloaded to.

  • processed_folder (pathlib.Path) – The folder to output processed data to.

  • threshold (int) – The UniRef identity threshold. Must be one of 50, 90, 100.

Returns:

None