Process#
The process
module of PPI Origami allows you to transform files from their original formats and create new datasets.
You can find a description of all the possible download commands by running:
ppi_origami process --help
Information specific to arguments of commands can be found by running the command with the help flag:
ppi_origami process COMMAND --help
This information is reproduced on this page.
- class ppi_origami.__main__.Process#
- static common_format(processed_folder: Path, source: str, taxon: int | None, identifier: str, version: str) None #
Converts raw source dataset into a “common” format from which PPI Origami can further transform.
Currently, only the
string_links_detailed
anddscript
dataset are supported.The identifier argument specifies which identifier to use when referring to proteins in the file. Must be one of:
upkb
– The UniProtKB accessionuniref50
– The UniRef50 identifieruniref90
– The UniRef90 identifieruniref100
– The UniRef100 identifier
You’ll have had to have added a column with the identifier you selected using the appropriate PPI Origami command/function.
For instance, if you specify
upkb
as the identifier, andstring_links_detailed
as the source, then, you’ll have had to run the PPI Origami commandstring_upkb
.You can call this function from the CLI using:
ppi_origami process common_format PROCESSED_FOLDER SOURCE TAXON IDENTIFIER VERSION
- Parameters:
processed_folder (pathlib.Path) – The path to the processed folder.
source (str) – The name of the source. Must be one of
string_links_detailed
ordscript
.taxon (Optional[int]) – The NCBI Taxon number of the databasae to be converted. Set to
None
if there is no specific organism.identifier (str) – The identifier to use to refer to proteins. Must be one of
upkb
,uniref50
,uniref90
, oruniref100
.version (int) – The version of the source to use.
- Returns:
None
- static common_to_rapppid(processed_folder: Path, common_path: Path, c_types: List[int], train_proportion: float = 0.8, val_proportion: float = 0.1, test_proportion: float = 0.1, neg_proportion: float = 1, uniref_threshold: int = 90, score_key: str | None = None, score_threshold: str | None = None, preloaded_protein_splits_path: Path | None = None, seed: int = 8675309, trim_unseen_proteins: bool = False, negatives_path: Path | None = None, taxon: int | None = None, weighted_random: bool = False, scramble_proteins: bool = False, exclude_preloaded_from_neg: bool = True) None #
Convert a dataset in the PPI Origami “common” format to the RAPPPID HDF5 format.
You can call this function from the CLI using:
ppi_origami process common_to_rapppid PROCESSED_FOLDER COMMON_PATH C_TYPES --train_proportion 0.8 --val_proportion 0.1 \ --test_proportion 0.1 --neg_proportion 1 --uniref_threshold 90 --score_key string_combined_score --score_threshold 950 \ --preloaded_protein_splits_path dataset.h5 --seed 8675309 --trim_unseen_proteins False --negatives_path None --taxon 9606 \ --weighted_random False --scramble_proteins False --exclude_preloaded_from_neg True
- Parameters:
processed_folder (pathlib.Path) – The processed folder, where you will deposit the new dataset.
common_path (pathlib.Path) – The path to the common file.
c_types (List[int]) – The different Park & Marcotte C-type levels to generate. Takes a list. e.g.: [1,2]
train_proportion (float) – The proportion of interactions to assign to the training fold. Defaults to 0.8.
val_proportion (float) – The proportion of interactions to assign to the validation fold. Defaults to 0.1.
test_proportion (float) – The proportion of interactions to assign to the testing fold. Defaults to 0.1.
neg_proportion (float) – The proportion of interactions that will be negative interactions. Defaults to 1.
uniref_threshold (int) – The UniRef threshold rate to use to ensure proteins between splits are not too similar. Defaults to 90.
score_key (Optional[str]) – The scoring key to theshold by, if any. Defaults to
None
.score_threshold (Optional[str]) – The value to threshold the score key by. Values below this value will be filtered out. Defaults to
None
.preloaded_protein_splits_path (Optional[pathlib.Path]) – Load protein splits from another RAPPPID dataset. Defaults to
None
.seed (int) – An integer that will serve as the random seed for datasets. Defaults to 8675309.
trim_unseen_proteins (bool) – If true, when a protein loaded from the preloaded_protein_splits is not found in the common dataset, it is not included in the dataset. Defaults to
False
.negatives_path (Optional[pathlib.Path]) – Optional, path to file with negative interactions. Defaults to
None
.taxon (Optional[int]) – Optional, restrict a dataset to a certain organism. Defaults to
None
.weighted_random (bool) – If true, negative samples will be sampled in such a way as to maintain the same protein degree as the positive samples. Defaults to
False
.scramble_proteins (bool) – Scramble the association between protein ids and their sequences. Defaults to
False
.exclude_preloaded_from_neg (bool) – Set this to true if you don’t want preloaded proteins to leak into negative. Defaults to
True
.
- Returns:
None
- static dscript_uniref(processed_folder: Path, threshold: int, taxon: int | None = None) None #
Add a column of UniRef IDs to a D-SCRIPT-formatted datasets.
You can call this function from the CLI using:
ppi_origami process dscript_uniref PROCESSED_FOLDER THRESHOLD --taxon 9606
- Parameters:
processed_folder (pathlib.Path) – The folder to output processed data.
threshold (int) – The UniRef identity threshold. Must be one of 50, 90, 100.
taxon (Optional[int]) – The NCBI taxon ID of the organism whose links you wish to download. Omit for all organisms. Defaults to
None
.
- Returns:
None
- static dscript_upkb(raw_folder: Path, processed_folder: Path, taxon: int | None = None) None #
Add a column of UPKB accessions to a D-SCRIPT-formatted datasets.
You can call this function from the CLI using:
ppi_origami process dscript_upkb RAW_FOLDER PROCESSED_FOLDER --taxon 9606
- Parameters:
raw_folder (pathlib.Path) – The folder datasets have been downloaded to.
processed_folder (pathlib.Path) – The folder to output processed data.
taxon (Optional[int]) – The NCBI taxon ID of the organism whose links you wish to download. Omit for all organisms. Defaults to
None
.
- Returns:
None
- static hippie_upkb(raw_folder: Path, processed_folder: Path) None #
Add a UPKB acession column to the HIPPIE dataset.
You can call this function from the CLI using:
ppi_origami process hippie_upkb RAW_FOLDER PROCESSED_FOLDER
- Parameters:
raw_folder (pathlib.Path) – The raw folder where datasets are downloaded to.
processed_folder (pathlib.Path) – The processed folder, where you will deposit the new dataset.
- Returns:
None
- static multimerge_rapppid(processed_folder: Path, dataset_paths: str) None #
Combine multiple RAPPPID datasets into one.
It’s important to note that if the datasets you are merging have different protein splits, then you will have data leakage. Only perform this operation if the two datasets have the same protein splits.
You can call this function from the CLI using:
ppi_origami process multimerge_rapppid PROCESSED_FOLDER DATASET_PATHS
- Parameters:
processed_folder (pathlib.Path) – The processed folder, where you will deposit the new dataset.
dataset_paths (str) – Comma-seperated paths to the datasets
- Returns:
None
- static oma_upkb_groups(raw_path: Path, processed_path: Path, limit_taxons: List[int] | None = None)#
Create a LevelDB database mapping UniProt accession codes to OMA Group IDs. To do this, the raw OMA XML file is parsed. This does, however mean, that you’ll have had to run the command:
ppi_origami download oma RAW_FOLDER
You can call this function from the CLI using:
ppi_origami process oma_upkb_groups RAW_PATH PROCESSED_PATH --limit_taxons [9606,10090]
- Parameters:
raw_path (pathlib.Path) – The folder datasets have been downloaded to.
processed_path (pathlib.Path) – The folder to output processed data.
limit_taxons (Optional[List[int]]) – A list of NCBI Taxon IDs. Will only parse IDs that belong to these taxa. Defaults to None (i.e.: No restriction on taxa).
- Returns:
None
- static rapppid_to_deepppi(rapppid_path: Path, deepppi_folder: Path, c_types: List[int])#
Convert a RAPPPID HDF5 dataset to the DeepPPI dataset format.
You can call this function from the CLI using:
ppi_origami process rapppid_to_deepppi RAPPPID_PATH DEEPPPI_FOLDER C_TYPES
- Parameters:
rapppid_path (pathlib.Path) – Path to the RAPPPID dataset.
deepppi_folder (pathlib.Path) – Path to the folder into which we write the DeepPPI dataset.
c_types (List[int]) – The Park & Marcotte C-type datasets to generate.
- Returns:
None
- static rapppid_to_dscript(rapppid_path: Path, dscript_folder: Path, c_types: List[int], trunc_len: int = 1500) None #
Convert a RAPPPID HDF5 dataset to the D-SCRIPT dataset format.
You can call this function from the CLI using:
ppi_origami process rapppid_to_dscript RAPPPID_PATH DSCRIPT_FOLDER C_TYPES --trunc_len 1500
- Parameters:
rapppid_path (pathlib.Path) – Path to the RAPPPID dataset.
dscript_folder (pathlib.Path) – Path to the folder into which we write the D-SCRIPT dataset.
c_types (List[int]) – The Park & Marcotte C-type datasets to generate.
trunc_len (int) – Length at which to truncate amino acid sequences.
- Returns:
None
- static rapppid_to_intrepppid(processed_path: Path, rapppid_path: Path, intrepppid_path: Path, c_types: List[int], allowlist_taxon: List[int] | None = None, denylist_taxon: List[int] | None = None, scramble_interactions: bool = False, scramble_orthologs: bool = False, uniref_threshold: int = 90)#
Convert a RAPPPID dataset to the INTREPPPID format. This primarily involves adding orthology data to the dataset.
You can call this function from the CLI using:
ppi_origami process rapppid_to_intrepppid PROCESSED_PATH RAPPPID_PATH INTREPPPID_PATH C_TYPES --allowlist_taxon [9606,10090] --denylist_taxon None --scramble_interactions False --scramble_orthologs False --uniref_threshold 90
- Parameters:
processed_path (pathlib.Path) – The folder where processed data is kept.
rapppid_path (pathlib.Path) – The path to the RAPPPID dataset to convert to INTREPPPID.
intrepppid_path (pathlib.Path) – The path to save the INTREPPPID dataset.
c_types (List[int]) – The different Park & Marcotte C-type levels to generate. Takes a list. e.g.: [1,2]
allowlist_taxon (Optional[List[int]]) – The NCBI Taxon IDs of organism for which orthologues are allowed to be from. Orthologues from orgnaism not in this list will be omitted. Cannot to be used with denylist. Defaults to
None
.denylist_taxon (Optional[List[int]]) – The NCBI Taxon IDs of organism for which orthologues are not allowed to be from. Orthologues from orgnaism in this list will be omitted. Cannot to be used with denylist. Defaults to
None
.scramble_interactions (bool) – If True, protein IDs will be scrambled, ablating the biological meaning of the interaction network. Defaults to
False
.scramble_orthologs (bool) – If True, orthologue IDs will be scrambled, ablating the biological meaning of the orthologue data. Defaults to
False
.uniref_threshold (int) – What uniref threshold to use when determining the similarity of proteins. Must be one of 50, 90, or 100. Defaults to 90.
- Returns:
None
- static rapppid_to_pipr(rapppid_path: Path, pipr_folder: Path, c_types: List[int])#
Convert a RAPPPID HDF5 dataset to the PIPR dataset format.
You can call this function from the CLI using:
ppi_origami process rapppid_to_pipr RAPPPID_PATH PIPR_FOLDER C_TYPES
- Parameters:
rapppid_path (pathlib.Path) – Path to the RAPPPID dataset.
pipr_folder (pathlib.Path) – Path to the folder into which we write the PIPR dataset.
c_types (List[int]) – The Park & Marcotte C-type datasets to generate.
- Returns:
None
- static rapppid_to_sprint(rapppid_path: Path, sprint_folder: Path, c_types: List[int]) None #
Convert a RAPPPID HDF5 dataset to the SPRINT dataset format.
You can call this function from the CLI using:
ppi_origami process rapppid_to_sprint RAPPPID_PATH SPRINT_FOLDER C_TYPES
- Parameters:
rapppid_path (pathlib.Path) – Path to the RAPPPID dataset.
sprint_folder (pathlib.Path) – Path to the folder into which we write the SPRINT dataset.
c_types (List[int]) – The Park & Marcotte C-type datasets to generate.
- Returns:
None
- static string_uniref(processed_folder: Path, threshold: int, version: str = '12.0', taxon: int | None = None) None #
Add a UniRef ID column to STRING rows.
You can call this function from the CLI using:
ppi_origami process string_uniref PROCESSED_FOLDER THRESHOLD --version 12.0 --taxon 9606
- Parameters:
version – The version of STRING DB to process.
processed_folder – The folder to output processed data.
threshold – The UniRef identity threshold. Must be one of 50, 90, 100.
taxon – The NCBI taxon ID of the organism whose links you wish to download. Omit for all organisms.
- Returns:
None
- static string_upkb(raw_folder: Path, processed_folder: Path, version: str = '12.0', taxon: int | None = None) None #
Add a UniprotKB accession column to STRING rows.
This can be a long process, as a STRING ID to UniprotKB accession map must be built.
Any protein pair with a STRING ID for which PPI Origami can’t find the corresponding UniprotKB accession is omitted from the dataset.
You can call this function from the CLI using:
ppi_origami process string_upkb RAW_FOLDER PROCESSED_FOLDER VERSION TAXON
- Parameters:
version – The version of STRING DB to process.
raw_folder – The folder datasets have been downloaded to.
processed_folder – The folder to output processed data.
taxon – The NCBI taxon ID of the organism whose links you wish to download. Omit for all organisms.
- Returns:
None
- static train_sentencepiece_model(processed_folder: Path, dataset_file: Path, seed: int, vocab_size: int) None #
Train a Uniword model using SentencePiece given a RAPPPID dataset.
You can call this function from the CLI using:
ppi_origami process train_sentencepiece_model PROCESSED_FOLDER DATASET_FILE SEED VOCAB_SIZE
- Parameters:
processed_folder (pathlib.Path) – The processed folder, where you will deposit the new dataset.
dataset_file (pathlib.Path) – The RAPPPID file from which to train a Sentencepiece model.
seed (int) – The random seed to use.
vocab_size (int) – The number of tokens to learn.
- Returns:
None
- static uniprot_id_mapping(raw_folder: Path, processed_folder: Path) None #
Process UniProt ID mappings. This creates a new CSV file mapping each UPKB accession code to a different identifier. These files will have filenames of the format
uniprot_idmapping_{id_type}.csv.gz
.You can call this function from the CLI using:
ppi_origami process uniprot_id_mapping RAW_FOLDER PROCESSED_FOLDER
- Parameters:
raw_folder (pathlib.Path) – The folder to download the dataset to.
processed_folder (pathlib.Path) – The folder to output processed data.
- Returns:
None
- static uniref(raw_folder: Path, processed_folder: Path, threshold: int) None #
Process UniRef dataset. Namely, parse the UniRef XML file into a few new formats useful to PPI Origami:
A CSV file mapping UniRef identifiers to UPKB accessions (
uniref{threshold}_members_upkb.csv.gz
)A CSV file mapping UniRef identifiers to UniParc identifiers (
uniref{threshold}_members_uniparc.csv.gz
)A CSV file mapping UniRef identifiers to their amino acid sequences (
uniref{threshold}_sequences.csv.gz
)A LevelDB databse with UPKB accessions keys and UniRef identifier values (
uniref{threshold}_members_upkb.leveldb
)A LevelDB databse with UniParc identifier keys and UniRef identifier values (
uniref{threshold}_members_uniparc.leveldb
)A LevelDB databse with UniRef identifier keys and values that correspond to the amino acid sequence of the representative protein of the cluster (
uniref{threshold}_sequences.leveldb
)
The files generated are particularly useful for identifying whether to proteins, as identified by their UPKB accessions, belong to the same UniRef cluster. PPI Origami uses this to identify similar proteins.
PPI Origami will parse the UniRef XML file by streaming it, avoiding having to load the whole file into memory.
The CSV files are simple to use, but the LevelDB databases are very fast without requiring you to load the whole database to memory.
You can call this function from the CLI using:
ppi_origami process uniref RAW_FOLDER PROCESSED_FOLDER THRESHOLD
- Parameters:
raw_folder (pathlib.Path) – The folder datasets have been downloaded to.
processed_folder (pathlib.Path) – The folder to output processed data to.
threshold (int) – The UniRef identity threshold. Must be one of 50, 90, 100.
- Returns:
None