Analysis#

The analysis module of PPI Origami allows you to validate and analyze datasets.

You can find a description of all the possible download commands by running:

ppi_origami analysis --help

Information specific to arguments of commands can be found by running the command with the help flag:

ppi_origami analysis COMMAND --help

This information is reproduced on this page.

class ppi_origami.__main__.Analysis#
static verify_rapppid(dataset_path: Path, taxon_ids: List[int] | None = None, sample_fraction: float | None = None, skip_protein_overlap: bool = False, skip_gotoh: bool = False, skip_sw: bool = False, skip_taxa_check: bool = False)#

Runs some sanity checks on a RAPPPID dataset. Specifically it can test:

  1. Overlap between the protein splits

  2. Overlap between proteins observed in interaction pairs belonging to different splits.

  3. Checks whether proteins belong to a set of expected taxa

  4. Checks whether sequences between splits are similar using either/both Gotoh or Smith-Waterman algorithms.

You can call this function from the CLI using:

ppi_origami analysis verify_rapppid DATASET_PATH --taxon_ids [9606,10090] --sample_fraction 0.05 --skip_protein_overlap False --skip_gotoh False --skip_sw False --skip_taxa_check False
Parameters:
  • dataset_path (pathlib.Path) – The path to the RAPPPID dataset.

  • taxon_ids (Optional[List[int]]) – A list of taxa which proteins will be required to be from. If proteins are detected from taxa other than those specified, the test will fail. Defaults to None.

  • sample_fraction (Optional[float]) – The fraction of dataset samples to test for sequence similarity and verifying protein organisms. Dataset samples are randomly chosen. Set to 1 or None to test the whole dataset. Defaults to None.

  • skip_protein_overlap (bool) – Skip the protein overlap test.

  • skip_gotoh (bool) – Skip checking the sequence similarity with the Gotoh algorithm.

  • skip_sw (bool) – Skip checking the sequence similarity with the Smith-Waterman algorithm.

  • skip_taxa_check (bool) – Skip checking the protein’s organism.

Returns:

None