
Preparing datasets for the purpose of training models that predict protein-protein interactions is a deceptively fraught process.

Several studies have outlined that insufficiently controlling for protein identity between cross-validation splits can lead to serious over-fitting of PPI prediction methods [1-3].

PPI Origami uses the notation from Park and Marcotte, which defines three types of PPI cross-validation datasets [1]:

  • C3 - Proteins that constitute interactions in one split (i.e.: training, validation, or test) are not to be found in any other split.

  • C2 - No more than one protein in a given interaction may be found in another split.

  • C1 - No restriction on protein split membership. Interactions are randomly assigned to a split.

In addition to this, PPI Origami ensures that C3 datasets meet the following two criteria oultined in the INTREPPPID manuscript.

First, let’s begin by defining \(\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}\), which are the set of proteins present in the interactions found in the Training, Testing, and Validation split, respectively.

Further, let’s define \(\mathcal{P}\) as the collection of protein sets \(\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}\)

Criterion 1 - Distinct Protein Identity The protein sets \(\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}\) must be mutually disjoint:

\[\forall Q, R \in \mathcal{P}, Q \cap R = \varnothing \textsf{ if } Q \neq R.\]

Criterion 2 - Distinct Sequence Identity

\[\forall Q, R \in \mathcal{P}, \;\;\;\; \forall q \in Q, \;\;\;\; \forall r \in R,\;\;\;\; f(q,r) \leq 90\% \;\;\;\; \textsf{ if } \;\;\;\; Q \neq R,\]

where \(f\) is some sequence similarity metric. We use UniRef cluster membership for sequence similarity.


  1. Park, Yungki and Edward M. Marcotte. “A flaw in the typical evaluation scheme for pair-input computational predictions.” Nature methods 9 (2012): 1134 - 1136.

  2. Hamp, Tobias and Burkhard Rost. “More challenges for machine-learning protein interactions.” Bioinformatics 31 10 (2015): 1521-5 .

  3. Bernett, Judith, David B. Blumenthal and Markus List. “Cracking the black box of deep sequence-based protein-protein interaction prediction.” bioRxiv (2023): n. pag.