Data#

Pretrained Weights#

You can download the pre-trained weights used in the INTREPPPID manuscript from the GitHub releases page.

Precomputed Datasets#

You can download precomputed datasets from the sources below:

Zenodo (DOI: 10.5281/zenodo.10594149)
Internet Archive

All datasets are made available under the Creative Commons Attribution-ShareAlike 4.0 International license.

Dataset Format#

INTREPPPID requires that datasets be prepared specifically in HDF5 files.

Each INTREPPPID dataset must have the following hierarchical structure

intrepppid.h5
├── orthologs
├── sequences
│
├── splits
│   ├── test
│   ├── train
│   └── val
│
└── interactions
    ├── c1
    │    ├── c1_train
    │    ├── c1_val
    │    └── c1_test
    │
    ├── c2
    │    ├── c2_train
    │    ├── c2_val
    │    └── c2_test
    │
    └── c3
         ├── c3_train
         ├── c3_val
         └── c3_test

All but one of the “c” folders under “interactions” need be present, so long as that is the dataset you specify in the train step with the --c_type flag.

Here is the schema for the tables:

`orthologs` schema#
Field Name	Type	Example	Description
`ortholog_group_id`	`Int64`	`1048576`	The OMA Group ID of the protein in the `protein_id` column
`protein_id`	`String`	`M7ZLH0`	The UniProt accession of a protein with OMA Group ID `ortholog_group_id`

`sequences` schema#
Field Name	Type	Example	Description
`name`	`String`	`Q9NZE8`	The UniProt accession that corresponds to the amino acid sequence in the `sequence` column.
`sequence`	`String`	`MAASAFAGAVRAASGILRPLNI`…	The amino acid sequence indicated by the `name` column.

Schema for all tables under `interactions`#
Field Name	Type	Example	Description
`protein_id1`	`String`	`Q9BQB4`	The UniProt accession of the first protein in the interaction pair.
`protein_id2`	`String`	`Q9NYF0`	The UniProt accession of the second protein in the interaction pair.
`omid_protein_id`	`String`	`C1MTX6`	The UniProt accession of the anchor protein for the orthologous locality loss.
`omid_id`	`Int64`	`737336`	The OMA Group ID of the anchor protein, from which a positive protein can be chose for the orthologous locality loss.
`label`	`Bool`	`False`	Label indicating whether `protein_id1` and `protein_id2` interact with one another.