tfrecord_loader module

Contents

tfrecord_loader module#

tfrecord_loader.get_loader(genes_no: int, file_path: str | List[str], batch_size: int, splits: Dict[str, float] | None = None, description: List[str] | Dict[str, str] | None = None, compression_type: str | None = 'gzip', multi_read: bool | None = False, get_clusters: bool | None = False) DataLoader[source]#

Provides an IterableLoader over a Dataset read from given tfrecord files for PyTorch.

Currently used to create data loaders from the PBMC preprocessed dataset in tfrecord from scGAN (Marouf et al.,2020). description parameter and post_process function can be modified to accommodate more tfrecord datasets.

Parameters:
  • genes_no (int) – Number of genes in the expression matrix.

  • file_path (Union[str, List[str]]) – Tfrecord file path for reading a single tfrecord (multi_read=False) or file pattern for reading multiple tfrecords (ex: /path/{}.tfrecord).

  • batch_size (int) – Training batch size.

  • splits (Optional[Dict[str, float]], optional) – Dictionary of (key, value) pairs, where the key is used to construct the data and index path(s) and the value determines the contribution of each split to the batch. Provide when reading from multiple tfrecords (multi_read=True), by default None.

  • description (Union[List[str], Dict[str, str], None], optional) – List of keys or dict of (key, value) pairs to extract from each record. The keys represent the name of the features and the values (“byte”, “float”, or “int”), by default { “indices”: None, “values”: None, }.

  • compression_type (Optional[str], optional) – The type of compression used for the tfrecord. Either ‘gzip’ or None, by default “gzip”.

  • multi_read (Optional[bool], optional) – Specifies whether to construct the dataset from multiple tfrecords. If True, a file pattern should be passed to file_path, by default False.

  • get_clusters (Optional[bool], optional) – If True, the returned data loader will contain the cluster label of cells in addition to their gene expression values, by default False.

Returns:

Iterable data loader over the dataset.

Return type:

DataLoader