Tutorial#

CLI#

GRouNdGAN comes with a command-line interface. This section outlines available commands and arguments.

To use the CLI, run the src/main.py script with the desired command and any applicable options.

Important

Use python3.9 instead of python if you’re running through docker or singularity.

$ python src/main.py --help
usage: GRouNdGAN [-h] --config CONFIG [--preprocess] [--create_grn] [--train] [--generate]

GRouNdGAN is a gene regulatory network (GRN)-guided causal implicit generative model for
simulating single-cell RNA-seq data, in-silico perturbation experiments, and benchmarking GRN
inference methods. This programs also contains cWGAN and unofficial implementations of scGAN and
cscGAN (with projection conditioning)

required arguments:
--config CONFIG  Path to the configuration file

optional arguments:
--preprocess     Preprocess raw data for GAN training
--create_grn     Infer a GRN from preprocessed data using GRNBoost2
                 and appropriately format as causal graph
--train          Start or resume model training
--generate       Simulate single-cells RNA-seq data in-silico

There are essentially four commands available: --preprocess, --create_grn, --train, and --generate. You must provide a config file containing inputs and hyperparameters with each command through the --config flag.

Note

You can run commands individually:

python src/main.py --config configs/causal_gan.cfg --preprocess

Or chain them together to run all or multiple steps in one go:

python src/main.py --config configs/causal_gan.cfg --preprocess --create_grn --train --generate

Config Files#

GRouNdGAN uses a configuration syntax similar to INI implemented by python’s configparser module.

We provide three sample config files in the configs/ directory:

  • causal_gan.cfg: for GRouNdGAN

  • conditional_gan.cfg: for cscGAN with projection conditioning (Marouf et al., 2020) and cWGAN.

  • gan.cfg: for scGAN (Marouf et al., 2020) (we use this to train GRouNdGAN’s causal controller)

Most of the configuration file consists of hyperparameters. You only need to modify input and output parameters which we will go through in each section. GRouNdGAN isn’t very sensitive to hyperparameters. However, it is still advisable to test different choices of hyperparameters using a validation set.

Below is the demo causal_gan.cfg config file for training GRouNdGAN using the PBMC68k dataset:

[EXPERIMENT]
output directory = results/GRouNdGAN
device = cuda ; we will let the program choose what is available
checkpoint  ; set value to use a trained model

[Preprocessing]
10x = True
raw = data/raw/PBMC/
validation set size = 1000
test set size = 1000
annotations = data/raw/PBMC/barcodes_annotations.tsv
min cells = 3 ; genes expressed in less than 3 cells are discarded
min genes = 10 ; cells with less than 10 genes expressed are discarded
library size = 20000 ; library size used for library-size normalization
louvain res = 0.15 ; Louvain clustering resolution (higher resolution means finding more and smaller clusters)
highly variable number = 1000 ; number of highly variable genes to identify

[GRN Preparation]
TFs = data/raw/Homo_sapiens_TF.csv
k = 15 ; k is the number of top most important TFs per gene to include in the GRN
Inferred GRN = data/processed/PBMC/inferred_grnboost2.csv

[Data]
train = data/processed/PBMC/PBMC68k_train.h5ad
validation = data/processed/PBMC/PBMC68k_validation.h5ad
test = data/processed/PBMC/PBMC68k_test.h5ad
number of genes = 1000
causal graph = data/processed/PBMC/causal_graph.pkl

[Generation]
number of cells to generate = 10000

[Model]
type = causal GAN
noise per gene = 1
depth per gene = 3
width per gene = 2
critic layers = 1024 512 256
labeler layers = 2000 2000 2000
latent dim = 128 ; noise vector dimensions
library size = 20000 ; UMI count
lambda = 10 ; regularization hyper-parameter for gradient penalty


[Training]
batch size = 1024
critic iterations = 5 ; iterations to train the critic for each iteration of the generator
maximum steps = 1000000
labeler and antilabeler training intervals = 1

    [Optimizer]
    ; coefficients used for computing running averages of gradient and its square
    beta1 = 0.5
    beta2 = 0.9

    [Learning Rate]
    generator initial = 0.001
    generator final = 0.0001
    critic initial = 0.001
    critic final = 0.001
    labeler = 0.0001
    antilabeler = 0.0001


    [Logging]
    summary frequency = 10000
    plot frequency = 10000
    save frequency = 100000

[CC Model]
type = GAN ; Non-conditional single-cell RNA-seq GAN
generator layers = 256 512 1024
critic layers = 1024 512 256
latent dim = 128 ; noise vector dimensions
library size = 20000 ; UMI count (hardcoded to None in the code)
lambda = 10 ; regularization hyper-parameter for gradient penalty


[CC Training]
batch size = 128
critic iterations = 5 ; iterations to train the critic for each iteration of the generator
maximum steps = 200000

    [CC Optimizer]
    ; coefficients used for computing running averages of gradient and its square
    beta1 = 0.5
    beta2 = 0.9

    [CC Learning Rate]
    generator initial = 0.0001
    generator final = 0.00001
    critic initial = 0.0001
    critic final = 0.00001

    [CC Logging]
    summary frequency = 10000
    plot frequency = 10000
    save frequency = 100000

Project outline#

GRouNdGAN is structured as follows:

.
|-- .git
|-- .gitattributes
|-- .github
|   `-- workflows
|       |-- docker-build.yml
|       `-- documentation.yaml
|-- .gitignore
|-- .gitmodules
|-- Atkinson_Hyperlegible
|-- Beeline
|-- LICENSE
|-- README.md
|-- configs
|   |-- causal_gan.cfg
|   |-- conditional_gan.cfg
|   `-- gan.cfg
|-- data
|   |-- generated
|   |-- interim
|   |-- processed
|   |   |-- BoneMarrow
|   |   `-- PBMC
|   `-- raw
|       |-- BoneMarrow
|       |-- Homo_sapiens_TF.csv
|       |-- Mus_musculus_TF.csv
|       `-- PBMC
|-- docker
|   `-- Dockerfile
|-- docs
|-- notebooks
|-- requirements.txt
|-- requirements_computecanada.txt
|-- results
|-- scDesign2
|-- scGAN
|-- scripts
|   |-- monitor.sh
|   `-- train.sh
|-- sparsim
`-- src

Demo Datasets#

The provided docker image comes prepackaged with the unprocessed Mouse BoneMarrow (Paul et al., 2015) and Human PBMC68k (Zheng et al., 2017) datasets (data/raw/PBMC and data/raw/BoneMarrow) and human and mouse TFs, downloaded from AnimalTFDB (data/raw/Homo_sapiens_TF.csv and data/raw/Mus_musculus_TF.csv).

Note

If you have opted for a local installation, you can download these files from here and place them in data/raw/.

If that’s too hard, this will do it in bash (you need curl and tar installed):

curl https://nextcloud.computecanada.ca/index.php/s/WqrCqkH5zjYYMw9/download --output demo_data.tar &&
tar -xvf demo_data.tar -C data/raw/ &&
mv data/raw/demo/* data/raw &&
rm demo_data.tar &&
rm -rf data/raw/demo/

Steps#

Preprocessing#

Attention

Don’t skip the preprocessing step, GRouNdGAN requires library-size normalized data as input.

To run our preprocessing pipeline, your config file should contain the following arguments:

[EXPERIMENT]

    [Preprocessing]
    ; set True if data is 10x (like PBMC)
    ; set False if you're providing an .h5ad file (like BoneMarrow.h5ad)
    10x = True

    ; If 10x = True, path to the directory containing matrix.mtx, genes.tsv, and barcodes.tsv
    ; If 10x = False, path to the .h5ad file containing the expression matrix
    raw = data/raw/PBMC/

    validation set size = 1000 ; size of the validation set to create
    test set size = 1000 ; size of the test set to create
    annotations = data/raw/PBMC/barcodes_annotations.tsv ; optional, leave empty if you don't have annotations
    min cells = 3 ; genes expressed in less than 3 cells are discarded
    min genes = 10 ; cells with less than 10 genes expressed are discarded
    library size = 20000 ; library size used for library-size normalization
    louvain res = 0.15 ; Louvain clustering resolution (higher resolution means finding more and smaller clusters)
    highly variable number = 1000 ; number of highly variable genes to identify

    [Data]
    train = data/processed/PBMC/PBMC68k_train.h5ad ; path to output the train set
    validation = data/processed/PBMC/PBMC68k_validation.h5ad ; path to output the validation set
    test = data/processed/PBMC/PBMC68k_test.h5ad ; path to output the test set

Then, run the following:

$ python src/main.py --config configs/causal_gan.cfg --preprocess

Once completed, you will see a success message. Train, validation, and test sets should be created in the paths defined under the [Data] section of the config file.

GRN Creation#

Note

GRN creation isn’t needed for scGAN, cscGAN, and cWGAN; you can skip the --create_grn command.

This command uses GRNBoost2 (Moerman et al., 2018) to infer a GRN on the preprocessed train set. It then converts it into the a format that GRouNdGAN accepts.

In addition to what was required in the previous step, you need to provide the following arguments:

[GRN Preparation]
TFs = data/raw/Homo_sapiens_TF.csv ; Path to file containing TFs (accepts AnimalTFDB csv formats)
k = 15 ; k is the number of top most important TFs per gene to include in the GRN
Inferred GRN = data/processed/PBMC/inferred_grnboost2.csv ; where to write GRNBoost2's output

[Data]
causal graph = data/processed/PBMC/causal_graph.pkl ; where to write the created GRN

Run using:

$ python src/main.py --config configs/causal_gan.cfg --create_grn

Once done, you will see the properties of the created GRN.

Using 63 TFs for GRN inference.
preparing dask client
parsing input
creating dask graph
4 partitions
computing dask graph
shutting down client and local cluster
finished

Causal Graph
-----------------  ------------
TFs                          63
Targets                     937
Genes                      1000
Possible Edges            59031
Imposed Edges             14055
GRN density Edges      0.238095
-----------------  ------------

The causal graph will be written to the path specified by [Data]/causal graph in the config file.

Imposing Custom GRNs#

It is possible to instead impose your own GRN onto GRouNdGAN. If you’re opting for this option, skip the --create_grn command. Instead, create a python dictionary where keys are gene indices (int). For each key (gene index), the value is the set of indices set[int] coresponding to TFs that regulate the gene.

_images/sampleGRN.svg

The GRN in the picture above can be written in dictionary form as:

causal_graph = {
    "G2": {"TF2", "TFn"},
    "G1": {"TF1"},
    "Gn": {"TF2", "TF1"},
    "G3": {"TFn", "TF1"}
}

Converting the key/value pairs into gene/TF indices, it becomes

causal_graph = {
    1: {4, 5},
    3: {0},
    6: {4, 0},
    2: {5, 0}
}

Then, pickle the dictionary:

import pickle

with open("path/to/write/causal_graph.pkl", "wb") as fp:
    pickle.dump(causal_graph, fp, protocol=pickle.HIGHEST_PROTOCOL)

Don’t forget to edit the causal graph path in the config file.

[Data]
causal graph = path/to/write/causal_graph.pkl
  • The GRN must be a directed bipartite graph

  • All genes and TFs in the dataset must appear in the dictionary either as key (target gene) or value (as part of the set of TFs)

Warning

Construct a biologically meaningful GRN!

Imposing a GRN with significantly different TF-gene relationships from those observable in the reference dataset will deteriorate the quality of simulated cells as generating realistic simulated datapoints and imposing the GRN will act as contradictory tasks

Training#

You can start training the model using the following command:

$ python src/main.py --config configs/causal_gan.cfg --train

Upon running the command above, three folders will be created inside the path provided in the config file ([EXPERIMENT]/output directory) and the config file will be copied over:

  • checkpoints/: Containing the .pth state dictionary including model’s weights, biases, etc.

  • TensorBoard/: Containing TensorBoard logs

  • TSNE/: Containing t-SNE plots of real vs simulated cells

You can change the save, logging, and plotting frequency (default every 10000 steps) in the config file.

Monitor training using TensorBoard:

tensorboard --logdir="{GAN OUTPUT DIR HERE}/TensorBoard" --host 0.0.0.0 --load_fast false &

We also provide two slurm submission scripts for training and monitoring in scripts/.

Note

  • Training time primarily depends on the number of genes and the density of the imposed GRN. It takes about five days with a very dense GRN (~20% density) containing 1000 genes on a single NVidia V100SXM2 (16G memory) GPU.

  • GRouNdGAN supports multi-GPU training, but we suggest sticking to a single GPU to avoid excess overhead.

  • GRouNdGAN trains for a million steps by default. It is not recommended to change this in the config file.

  • You can resume training from a checkpoint by setting [EXPERIMENT]/checkpoint in the config file to the .pth checkpoint you wish to use.

In-silico Single-Cell Simulation#

One training is done, populate the [EXPERIMENT]/checkpoint field with the path of the .pth checkpoint you want to use in the config file (usually the latest).

You can change the number of cells to simulate in the config file (10000 by default)

[Generation]
number of cells to generate = 10000

Then run

$ python src/main.py --config path/to/config_file --generate

This will output a simulated.h5ad file to [EXPERIMENT]/output directory containing the simulated expression matrix.

References#

Marouf, M., Machart, P., Bansal, V., Kilian, C., Magruder, D. S., Krebs, C., & Bonn, S. (2020). Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nature Communications, 11(1). https://doi.org/10.1038/s41467-019-14018-z

Paul, F., Arkin, Y., Giladi, A., Jaitin, D. A., Kenigsberg, E., Keren-Shaul, H., Winter, D. R., Lara-Astiaso, D., Gury, M., Weiner, A., David, E., Cohen, N., Lauridsen, F. K. B., Haas, S., Schlitzer, A., Mildner, A., Ginhoux, F., Jung, S., Trumpp, A., … Tanay, A. (2015). Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell, 163(7), 1663–1677. https://doi.org/10.1016/j.cell.2015.11.013

Zheng, G., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z., Wilson, R. J., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J., Gregory, M., Shuga, J., Montesclaros, L., Underwood, J. G., Masquelier, D. A., Nishimura, S. Y., Schnall-Levin, M., Wyatt, P., Hindson, C. M., … Bielas, J. H. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8(1). https://doi.org/10.1038/ncomms14049

Moerman, T., Aibar, S., González-Blas, C. B., Simm, J., Moreau, Y., Aerts, J., & Aerts, S. (2018). GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics, 35(12), 2159–2161. https://doi.org/10.1093/bioinformatics/bty916