Using InPheRNo-ChIP with Your Own Data#
InPheRNo-ChIP is a computational framework designed to reconstruct phenotype-relevant gene regulatory networks (GRNs) by integrating RNA-seq data, TF-specific ChIP-seq data, and phenotypic labels. This guide provides detailed instructions on how to prepare your own data to use with InPheRNo-ChIP, outlining necessary preprocessing steps, tools used, and considerations for optional steps. We will cover:
Necessary data inputs and their required formats.
Data preprocessing steps for each data type.
Optional steps that may enhance your analysis.
Tools and methods used in our analysis, which you can adopt or modify based on your needs.
Overview of Required Data In This Guide#
You will need the following data:
RNA-seq Data: A gene expression count matrix (genes × samples) containing expression values for both target genes and transcription factors (TFs) across samples with known phenotypic labels.
TF-specific ChIP-seq Data: ChIP-seq peak files in BED format for the TFs of interest in your study.
Phenotypic Labels: Labels indicating the phenotype of each sample (e.g., disease status, cell type).
Preparing RNA-seq Data#
In this section, we will explain how to prepare the RNA-seq data required for InPheRNo-ChIP, specifically focusing on generating the three essential inputs for InPheC_Step1.py, as shown in our GitHub repo’s README.md:
Input 1.1: Normalized counts (e.g.,
RNAseq_expr_voom_normalized.csv
)Input 1.2: P-values of gene-phenotype associations for significant genes (e.g.,
RNAseq_pval_gene-phenotype_significant.csv
)Input 1.3: P-values of gene-phenotype associations for all genes (e.g.,
RNAseq_pval_gene-phenotype_allGenes.csv
)
We will walk through the steps needed to prepare these inputs from your RNA-seq and phenotype data.
Approach:
Data Acquisition:
If you have raw RNA-seq reads, process them to generate a gene expression count matrix (genes × samples) using standard RNA-seq pipelines.
Alignment: Use aligners like STAR or HISAT2 to map reads to the reference genome.
Quantification: Use tools like featureCounts or HTSeq-count to obtain raw counts.
Alternatively, if you already have a gene expression count matrix, you can proceed directly.
Sample Selection and Quality Control:
Convert all gene identifiers to a consistent format (e.g., HGNC symbols) to ensure compatibility across datasets.
Assess the quality of your gene expression data.
Filter out low-quality samples or genes with low expression levels as appropriate for your analysis.
Optional: Remove genes located on sex chromosomes if sex-specific effects are not relevant to your study.
Normalization and Batch Effect Removal:
Normalization:
Normalize the count matrix to account for library size differences and other technical biases.
In our analysis, we used the TMM (Trimmed Mean of M-values) normalization method implemented in edgeR’s calcNormFactors function.
Alternative normalization methods include DESeq2’s median-of-ratios method or limma’s voom normalization.
Batch Effect Removal:
If your data comes from multiple sources or experiments, consider removing batch effects to minimize unwanted variability.
In our analysis, we used ComBat-seq from the sva package to adjust for batch effects.
Alternative tools include limma’s removeBatchEffect function or RUVSeq.
Transformation for Linear Modeling:
Transform the normalized count data to be suitable for linear modeling.
In our analysis, we applied the voom transformation using limma’s voom function with quantile normalization.
We then transformed each gene’s expression values across samples to follow a standard normal distribution (mean zero, variance one), ensuring comparability across genes.
Differential Expression Analysis:
Perform differential expression (DE) analysis to identify (non-TF) genes associated with your phenotype.
In our analysis, we used edgeR to perform DE analysis, incorporating factors like batch and lineage in the design matrix.
Alternative tools include DESeq2.
Prepare Input Files:
Save the voom-transformed and inverse-quantile normalized expression data from step 4 here to a CSV file (e.g., input 1.1 in the GitHub repo’s README.md):
RNAseq_expr_voom_normalized.csv
).Save the DE results for significant genes from step 5 here in a CSV file (e.g., input 1.2:
RNAseq_pval_gene-phenotype_significant.csv
).Save the DE results for all genes (not just significant ones) from step 5 here in a CSV file (e.g., input 1.3:
RNAseq_pval_gene-phenotype_allGenes.csv
)All three files should be placed in
./Data/RNA_seq/
Notes:
Confounding Variables: Include factors such as batch, gender, or other experimental conditions in your design matrix to control for their effects during DE analysis.
TFs in Expression Data: Ensure you have expression data for the TFs you plan to include in your GRN.
Preparing ChIP-seq Data#
In this section, we will explain how to prepare ChIP-seq data.
Approach:
Data Acquisition:
Obtain ChIP-seq peak files (BED format) for your TFs of interest.
If you have raw ChIP-seq reads, process them to generate peak files using standard ChIP-seq analysis pipelines.
Tools for processing raw reads include aligners like Bowtie2 or BWA and peak-calling software such as MACS2.
Alternatively, use publicly available ChIP-seq peak datasets relevant to your study.
Sample Filtering:
Convert all gene identifiers to a consistent format (e.g., HGNC symbols) to ensure compatibility across datasets.
Remove low-quality samples and those not matching your criteria.
In our analysis, we filtered out samples with low quality or incorrect annotations based on metadata review.
Combining Replicates:
If you have biological replicates, combine peaks from replicates to identify reproducible binding sites.
In our analysis, we used the IDR (Irreproducible Discovery Rate) framework from ENCODE to combine peaks from replicates and select reproducible binding sites.
Filtering Peaks:
Remove peaks in blacklisted regions to avoid artifacts and regions prone to mapping errors.
In our analysis, we used BEDTools to exclude blacklisted regions, such as the ENCODE blacklist regions.
Assigning Peaks to Genes:
Assign peaks to genes based on proximity to transcription start sites (TSS) or other genomic features.
In our analysis, we used T-Gene from MEME suite (link) to associate peaks with potential target genes.
Prepare Input Files:
For each TF, extract p-values or scores for each TF-gene pair from the peak assignment results and output to a .BED file.
Place processed BED files in the
./Data/ChIP_GSExxx
directory (replace GSExxx with the appropriate identifier).