Script Reference Guide#

This guide provides detailed information on function parameters and code documentation for InPheRNo-ChIP. The guide is divided into three main sections, corresponding to the three scripts of the InPheRNo-ChIP framework:

InPheRNo-ChIP Step 1#

The first step involves processing RNA-seq and ChIP-seq data to obtain 3 sets of pvalues.

Basic Usage:

python InPheC_Step1.py

Advanced Usage:

This script is part of the InPheRNo-ChIP project and is responsible for process and analyze RNA-seq and ChIP-seq data. It includes functionality for:

Parsing command-line arguments to configure paths and options for data processing.

Loading ChIP-seq and RNA-seq data from specified directories: ./Data/

Processing and filtering data to ensure consistency among gene and TF sets.

Performing Elastic Net regression on gene and TF expression data.

Usage:

Run the script from the command line “python InPheC_Step1.py” directly.

For detailed argument options, use the help option: python InPheC_Step1.py –help

Note: Ensure that all dependencies are installed and the Python environment is correctly set up for running this script.

class InPheC_Step1.DataLoader(chip_seq_dir, rna_seq_dir)[source]#

Bases: object

Loads and prepares RNA-seq and ChIP-seq data from specified directories.

Parameters:

chip_seq_dir (str) – Path to the ChIP-seq data directory.
rna_seq_dir (str) – Path to the RNA-seq data directory.

load_chip_data(lineage_of_interest)[source]#

Loads and merges ChIP-seq data files for specified lineages.

Parameters:: lineage_of_interest (list) – Lineages to include in the analysis.
Returns:: Combined DataFrame of ChIP-seq data across specified lineages.
Return type:: pandas.DataFrame

load_rna_data(f_exp, f_gp)[source]#

Loads RNA-seq expression data and associated gene-phenotype p-values from specified files.

Parameters:

f_exp (str) – Filename of the RNA-seq expression data file.
f_gp (str) – Filename of the gene-phenotype p-values data file.

Returns:

Tuple containing expression data and gene-phenotype p-values.

Return type:

(pandas.DataFrame, pandas.DataFrame)

load_tf_list(f_tf)[source]#

Loads a list of transcription factors from a specified CSV file.

Parameters:: f_tf (str) – Path to the CSV file containing the list of transcription factors.
Returns:: DataFrame containing the list of transcription factors.
Return type:: pandas.DataFrame

class InPheC_Step1.DataProcessor(chip_data, GEx_gene_tf, gene_phenotype, hm_tf_df, chip_pvalue_of_interest)[source]#

Bases: object

Processes and filters RNA-seq and ChIP-seq data to ensure consistency among gene and TF sets. This class is crucial for preparing the data for subsequent analysis, including filtering based on shared genes and TFs.

filter_data()[source]#

Filters RNA-seq and ChIP-seq data to ensure consistency among gene and transcription factor (TF) sets, and prepares the data for subsequent analysis steps. This method processes the data by: - Identifying common genes and TFs across different datasets. - Filtering ChIP-seq data to keep rows that have genes and TFs present in both lineages. - Filtering to keep the minimal p-value across peaks for each (TF, gene, lineage) combination. - Preparing expression data matrices for genes and TFs for Elastic Net regression.

Returns:: A tuple containing: - pandas.DataFrame: Filtered ChIP-seq data with minimal p-values across peaks, reset index. - pandas.DataFrame: Gene expression data for shared genes. - pandas.DataFrame: TF expression data for shared TFs. - pandas.DataFrame: Gene-phenotype data for shared genes.
Return type:: tuple

class InPheC_Step1.ElasticNetProcessor(expr_gene, expr_tf, max_num_coefs=None)[source]#

Bases: object

Conducts Elastic Net regression to identify relationships between gene expressions and transcription factors within the InPheRNo-ChIP project. It dynamically adjusts model parameters to optimize fit and manage convergence.

Parameters:

expr_gene (pandas.DataFrame) – DataFrame containing gene expression data.
expr_tf (pandas.DataFrame) – DataFrame containing TF expression data.
max_num_coefs (int, optional) – Maximum allowed non-zero coefficients in the model, provides a control over model complexity.

perform_elastic_net()[source]#

Performs Elastic Net regression across all genes, using transcription factor expressions to predict gene activity.

Returns:: DataFrame with p-values indicating the significance of associations between genes and TFs.
Return type:: pandas.DataFrame
Note:: This function is adapted from the InPheRNo paper with updated eps and n_alphas values to improve model performance and stability.

class InPheC_Step1.Step1ArgumentParser[source]#

Bases: object

Handles command-line arguments for the InPheRNo-ChIP project, setting up and parsing them.

Parameters:

parser (argparse.ArgumentParser) – Configures the command-line arguments.
args (argparse.Namespace) – Stores the values of parsed arguments.

parse_args()[source]#: Parses the command-line arguments.

InPheC_Step1.save_output_files(args, chip_gt, rna_gt, rna_gp)[source]#

Saves processed data to specified output files as part of the InPheRNo-ChIP project. This function handles the creation of output directories and manages the naming conventions for output files based on the provided command-line arguments.

Parameters:

args (argparse.Namespace) – Parsed command-line arguments containing output file paths and names.
chip_gt (pandas.DataFrame) – Processed ChIP-seq data to be saved.
rna_gt (pandas.DataFrame) – Processed RNA-seq gene-TF data to be saved.
rna_gp (pandas.DataFrame) – Processed RNA-seq gene-phenotype data to be saved.

InPheC_Step1.step1_main()[source]#

Main function for the first step in the InPheRNo-ChIP project pipeline. Orchestrates the execution of data loading, processing, Elastic Net regression, and saving the output files. This function acts as the entry point when running the script.

This function follows these steps: - 1. Parses command-line arguments. - 2. Loads data from specified directories. - 3. Processes data to align gene and TF datasets. - 4. Performs Elastic Net regression to analyze gene-TF relationships. - 5. Saves the processed data to output files.

Raises:: SystemExit – If the dimensions of gene-phenotype data and results from Elastic Net regression do not match, indicating a potential issue in data alignment or processing.

InPheRNo-ChIP Step 2#

The second step involves running the probabilistic graphical model (PGM).

Basic Usage:

python InPheC_Step2.py

Advanced Usage:

InPheC_Step2.py

This script is part of the InPheRNo-ChIP project and is responsible for:

Estimating the parameters of the model.

Running the probabilistic graphical model (PGM).

Incorporating enhancements in model Qij to account for inter-lineage dependencies and maintaining numerical stability.

Usage:

Run the script from the command line “python InPheC_Step2.py” directly.

For detailed argument options, use the help option: python InPheC_Step2.py –help

class InPheC_Step2.CorePGM(in2pgm, pgm_param, Prior_T)[source]#

Bases: object

This class sets up the probabilistic graphical model (PGM) configurations, manages batch processing of genes, and handles the execution of the model for each gene.

Parameters:

in2pgm (dict) – Input data and configurations for setting up the PGM.
pgm_param (dict) – Parameters to control the PGM execution, such as number of iterations and chains.
Prior_T (float) – Prior probability for the Bernoulli distribution of Tij.

logsumexp(logp1, logp2, r)[source]#

Computes the log-sum-exp of two log probabilities for numerical stability.

Parameters:

logp1 (float) – Log probability 1.
logp2 (float) – Log probability 2.
r (float) – Mixing ratio.

Returns:

Result of log-sum-exp calculation.

Return type:

float

run(out_dir, n_chain, start_batch, batch_size)[source]#

Processes genes in batches, executing the PGM for each batch and managing the output.

Parameters:

out_dir (str) – Directory to save the output files.
n_chain (int) – Number of chains for MCMC sampling.
start_batch (int) – Starting index of the batch processing.
batch_size (int) – Number of genes processed in each batch.

run_model(one_gene)[source]#

Executes the PGM for a single gene based on the provided gene data and model parameters.

Parameters:: one_gene (pandas.DataFrame) – Data for one gene including its interactions and expressions.
Returns:: Trace of the model after sampling.
Return type:: pymc.backends.base.MultiTrace

setup_model(n_TF, n_lineages, a_gp)[source]#

Configures the PGM model using the specified parameters and distributions.

Parameters:

n_TF (int) – Number of transcription factors.
n_lineages (int) – Number of lineages, default to 1, as we are considering the minimum p-value across lineages.
a_gp (float) – Estimated parameter ‘a_prime’ for the Beta distribution in phenotype modeling.

Returns:

Configured PGM model.

Return type:

pymc.Model

class InPheC_Step2.PGMInputPreparer(out_tmp_dir, rna_tg, rna_gene_pheno, chip_tg, rna_gene_pheno_all)[source]#

Bases: object

Prepares and processes input data for the probabilistic graphical model (PGM) as part of the InPheRNo-ChIP project. This includes data masking, estimating parameters, and adding control variables to ensure robust model performance.

Parameters:

out_tmp_dir (str) – Directory path for saving intermediate outputs.
rna_tg (pandas.DataFrame) – DataFrame containing RNA-seq gene-TF associations.
rna_gene_pheno (pandas.DataFrame) – DataFrame containing gene-phenotype associations.
chip_tg (pandas.DataFrame) – DataFrame containing ChIP-seq gene-TF associations.
rna_gene_pheno_all (pandas.DataFrame) – DataFrame containing gene-phenotype associations for all genes.

class beta_unif_gen(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]#: Bases: rv_continuous

check_modality_per_gene()[source]#

Ensures that each gene meets specific criteria in both RNA-seq and ChIP-seq data for inclusion in the PGM.

Returns:: DataFrames filtered based on modality checks.
Return type:: tuple of pandas.DataFrame

combine_and_add_controls(flt_rna_tg, flt_chip_tg, flt_rna_gp)[source]#

Merges datasets and adds control entries to prepare for PGM analysis.

Parameters:

flt_rna_tg (pandas.DataFrame) – Filtered RNA-seq gene-TF associations.
flt_chip_tg (pandas.DataFrame) – Filtered ChIP-seq gene-TF associations.
flt_rna_gp (pandas.DataFrame) – Filtered gene-phenotype associations.

Returns:

Combined DataFrame with control entries added.

Return type:

pandas.DataFrame

estimate_a_prime(Pj_arr)[source]#

Estimates the ‘a_prime’ parameter for the PGM using a hybrid beta-uniform distribution.

Parameters:: Pj_arr (numpy.ndarray) – Array of p-values for estimating the distribution parameter.
Returns:: Estimated ‘a_prime’ parameter.
Return type:: float

class InPheC_Step2.Step1DataLoader(args)[source]#

Bases: object

Handles the loading of data from outputs of the first step in the InPheRNo-ChIP project pipeline as well as RNA-seq data.

This class is tasked with ensuring data integrity and format correctness as it prepares the data for subsequent processing stages.

Parameters:: args (argparse.Namespace) – Parsed command-line arguments containing paths and file-related settings.

load_csv(file_path, delimiter)[source]#: Loads a CSV file into a pandas DataFrame, using a specified delimiter.

load_data()[source]#

Loads necessary data files from specified directories for further processing, including gene-TF associations, gene-phenotype associations, and ChIP-seq data. Validates the loading process by printing the shapes of loaded datasets.

Returns:: Tuple containing DataFrames for gene-phenotype associations for all genes, gene-phenotype associations for the set of interest, gene-TF associations, and ChIP-seq gene-TF associations.
Return type:: tuple

class InPheC_Step2.Step2ArgumentParser[source]#

Bases: object

Parses and handles command-line arguments for the step2_main.py script, ensuring proper validation and configuration of input and output paths and file-related options for RNA-seq and ChIP-seq data processing.

get_args()[source]#

Returns the validated command-line arguments.

Returns:: Parsed command-line arguments.
Return type:: argparse.Namespace

validate_args()[source]#: Validates the existence of directories specified in command-line arguments and creates the output directory if it does not exist.

InPheC_Step2.step2_main()[source]#

Entry point for executing the step 2 of the InPheRNo-ChIP project’s pipeline. This function orchestrates the loading of data, preprocessing, and execution of the probabilistic graphical model (PGM) for gene regulatory network (GRN) inference.

Steps: - Parses command-line arguments for configuration settings. - Loads data produced by step 1, including gene-TF and gene-phenotype associations. - Prepares input for the PGM, including data cleaning, modality checks, and control integration. - Configures and runs the PGM to infer regulatory networks.

The function manages file directories, initializes data preparation, sets PGM parameters, and handles the batch processing of genes, ensuring all outputs are appropriately saved and managed.

InPheRNo-ChIP Step 3#

The final step combines and processes the output from the previous steps.

Basic Usage:

python InPheC_Step3.py

Advanced Usage:

InPheC_Step3.py

This script is an integral part of the InPheRNo-ChIP project’s step 3 process.

Key Operations:

Combines multiple batches

Averages posterior probabilities across different chains from step 2’s PGM outputs.

Applies gene-wise filters to the averaged data.

Performs min-max normalization and form the final phenotype-relevant GRN.

Usage:

Run the script from the command line “python InPheC_Step3.py” directly.

For detailed argument options, use the help option: python InPheC_Step3.py –help

Note: Ensure the Python environment is properly set up and all dependencies are installed before running this script.

class InPheC_Step3.DataProcessor(fn: str)[source]#

Bases: object

Processes and analyzes the combined posterior data from probabilistic graphical models.

This class loads and processes the in2pgm and posterior data, combines them, and applies filtering and normalization to produce a final gene regulatory network (GRN).

Parameters:

fn1 (str) – Path to the first input file containing in2pgm data.
fn2 (str) – Path to the second input file containing posterior data.

combine_in2pgm_outposterior() → DataFrame[source]#

Combines the in2pgm and posterior data into a single DataFrame.

This function merges the two datasets on the TF and Gene columns, effectively creating an edge list for the GRN.

Returns:: Combined DataFrame with the in2pgm and posterior data.
Return type:: pd.DataFrame

form_final_grn(col_raw_score, Tp_val) → DataFrame[source]#

Forms the final gene regulatory network (GRN) based on normalized and filtered data.

Parameters:

col_raw_score (str) – Column name of the raw scores used for final processing.
Tp_val (float) – Threshold prior value used in forming the GRN.

Returns:

Tuple containing the final GRN DataFrame and gene-specific thresholds.

Return type:

pd.DataFrame

exception InPheC_Step3.FileNotFound[source]#

Bases: Exception

Custom exception for handling file-not-found errors.

This exception is raised when a required file is not found in the specified directory, extending the base Exception class.

Parameters:: message (str) – The error message to display.

class InPheC_Step3.ParameterExtractor(in_dir: str)[source]#

Bases: object

Extracts parameters from file names in the specified directory.

Designed to read file names in a given directory and extract key parameters such as T_prior, n_c, n_i, and n_t, which are essential for processing steps in the pipeline.

Parameters:: in_dir (str) – Input directory where files are located.
Returns:: A dictionary containing extracted parameters with keys like ‘sample_fn’, ‘T_prior’, ‘n_c’, ‘n_i’, and ‘n_t’.
Return type:: Dict[str, Union[str, float]]
Raises:: FileNotFound – If no relevant pickle file is found in the directory.

extract() → Dict[str, str | float][source]#

Extracts parameters from the filenames in the input directory.

Returns:: A dictionary containing the extracted parameters.
Return type:: Dict[str, Union[str, float]]
Raises:: FileNotFound – If no relevant pickle file is found in the directory.

class InPheC_Step3.PosteriorCombiner(in_dir: str, n_batches: int, n_burn: int, common_params: dict)[source]#

Bases: object

Combines posterior distributions from multiple batches and chains.

Processes each batch, extracting and combining posterior distributions into a single DataFrame from pickle files.

Parameters:

in_dir (str) – Input directory containing the pickle files.
n_batches (int) – Number of batches to process.
n_burn (int) – Number of burn-in iterations.
common_params (dict) – Common parameters used across the batches.

Returns:

Tuple containing paths to the combined posterior file and the in2pgm data.

Return type:

tuple

combine(out_dir: str)[source]#

Combines the posteriors and saves the results to specified output files.

Parameters:: out_dir (str) – Directory where the output files will be saved.
Returns:: Tuple of paths to the combined in2pgm and posterior files.
Return type:: Tuple[str, str]

class InPheC_Step3.Step3ArgumentParser[source]#

Bases: object

Parses command-line arguments for the script.

This class configures and retrieves command-line options, returning the parsed arguments as an argparse.Namespace object. It sets default values for input and output directories, the number of batches, and the number of burn-in iterations for posterior processing.

Returns:: An argparse.Namespace object containing all the command-line arguments.
Return type:: argparse.Namespace
Example:

args = ArgumentParser.parse() print(args.in_dir) # Prints the input directory specified in the command-line

static parse() → Namespace[source]#

Sets up the command-line arguments and parses them.

Returns:: Parsed arguments from the command line.
Return type:: argparse.Namespace

class InPheC_Step3.Utils[source]#

Bases: object

static check_pkl_file_count(in_dir: str, expected_count: int)[source]#

Checks if the number of .pkl files in the input directory matches the expected number.

Parameters:

in_dir (str) – Directory to check for .pkl files.
expected_count (int) – The number of .pkl files expected in the directory.

Raises:

ValueError – If the number of .pkl files does not match the expected count.

static configure_logging(log_dir: str) → None[source]#

Configures the logging system for the script, directing log output to a file in a specified directory.

Parameters:: log_dir (str) – Directory where the log file will be stored.

static print_python_details()[source]#: Prints detailed information about the Python environment running the script. Outputs various Python-related system information, including version, compiler, and other system details.

InPheC_Step3.step3_main() → None[source]#

Executes the main functionality of the Step 3 script in the InPheRNo-ChIP project. This function manages the workflow that combines and processes posterior distributions from probabilistic graphical models (PGMs) generated in Step 2.

The main steps include: - Configuring logging for debugging and information tracking. - Parsing command line arguments to get directories and processing parameters. - Checking the count of .pkl files to ensure all expected files are present. - Creating necessary directories for outputs. - Extracting and logging parameters from file names. - Checking for and combining existing processed data or processing new data. - Merging, filtering, and normalizing data to form the final Gene Regulatory Network (GRN). - Saving the final outputs in an Excel file with multiple sheets for easy review.

Raises:

FileNotFoundError – If required files are not found in the specified directories.
ValueError – If processing parameters are incorrect or if files are empty.

FAQs#

Include a Frequently Asked Questions section to cover common queries:

Question 1: can’t wait for the 1st question you have for InPheC!
Answer 1:…
Question 2:
Answer 2:…

Support and Contribution#

We are committed to providing support for InPheRNo-ChIP users and actively encourage contributions to the project.

Getting Support#

If you encounter issues or have questions while using InPheRNo-ChIP, there are several ways to get support:

Email the Author: Feel free to reach out to the author of the paper associated with InPheRNo-ChIP. The author’s email can typically be found in the corresponding research paper or on the project’s main webpage.
GitHub Issues: For technical problems or bugs, you can open an issue on the GitHub repository.

Contributing to InPheRNo-ChIP#

Your contributions to InPheRNo-ChIP are highly valued. There are various ways you can contribute to the project:

Reporting Bugs: If you find a bug, please report it by opening an issue on our GitHub repository. Provide as much detail as possible to help us understand and address the issue.
Feature Requests: Have ideas for new features or improvements? We’d love to hear them! Please file a feature request on our Github Issue Page.
Code Contributions: If you’re interested in contributing code, feel free to fork the repository and submit a pull request with your changes. For major changes, please open an issue first to discuss what you would like to change.

We appreciate all forms of feedback and contributions to make InPheRNo-ChIP better for everyone!

Note

This guide assumes a basic understanding of python and pymc. If you are new to these, consider reading PyMC documentation