Skip to content

3. Decomposition Step

Lena Krockenberger edited this page Feb 19, 2026 · 21 revisions

Overview of Decomposition Step

The decomposition_step() function decomposes the input gene expression data across C contexts (e.g., tissues, cell types, or experimental conditions) into one shared and C context-specific components. By removing variation that is common across contexts, this step increases power to detect of context-specific eQTLs.

Usage

# Specify directories 
data_dir_decomp <- "simulated_example/"
exp_mat_filename <- paste0(data_dir_decomp, "expression.txt")

# Run decomposition
decomposition_step(exp_mat_filename, data_dir_decomp)

Handling Missing Data

FastGxC natively supports missing expression values; imputation is not required. Prior to decomposition, ensure that all missing entries in the expression matrix are encoded as NA. This allows the model to correctly account for incomplete observations during fitting.

Background Noise Adjustment and Normalization

FastGxC assumes an approximately Gaussian expression matrix. To ensure that your data fits these assumptions, you should perform variance-stabilizing transformation (e.g., log2(TPM + 1)) or, preferably, quantile normalization before running FastGxC.

In addition, to improve eQTL mapping power we suggest to residualize expression matrix for technical and biological covariates before running FastGxC. Common approaches include:

  • Regressing out measured factors (e.g., batch, sequencing depth, sex, age, or other known confounders).

  • Latent factor adjustment, such as PEER factors or PCA-derived covariates, to capture unmeasured sources of variation. We suggest using PCA-derived covariates.

These preprocessing steps help ensure that the decomposition focuses on biologically meaningful variation rather than technical artifacts.

Input files

The decomposition_step() function expects an input expression matrix formatted with genes as columns and samples as rows.
Note: The first column must be labeled design and contain unique sample identifiers in the format "individual ID - context name" (e.g., ind1 - context3). Consistent naming is critical — mismatches in sample identifiers between the expression matrix and genotype data will cause errors in downstream steps.

design             gene1     gene2     gene3    ...   gene100
ind1 - context1   0.4369    -1.7087    -1.8113  ...   1.1721
ind1 - context10  0.6437    -0.4357    -0.8296  ...  -0.3944
ind1 - context2   0.1092    -0.1294    -0.3222  ...  -0.4939
ind1 - context3   0.1234    -0.1947    -0.1111  ...  -0.9999
ind1 - context4   0.8347    -0.9876    -0.1490  ...  -0.3833

Output files

For C contexts, the decomposition_step() function outputs C+1 files, one with the shared expression component (i.e. average expression of each gene and individual across context) and C with the specific expression component in each context (i.e. residual expression of each gene and individual in context after subtracting shared component). Columns in each file correspond to individuals and rows correspond to genes.

context_shared_expression.txt

        ind1        ind2        ind3        ind4        ind5
gene1  0.4321     -1.2345      0.5678     -0.8910      1.1203
gene2 -0.9834      0.1342     -1.0923      0.2345     -0.7583
gene3  1.2948      1.1023      0.8421      0.9104      1.0012
gene4 -0.1243     -0.8934     -1.2098     -1.1342     -0.9981
gene5  0.4587      0.5632      0.9983      1.2345      0.8231

contextX_specific_expression.txt

        ind1        ind2        ind3        ind4        ind5
gene1  0.1834      0.0456     -0.0912      0.2314     -0.0873
gene2 -0.1345      0.0543      0.1123     -0.0784      0.0912
gene3  0.0034     -0.0312      0.0402      0.0105     -0.0067
gene4 -0.1423     -0.1092     -0.0876     -0.1904     -0.1523
gene5  0.2145      0.1432      0.1784      0.1984      0.2013

Clone this wiki locally