-
Notifications
You must be signed in to change notification settings - Fork 1
3. Decomposition Step
The decomposition_step() function decomposes the input gene expression data across C contexts (e.g., tissues, cell types, or experimental conditions) into one shared and C context-specific components. By removing variation that is common across contexts, this step increases power to detect of context-specific eQTLs.
# Specify directories
data_dir_decomp <- "simulated_example/"
exp_mat_filename <- paste0(data_dir_decomp, "expression.txt")
# Run decomposition
decomposition_step(exp_mat_filename, data_dir_decomp)
FastGxC natively supports missing expression values; imputation is not required. Prior to decomposition, ensure that all missing entries in the expression matrix are encoded as NA. This allows the model to correctly account for incomplete observations during fitting.
FastGxC assumes an approximately Gaussian expression matrix. To ensure that your data fits these assumptions, you should perform variance-stabilizing transformation (e.g., log2(TPM + 1)) or, preferably, quantile normalization before running FastGxC.
In addition, to improve eQTL mapping power we suggest to residualize expression matrix for technical and biological covariates before running FastGxC. Common approaches include:
-
Regressing out measured factors (e.g., batch, sequencing depth, sex, age, or other known confounders).
-
Latent factor adjustment, such as PEER factors or PCA-derived covariates, to capture unmeasured sources of variation. We suggest using PCA-derived covariates.
These preprocessing steps help ensure that the decomposition focuses on biologically meaningful variation rather than technical artifacts.
The decomposition_step() function expects an input expression matrix formatted with genes as columns and samples as rows.
Note: The first column must be labeled design and contain unique sample identifiers in the format "individual ID - context name" (e.g., ind1 - context3). Consistent naming is critical — mismatches in sample identifiers between the expression matrix and genotype data will cause errors in downstream steps.
design gene1 gene2 gene3 ... gene100
ind1 - context1 0.4369 -1.7087 -1.8113 ... 1.1721
ind1 - context10 0.6437 -0.4357 -0.8296 ... -0.3944
ind1 - context2 0.1092 -0.1294 -0.3222 ... -0.4939
ind1 - context3 0.1234 -0.1947 -0.1111 ... -0.9999
ind1 - context4 0.8347 -0.9876 -0.1490 ... -0.3833
For C contexts, the decomposition_step() function outputs C+1 files, one with the shared expression component (i.e. average expression of each gene and individual across context) and C with the specific expression component in each context (i.e. residual expression of each gene and individual in context after subtracting shared component). Columns in each file correspond to individuals and rows correspond to genes.
context_shared_expression.txt
ind1 ind2 ind3 ind4 ind5
gene1 0.4321 -1.2345 0.5678 -0.8910 1.1203
gene2 -0.9834 0.1342 -1.0923 0.2345 -0.7583
gene3 1.2948 1.1023 0.8421 0.9104 1.0012
gene4 -0.1243 -0.8934 -1.2098 -1.1342 -0.9981
gene5 0.4587 0.5632 0.9983 1.2345 0.8231
contextX_specific_expression.txt
ind1 ind2 ind3 ind4 ind5
gene1 0.1834 0.0456 -0.0912 0.2314 -0.0873
gene2 -0.1345 0.0543 0.1123 -0.0784 0.0912
gene3 0.0034 -0.0312 0.0402 0.0105 -0.0067
gene4 -0.1423 -0.1092 -0.0876 -0.1904 -0.1523
gene5 0.2145 0.1432 0.1784 0.1984 0.2013