diff --git a/.Rbuildignore b/.Rbuildignore index a8eff32..988159e 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -16,4 +16,4 @@ KinformR.Rproj .pre-commit-config.yaml .Rproj.user .git -.github \ No newline at end of file +.github diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml index 6648884..61a1298 100644 --- a/.github/workflows/main.yaml +++ b/.github/workflows/main.yaml @@ -34,4 +34,6 @@ jobs: run: | conda create -n test_r r-base r-devtools r-testthat conda activate test_r - Rscript -e "testthat::test_local()" + Rscript -e "testthat::test_local()" +# - name: PreCommit +# uses: pre-commit/action@v3.0.1 diff --git a/CHANGELOG.md b/CHANGELOG.md index efdcc4a..363dc6e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Changed ### Added +- use of precommit spelling, not making a CI check so as to keep cran compatibility. ### Fixed - linting and spelling errors resolved with pre-commit usage. diff --git a/DESCRIPTION b/DESCRIPTION index e3523cf..240a8c8 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,18 +1,18 @@ Package: KinformR Title: Relationship-Informed Pedigree and Variant Scoring Version: 0.1.0 -Authors@R: +Authors@R: person("Cameron M.", "Nugent", , "cam.nugent@sequencebio.com", role = c("aut", "cre"), comment = c(ORCID = "0000-0002-1135-2605")) Author: Cameron M. Nugent Maintainer: Cameron M. Nugent -Description: - The KinformR R package is meant to aid in comparative evaluation of families - and candidate variants in rare-variant association studies. The package can be used for - two methodologically overlapping but distinct purposes. First, the prior to any genetic or genomic - evaluation, evaluation of relative detection power of pedigrees, can direct recruitment - efforts by showing which unsampled individuals would be the most meaningful additions to a study. - Second, after sequencing and analysis, variants based on association with disease status +Description: + The KinformR R package is meant to aid in comparative evaluation of families + and candidate variants in rare-variant association studies. The package can be used for + two methodologically overlapping but distinct purposes. First, the prior to any genetic or genomic + evaluation, evaluation of relative detection power of pedigrees, can direct recruitment + efforts by showing which unsampled individuals would be the most meaningful additions to a study. + Second, after sequencing and analysis, variants based on association with disease status and familial relationships of individuals, aids in variant prioritization. License: MIT + file LICENSE Encoding: UTF-8 @@ -22,5 +22,5 @@ VignetteBuilder: knitr Suggests: devtools, testthat, - knitr, + knitr, rmarkdown diff --git a/R/io.R b/R/io.R index cc7da22..c611cc2 100644 --- a/R/io.R +++ b/R/io.R @@ -43,7 +43,7 @@ read.relation.mat <- function(fname){ #' status encoded in the indivudal's names #' #' Note - ensure the status in the names match your desired encoding! -#' There are individuals with ambigious statues, that you may require to +#' There are individuals with ambiguous statues, that you may require to #' be encoded in a specific fashion for you current purposes. #' #' @@ -80,6 +80,3 @@ read.var.table <- function(fname){ "variant" = in.variants) return(out.df) } - - - diff --git a/R/pedigree.r b/R/pedigree.r index cf80331..f835a32 100644 --- a/R/pedigree.r +++ b/R/pedigree.r @@ -146,7 +146,7 @@ score.pedigree <- function(h){ for (i in seq_len(nrow(h))) { family <- h[i,"Family"] max.a <- h[i, "max_a"] - #Yeezy yeezy whats good its ya boy + #Yeezy yeezy what's good, its ya boy max.b <- h[i, "max_b"] max.c <- h[i, "max_c"] max.d <- h[i, "max_d"] diff --git a/README.md b/README.md index c2310e1..874572a 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ The development version of `KinformR` can be installed directly from GitHub. You ``` #install.packages("devtools") #install.packages("knitr") #required if build_vignettes = TRUE -#library(devtools) +#library(devtools) devtools::install_github("SequenceBio/KinformR", build_vignettes = TRUE) library(KinformR) ``` @@ -25,14 +25,14 @@ library(KinformR) The package's vignette contains detailed explanations of the functions and parameters. -For a walk through of the `KinformR` functions for scoring the value of *families* based on penetrance and IBD, see the corresponging vignette file: +For a walk through of the `KinformR` functions for scoring the value of *families* based on penetrance and IBD, see the corresponding vignette file: `vignettes/KinformR-penetrance_and_ibd.Rmd` or within R, run: ``` vignette('KinformR-penetrance_and_ibd') ``` -For a walk through of the `KinformR` functions for scoring the value of *variants* within families, see the corresponging vignette file: +For a walk through of the `KinformR` functions for scoring the value of *variants* within families, see the corresponding vignette file: `vignettes/KinformR-variant_scoring.Rmd` or within R, run: @@ -59,7 +59,7 @@ and scoring then performed: ## Scoring Variants -When looking at shared rare variants across families, not all sets of affected and unaffected individuals are equal. This R package is designed to score rare variants, assigning values based on the disease status of individuals, the presence or absence of a rare variant in those individuals, and their pairwise coefficients of relatedness. The package uses a custom formula to assign value to a variant that gives more weight to shared variants common to distantly related affected individuals. The variant status for unaffected individuals can optionally be considered as well, with the highest scoring values being given to closely related individuals that *do not* share a variant of interst. Since variants can be incompletely penetrant, the scoring can be based solely on the affected individuals, or the weight of unaffected evidence can be customized. +When looking at shared rare variants across families, not all sets of affected and unaffected individuals are equal. This R package is designed to score rare variants, assigning values based on the disease status of individuals, the presence or absence of a rare variant in those individuals, and their pairwise coefficients of relatedness. The package uses a custom formula to assign value to a variant that gives more weight to shared variants common to distantly related affected individuals. The variant status for unaffected individuals can optionally be considered as well, with the highest scoring values being given to closely related individuals that *do not* share a variant of interest. Since variants can be incompletely penetrant, the scoring can be based solely on the affected individuals, or the weight of unaffected evidence can be customized. ### The relationship matrix @@ -89,4 +89,4 @@ The two streams of information can then be combined to score a variant based off ``` score.example <- score.fam(rel.mat, ind.df.status) -``` \ No newline at end of file +``` diff --git a/man/add.fam.scores.Rd b/man/add.fam.scores.Rd index ac5e3a6..d5632c3 100644 --- a/man/add.fam.scores.Rd +++ b/man/add.fam.scores.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/relatedness.r +% Please edit documentation in R/relatedness.R \name{add.fam.scores} \alias{add.fam.scores} \title{Sum all the given scores and return a single vector with cumulative "score", "for" and "against" vals. diff --git a/man/calc.rv.score.Rd b/man/calc.rv.score.Rd index 52ade48..4c9a2e5 100644 --- a/man/calc.rv.score.Rd +++ b/man/calc.rv.score.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/relatedness.r +% Please edit documentation in R/relatedness.R \name{calc.rv.score} \alias{calc.rv.score} \title{Calculate a relatedness-weighted score for a given rare variant.} diff --git a/man/ibd.Rd b/man/ibd.Rd index 16173ed..fd7f665 100644 --- a/man/ibd.Rd +++ b/man/ibd.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pedigree.r +% Please edit documentation in R/pedigree.R \name{ibd} \alias{ibd} \title{Calculation of Identity by descent (IBD).} diff --git a/man/penetrance.Rd b/man/penetrance.Rd index 7a0f98e..f3149bb 100644 --- a/man/penetrance.Rd +++ b/man/penetrance.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pedigree.r +% Please edit documentation in R/pedigree.R \name{penetrance} \alias{penetrance} \title{Likelihood function for calculation of Pedigree-based autosomal dominant penetrance value. diff --git a/man/read.pedigree.Rd b/man/read.pedigree.Rd index b9e4ecf..841f5f1 100644 --- a/man/read.pedigree.Rd +++ b/man/read.pedigree.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pedigree.r +% Please edit documentation in R/pedigree.R \name{read.pedigree} \alias{read.pedigree} \title{Read in the encoded pedigree data file.} diff --git a/man/read.var.table.Rd b/man/read.var.table.Rd index 2ad5848..22194a3 100644 --- a/man/read.var.table.Rd +++ b/man/read.var.table.Rd @@ -21,7 +21,7 @@ MS-5678-1001 A 0/1 } \description{ Note - ensure the status in the names match your desired encoding! -There are individuals with ambigious statues, that you may require to +There are individuals with ambiguous statues, that you may require to be encoded in a specific fashion for you current purposes. } \examples{ diff --git a/man/score.Rd b/man/score.Rd index d5e9ee5..f4a1ff8 100644 --- a/man/score.Rd +++ b/man/score.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pedigree.r +% Please edit documentation in R/pedigree.R \name{score} \alias{score} \title{Score the pedigrees using the pihat values.} diff --git a/man/score.fam.Rd b/man/score.fam.Rd index f704823..2083931 100644 --- a/man/score.fam.Rd +++ b/man/score.fam.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/relatedness.r +% Please edit documentation in R/relatedness.R \name{score.fam} \alias{score.fam} \title{Given a relationship matrix and status dataframe, score a family by applying the calc.rv.score diff --git a/man/score.pedigree.Rd b/man/score.pedigree.Rd index ff03942..1b01449 100644 --- a/man/score.pedigree.Rd +++ b/man/score.pedigree.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/pedigree.r +% Please edit documentation in R/pedigree.R \name{score.pedigree} \alias{score.pedigree} \title{Take the encoded information about the pedigrees and calculate penetrance.} diff --git a/man/subset.mat.Rd b/man/subset.mat.Rd index e9ba75c..692cd4c 100644 --- a/man/subset.mat.Rd +++ b/man/subset.mat.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/relatedness.r +% Please edit documentation in R/relatedness.R \name{subset.mat} \alias{subset.mat} \title{Take the matrix and subset out only the encoded individuals that are present in the status dataframe.} diff --git a/tests/testthat/test_encoding.R b/tests/testthat/test_encoding.R index 3513a9d..ab2a8d3 100644 --- a/tests/testthat/test_encoding.R +++ b/tests/testthat/test_encoding.R @@ -37,10 +37,10 @@ test_that("Families are correctly encoded.", { expect_equal(scores$statvar.cat, expected.scores) print("theoretical.max high score values for a family") - ther.scores <- score.variant.status(indiv.df, theoretical.max=TRUE) + theory.scores <- score.variant.status(indiv.df, theoretical.max=TRUE) expected.thermax.scores <- c("A.c","U.c","A.c","A.c","A.c" ,"U.c", "A.c", "U.c") - expect_equal(ther.scores$statvar.cat, expected.thermax.scores) + expect_equal(theory.scores$statvar.cat, expected.thermax.scores) }) diff --git a/vignettes/KinformR-penetrance_and_ibd.Rmd b/vignettes/KinformR-penetrance_and_ibd.Rmd index 869165a..e9ee8f3 100644 --- a/vignettes/KinformR-penetrance_and_ibd.Rmd +++ b/vignettes/KinformR-penetrance_and_ibd.Rmd @@ -3,7 +3,7 @@ title: "KinformR - penetrance and idb informed scoring of families" author: "Cameron M. Nugent" date: "`r format(Sys.time(), '%d %B, %Y')`" data: "`r Sys.Date()`" -output: rmarkdown::pdf_document # rmarkdown::html_vignette # +output: rmarkdown::pdf_document # rmarkdown::html_vignette # pdf_document: df_print: kable vignette: > @@ -37,12 +37,12 @@ show <- function(df){ The family power calculations depend on a single tab-delimited input file, where each row represents a family. The input file is read in using the `read.pedigree` function. ```{r} -example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv', +example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv', package = 'KinformR') example.pedigree.df <- read.pedigree(example.pedigree.file) ``` -The input file is expected to have the following 11 columns (with a header). +The input file is expected to have the following 11 columns (with a header). ```{r} colnames(example.pedigree.df) @@ -51,7 +51,7 @@ colnames(example.pedigree.df) ### Simplified summary of pedigrees For now this file should be be constructed through careful manual inspection of the predigrees. To encode the rows for each family, you should first prune down pedigrees to informative allele transfers. For -the purposes of this tool, we exclude young generations (non-adults, younger than age of onset) and large (more than two sequential generations) trees of exclusively unaffected family members. Additionally all individuals require a binary A/U status, there should be no ambigious individuals. There will be some judgment calls required here. +the purposes of this tool, we exclude young generations (non-adults, younger than age of onset) and large (more than two sequential generations) trees of exclusively unaffected family members. Additionally all individuals require a binary A/U status, there should be no ambiguous individuals. There will be some judgment calls required here. ### Encoding categories of relationships @@ -73,11 +73,11 @@ show(example.pedigree.df) ``` All columns with the prefix `max_` are meant to count the total number of each category in the pedigree, while -the columns without this prefix are the number of each category for whom samples have been collected. +the columns without this prefix are the number of each category for whom samples have been collected. The categories correspond to A, B, and C as defined above. -Category D is represented by two numbers, d and n. n is the number of offspring in a tree of unaffecteds; d is the number of those types of trees across the pedigree. Multiple types of trees are encoded with commas separating the values. For example, the following represents a family with three total trees of unaffecteds. One tree (d=1) has three offspring (n=3); two trees (d=2) each have one offspring (n=1). +Category D is represented by two numbers, d and n. n is the number of offspring in a tree of unaffecteds; d is the number of those types of trees across the pedigree. Multiple types of trees are encoded with commas separating the values. For example, the following represents a family with three total trees of unaffecteds. One tree (d=1) has three offspring (n=3); two trees (d=2) each have one offspring (n=1). ``` d n @@ -138,6 +138,3 @@ we only count the parent. (d=1, n=0; equivalently, c=1) 2. You have collected one or more children, but not the parent. In this case, each of the children contribute a portion of what the parent would have contributed to our understanding. (d=1, n>0) - - - diff --git a/vignettes/KinformR-variant_scoring.Rmd b/vignettes/KinformR-variant_scoring.Rmd index ae66954..a095ebc 100644 --- a/vignettes/KinformR-variant_scoring.Rmd +++ b/vignettes/KinformR-variant_scoring.Rmd @@ -3,7 +3,7 @@ title: "KinformR - pedigree-informed rare variant association scoring" author: "Cameron M. Nugent" date: "`r format(Sys.time(), '%d %B, %Y')`" data: "`r Sys.Date()`" -output: rmarkdown::pdf_document #rmarkdown::html_vignette # +output: rmarkdown::pdf_document #rmarkdown::html_vignette # pdf_document: df_print: kable vignette: > @@ -43,7 +43,7 @@ To read in the data, one uses the function `read.relation.mat`. mat.name1<-system.file('extdata/1234_ex2.mat', package = 'KinformR') rel.mat <- read.relation.mat(mat.name1) show(rel.mat) -``` +``` ### The status file @@ -60,7 +60,7 @@ tsv.name1<-system.file('extdata/1234_ex2.tsv', package = 'KinformR') status.df <- read.indiv(tsv.name1) show(status.df) -``` +``` The disease-genotype scoring can then be encoded using the `score.variant.status` function to produce the status-variant category for all individuals. This creates a df with the new column: `statvar.cat`. @@ -68,7 +68,7 @@ The disease-genotype scoring can then be encoded using the `score.variant.status full.df.status <- score.variant.status(status.df) show(full.df.status) -``` +``` @@ -80,7 +80,7 @@ For most real-world applications, you will likely want to score family members i ex.score.default <- score.fam(rel.mat, full.df.status) show(ex.score.default) -``` +``` By default `score.fam` returns: @@ -92,26 +92,26 @@ As previously noted, if an individual is present in the relationship matrix and The scoring can be changed to summing across all combinations as opposed to the mean by passing the following options. Note using the program in this way will return higher scores for more dense pedigrees. ```{r} -ex.score.sum <- score.fam(rel.mat, full.df.status, +ex.score.sum <- score.fam(rel.mat, full.df.status, return.sums = TRUE, return.means = FALSE) show(ex.score.sum) -``` +``` To obtain a long form table with the scores for variants expressed relative to each individual, set both `return.sums` and `return.means` to `FALSE`. This output can aid in identifying which individuals are carrying the most weight in a family's score. ```{r} -ex.score.table <- score.fam(rel.mat, full.df.status, +ex.score.table <- score.fam(rel.mat, full.df.status, return.sums = FALSE, return.means = FALSE) show(ex.score.table) -``` +``` -## How scoring works +## How scoring works ### A Minimal example, scoring a variant from perspective of a single individual. -This section is meant to demonstrate how the variant scoring is accomplished on a finer scale. A user does not need to interact with the package on this level of granularity. This section is for explanatory purposes only, demonstrating how the `score.fam` function operated "under the hood". +This section is meant to demonstrate how the variant scoring is accomplished on a finer scale. A user does not need to interact with the package on this level of granularity. This section is for explanatory purposes only, demonstrating how the `score.fam` function operated "under the hood". -The `score.fam` function runs the scoring method once for each affected individual in the status dataframe (or for each individual regardless of status if `affected.only = FALSE`). To do this, for each individual, the program takes corresponding row of the relationship matrix to determine the relations to all other individuals in the pedigree. +The `score.fam` function runs the scoring method once for each affected individual in the status dataframe (or for each individual regardless of status if `affected.only = FALSE`). To do this, for each individual, the program takes corresponding row of the relationship matrix to determine the relations to all other individuals in the pedigree. For example, the degrees of relationships of all other members of the example family relative to the reference individual `"MS-1234-1001"` are show in the following subset of the matrix: @@ -135,25 +135,25 @@ name.stat.dict ```{r} rel.dict<-build.relation.dict(rel.mat.proband, name.stat.dict) rel.dict -``` +``` In this example, the proband, two first degree relations, and a third degree relations are all affected and share the candidate variant. For the affected correct (`A.c`) category we therefore see the following encoded: ```{r} rel.dict$A.c -``` +``` Since one first degree unaffected relative has the variant, they are categorized as "unaffected incorrect"(`U.i`) and we see: ```{r} rel.dict$U.i -``` +``` Deriving a relatedness-weighted score for the variant from the perspective of the given individual is then performed by `calc.rv.score` -For each degree-encoded relationship, the coefficient of relatedness is used to weight the evidence for or against a variant. The coefficients for different degress of relationship are: +For each degree-encoded relationship, the coefficient of relatedness is used to weight the evidence for or against a variant. The coefficients for different degrees of relationship are: ```{r} for(i in 0:7){ - print(paste0("Degree of relatedness: ", i, - " coefficient of relatedness: ", 1 / (2 ** (i)))) + print(paste0("Degree of relatedness: ", i, + " coefficient of relatedness: ", 1 / (2 ** (i)))) } ``` @@ -200,10 +200,10 @@ The final score for the variant would then be: ``` Giving a final score of 10 for the variant. -This is all accomplished by the function `calc.rv.score`. +This is all accomplished by the function `calc.rv.score`. ```{r} calc.rv.score(rel.dict) -``` +``` The weights of the scoring can be adjusted, for example if we wanted to consider only `affected`-based evidence, we could turn off the unaffected part of the calculation by setting the unaffected weighting to 0. This can be useful for incompletely penetrant variants, where disease status and genotype of unaffected individuals are more likely to have imperfect concordance. @@ -211,8 +211,6 @@ Additionally, families with low numbers of affected individuals sequenced and hi ```{r} calc.rv.score(rel.dict, unaffected.weight=0) -``` +``` The `score.fam` function automatically walks through this process from all specified perspectives in the pedigree and by default returns the average score. The use of the averages and different perspectives is meant to eliminate pedigree-associated bias, such as for instances when a proband is distantly related to all other members in a family (considering the relationships from only the perspective of the proband in this case would give an inflated score for the variant's value). - -