--- title: Pedigree kinship() details author: TM Therneau date: '`r format(Sys.time(),"%d %B, %Y")`' output: BiocStyle::html_document: toc: true toc_depth: 2 header-includes: \usepackage{tabularx} vignette: | %\VignetteIndexEntry{Pedigree kinship() details} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- Introduction ===================== The kinship matrix is foundational for random effects models with family data. For $n$ subjects it is an $n \times n$ matrix whose $i$, $j$ elements contains the expected fraction of alleles that would be identical by descent if we sampled one from subject $i$ and another from subject $j$. Note that the diagonal elements of the matrix will be 0.5 not 1: when we randomly sample twice from the same subject (with replacement) we will get two copies of the gene inherited from the father 1/4 of the time, the maternal copy twice (1/4) or one of each 1/2 the time. The formal definition is $K(i,i) = 1/4 + 1/4 + 1/2 K(m,f)$ where $m$ and $f$ are the father and mother of subject $i$. The algorithm used is found in K Lange, *Mathematical and Statistical Methods for Genetic Analysis*, Springer 1997, page 71--72. The key idea of the recursive algorithm for $K(i,j)$ is to condition on the gene selection for the first index $i$. Let $m(i)$ and $f(i)$ be the indices of the mother and father of subject $i$ and $g$ be the allele randomly sampled from subject $i$, which may of either maternal or paternal origin. $$ \begin{align} K(i,j) &= P(\mbox{$g$ maternal}) * K(m(i), j) + P(\mbox{$g$ paternal}) * K(f(i), j) \label{recur0} \\ &= 1/2 K(m(i), j) + 1/2 K(f(i), j) \label{recur1} \\ K(i,i) &= 1/2(1 + K(m(i), f(i))) \label{self} \end{align} $$ The key step in equation $K(i,j)$ is if $g$ has a maternal origin, then it is a random selection from the two maternal genes, and its IBD state with respect to subject $j$ is that of a random selection from $m(i)$ to a random selection from $j$. This is precisely the definition of $K(m(i), j)$. The recursion does not work for $K(i,i)$ since once we select a maternal gene the second choice from $j$ cannot use a different maternal gene. For the recurrence algorithm to work properly we need to compute the values of $K$ for any parent before the calculations for their children. Pedigree founders (those with no parents) are assumed to be unassociated, so for these subjects we have $$ \begin{align*} K(i,i) &= 1/2 \\ K(i,j) &= 0 \ ; i\ne j \end{align*} $$ The final formula slightly different for the $X$ chromosome. Equation $K(i,j)$ still holds, but for males the probability that a selected $X$ chromosome is maternal is 1, so when $i$ a male the recurrence formula becomes $K(i,j) = K(m(i),j)$. For females it is unchanged. All males will have $K(i,i) = 1$ for the $X$ chromosome. In order to have already-defined terms on the right hand side of the recurrence formula for each element, subjects need to be processed in the following order - Generation 0 (founders) - $K(i,j)$ where $i$ is from generation 1 and $j$ from generation 0. - $K(i,j)$ with $i$ and $j$ from generation 1 - $K(i,j)$ with $i$ from generation 2 and $j$ from generation 0 or 1 - $K(i,j)$ with $i$ and $j$ from generation 2. - ... The kindepth routine assigns a plotting depth to each subject in such a way that parents are always above children. For each depth we need to do the compuations of formula $K(i,j)$ twice. The first time it will get the relationship between each subject and prior generations correct, the second will correctly compute the values between subjects on the same level. The computations within any stage of the above list can be vectorized, but not those between stages. Let $indx$ be the index of the rows for the generation currently being processed, say generation $g$. We add correct computations to the matrix one row at a time; all of the calculations depend only on the prior rows with the exception of the $i,i$ element. This approach leads to a for loop containing operations on single rows/columns. At one point below we use a vectorized version. It looks like the snippet below ```{r, kinship_algo, eval=FALSE} for (g in 1:max(depth)) { indx <- which(depth == g) kmat[indx, ] <- (kmat[mother[indx], ] + kmat[father[indx], ]) / 2 kmat[, indx] <- (kmat[, mother[indx]] + kmat[, father[indx], ]) / 2 for (j in indx) kmat[j, j] <- (1 + kmat[mother[j], father[j]]) / 2 } ``` The first line computes all the values for a horizontal stripe of the matrix. It will be correct for columns in generations $