---
title: "Pedixplorer tutorial"
author: "Louis Le Nézet"
date: "31/08/2023"
url: "https://github.com/LouisLeNezet/Pedixplorer"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
    fig_crop: no
header-includes: \usepackage{tabularx}
vignette: |
  %\VignetteIndexEntry{Pedixplorer tutorial}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r width_control, echo = FALSE}
options(width = 100)
```

# Introduction

This document is a tutorial for the `Pedixplorer` package, with examples
of creating `Pedigree` objects and kinship matrices and other pedigree
utilities.

The `Pedixplorer` package is an updated version of the
[`Kinship2`](https://github.com/mayoverse/kinship2) package, featuring a
change in maintainer and repository from CRAN to Bioconductor for
continued development and support.

It contains the routines to handle family data with a `Pedigree` object.
The initial purpose was to create correlation structures that describe
family relationships such as kinship and identity-by-descent, which can
be used to model family data in mixed effects models, such as in the
`coxme` function. It also includes tools for pedigree drawing and
filtering which is focused on producing compact layouts without
intervention. Recent additions include utilities to trim the `Pedigree`
object with various criteria, and kinship for the X chromosome.

Supplementary vignettes are available to explain:

-   The **`Pedigree` object**
    `vignette("pedigree_object", package = "Pedixplorer")`
-   The **alignment algorithm** used to create the pedigree structure
    `vignette("pedigree_alignment", package = "Pedixplorer")`
-   The **kinship algorithm**
    `vignette("pedigree_kinship", package = "Pedixplorer")`
-   The **plotting algorithm** used to plot the pedigree
    `vignette("pedigree_plot", package = "Pedixplorer")`

# Installation

The `Pedixplorer` package is available on
[Bioconductor](https://www.bioconductor.org/packages/release/bioc/html/Pedixplorer.html)
and can be installed with the following command:

```{r BiocManager_install, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("Pedixplorer")
```

The package can then be loaded with the following command:

```{r, library_charge}
library(Pedixplorer)
```

# The `Pedigree` S4 object

The `Pedigree` object is a list of dataframes that describe the family
structure. It contains the following components:

-   ped: a `Ped` object with the pedigree information `help(Ped)`.
-   rel: a `Rel` object with the relationship information `help(Rel)`.
-   scales: a `Scales` object of 2 dataframe with the filling and borders
    informations for the plot `help(Scales)`.
-   hints: a `Hints` objects with 2 slots indicating the horder and the
    spouse to organise the pedigree structure `help(Hints)`.

# Basic Usage

## Example Data

Two datasets are provided within the `Pedixplorer` package: + `minnbreast`:
17 families from a breast cancer study + `sampleped`: two sample pedigrees,
with 41 and 14 subjects and the special relationship of these two pedigrees
in `relped`.

This vignette uses the two pedigrees in `sampleped`. For more
information on these datasets, see `help(minnbreast)` and
`help(sampleped)`.

## Pedigree

First, we load `sampleped` and look at some of the values in the dataset,
and create a `Pedigree` object using the `Pedigree()` function. This
function automaticaly detect the necessary columns in the dataframe. If
necessary you can modify the columns names with *cols_ren*. To create a
`Pedigree` object, with multiple families, the dataframe just need a
family column in the *ped_df* dataframe. When this is the case, the
famid column will be pasted to the id of each individuals separated by
an underscore to create a unique id for each individual in the `Pedigree`
object.

```{r, Pedigree_creation}
data("sampleped")
print(sampleped[1:10, ])
ped <- Pedigree(sampleped[c(3, 4, 10, 35, 36), ])
print(ped)
```

For more information on the `Pedigree()` function, see `help(Pedigree)`.

The `Pedigree` object can be subset to individual pedigrees by their
family id. The `Pedigree` object has a print, summary and plot method,
which we show below. The print method prints the `Ped` and `Rel` object of
the pedigree. The summary method prints a short summary of the pedigree.
Finally the plot method displays the pedigree.

```{r, ped1, fig.alt = "Pedigree of family 1", fig.align = "center"}
ped <- Pedigree(sampleped)
print(famid(ped(ped)))
ped1 <- ped[famid(ped(ped)) == "1"]
summary(ped1)
plot(ped1, cex = 0.7)
```

You can add a title and a legend to the plot with the following command:

```{r, ped1_title, fig.alt = "Pedigree of family 1 with legend"}
plot(
    ped1, title = "Pedigree 1",
    legend = TRUE, leg_loc = c(7, 16, 1.05, 1.9),
    cex = 0.7
)
```

## Pedigree Shiny application

A shiny application is available to create, interact and plot pedigrees.
To launch the application, use the following command:

```{r, shiny_app, eval = FALSE}
ped_shiny()
```

The use is simple:

- You first need to import a dataset and select the columns to use.
- You can then select the affection informations and the colors associated
to them.
- If different families are present in the dataset, you can select which
one to plot.
- Before the plot is displayed, you can filter the pedigree by selecting the 
informatives subjects to keep and their relatives. If the pedigree is then
splited in multiple families, you can select which to plot.
- Finally the plot is displayed and you can make it interactive and download
the resulting image.


# Fixing Pedigree Issues

To "break" the pedigree, we can manipulate the sex value to not match
the parent value (in this example, we change *203* from a male to a
female, even though *203* is a father). To do this, we first subset
*datped2*, locate the `id` column, and match it to a specific id (in
this case, *203*). Within id *203*, then locate in the `sex` column.
Assign this subset to the incorrect value of `2` (female) to change the
original/correct value of `1` (male).

To further break the pedigree, we can delete subjects who seem
irrelevant to the pedigree (in this example, we delete *209* because he
is a married-in father). To do this, we subset *datped2* and use the
`which()` function to locate and delete the specified subject (in this
case, *209*). Reassign this code to *datped22* to drop the specified
subject entirely.

```{r, datped2}
datped2 <- sampleped[sampleped$famid == 2, ]
datped2[datped2$id %in% 203, "sex"] <- 2
datped2 <- datped2[-which(datped2$id %in% 209), ]
```

An error occurs when the `Pedigree()` function notices that id *203* is
not coded to be male (`1`) but is a father. To correct this, we simply
employ the `fix_parents()` function to adjust the `sex` value to match
either `momid` or `dadid`. `fix_parents()` will also add back in any
deleted subjects, further fixing the Pedigree.

```{r, fixped2, fig.alt = "Pedigree of family 2"}
tryout <- try({
    ped2 <- Pedigree(datped2)
})
fixped2 <- with(datped2, fix_parents(id, dadid, momid, sex))
fixped2
ped2 <- Pedigree(fixped2)
plot(ped2)
```

If the fix is straightforward (changing one sex value based on either
being a mother or father), `fix_parents()` will resolve the issue. If
the issue is more complicated, say if *203* is coded to be both a father
and a mother, `fix_parents()` will not know which one is correct and
therefore the issue will not be resolved.

# Kinship

A common use for pedigrees is to make a matrix of kinship coefficients
that can be used in mixed effect models. A kinship coefficient is the
probability that a randomly selected allele from two people at a given
locus will be identical by descent (IBD), assuming all founder alleles
are independent. For example, we each have two alleles per autosomal
marker, so sampling two alleles with replacement from our own DNA has
only $p=0.50$ probability of getting the same allele twice.

## Kinship for Pedigree object

We use `kinship()` to calculate the kinship matrix for *ped2*. The
result is a special symmetrix matrix class from the [Matrix R
package](https://CRAN.R-project.org/package=Matrix/), which is stored
efficiently to avoid repeating elements.

```{r, calc_kinship}
kin2 <- kinship(ped2)
kin2[1:9, 1:9]
```

For family 2, see that the row and column names match the id in the
figure below, and see that each kinship coefficient with themselves is
*0.50*, siblings are *0.25* (e.g. *204-205*), and pedigree marry-ins
only share alleles IBD with their children with coefficient *0.25* (e.g.
*203-210*). The plot can be used to verify other kinship coefficients.

## Kinship for Pedigree with multiple families

The `kinship()` function also works on a `Pedigree` object with multiple
families. We show how to create the kinship matrix, then show a snapshot
of them for the two families, where the row and columns names are the
ids of the subject.

```{r, kin_all}
ped <- Pedigree(sampleped)
kin_all <- kinship(ped)
kin_all[1:9, 1:9]
kin_all[40:43, 40:43]
kin_all[42:46, 42:46]
```

## Kinship for twins in Pedigree with multiple families

Specifying twin relationships in a Pedigree with multiple families
object is complicated by the fact that the user must specify the family
id to which the `id1` and `id2` belong. We show below the relation
matrix requires the family id to be in the last column, with the column
names as done below, to make the plotting and kinship matrices to show
up with the monozygotic twins correctly. We show how to specify
monozygosity for subjects *206* and *207* in *ped2*, and subjects
*125* and *126* in *ped1*. We check it by looking at the kinship matrix
for these pairs, which are correctly at *0.5*.

```{r, kin_twins}
data("relped")
relped
ped <- Pedigree(sampleped, relped)

kin_all <- kinship(ped)
kin_all[24:27, 24:27]
kin_all[46:50, 46:50]
```

Note that subject *113* is not in *ped1* because they are a marry-in
without children in the `Pedigree`. Subject *113* is in their own `Pedigree`
of size 1 in the *kin_all* matrix at index *41*. We later show how to
handle such marry-ins for plotting.

# Optional Pedigree Informations

We use *ped2* from `sampleped` to sequentially add optional
information to the `Pedigree` object.

## Status

The example below shows how to specify a `status` indicator, such as
vital status. The `sampleped` data does not include such an\
indicator, so we create one to indicate that the first generation of
*ped2*, subjects *1* and *2*, are deceased. The `status` indicator is
used to cross out the individuals in the Pedigree plot.

```{r, status, fig.alt = "Pedigree of family 2 with different vital status"}
df2 <- sampleped[sampleped$famid == 2, ]
names(df2)
df2$status <- c(1, 1, rep(0, 12))
ped2 <- Pedigree(df2)
summary(status(ped(ped2)))
plot(ped2)
```

## Labels

Here we show how to use the `label` argument in the plot method to add
additional information under each subject. In the example below, we add
names to the existing plot by adding a new column to the `elementMetadata`
of the `Ped` object of the `Pedigree`.

As space permits, more lines and characters per line can be
made using the a {/em \n} character to indicate a new line.

```{r, labels, fig.alt = "Pedigree of family 2 with names label"}
mcols(ped2)$Names <- c(
    "John\nDalton", "Linda", "Jack", "Rachel", "Joe", "Deb",
    "Lucy", "Ken", "Barb", "Mike", "Matt",
    "Mindy", "Mark", "Marie\nCurie"
)
plot(ped2, label = "Names", cex = 0.7)
```

## Affected Indicators

We show how to specify affected status with a single indicator and
multiple indicators. First, we use the affected indicator from
`sampleped`, which contains *0*/*1* indicators and *NA* as missing, and let it
it indicate blue eyes. Next, we create a vector as an indicator for
baldness. And add it as a second filling scale for the plot with
`generate_colors(add_to_scale = TRUE)`. The plot shapes for each subject
is therefore divided into two equal parts and shaded differently to
indicate the two affected indicators.

```{r, two_affection, fig.alt = "Pedigree of family 2 with two affection indicators"}
mcols(ped2)$bald <- as.factor(c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1))
ped2 <- generate_colors(ped2, col_aff = "bald", add_to_scale = TRUE)
# Increase down margin for the legend
op <- par(mai = c(1.5, 0.2, 0.2, 0.2))
plot(
    ped2, legend = TRUE,
    leg_loc = c(0.5, 6, 3.5, 4)
)
# Reset graphical parameter
par(op)
```

## Special Relationships

Special pedigree relationships can be specified in a matrix as the
`rel_df` argument in the `Pedigree()` constructor.
There are 4 relationships that can be specified by
numeric codes:

- `1` = Monozygotic twins
- `2` = Dizygotic twins
- `3` = Twins of unknown zygosity
- `4` = Spouse

The spouse relationship can indicate a
marry-in when a couple does not have children together.

### Twins

Below, we use the relationship dataset.
We can specify in the code column if the individuals are
monozygotic `1`, dizygotic `2` or of unknown-zygosity `3` twins.
The twin relationships are both represented with diverging lines from a
single point. The monozygotic twins have an additional line connecting
the diverging lines, while twins of unknown zygosity have a question mark.

```{r, twins, fig.alt = "Pedigree of family 2 with special relationships"}
## create twin relationships
data("relped")
rel(ped2) <- Rel(relped[relped$famid == 2, ])
plot(ped2)
```

### Inbreeding

Another special relationship is inbreeding. Inbreeding of founders
implies the founders' parents are related (the maternal and paternal
genes descended from a single ancestral gene). One thing we can do is
add more people to the pedigree to show this inbreeding.

To show that a pair of founders (subjects *201* and *202*) are inbred,
we must show that their parents are siblings. To do this, we create
subjects *197* and *198* to be the parents of *201* and also create
subjects *199* and *200* to be the parents of *202*. To make subjects
*198* and *199* siblings, we give them the same parents, creating
subjects *195* and *196*. This results in subjects *201* and *202* being
first cousins, and therefore inbred.

```{r, inbreeding, fig.alt = "Pedigree with inbreeding"}
indid <- 195:202
dadid <- c(NA, NA, NA, 196, 196, NA, 197, 199)
momid <- c(NA, NA, NA, 195, 195, NA, 198, 200)
sex <- c(2, 1, 1, 2, 1, 2, 1, 2)
ped3 <- data.frame(
    id = indid, dadid = dadid,
    momid = momid, sex = sex
)

ped4df <- rbind.data.frame(df2[-c(1, 2), 2:5], ped3)
ped4 <- Pedigree(ped4df)
plot(ped4)
```

### Marry-ins

Spouse with no child can also be specified with the `rel_df` argument by
setting the code value to `spouse` or `4`. If we use the *ped2* from
earlier and add a new spouse relationship between the individuals *212*
and *211* we get the following plot.

```{r, spouse, fig.alt = "Pedigree with spouse with no children"}
## create twin relationships
rel_df2 <- data.frame(
    id1 = "211",
    id2 = "212",
    code = 4,
    famid = "2"
)
new_rel <- c(rel(ped2), with(rel_df2, Rel(id1, id2, code, famid)))
rel(ped2) <- upd_famid(new_rel)
plot(ped2)
```

# Pedigree Plot Details

The plot method attempts to adhere to many standards in pedigree
plotting, as presented by [Bennet et al.
2008](https://pubmed.ncbi.nlm.nih.gov/18792771/).

To show some other tricks with pedigree plotting, we use *ped1* from
`sampleped`, which has 41 subjects in 4 generations, including a
generation with double first cousins. After the first marriage of *114*,
they remarried subject *113* without children between them. If we do not
specify the marriage with the `rel_df` argument, the plot method
excludes subject *113* from the plot. The basic plot of *ped1* is
shown in the figure below.

```{r, plotped1, fig.alt = "Pedigree of family 1"}
df1 <- sampleped[sampleped$famid == 1, ]
relate1 <- data.frame(
    id1 = 113,
    id2 = 114,
    code = 4,
    famid = 1
)
ped1 <- Pedigree(df1, relate1)
plot(ped1, cex = 0.7)
```

## Align by Input Order

The plot method does a decent job aligning subjects given the order of
the subjects when the Pedigree object is made, and sometimes has to make
two copies of a subject. If we change the order of the subjects when
creating the Pedigree, we can help the plot method reduce the need to
duplicate subjects, as Figure\~\ref{reordPed1} no longer has subject
*110* duplicated.

```{r, ordering, fig.alt = "Pedigree of family 1 with reordering"}
df1reord <- df1[c(35:41, 1:34), ]
ped1reord <- Pedigree(df1reord, relate1)
plot(ped1reord, cex = 0.7)
```

## Plot colors and scales

The `Pedigree` object contains a `Scales` object that can be modified to
change the colors and patterns used in the plot.
To make it easy for the user to modify it a function `generate_colors()`
is available. This function will generate a color palette for the
filling and the bordering of the plot. This function transform a given
column of the dataframe into a factor and generate a color palette for
each level of the factor. The user can then modify the colors and the
patterns used for the filling and the bordering of the plot.

To do so you can do as follow:

```{r, generate_colors, fig.alt = "Pedigree of family 1 with change in colors"}
scales(ped1)
ped1 <- generate_colors(
    ped1, col_aff = "num",
    add_to_scale = TRUE, is_num = TRUE,
    keep_full_scale = TRUE, breaks = 2,
    colors_aff = c("blue", "green"),
    colors_unaff = c("yellow", "brown"),
    threshold = 3, sup_thres_aff = FALSE
)

plot(ped1, cex = 0.7)

# To modify a given scale you can do as follow
fill(ped1)
fill(ped1)$fill[4] <- "#970b6d"
fill(ped1)$density[5] <- 30
fill(ped1)$angle[5] <- 45
border(ped1)$border <- c("red", "black", "orange")
plot(
    ped1, cex = 0.7, legend = TRUE,
    leg_loc = c(6, 16, 1, 1.8), leg_cex = 0.5,
)
```

# Pedigree Utility Functions

## Ped as a data.frame

A main features of a `Pedigree` object are vectors with an element for
each subject. It is sometimes useful to extract these vectors from the
Pedigree object into a `data.frame` with basic information that can be
used to construct a new `Pedigree` object. This is possible with the
`as.data.frame()` method, as shown below.

```{r, ped2df, eval = FALSE}
dfped2 <- as.data.frame(ped(ped2))
dfped2
```

## Automatic filtering

The `useful_inds()` allows to filter a huge and complex pedigree easily
by providing the informative individuals and the maximal distance from
them to the other individuals.
The informative individuals are provided through the `informative` argument
and can take the following values:

- `AvAf` (available and affected)
- `AvOrAf` (available or affected)
- `Av` (available only)
- `Af` (affected only)
- `All` (all individuals)
- A numeric/character vector of individuals id
- A boolean

They will be the individuals from which the distance will be compute.
This distance correspond to :

$$
    minDist = log2(\frac{1}{\max(kinship)})
$$

Therefore, the minimum distance is 0 when the maximum kinship is 1 and
is infinite when the maximum kinship is 0. For siblings, the kinship value
is 0.5 and the minimum distance is 1. Each time the kinship degree is divided
by 2, the minimum distance is increased by 1.
This distance can be understood as the number of step needed to link two
individuals on a pedigree.

Therefore the threshold `max_dist` can be interpreted as the size of a circle
around the informative individuals. All individuals inside the circle will be
kept and the others disregarded.

The `useful_inds()` function is used as follow and return the same `Pedigree`
object but with the `useful` column updated in the `Ped` object :

```{r, useful_inds}
data(sampleped)
ped1 <- Pedigree(sampleped)

ped1 <- useful_inds(ped1, informative = c("1_110", "1_120"), max_dist = 1)
print(useful(ped(ped1)))
ped_filtered <- ped1[useful(ped(ped1))]
plot(ped_filtered)
```

## Subsetting and Trimming

Pedigrees with large size can be a bottleneck for programs that run
calculations on them. The Pedixplorer package contains some routines to
identify which subjects to remove. We show how a subject (e.g. subject
*210*) can be removed from *ped2*, and how the Pedigree object is changed
by verifying that the `Rel` object no longer has the twin
relationship between subjects *210* and *211*, as indicated by `id1` and
`id2`.

```{r, subset}
ped2_rm210 <- ped2[-10]
rel(ped2_rm210)
rel(ped2)
```

The steps above also works by the `id` of the subjects themselves.\
We provide `subset()`, which trims subjects from a pedigree by their
`id` or other argument. Below is an example of removing subject *110*, as
done above, then we further trim the pedigree by a vector of subject
ids. We check the trimming by looking at the `id` vector and the
`Rel` object.

```{r, subset_more}
ped2_trim210 <- subset(ped2, "2_210", keep = FALSE)
id(ped(ped2_trim210))
rel(ped2_trim210)
ped2_trim_more <- subset(ped2_trim210, c("2_212", "2_214"), keep = FALSE)
id(ped(ped2_trim_more))
rel(ped2_trim_more)
```

## Shrinking

An additional function in Pedixplorer is `shrink()`, which shrinks a
pedigree to a specified bit size while maintaining the maximal amount of
information for genetic linkage and association studies. Using an
indicator for availability and affected status, it removes subjects in
this order: + unavailable with no available descendants + available and
are not parents + available who have missing affected status + available
who are unaffected + available who are affected

We show how to shrink *Pedigree 1* to bit size *30*, which happens to be
the bit size after removing only the unavailable subjects. We show how
to extract the shrunken `Pedigree` object from the `shrink()` result, and
plot it.

```{r, shrink1, fig.alt = "Pedigree of family 1 shrinked to 30 bits"}
set.seed(200)
shrink1_b30 <- shrink(ped1, max_bits = 30)
print(shrink1_b30[c(2:8)])
plot(shrink1_b30$pedObj)
```

Now shrink *Pedigree 1* to bit size *25*, which requires removing subjects
who are informative. If there is a tie between multiple subjects about
who to remove, the method randomly chooses one of them. With this seed
setting, the method removes subjects *140* then *141*.

```{r, shrink2, fig.alt = "Pedigree of family 1 shrinked to 25 bits"}
set.seed(10)
shrink1_b25 <- shrink(ped1, max_bits = 25)
print(shrink1_b25[c(2:8)])
plot(shrink1_b25$pedObj)
```

# Select Unrelateds

In this section we briefly show how to use `unrelated()` to find a set of
the maximum number of unrelated available subjects from a pedigree. The
input required is a `Pedigree` object and a vector indicating
availability. In some pedigrees there are numerous sets of subjects that
satisfy the maximum number of unrelateds, so the method randomly chooses
from the set. We show two sets of subject ids that are selected by the
routine and discuss below.

```{r, unrelateds}
ped2 <- Pedigree(df2)
set.seed(10)
set1 <- unrelated(ped2)
set1
set2 <- unrelated(ped2)
set2
```

We can easily verify the sets selected by `unrelated()` by referring to
Figure\~\ref{fixped} and see that subjects *203* and *206* are unrelated
to everyone else in the pedigree except their children. Furthermore, we
see in *df2* that of these two, only subject *203* is available. Therefore,
any set of unrelateds who are available must include subject *203* and one
of the these subjects: *201*, *204*, *206*, *207*, *212* and *214*, as
indicated by the kinship matrix for *Pedigree 2* subset to those with
availability status of *1*.

```{r, unrelkin}
kin2 <- kinship(ped2)
is_avail <- id(ped(ped2))[avail(ped(ped2))]
kin2
kin2[is_avail, is_avail]
```

# Session information

```{r}
sessionInfo()
```