---
title: Pedigree object
author: Terry Therneau, Elizabeth Atkinson, Louis Le Nézet
date: '`r format(Sys.time(),"%d %B, %Y")`'
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
header-includes: \usepackage{tabularx}
vignette: |
  %\VignetteIndexEntry{Pedigree object}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

Introduction
===============

The pedigree routines came out of a simple need -- to quickly draw a
Pedigree structure on the screen, within R, that was "good enough" to
help with debugging the actual routines of interest, which were those for
fitting mixed effecs Cox models to large family data.  As such the routine
had compactness and automation as primary goals; complete annotation
(monozygous twins, multiple types of affected status) and most certainly
elegance were not on the list.  Other software could do that much
better.
        
It therefore came as a major surprise when these routines proved useful
to others.  Through their constant feedback, application to more
complex pedigrees, and ongoing requests for one more feature, the routine has 
become what it is today.  This routine is still not 
suitable for really large pedigrees, nor for heavily inbred ones such as in
animal studies, and will likely not evolve in that way.  The authors fondest
hope is that others will pick up the project.
        
Pedigree Constructor
========================

## Arguments

The Pedigree function is the first step, creating an object of class
Pedigree.  
It accepts the following input

- **ped_df** A dataframe containing the columns
    - $indId$ A numeric or character vector of subject identifiers.
    - $fatherId$ The identifier of the father.
    - $motherId$ The identifier of the mother.
    - $gender$ The gender of the individual.  This can be a numeric
    variable with codes of 1=male, 2=female, 3=unknown, 4=terminated,
    or NA=unknown.
    A character or factor variable can also be supplied containing
    the above; the string may be truncated and of arbitrary case.
    - $available$ Optional, a numeric variable with 0 = unavailable
    and 1 = available.
    - $affected$ Optional, a numeric variable with 0 = unaffected
    and 1 = affected.
    - $status$ Optional, a numeric variable with 0 = censored and
    1 = dead.
    - $famid$ Optional, a numeric or character vector of family
    identifiers.
    - $steril$ Optional, a numeric variable with 0 = not steril and
    1 = steril.
- **rel_df** Optional, a data frame with three columns or four columns.
    - $indId1$ identifier values of the subject pairs
    - $indId2$ identifier values of the subject pairs
    - $code$ relationship codification : 1 = Monozygotic twin,
    2=Dizygotic twin, 3= twin of unknown zygosity, 4 = Spouse.
    - $famid$ Optional, a numeric or character vector of family
    identifiers.
- **cols_ren_ped** Optional, a named list for the renaming of the
**ped_df** dataframe
- **cols_ren_rel** Optional, a named list for the renaming of the
**rel_df** dataframe
- **normalize** Optional, a logical to know if the data should be normalised.
- **hints** Optional, a list containing the horder in which to plot the
individuals and the matrix of the spouse.

## Notes

Note that a factor variable is not listed as one of the choices for the
subject identifier. This is on purpose.  Factors
were designed to accomodate character strings whose values came from a limited
class -- things like race or gender, and are not appropriate for a subject
identifier.  All of their special properties as compared to a character
variable turn out to be backwards for this case, in particular a memory
of the original level set when subscripting is done.

However, due to the awful decision early on in S to automatically turn every
character into a factor --- unless you stood at the door with a club to
head the package off --- most users have become ingrained to the idea of
using them for every character variable.

(I encourage you to set the global option `stringsAsFactors = FALSE` to turn
off autoconversion -- it will measurably improve your R experience).

Therefore, to avoid unnecessary hassle for our users 
the code will accept a factor as input for the id variables, but
the final structure does not retain it.  
Gender and relation do become factors.
Status follows the pattern of the survival routines and remains an integer.

## Column renaming

Based on the dataframe given for **ped_df** and **rel_df** and their
corresponding named list, the columns are renamed for them to be used
correctly.
The renaming is done as follow

```{r, column renaming}
rel_df <- data.frame(
    indId1 = c("110", "204"),
    indId2 = c("112", "205"),
    code = c(1, 2),
    family = c("1", "2")
)
cols_ren_rel <- list(
    id1 = "indId1",
    id2 = "indId2",
    famid = "family"
)

## Rename columns rel
old_cols <- as.vector(unlist(cols_ren_rel))
new_cols <- names(cols_ren_rel)
cols_to_ren <- match(old_cols, names(rel_df))
names(rel_df)[cols_to_ren[!is.na(cols_to_ren)]] <-
    new_cols[!is.na(cols_to_ren)]
print(rel_df)
```

## Normalisation

If the normalisation process is selected `normalize = TRUE`, then both
dataframe will be checked by their dedicated normalization function.
It will ensure that all modalities are written correctly and set up the
right way. If a $famid$ column is present in the dataframe, then it will
be aggregated to the id of each individual and separated by an ''_'' to
ensure the uniqueness of the individuals identifiers.

```{r, normalisation}
library(Pedixplorer)
data("sampleped")
cols <- c("sex", "id", "avail")
summary(sampleped[cols])
ped <- Pedigree(sampleped)
summary(as.data.frame(ped(ped))[cols])
```

### Errors present after the normalisation process

If any error is detected after the normalisation process, then the normalised
dataframe is gave back to the user with errors column added describing the
encountered problems.

```{r, rel_df errors}
rel_wrong <- rel_df
rel_wrong$code[2] <- "A"
df <- Pedigree(sampleped, rel_wrong)
print(df)
```

## Validation

Now that the data for the Pedigree object creation are ready, they are
given to a new $Pedigree$ object, trigerring the $validation$ process.

This validation step will check up for many errors such as:

- All necessary columns are present
- No duplicated $id$
- All $momid$ and $dadid$ are present in $id$
- $sex$ column only contain "male", "female", "unknown" or "terminated" values
- $steril$, $status$, $available$, $affected$ only contains 0, 1 or NA values
- Father are males and Mother are females
- Twins have same parents and MZ twins have same sex
- Hints object is valid and ids contained is in the Ped object
- ...

Pedigree Class
========================

After validation an $S4$ object is generated.
This new concept make it possible to easily setup methods for this new
type of object.
The controls of the parameters is also more precise.

The $Pedigree$ object contains 4 slots, each of them contains a different
$S4$ object containing a specific type of information used for the Pedigree
construction.

- $ped$ a Ped object for the Pedigree information with at least the following
  slots:
    - $id$ the identifiers of the individuals
    - $dadid$ the identifiers of the fathers
    - $momid$ the identifiers of the mothers
    - $sex$ the gender of each individuals
- $rel$ a Rel object describing all special relationship beetween individuals
that can't be descibed in the $ped$ slot.
The minimal slots needed are :
    - $id1$ the identifiers of the 1st individuals
    - $id2$ the identifiers of the 2nd individuals
    - $code$ factor describing the type of relationship
    ("MZ twin", "DZ twin", "UZ twin", "Spouse")
- $scales$ a Scales object with two slots :
    - $fill$ a dataframe describing which modalities in which columns
    correspond to an affected individuals.
    Plotting information such as colour, angle and density are also provided
    - $border$ a dataframe describing which modalities in which columns to
    use to plot the border of the plot elements.
- $hints$ a Hints object with two slots :
    - $horder$ numeric vector for the ordering of the individuals plotting
    - $spouse$ a matrix of the spouses

For more information on each object:

- `help(Ped)`
- `help(Rel)`
- `help(Scales)`
- `help(Hints)`

Pedigree accessors
========================

As the Pedigree object is now an $S4$ class, we have made available a number
of accessors.
Most of them can be used as a getter or as a setter to modify a value in the
correponding slot of the object

## For the Pedigree object

- Get/Set slots : ped(), rel(), scales(), hints()
- Wrapper to the Ped object: famid(), mcols()
- Wrapper of the Scales object: fill(), border()
- Wrapper of the Hints object: horder(), spouse()

## For the Ped object

- Given in input: id(), dadid(), momid(), famid(), sex()
- Other infos used : affected(), avail(), status()
- Computed : isinf(), kin(), useful()
- Metadata : mcols()

## For the Rel object

- id1(), id2(), code(), famid()

## For the Scales object

- fill(), border()

## For the Hints object

- horder(), spouse()

## Focus on mcols()

The mcols() accessors is the one you should use to add more
informations to your individuals.

```{r, mcols}
ped <- Pedigree(sampleped)
mcols(ped)[8:12]
## Add new columns as a threshold if identifiers of individuals superior
## to a given threshold for example
mcols(ped)$idth <- ifelse(as.numeric(mcols(ped)$indId) < 200, "A", "B")
mcols(ped)$idth
```


Pedigree methods
========================

With this new S4 object comes multiple methods to ease the use of it:

- plot()
- summary()
- print()
- show()
- as.list()
- `[`
- shrink()
- generate_colors()
- is_informative()
- kindepth()
- kinship()
- make_famid()
- upd_famid_id()
- num_child()
- unrelated()
- useful_inds()

```{r, pedigree methods}
## We can change the family name based on an other column
ped <- upd_famid_id(ped, mcols(ped)$idth)

## We can substract a given family
pedA <- ped[famid(ped) == "A"]

## Plot it
plot(pedA, cex = 0.5)

## Do a summary
summary(pedA)

## Coerce it to a list
as.list(pedA)[[1]][1:3]

## Shrink it to keep only the necessary information
lst1_s <- shrink(pedA, max_bits = 10)
plot(lst1_s$pedObj, cex = 0.5)

## Compute the kinship individuals matrix
kinship(pedA)[1:10, 1:10]

## Get the useful individuals
pedA <- useful_inds(pedA, informative = "AvAf")
as.data.frame(ped(pedA))["useful"][1:10,]
```

Session information
===================

```{r}
sessionInfo()
```