An aligned Pedigree is an object that contains a Pedigree along with a set of information that allows for pretty plotting. This information consists of two parts: a set of vertical and horizontal plotting coordinates along with the identifier of the subject to be plotted at each position, and a list of connections to be made between parent/child, spouse/spouse, and twin/twin.
Creating this aligment turned out to be one of the more difficult parts of the project, and is the area where significant further work could be done.
All the routines in this section completely ignore the id component of a Pedigree; everyone is indexed solely by their row number in the object.
The first part of the work has to do with a hints list for each Pedigree. It consists of 3 parts:
The default starting values for all of these are simple: founders are processed in the order in which they appear in the data set, children appear in the order they are found in the data set, husbands are to the left of their wives, and a marriage is plotted at the leftmost spouse. A simple example where we want to bend these rules is when two families marry, and the Pedigrees for both extend above the wedded pair. In the joint Pedigree the pair should appear as the right-most child in the left hand family, and as the left-most child in the right hand family. With respect to founders, assume that a family has three lineages with a marriage between 1 and 2, and another between 2 and 3. In the joint Pedigree the sets should be 1, 2, 3 from left to right.
The hints consist of a list with two components.
This routine is used to create an initial hints list. It is a part of the general intention to make the routine do pretty good drawings automatically. The basic algorithm is trial and error.
The routine makes no attempt to reorder founders. It just is not smart enough to figure that out.
The first thing to be done is to check on twins. They increase the
complexity, since twins need to move together. The
rel(ped, "code")
object is a factor, so first turn that
into numeric. We create 3 vectors:
A recent addition is to carry forward packaged and align to kindepth and align.
Next is an internal function that rearranges someone to be the leftmost or rightmost of his/her siblings. The only real complication is twins. If one of them moves the other has to move too. And we need to keep the monozygotics together within a band of triplets.
Algorithm : if the person to be moved is part of a twinset, first move all the twins to the left end (or right as the case may be), then move all the monozygotes to the left, then move the subject himself to the left.
Now, get an ordering of the Pedigree to use as the starting point.
The numbers start at 1 on each level. We do not need the final
prettify step, hence align = FALSE
. If there is a
hints structure entered, we retain its non-zero entries, otherwise
people are put into the order of the data set.
We allow the hints input to be only an order vector. twins are then further reordered.
The result coming back from align()
is a set of vectors
and matrices:
Now, walk down through the levels one by one. A candidate subject is one who appears twice on the level, once under his/her parents and once somewhere else as a spouse. Move this person and spouse the the ends of their sibships and add a marriage hint. The figure above shows a simple case. The input data set has the subjects ordered from 1–11, the left panel is the result without hints which processes subjects in the order encountered. The return values from align have subject 9 shown twice. The first is when he is recognized as the spouse of subject 4, the second as the child of 6–7.
The basic logic is
This logic works 9 times out of 10, at least for human pedigrees.
We’ll look at more complex cases below when looking at the duporder
(order the duplicates) function, which returns a matrix with columns 1
and 2 being a pair of duplicates, and 3 a direction. Note that in the
following code idlist
refers to the row numbers of each subject in the Pedigree, not to their
label ped(ped, "id")
.
For the case shown in figure above the duporder function will return a single row array
with values (2, 6, 1)
, the first two being the positions of
the duplicated subject. The anchor will be 2 since that is the copy
connected to parents The direction is TRUE, since the spouse is to the
left of the anchor point. The id is 9, sibs are 8, 9, 10, and the shift
function will create position hints of 2,1,3, which will cause them to
be listed in the order 9, 8, 10.
The value of spouse is 3 (third position in the row), subjects 3, 4,
and 5 are reordered, and finally the line (4,9,1)
is added
to the sptemp matrix. In this particular case the final element could be
a 1 or a 2, since both are connected to their parents.
The figure above shows a more complex case with several arcs. In the
upper left is a double marry-in. The anchor
variable in the above code will be (2,2)
since both copies
have an anchored spouse. The left and right sets of sibs are reordered
(even though the left one does not need it), and two lines are added to
the sptemp matrix: (5,11,1)
and (11,9,2)
.
On the upper right is a pair of overlapping arcs. In the final tree we want to put sibling 28 to the right of 29 since that will allow one node to join, but if we process the subjects in lexical order the code will first shift 28 to the right and then later shift over 29. The duporder function tries to order the duplicates into a matrix so that the closest ones are processed last. The definition of close is based first on whether the families touch, and second on the actual distance. The third column of the matrix hints at whether the marriage should be plotted at the left (1) or right (2) position of the pair. The goal for this is to spread apart families of cousins; in the example to not have the children of 28/31 plotted under the 21/22 grandparents, and those for 29/32 under the 25/26 grandparents. The logic for this column is very ad hoc: put children near the edges.
Finally, here are two helper routines. Finding my spouse can be
interesting – suppose we have a listing with Shirley, Fred, Carl, me on
the line with the first three marked as spouse = TRUE
– it
means that she has been married to all 3 of us. First we find the string
from rpos to lpos that is a marriage block; 99% of the time this will be
of length 2 of course. Then find the person in that block who is
opposite sex, and check that they are connected. The routine is called
with a left-right position in the alignment arrays and returns a
position.
The findsibs function starts with a position and returns a position as well, and is much simpler than findspouse.
At this point the most common situation will be what is shown in
figure. The variable anchor
is (2,1)
showing that the left hand copy of subject 9 is
connected to an anchored spouse and the right hand copy is himself
anchored. The proper addition to the spouselist is
(4, 9, dpairs)
, where the last is the hint from the dpairs
routine as to which of the parents is the one to follow further when
drawing the entire Pedigree. (When drawing a Pedigree and there is a
child who can be reached from multiple founders, we only want to find
the child once.)
The double marry-in found in the figure, subject 11, leads to value
of (2,2)
for the [anchor] variable. The proper addition to
the sptemp
matrix in this case will be two rows, (5, 11, 1)
indicating
that 5 should be plotted left of 11 for the 5-11 marriage, with the
first partner as the anchor, and a second row (11, 9, 2)
.
This will cause the common spouse to be plotted in the middle.
Multiple marriages can lead to unanchored subjects. In the left hand
portion of the figure above we have two double marriages, one on the
left and one on the right with anchor values of (0,2)
and
(2,0)
, respectively. We add two marriages to the return
list to ensure that both print in the correct left-right order; the 14-4
one is correct by default but it is easier to output a line than check
sex orders.
The left panel of the figure above shows a case where subject 11
marries into the Pedigree but also has a second spouse. The anchor
variable for this case will be (2, 0)
; the first instance
of 11 has a spouse tied into the tree above, the second instance has no
upward connections. In the top row, subject 6 has values of
(0, 0)
since neither connection has an upward parent. In
the right hand panel subject 2 has an anchor variable of
(0,1)
.
The top level routine for alignment has 5 arguments
The result coming back from align is a set of vectors and matrices:
Start with some setup. Throughout this routine the row number is used as a subject id (ignoring the actual id label).
As the routine proceeds a spousal pair can be encountered multiple times; we take them out of this list when the ``connected’’ member is added to the Pedigree so that no marriage gets added twice.
When importing data from auto_hint, that routine’s spouse matrix has column 1 = subject plotted on the left, 2 = subject plotted on the right. The spouselist array has column 1 = husband, 2 = wife. Hence the clumsy looking ifelse below. The auto_hint format is more congenial to users, who might modify the output, the spouselist format easier for the code.
The align routine does the alignment using 3 co-routines:
Call alignped1 sequentially with each founder pair and merge the results. A founder pair is a married pair, neither of which has a father.
Now finish up. There are 4 tasks to do:
The twins array is of the same shape as the spouse and nid arrays: one row per level giving data for the subjects plotted on that row. In this case they are
At this point the Pedigree has been arranged, with the positions in each row going from 1 to (number of subjects in the row). (For a packed Pedigree, which is the usual case). Having everything pushed to the left margin is not very pretty, now we fix that. Note that alignped4 wants a T/F spouse matrix: it does not care about the degree of relationship to the spouse.
This is the first of the three co-routines. It is called with a single subject, and returns the subtree founded on said subject, as though it were the only tree. We only go down the Pedigree, not up. Input arguments are
The return argument is a set of matrices as described in section align, along with the spouselist matrix. The latter has marriages removed as they are processed.
In this routine the nid array consists of the final nid array + 1/2 of the final spouse array. The basic algorithm is simple.
Note that the spouselist
matrix will only contain spouse pairs that are not yet processed. The
logic for anchoring is slightly tricky. First, if row 4 of the
spouselist matrix is 0, we anchor at the first opportunity, i.e. now..
Also note that if spouselist[,3] == spouselist[,4]
it is
the husband who is the anchor (just write out the possibilities).
Create the set of 3 return structures, which will be matrices with
(1 + nspouse)
columns. If there are children then other
routines will widen the result.
Now we have a list of spouses that should be dealt with and the the
correponding columns of the spouselist matrix. Create the two
complimentary lists lspouse and rspouse to denote those plotted on the
left and on the right. For someone with lots of spouses we try to split
them evenly. If the number of spouses is odd, then men should have more
on the right than on the left, women more on the right. Any hints in the
spouselist matrix override. We put the undecided marriages closest to
x, then add predetermined ones
to the left and right. The majority of marriages will be undetermined
singletons, for which nleft will be 1 for female (put my husband to the
left) and 0 for male. In one bug found by plotting canine data, lspouse
could initially be empty but length(rspouse)> 1
. This
caused nleft>length(indx)
. A fix was to not let indx to
be indexed beyond its length, fix by JPS 5/2013.
The spouses are in the Pedigree, now look below. For each spouse get the list of children. If there are any we call alignped2 to generate their tree and then mark the connection to their parent. If multiple marriages have children we need to join the trees.
To finish up we need to splice together the tree made up from all the kids, which only has data from lev+1 down, with the data here. There are 3 cases. The first and easiest is when no children were found. The second, and most common, is when the tree below is wider than the tree here, in which case we add the data from this level onto theirs. The third is when below is narrower, for instance an only child.
This routine takes a collection of siblings, grows the tree for each, and appends them side by side into a single tree. The input arguments are the same as those to alignped1 with the exception that x will be a vector. This routine does nothing to the spouselist matrix, but needs to pass it down the tree and back since one of the routines called by alignped2 might change the matrix.
The code below has one non-obvious special case. Suppose that two
sibs marry. When the first sib is processed by alignped1 then both partners (and any children)
will be added to the rval structure below. When the second sib is
processed they will come back as a 1 element tree (the marriage will no
longer be on the spouselist), which should not be added
onto rval. The rule thus is to not add any 1 element tree whose value
(which must be x[i]
) is already in the rval structure for
this level. (Where did Curt Olswold. find these families?)
The third alignment co-routine merges two pedigree trees which are side by side into a single object. The primary special case is when the rightmost person in the left tree is the same as the leftmost person in the right tree; we need not plot two copies of the same person side by side. (When initializing the output structures do not worry about this - there is no harm if they are a column bigger than finally needed.) Beyond that the work is simple bookkeeping.
For the unpacked case, which is the traditional way to draw a Pedigree when we can assume the paper is infinitely wide, all parents are centered over their children. In this case we think if the two trees to be merged as solid blocks. On input they both have a left margin of 0. Compute how far over we have to slide the right tree.
Now merge the two trees. Start at the top level and work down.
n2 = 0
, there is nothing to dofam = 0
, so
max(fam, fam2)
preserves the correct one.packed = TRUE
determine the amount of slide for this
row. It will be space
over from the last element in the left Pedigree, less overlap.The alignped4 routine is the final step of alignment. It attempts to line up children under parents and put spouses and siblings “close” to each other, to the extent possible within the constraints of page width. This routine used to be the most intricate and complex of the set, until I realized that the task could be cast as constrained quadradic optimization. The current code does necessary setup and then calls the quadprog function. At one point I investigated using one of the simpler least-squares routines where β is constrained to be non-negative. However a problem can only be translated into that form if the number of constraints is less than the number of parameters, which is not true in this problem.
There are two important parameters for the function. One is the user specified maximum width. The smallest possible width is the maximum number of subjects on a line, if the user suggestion is too low it is increased to that 1+ that amount (to give just a little wiggle room). The other is a vector of 2 alignment parameters a and b. For each set of siblings x with parents at p1 and p2 the alignment penalty is
(1/ka)∑i = 1k(xi − (p1 + p2)2
where k is the number of siblings in the set. Using the fact that ∑(xi − c)2 = ∑(xi − μ)2 + k(c − μ)2, when a = 1 then moving a sibship with k sibs one unit to the left or right of optimal will incur the same cost as moving one with only 1 or two sibs out of place. If a = 0 then large sibships are harder to move than small ones, with the default value a = 1.5 they are slightly easier to move than small ones. The rationale for the default is as long as the parents are somewhere between the first and last siblings the result looks fairly good, so we are more flexible with the spacing of a large family. By tethering all the sibs to a single spot they tend are kept close to each other. The alignment penalty for spouses is b(x1 − x2)2, which tends to keep them together. The size of b controls the relative importance of sib-parent and spouse-spouse closeness.
We start by adding in these penalties. The total number of parameters
in the alignment problem (what we hand to quadprog) is the set of
sum(n)
positions. A work array myid keeps track of the
parameter number for each position so that it is easy to find. There is
one extra penalty added at the end. Because the penalty amount would be
the same if all the final positions were shifted by a constant, the
penalty matrix will not be positive definite; solve.QP does not like
this. We add a tiny amount of leftward pull to the widest line.
Next come the constraints. If there are k subjects on a line there will be k + 1 constraints for that line. The first point must be ≥ 0, each subesquent one must be at least 1 unit to the right, and the final point must be ≤ the max width.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Pedixplorer_1.3.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0 ggplot2_3.5.1
## [5] htmlwidgets_1.6.4 lattice_0.22-6 quadprog_1.5-8 vctrs_0.6.5
## [9] tools_4.4.1 generics_0.1.3 stats4_4.4.1 tibble_3.2.1
## [13] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3 Matrix_1.7-1
## [17] data.table_1.16.2 S4Vectors_0.43.2 readxl_1.4.3 lifecycle_1.0.4
## [21] compiler_4.4.1 stringr_1.5.1 shinytoastr_2.2.0 munsell_0.5.1
## [25] httpuv_1.6.15 shinyWidgets_0.8.7 htmltools_0.5.8.1 sys_3.4.3
## [29] buildtools_1.0.0 sass_0.4.9 yaml_2.3.10 lazyeval_0.2.2
## [33] plotly_4.10.4 later_1.3.2 pillar_1.9.0 jquerylib_0.1.4
## [37] tidyr_1.3.1 DT_0.33 cachem_1.1.0 mime_0.12
## [41] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4 colourpicker_1.3.0
## [45] dplyr_1.1.4 purrr_1.0.2 maketools_1.3.1 fastmap_1.2.0
## [49] grid_4.4.1 colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
## [53] utf8_1.2.4 withr_3.0.2 scales_1.3.0 promises_1.3.0
## [57] rmarkdown_2.28 httr_1.4.7 gridExtra_2.3 cellranger_1.1.0
## [61] shiny_1.9.1 evaluate_1.0.1 knitr_1.48 shinycssloaders_1.1.0
## [65] miniUI_0.1.1.1 viridisLite_0.4.2 rlang_1.1.4 Rcpp_1.0.13
## [69] xtable_1.8-4 glue_1.8.0 BiocManager_1.30.25 BiocGenerics_0.53.0
## [73] jsonlite_1.8.9 R6_2.5.1 plyr_1.8.9