Transforming FHIR documents to tables with BiocFHIR

Introduction

The purpose of this vignette is to provide details on how FHIR documents are transformed to tables in BiocFHIR.

This text uses R commands that will work for an R (version 4.2 or greater) in which BiocFHIR (version 0.0.14 or greater) has been installed. The source codes are always available at github and may be available for installation by other means.

Examining sample data, again

In the “Upper level FHIR concepts” vignette, we used the following code to get a peek at the information structure in a single document representing a Bundle associated with a patient.

tfile = dir(system.file("json", package="BiocFHIR"), full=TRUE)
peek = jsonlite::fromJSON(tfile)
names(peek)
## [1] "resourceType" "type"         "entry"
peek$resourceType
## [1] "Bundle"
names(peek$entry)
## [1] "fullUrl"  "resource" "request"
length(names(peek$entry$resource))
## [1] 72
class(peek$entry$resource)
## [1] "data.frame"
dim(peek$entry$resource)
## [1] 301  72
head(names(peek$entry$resource))
## [1] "resourceType" "id"           "text"         "extension"    "identifier"  
## [6] "name"

We perform a first stage of transformation with process_fhir_bundle:

bu = process_fhir_bundle(tfile)
bu
## BiocFHIR FHIR.bundle instance.
##   resource types are:
##    AllergyIntolerance CarePlan ... Patient Procedure

Bundle to data frames

Each processed bundle is a collection of data.frame instances, formed by splitting the input “entry” element by “resourceType”. These data.frames are mostly filled with NA missing values, but some columns have been ingested as lists. Executive decisions are made in the package regarding which columns are likely to hold useful information. Thus we have

po1 <- process_Observation(bu$Observation)
dim(po1)
## [1] 127  11
datatable(po1)

Filtering FHIR elements

A list of vectors of field names serves as the basis for filtering JSON elements into records for tabulation.

FHIR_retention_schemas()
## $Condition
## [1] "id"            "onsetDateTime" "code"          "subject"      
## 
## $AllergyIntolerance
## [1] "id"            "onsetDateTime" "code"          "patient"      
## [5] "category"     
## 
## $CarePlan
## [1] "id"       "activity" "subject"  "category"
## 
## $Claim
## [1] "id"             "provider"       "patient"        "billablePeriod"
## [5] "insurance"      "created"       
## 
## $Encounter
## [1] "id"              "type"            "subject"         "period"         
## [5] "serviceProvider" "class"          
## 
## $MedicationRequest
## [1] "id"                        "subject"                  
## [3] "status"                    "requester"                
## [5] "medicationCodeableConcept"
## 
## $Observation
## [1] "id"                "subject"           "code"             
## [4] "valueQuantity"     "category"          "effectiveDateTime"
## [7] "issued"            "component"        
## 
## $Procedure
## [1] "id"              "subject"         "status"          "performedPeriod"
## [5] "code"           
## 
## $Patient
##  [1] "id"                   "identifier"           "name"                
##  [4] "telecom"              "gender"               "birthDate"           
##  [7] "address"              "maritalStatus"        "multipleBirthBoolean"
## [10] "communication"        "active"              
## 
## $Immunization
## [1] "id"                 "patient"            "vaccineCode"       
## [4] "occurrenceDateTime"

Because each observation on Blood Pressure includes a “component” element with two elements (for systolic and diastolic blood pressure readings), special code is required to map the metadata for the Blood Pressure observations to the specific values for each component.

The resources extracted from a bundle

The process_* functions in BiocFHIR address various resource types. As of version 0.0.15 we have

ls("package:BiocFHIR") |> grep(x=_, "process_[A-Z]", value=TRUE)
##  [1] "process_AllergyIntolerance" "process_CarePlan"          
##  [3] "process_Claim"              "process_Condition"         
##  [5] "process_Encounter"          "process_Immunization"      
##  [7] "process_MedicationRequest"  "process_Observation"       
##  [9] "process_Patient"            "process_Procedure"

There is no guarantee that any given bundle with have resources among all these types.

Accumulating resources across bundles

Bundles are not guaranteed to have any specific resources. To assemble all information on conditions recorded in the Synthea sample, we must program defensively. We obtain the indices of bundles possessing a “Condition” resource, and then combine the resulting tables, which are designed to have a common set of columns.

data("allin", package="BiocFHIR")
hascond = sapply(allin, function(x)length(x$Condition)>0)
oo = do.call(rbind, lapply(allin[hascond], function(x)process_Condition(x$Condition)))
dim(oo)
## [1] 406   5
length(unique(oo$subject.reference))
## [1] 49

The most commonly reported conditions in the sample are:

table(oo$code.coding.display) |> sort() |> tail()
## 
##                             Prediabetes Body mass index 30+ - obesity (finding) 
##                                      16                                      17 
##                        Normal pregnancy             Acute bronchitis (disorder) 
##                                      22                                      27 
##      Acute viral pharyngitis (disorder)              Viral sinusitis (disorder) 
##                                      30                                      63

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rjsoncons_1.3.1  jsonlite_1.8.9   DT_0.33          BiocFHIR_1.9.0  
## [5] BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9          utf8_1.2.4          generics_0.1.3     
##  [4] tidyr_1.3.1         digest_0.6.37       magrittr_2.0.3     
##  [7] evaluate_1.0.1      fastmap_1.2.0       graph_1.85.0       
## [10] promises_1.3.2      BiocManager_1.30.25 purrr_1.0.2        
## [13] fansi_1.0.6         crosstalk_1.2.1     jquerylib_0.1.4    
## [16] cli_3.6.3           shiny_1.9.1         rlang_1.1.4        
## [19] visNetwork_2.1.2    cachem_1.1.0        yaml_2.3.10        
## [22] BiocBaseUtils_1.9.0 tools_4.4.2         dplyr_1.1.4        
## [25] httpuv_1.6.15       BiocGenerics_0.53.3 buildtools_1.0.0   
## [28] vctrs_0.6.5         R6_2.5.1            mime_0.12          
## [31] stats4_4.4.2        lifecycle_1.0.4     htmlwidgets_1.6.4  
## [34] pkgconfig_2.0.3     pillar_1.9.0        bslib_0.8.0        
## [37] later_1.4.1         glue_1.8.0          Rcpp_1.0.13-1      
## [40] xfun_0.49           tibble_3.2.1        tidyselect_1.2.1   
## [43] sys_3.4.3           knitr_1.49          xtable_1.8-4       
## [46] htmltools_0.5.8.1   rmarkdown_2.29      maketools_1.3.1    
## [49] compiler_4.4.2