```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, comment = "", fig.height = 5.5, fig.width = 5.5, cache = TRUE) library(tidyr) library(dplyr) library(tibble) library(ggplot2) library(RColorBrewer) ``` ## Overall Pipeline

## Motivation - Multi-site imaging studies are becoming increasingly common. - Combining imaging data across sites introduces non-biological sources of variation that arise from the use of different scanner hardware and acquisition protocols. - E.g., field strength, manufacturer, subject positioning - **Scanner effects** or **site effects** are similar to **batch effects** in the genomics literature - Known to affect measurement of regional volumes, cortical thickness, voxel-based morphometry, ... - More generally, structural, functional, diffusion tensor, and other types of images and features extracted from them may exhibit scanner effects ## Motivation - Need to eliminate or account for scanner effects in downstream statistical analyses - Most critical if sites or scanners are imbalanced with respect to other variables such as age, sex, race, clinical status - Simply including scanner as a confounding variable may not work well (Rao et al. 2017) - Several methods for estimating and removing unwanted sources of variation due to site/scanner have been adapted to neuroimaging data. - In this tutorial we will use ComBat [@johnson2007adjusting; @fortin2018harmonization] to harmonize cortical thicknesses from the ADNI data. - ComBat has been shown to effectively reduce scanner-to-scanner variability while preserving biological associations. ## ADNI Cortical Thickness Data - The Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multi-million dollar study funded by public and private sources. - National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration, private pharmaceutical companies, and non-profit organizations. - The goals of the ADNI are to better understand progression from normal aging to mild cognitive impairment (MCI) and early Alzheimer's disease (AD) and determine effective biomarkers for disease diagnosis, monitoring, and treatment development. - We estimated cortical thicknesses from a subset of initial subject visits (N=187) - Mix of male/female aged 56-91 - Mix of healthy controls (46), MCI (93), and AD (48) diagnoses at baseline - Our subset consists of images aquired from scanners at 17 different sites - Mix of field strength, manufacturer, and model ## ADNI Data Access http://adni.loni.usc.edu/data-samples/access-data/

## Cortical Thickness

Image taken from [http://www.martinos.org/neurorecovery/technology.htm](http://www.martinos.org/neurorecovery/technology.htm)

- There are several different ways to measure cortical thickness - General idea is to measure perpendicular from the white/gray matter boundary to the pial surface - Cortical thickness is an important imaging-based biomarker for numerous psychological and neurodegenerative diseases: Alzheimer's and other dementias, Schizophrenia, MS, addiction, ...

## ANTS Cortical Thickness Pipeline (antsCorticalThickness.sh)

- DiReCT: Diffeomorphic Registration based CT measurement [@das2009registration] - With results from Atropos, can use ``cort_thickness`` in ``extrantsr`` package ## Cortical Thickness Regions - Multi-atlas label fusion applied using 20 OASIS template T1s manually labeled with the Desikan-Killiany-Tourville (DKT) cortical labeling protocol (www.mindboggle.info).

```{r, echo=FALSE, warning=FALSE, message=FALSE} library(dplyr) labels = data.frame('label'=names(read.csv('imageData.csv'))[-1]) rois = read.csv('labels.csv', header=FALSE, stringsAsFactors=FALSE) colnames(rois) = c('label', 'roi') regions = left_join(labels, rois, by='label') regions[1:15,] ```

## ADNI Cortical Thickness Data

## ADNI Cortical Thickness Data - Scanner is often confounded with biological covariates of interest.

## ComBat Model - ComBat [@johnson2007adjusting; @fortin2018harmonization] assumes the imaging feature measurements can be modeled as a linear combination of the biological variables and the scanner effects with an error term that includes a multiplicative scanner-specific scaling factor: $$y_{i,j,v} = \alpha_v + X_{i,j}^{T}\beta_v + \gamma_{j,v} + \delta_{j,v}\epsilon_{i,j,v}$$ - $y_{i,j,v}$ is average cortical thickness in ROI $v$ from subject $i$, scanner $j$ - $X_{i,j}^{T}$ is a vector of fixed covariates for subject $i$ scanned on scanner $j$ - $\alpha_v$, $\beta_v$ are the intercept and slope vector of covariates for measurement $v$ - $\gamma_{j, v}$ is the location shift of scanner $j$ on measurement $v$ - $\delta_{j,v}$ is the multiplicative effect of scanner $j$ on measurement $v$ - $E(\epsilon_{i,j,v}) = 0$ ## ComBat Harmonization - ComBat uses empirical Bayes estimation to improve quality of scanner-specific parameter estimates when sample sizes are small - Let $\gamma_{j,v}^{*}$ and $\delta_{j,v}^{*}$ denote the EB estimates of $\gamma_{j,v}$ and $\delta_{j,v}$, respectively. - Then, the ComBat-harmonized cortical thicknesses are defined as $$y_{i,j,v}^{ComBat} = \frac{y_{i,j,v} - \hat{\alpha}_v - X_{i,j}^{T}\hat{\beta}_v - \gamma_{j,v}^{*}}{\delta_{j,v}^{*}} + \hat{\alpha}_v + X_{i,j}^{T}\hat{\beta}_v$$ ## Applying ComBat - Need utils.R and combat.R from https://github.com/Jfortin1/ComBatHarmonization ```{r} source('utils.R') source('combat.R') ``` - The model matrix should contain covariates of biological interest. ```{r} modelData = read.csv('modelData.csv') head(modelData) ``` - We will include age, sex, and diagnosis. ```{r} mod = model.matrix(~age+factor(sex)+factor(dx), data=modelData) ``` ## Applying ComBat - Let's read in the cortical thickness data. ```{r} ctData = read.csv('imageData.csv') head(ctData)[,1:10] ``` ## Applying ComBat - The imaging data needs to be in a separate matrix where rows are features and columns are subjects. - Let's remove the subject column and transpose the ctData object. ```{r} img = t(ctData[,-1]) head(img)[,1:10] ``` ## Applying ComBat - Given a model matrix, scanner information, and image data matrix, ComBat just requires a single line of code: ```{r} harmonized = combat(dat=img, batch=modelData$scanner, mod=mod) ``` - 4 covariates: age, sex (2-level factor), and diagnosis (3-level factor). ## ComBat Harmonized Data - The `combat()` function returns a list. - Harmonized data are returned as a matrix with the same dimensions as the image data input (rows are features, columns are subjects). ```{r} head(harmonized$dat.combat)[,1:10] ``` ```{r, echo = FALSE} #pre_df = inner_join(ctData, modelData, by = "subject") #pre_df = pre_df %>% # gather(key = "roi", value = "thickness", starts_with("X")) #pre_df = pre_df %>% # mutate(roi = sub("^X", "", roi)) %>% # arrange(subject, roi, thickness) %>% # as_data_frame() #pre_df$combat = "before" #df = as_data_frame(t(harmonized$dat.combat)) #df = bind_cols(df, modelData) #df = df %>% # gather(key = "roi", value = "thickness", starts_with("X")) #df = df %>% # mutate(roi = sub("^X", "", roi)) %>% # arrange(subject, roi, thickness) #df$combat = "after" #all_df = bind_rows(df, pre_df) %>% # arrange(subject, roi, combat, thickness) #no_xlabs = theme(axis.text.x=element_blank()) + # theme(text = element_text(size = 20)) #labber = labs(x = "Subjects", y = "Cortical Thickness") #pre_medians = pre_df %>% # group_by(subject) %>% # summarize(med = median(thickness)) %>% # arrange(med) #post_medians = df %>% # group_by(subject) %>% # summarize(med = median(thickness)) %>% # arrange(med) #df = df %>% # mutate(sorted_subj = as.character(subject), # sorted_subj = factor(sorted_subj, # levels = pre_medians$subject), # post_sorted_subj = as.character(subject), # post_sorted_subj = factor(post_sorted_subj, # levels = post_medians$subject) # ) #marker1 = list(color = brewer.pal(9, "BuGn")) #marker2 = list(color = brewer.pal(9, "RdPu")) #marker = c(marker1$color[1:5], marker2$color[3:7], marker1$color[5:9], marker2$color[8:9]) #names(marker) = sort(unique(df$scanner)) #pre_df = pre_df %>% # mutate(sorted_subj = as.character(subject), # sorted_subj = factor(sorted_subj, # levels = pre_medians$subject)) #range_pre = c(0, 5) #g = df %>% # ggplot(aes(x = sorted_subj, y = thickness, fill = scanner)) + # geom_boxplot() + # no_xlabs + labber + # ylim(range_pre) + # scale_fill_manual(values = marker, # guide = guide_legend(title = NULL)) #pre_g = g %+% pre_df #png("avg_med_pre.png", # height = 7, width = 14, res = 600, units = "in") # pre_g #dev.off() #png("avg_med_post_sortedByPre.png", # height = 7, width = 14, res = 600, units = "in") # g #dev.off() #gpost = df %>% # ggplot(aes(x = post_sorted_subj, y = thickness, fill = scanner)) + # geom_boxplot() + # no_xlabs + labber + # ylim(range_pre) + # scale_fill_manual(values = marker, # guide = guide_legend(title = NULL)) #png("avg_med_post.png", height = 7, width = 14, res = 600, units = "in") #gpost #dev.off() ``` ## Compare prior distributions to EB estimated parameter distributions - **Location:** `harmonized$gamma.hat` versus `harmonized$gamma.star` - **Scale:** `harmonized$delta.hat` versus `harmonized$delta.star` $$y_{i,j,v} = \alpha_v + X_{i,j}^{T}\beta_v + \gamma_{j,v} + \delta_{j,v}\epsilon_{i,j,v}$$

## ADNI Cortical Thickness Data - Before ComBat

## ADNI Cortical Thickness Data - After ComBat

## ADNI Cortical Thickness Data - Before ComBat

## ADNI Cortical Thickness Data - After ComBat

## ADNI Cortical Thickness Data - Before ComBat

## ADNI Cortical Thickness Data - After ComBat

## Example: Test for Site Effects Before ComBat ```{r} modelData$sex = factor(modelData$sex) modelData$dx = factor(modelData$dx) modelData$scanner = factor(modelData$scanner) ``` ```{r} preComBat = left_join(modelData, ctData, by='subject') preRInsula = lm(X2035 ~ age + sex + dx + scanner, data=preComBat) summary(aov(preRInsula)) ``` ```{r} coef(preRInsula)[2:4] ``` ## Example: Test for Site Effects After ComBat ```{r} harmonizedData = as.data.frame(t(harmonized$dat.combat)) harmonizedData$subject = ctData$subject ``` ```{r} postComBat = left_join(modelData, harmonizedData, by='subject') postRInsula = lm(X2035 ~ age + sex + dx + scanner, data=postComBat) summary(aov(postRInsula)) ``` ```{r} coef(postRInsula)[2:4] ``` ## Conclusions - Data harmonization is an important step in any image analysis that combines data from different sites and/or scanners. - ComBat has been successfully applied to DTI and cortical thickness data to remove scanner effects while preserving biological associations of interest [@fortin2017harmonization; @fortin2018harmonization]. - ComBat code is located at https://github.com/Jfortin1/ComBatHarmonization - Available in R and MATLAB - To obtain valid statistical inference in downstream analyses, uncertainty in ComBat harmonization should be considered ## Wrap-up - We've covered all (or most) image pre-procssing steps in a typical image pre-processing pipeline, starting with raw nifti images. - Everything was done in R!

## What we didn't cover - fMRI: see `fmri` library - Other imaging modalities, e.g., CT, PET - MALF segmentation is robust - Voxel-wise testing: see `voxel` library - Other population-level statistical inference: - Statistical/machine learning: see `caret` library ## What can you do next? - Further modeling and statistical analysis. - Register images to a template to do population inference. - General R - Build your own R libraries for image analysis. - Rmarkdown for reproducible reports - R shiny apps **Resources** - Neurohacking tutorial on Coursera - Neuroconductor: central repository for image analysis R libraries ## Website http://johnmuschelli.com/imaging_in_r ## References {.smaller}