Identification of novel biomarkers for risk prediction is important for disease prevention and optimal treatment selection. However, studies aiming to discover which biomarkers are useful for risk prediction often require the use of stored biological samples from large assembled cohorts, and thus the depletion of a finite and precious resource. To make efficient use of such stored samples, two-phase sampling designs are often adopted as resource-efficient sampling strategies, especially when the outcome of interest is rare. Existing methods for analyzing data from two-phase studies focus primarily on single marker analysis or fitting the Cox regression model to combine information from multiple markers. However, the Cox model may not fit the data well. Under model misspecification, the composite score derived from the Cox model may not perform well in predicting the outcome. Under a general two-phase stratified cohort sampling design, we present a novel approach to combining multiple markers to optimize prediction by fitting a flexible nonparametric transformation model. Using inverse probability weighting to account for the outcome-dependent sampling, we propose to estimate the model parameters by maximizing an objective function which can be interpreted as a weighted C-statistic for survival outcomes. Regardless of model adequacy, the proposed procedure yields a sensible composite risk score for prediction. A major obstacle for making inference under two phase studies is due to the correlation induced by the finite population sampling, which prevents standard inference procedures such as the bootstrap from being used for variance estimation. We propose a resampling procedure to derive valid confidence intervals for the model parameters and the C-statistic accuracy measure. We illustrate the new methods with simulation studies and an analysis of a two-phase study of high-density lipoprotein cholesterol (HDL-C) subtypes for predicting the risk of coronary heart disease.
BACKGROUND: Fragile X syndrome (FXS) is a neurodevelopmental disorder whose biochemical manifestations involve dysregulation of mGluR5-dependent pathways, which are widely modeled using cultured neurons. In vitro phenotypes in cultured neurons using standard morphological, functional, and chemical approaches have demonstrated considerable variability. Here, we study transcriptomes obtained in situ in the intact brain tissues of a murine model of FXS to see how they reflect the in vitro state. METHODS: We used genome-wide mRNA expression profiling as a robust characterization tool for studying differentially expressed pathways in fragile X mental retardation 1 (Fmr1) knockout (KO) and wild-type (WT) murine primary neuronal cultures and in embryonic hippocampal and cortical murine tissue. To study the developmental trajectory and to relate mouse model data to human data, we used an expression map of human development to plot murine differentially expressed genes in KO/WT cultures and brain. RESULTS: We found that transcriptomes from cell cultures showed a stronger signature of Fmr1KO than whole tissue transcriptomes. We observed an over-representation of immunological signaling pathways in embryonic Fmr1KO cortical and hippocampal tissues and over-represented mGluR5-downstream signaling pathways in Fmr1KO cortical and hippocampal primary cultures. Genes whose expression was up-regulated in Fmr1KO murine cultures tended to peak early in human development, whereas differentially expressed genes in embryonic cortical and hippocampal tissues clustered with genes expressed later in human development. CONCLUSIONS: The transcriptional profile in brain tissues primarily centered on immunological mechanisms, whereas the profiles from cell cultures showed defects in neuronal activity. We speculate that the isolation and culturing of neurons caused a shift in neurological transcriptome towards a "juvenile" or "de-differentiated" state. Moreover, cultured neurons lack the close coupling with glia that might be responsible for the immunological phenotype in the intact brain. Our results suggest that cultured cells may recapitulate an early phase of the disease, which is also less obscured with a consequent "immunological" phenotype and in vivo compensatory mechanisms observed in the embryonic brain. Together, these results suggest that the transcriptome of cultured primary neuronal cells, in comparison to whole brain tissue, more robustly demonstrated the difference between Fmr1KO and WT mice and might reveal a molecular phenotype, which is typically hidden by compensatory mechanisms present in vivo. Moreover, cultures might be useful for investigating the perturbed pathways in early human brain development and genes previously implicated in autism.
The case cohort (CCH) design is a cost-effective design for assessing genetic susceptibility with time-to-event data especially when the event rate is low. In this work, we propose a powerful pseudo-score test for assessing the association between a single nucleotide polymorphism (SNP) and the event time under the CCH design. The pseudo-score is derived from a pseudo-likelihood which is an estimated retrospective likelihood that treats the SNP genotype as the dependent variable and time-to-event outcome and other covariates as independent variables. It exploits the fact that the genetic variable is often distributed independent of covariates or only related to a low-dimensional subset. Estimates of hazard ratio parameters for association can be obtained by maximizing the pseudo-likelihood. A unique advantage of our method is that it allows the censoring distribution to depend on covariates that are only measured for the CCH sample while not requiring the knowledge of follow-up or covariate information on subjects not selected into the CCH sample. In addition to these flexibilities, the proposed method has high relative efficiency compared with commonly used alternative approaches. We study large sample properties of this method and assess its finite sample performance using both simulated and real data examples.
Genetic studies of complex traits have uncovered only a small number of risk markers explaining a small fraction of heritability and adding little improvement to disease risk prediction. Standard single marker methods may lack power in selecting informative markers or estimating effects. Most existing methods also typically do not account for non-linearity. Identifying markers with weak signals and estimating their joint effects among many non-informative markers remains challenging. One potential approach is to group markers based on biological knowledge such as gene structure. If markers in a group tend to have similar effects, proper usage of the group structure could improve power and efficiency in estimation. We propose a two-stage method relating markers to disease risk by taking advantage of known gene-set structures. Imposing a naive bayes kernel machine (KM) model, we estimate gene-set specific risk models that relate each gene-set to the outcome in stage I. The KM framework efficiently models potentially non-linear effects of predictors without requiring explicit specification of functional forms. In stage II, we aggregate information across gene-sets via a regularization procedure. Estimation and computational efficiency is further improved with kernel principle component analysis. Asymptotic results for model estimation and gene set selection are derived and numerical studies suggest that the proposed procedure could outperform existing procedures for constructing genetic risk models.
Natural language processing tools allow the characterization of sentiment--that is, terms expressing positive and negative emotion--in text. Applying such tools to electronic health records may provide insight into meaningful patient or clinician features not captured in coded data alone. We performed sentiment analysis on 2,484 hospital discharge notes for 2,010 individuals from a psychiatric inpatient unit, as well as 20,859 hospital discharges for 15,011 individuals from general medical units, in a large New England health system between January 2011 and 2014. The primary measures of sentiment captured intensity of subjective positive or negative sentiment expressed in the discharge notes. Mean scores were contrasted between sociodemographic and clinical groups in mixed effects regression models. Discharge note sentiment was then examined for association with risk for readmission in Cox regression models. Discharge notes for individuals with greater medical comorbidity were modestly but significantly lower in positive sentiment among both psychiatric and general medical cohorts (p<0.001 in each). Greater positive sentiment at discharge was associated with significantly decreased risk of hospital readmission in each cohort (~12% decrease per standard deviation above the mean). Automated characterization of discharge notes in terms of sentiment identifies differences between sociodemographic groups, as well as in clinical outcomes, and is not explained by differences in diagnosis. Clinician sentiment merits investigation to understand why and how it reflects or impacts outcomes.
Neurons live for decades in a postmitotic state, their genomes susceptible to DNA damage. Here we survey the landscape of somatic single-nucleotide variants (SNVs) in the human brain. We identified thousands of somatic SNVs by single-cell sequencing of 36 neurons from the cerebral cortex of three normal individuals. Unlike germline and cancer SNVs, which are often caused by errors in DNA replication, neuronal mutations appear to reflect damage during active transcription. Somatic mutations create nested lineage trees, allowing them to be dated relative to developmental landmarks and revealing a polyclonal architecture of the human cerebral cortex. Thus, somatic mutations in the brain represent a durable and ongoing record of neuronal life history, from development through postmitotic function.