BACKGROUND: The CEGS N-GRID 2016 Shared Task in Clinical Natural Language Processing (NLP) provided a set of 1000 neuropsychiatric notes to participants as part of a competition to predict psychiatric symptom severity scores. This paper summarizes our methods, results, and experiences based on our participation in the second track of the shared task. OBJECTIVE: Classical methods of text classification usually fall into one of three problem types: binary, multi-class, and multi-label classification. In this effort, we study ordinal regression problems with text data where misclassifications are penalized differently based on how far apart the ground truth and model predictions are on the ordinal scale. Specifically, we present our entries (methods and results) in the N-GRID shared task in predicting research domain criteria (RDoC) positive valence ordinal symptom severity scores (absent, mild, moderate, and severe) from psychiatric notes. METHODS: We propose a novel convolutional neural network (CNN) model designed to handle ordinal regression tasks on psychiatric notes. Broadly speaking, our model combines an ordinal loss function, a CNN, and conventional feature engineering (wide features) into a single model which is learned end-to-end. Given interpretability is an important concern with nonlinear models, we apply a recent approach called locally interpretable model-agnostic explanation (LIME) to identify important words that lead to instance specific predictions. RESULTS: Our best model entered into the shared task placed third among 24 teams and scored a macro mean absolute error (MMAE) based normalized score (100·(1-MMAE)) of 83.86. Since the competition, we improved our score (using basic ensembling) to 85.55, comparable with the winning shared task entry. Applying LIME to model predictions, we demonstrate the feasibility of instance specific prediction interpretation by identifying words that led to a particular decision. CONCLUSION: In this paper, we present a method that successfully uses wide features and an ordinal loss function applied to convolutional neural networks for ordinal text classification specifically in predicting psychiatric symptom severity scores. Our approach leads to excellent performance on the N-GRID shared task and is also amenable to interpretability using existing model-agnostic approaches.
Engulfment of synapses and neural progenitor cells (NPCs) by microglia is critical for the development and maintenance of proper brain circuitry, and has been implicated in neurodevelopmental as well as neurodegenerative disease etiology. We have developed and validated models of these mechanisms by reprogramming microglia-like cells from peripheral blood mononuclear cells, and combining them with NPCs and neurons derived from induced pluripotent stem cells to create patient-specific cellular models of complement-dependent synaptic pruning and elimination of NPCs. The resulting microglia-like cells express appropriate markers and function as primary human microglia, while patient-matched macrophages differ markedly. As a demonstration of disease-relevant application, we studied the role of C4, recently implicated in schizophrenia, in engulfment of synaptic structures by human microglia. The ability to create complete patient-specific cellular models of critical microglial functions utilizing samples taken during a single clinical visit will extend the ability to model central nervous system disease while facilitating high-throughput screening.
Major depressive disorder frequently co-occurs with medical disorders, raising the possibility of shared genetic liability. Recent identification of 15 novel genetic loci associated with depression allows direct investigation of this question. In cohorts of individuals participating in biobanks at two academic medical centers, we calculated polygenic loading for risk loci reported to be associated with depression. We then examined the association between such loading and 50 groups of clinical diagnoses, or topics, drawn from these patients' electronic health records, determined using a novel application of latent Dirichilet allocation. Three topics showed experiment-wide association with the depression liability score; these included diagnostic groups representing greater prevalence of mood and anxiety disorders, greater prevalence of cardiac ischemia, and a decreased prevalence of heart failure. The latter two associations persisted even among individuals with no mood disorder diagnosis. This application of a novel method for grouping related diagnoses in biobanks indicate shared genetic risk for depression and cardiac disease, with a pattern suggesting greater ischemic risk and diminished heart failure risk.
BACKGROUND: Applications of natural language processing to mental health notes are not common given the sensitive nature of the associated narratives. The CEGS N-GRID 2016 Shared Task in Clinical Natural Language Processing (NLP) changed this scenario by providing the first set of neuropsychiatric notes to participants. This study summarizes our efforts and results in proposing a novel data use case for this dataset as part of the third track in this shared task. OBJECTIVE: We explore the feasibility and effectiveness of predicting a set of common mental conditions a patient has based on the short textual description of patient's history of present illness typically occurring in the beginning of a psychiatric initial evaluation note. MATERIALS AND METHODS: We clean and process the 1000 records made available through the N-GRID clinical NLP task into a key-value dictionary and build a dataset of 986 examples for which there is a narrative for history of present illness as well as Yes/No responses with regards to presence of specific mental conditions. We propose two independent deep neural network models: one based on convolutional neural networks (CNN) and another based on recurrent neural networks with hierarchical attention (ReHAN), the latter of which allows for interpretation of model decisions. We conduct experiments to compare these methods to each other and to baselines based on linear models and named entity recognition (NER). RESULTS: Our CNN model with optimized thresholding of output probability estimates achieves best overall mean micro-F score of 63.144% for 11 common mental conditions with statistically significant gains (p<0.05) over all other models. The ReHAN model with interpretable attention mechanism scored 61.904% mean micro-F1 score. Both models' improvements over baseline models (support vector machines and NER) are statistically significant. The ReHAN model additionally aids in interpretation of the results by surfacing important words and sentences that lead to a particular prediction for each instance. CONCLUSIONS: Although the history of present illness is a short text segment averaging 300 words, it is a good predictor for a few conditions such as anxiety, depression, panic disorder, and attention deficit hyperactivity disorder. Proposed CNN and RNN models outperform baseline approaches and complement each other when evaluating on a per-label basis.
In response to the challenges set forth by the CEGS N-GRID 2016 Shared Task in Clinical Natural Language Processing, we describe a framework to automatically classify initial psychiatric evaluation records to one of four positive valence system severities: absent, mild, moderate, or severe. We used a dataset provided by the event organizers to develop a framework comprised of natural language processing (NLP) modules and 3 predictive models (two decision tree models and one Bayesian network model) used in the competition. We also developed two additional predictive models for comparison purpose. To evaluate our framework, we employed a blind test dataset provided by the 2016 CEGS N-GRID. The predictive scores, measured by the macro averaged-inverse normalized mean absolute error score, from the two decision trees and Naïve Bayes models were 82.56%, 82.18%, and 80.56%, respectively. The proposed framework in this paper can potentially be applied to other predictive tasks for processing initial psychiatric evaluation records, such as predicting 30-day psychiatric readmissions.
OBJECTIVE: Mental health is becoming an increasingly important topic in healthcare. Psychiatric symptoms, which consist of subjective descriptions of the patient's experience, as well as the nature and severity of mental disorders, are critical to support the phenotypic classification for personalized prevention, diagnosis, and intervention of mental disorders. However, few automated approaches have been proposed to extract psychiatric symptoms from clinical text, mainly due to (a) the lack of annotated corpora, which are time-consuming and costly to build, and (b) the inherent linguistic difficulties that symptoms present as they are not well-defined clinical concepts like diseases. The goal of this study is to investigate techniques for recognizing psychiatric symptoms in clinical text without labeled data. Instead, external knowledge in the form of publicly available "seed" lists of symptoms is leveraged using unsupervised distributional representations. MATERIALS AND METHODS: First, psychiatric symptoms are collected from three online repositories of healthcare knowledge for consumers-MedlinePlus, Mayo Clinic, and the American Psychiatric Association-for use as seed terms. Candidate symptoms in psychiatric notes are automatically extracted using phrasal syntax patterns. In particular, the 2016 CEGS N-GRID challenge data serves as the psychiatric note corpus. Second, three corpora-psychiatric notes, psychiatric forum data, and MIMIC II-are adopted to generate distributional representations with paragraph2vec. Finally, semantic similarity between the distributional representations of the seed symptoms and candidate symptoms is calculated to assess the relevance of a phrase. Experiments were performed on a set of psychiatric notes from the CEGS N-GRID 2016 Challenge. RESULTS & CONCLUSION: Our method demonstrates good performance at extracting symptoms from an unseen corpus, including symptoms with no word overlap with the provided seed terms. Semantic similarity based on the distributional representation outperformed baseline methods. Our experiment yielded two interesting results. First, distributional representations built from social media data outperformed those built from clinical data. And second, the distributional representation model built from sentences resulted in better representations of phrases than the model built from phrase alone.
In this paper, we present our system as submitted in the CEGS N-GRID 2016 task 2 RDoC classification competition. The task was to determine symptom severity (0-3) in a domain for a patient based on the text provided in his/her initial psychiatric evaluation. We first preprocessed the psychiatry notes into a semi-structured questionnaire and transformed the short answers into either numerical, binary, or categorical features. We further trained weak Support Vector Regressors (SVR) for each verbose answer and combined regressors' output with other features to feed into the final gradient tree boosting classifier with resampling of individual notes. Our best submission achieved a macro-averaged Mean Absolute Error of 0.439, which translates to a normalized score of 81.75%.
The second track of the CEGS N-GRID 2016 natural language processing shared tasks focused on predicting symptom severity from neuropsychiatric clinical records. For the first time, initial psychiatric evaluation records have been collected, de-identified, annotated and shared with the scientific community. One-hundred-ten researchers organized in twenty-four teams participated in this track and submitted sixty-five system runs for evaluation. The top ten teams each achieved an inverse normalized macro-averaged mean absolute error score over 0.80. The top performing system employed an ensemble of six different machine learning-based classifiers to achieve a score 0.86. The task resulted to be generally easy with the exception of two specific classes of records: records with very few but crucial positive valence signals, and records describing patients predominantly affected by negative rather than positive valence. Those cases proved to be very challenging for most of the systems. Further research is required to consider the task solved. Overall, the results of this track demonstrate the effectiveness of data-driven approaches to the task of symptom severity classification.
Clinical narratives (the text notes found in patients' medical records) are important information sources for secondary use in research. However, in order to protect patient privacy, they must be de-identified prior to use. Manual de-identification is considered to be the gold standard approach but is tedious, expensive, slow, and impractical for use with large-scale clinical data. Automated or semi-automated de-identification using computer algorithms is a potentially promising alternative. The Informatics Institute of the University of Alabama at Birmingham is applying de-identification to clinical data drawn from the UAB hospital's electronic medical records system before releasing them for research. We participated in a shared task challenge by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) at the de-identification regular track to gain experience developing our own automatic de-identification tool. We focused on the popular and successful methods from previous challenges: rule-based, dictionary-matching, and machine-learning approaches. We also explored new techniques such as disambiguation rules, term ambiguity measurement, and used multi-pass sieve framework at a micro level. For the challenge's primary measure (strict entity), our submissions achieved competitive results (f-measures: 87.3%, 87.1%, and 86.7%). For our preferred measure (binary token HIPAA), our submissions achieved superior results (f-measures: 93.7%, 93.6%, and 93%). With those encouraging results, we gain the confidence to improve and use the tool for the real de-identification task at the UAB Informatics Institute.
OBJECTIVE: A major preventable contributor to healthcare costs among older individuals is fall-related injury. We sought to validate a tool to stratify such risk based on readily available clinical data, including projected medication adverse effects, using state-wide medical claims data. DESIGN: Sociodemographic and clinical features were drawn from health claims paid in the state of Massachusetts for individuals aged 35-65 with a hospital admission for a period spanning January-December 2012. Previously developed logistic regression models of hospital readmission for fall-related injury were refit in a testing set including a randomly selected 70% of individuals, and examined in a training set comprised of the remaining 30%. Medications at admission were summarised based on reported adverse effect frequencies in published medication labelling. SETTING: The Massachusetts health system. PARTICIPANTS: A total of 68 764 hospitalised individuals aged 35-65 years. PRIMARY MEASURES: Hospital readmission for fall-related injury defined by claims code. RESULTS: A total of 2052 individuals (3.0%) were hospitalised for fall-related injury within 90 days of discharge, and 3391 (4.9%) within 180 days. After recalibrating the model in a training data set comprised of 48 136 individuals (70%), model discrimination in the remaining 30% test set yielded an area under the receiver operating characteristic curve (AUC) of 0.74 (95% CI 0.72 to 0.76). AUCs were similar across age decades (0.71 to 0.78) and sex (0.72 male, 0.76 female), and across most common diagnostic categories other than psychiatry. For individuals in the highest risk quartile, 11.4% experienced fall within 180 days versus 1.2% in the lowest risk quartile; 57.6% of falls occurred in the highest risk quartile. CONCLUSIONS: This analysis of state-wide claims data demonstrates the feasibility of predicting fall-related injury requiring hospitalisation using readily available sociodemographic and clinical details. This translatable approach to stratification allows for identification of high-risk individuals in whom interventions are likely to be cost-effective.
Multiple studies have examined the risk of prenatal antidepressant exposure and risk for autism spectrum disorder (ASD) or attention-deficit hyperactivity disorder (ADHD), with inconsistent results. Precisely estimating such risk, if any, is of great importance in light of the need to balance such risk with the benefit of depression and anxiety treatment. We developed a method to integrate data from multiple New England health systems, matching offspring and maternal health data in electronic health records to characterize diagnoses and medication exposure. Children with ASD or ADHD were matched 1:3 with children without neurodevelopmental disorders. Association between maternal antidepressant exposure and ASD or ADHD liability was examined using logistic regression, adjusting for potential sociodemographic and psychiatric confounding variables. In new cohorts of 1245 ASD cases and 1701 ADHD cases, along with age-, sex- and socioeconomic status matched controls, neither disorder was significantly associated with prenatal antidepressant exposure in crude or adjusted models (adjusted odds ratio 0.90, 95% confidence interval 0.50-1.54 for ASD; 0.97, 95% confidence interval 0.53-1.69 for ADHD). Pre-pregnancy antidepressant exposure significantly increased risk for both disorders. These results suggest that prior reports of association between prenatal antidepressant exposure and neurodevelopmental disease are likely to represent a false-positive finding, which may arise in part through confounding by indication. They further demonstrate the potential to integrate data across electronic health records studies spanning multiple health systems to enable efficient pharmacovigilance investigation.
Compelling clinical, social, and economic reasons exist to innovate in the process of drug discovery for neuropsychiatric disorders. The use of patient-specific, induced pluripotent stem cells (iPSCs) now affords the ability to generate neuronal cell-based models that recapitulate key aspects of human disease. In the context of neuropsychiatric disorders, where access to physiologically active and relevant cell types of the central nervous system for research is extremely limiting, iPSC-derived in vitro culture of human neurons and glial cells is transformative. Potential applications relevant to early stage drug discovery, include support of quantitative biochemistry, functional genomics, proteomics, and perhaps most notably, high-throughput and high-content chemical screening. While many phenotypes in human iPSC-derived culture systems may prove adaptable to screening formats, addressing the question of which in vitro phenotypes are ultimately relevant to disease pathophysiology and therefore more likely to yield effective pharmacological agents that are disease-modifying treatments requires careful consideration. Here, we review recent examples of studies of neuropsychiatric disorders using human stem cell models where cellular phenotypes linked to disease and functional assays have been reported. We also highlight technical advances using genome-editing technologies in iPSCs to support drug discovery efforts, including the interpretation of the functional significance of rare genetic variants of unknown significance and for the purpose of creating cell type- and pathway-selective functional reporter assays. Additionally, we evaluate the potential of in vitro stem cell models to investigate early events of disease pathogenesis, in an effort to understand the underlying molecular mechanism, including the basis of selective cell-type vulnerability, and the potential to create new cell-based diagnostics to aid in the classification of patients and subsequent selection for clinical trials. A number of key challenges remain, including the scaling of iPSC models to larger cohorts and integration with rich clinicopathological information and translation of phenotypes. Still, the overall use of iPSC-based human cell models with functional cellular and biochemical assays holds promise for supporting the discovery of next-generation neuropharmacological agents for the treatment and ultimately prevention of a range of severe mental illnesses.
Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify multiple regulators or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.
Bulent Ataman, Gabriella L Boulting, David A Harmin, Marty G Yang, Mollie Baker-Salisbury, Ee-Lynn Yap, Athar N Malik, Kevin Mei, Alex A Rubin, Ivo Spiegel, Ershela Durresi, Nikhil Sharma, Linda S Hu, Mihovil Pletikos, Eric C Griffith, Jennifer N Partlow, Christine R Stevens, Mazhar Adli, Maria Chahrour, Nenad Sestan, Christopher A Walsh, Vladimir K Berezovskii, Margaret S Livingstone, and Michael E Greenberg. 2016. “Evolution of Osteocrin as an activity-regulated factor in the primate brain.” Nature, 539, 7628, Pp. 242-247.Abstract
Sensory stimuli drive the maturation and function of the mammalian nervous system in part through the activation of gene expression networks that regulate synapse development and plasticity. These networks have primarily been studied in mice, and it is not known whether there are species- or clade-specific activity-regulated genes that control features of brain development and function. Here we use transcriptional profiling of human fetal brain cultures to identify an activity-dependent secreted factor, Osteocrin (OSTN), that is induced by membrane depolarization of human but not mouse neurons. We find that OSTN has been repurposed in primates through the evolutionary acquisition of DNA regulatory elements that bind the activity-regulated transcription factor MEF2. In addition, we demonstrate that OSTN is expressed in primate neocortex and restricts activity-dependent dendritic growth in human neurons. These findings suggest that, in response to sensory input, OSTN regulates features of neuronal structure and function that are unique to primates.
Murray B Stein, Chia-Yen Chen, Robert J Ursano, Tianxi Cai, Joel Gelernter, Steven G Heeringa, Sonia Jain, Kevin P Jensen, Adam X Maihofer, Colter Mitchell, Caroline M Nievergelt, Matthew K Nock, Benjamin M Neale, Renato Polimanti, Stephan Ripke, Xiaoying Sun, Michael L Thomas, Qian Wang, Erin B Ware, Susan Borja, Ronald C Kessler, Jordan W Smoller, and Army Study Assess Risk Resilience Servicemembers (STARRS) to and in Collaborators. 2016. “Genome-wide Association Studies of Posttraumatic Stress Disorder in 2 Cohorts of US Army Soldiers.” JAMA Psychiatry, 73, 7, Pp. 695-704.Abstract
IMPORTANCE: Posttraumatic stress disorder (PTSD) is a prevalent, serious public health concern, particularly in the military. The identification of genetic risk factors for PTSD may provide important insights into the biological foundation of vulnerability and comorbidity. OBJECTIVE: To discover genetic loci associated with the lifetime risk for PTSD in 2 cohorts from the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS). DESIGN, SETTING, AND PARTICIPANTS: Two coordinated genome-wide association studies of mental health in the US military contributed participants. The New Soldier Study (NSS) included 3167 unique participants with PTSD and 4607 trauma-exposed control individuals; the Pre/Post Deployment Study (PPDS) included 947 unique participants with PTSD and 4969 trauma-exposed controls. The NSS data were collected from February 1, 2011, to November 30, 2012; the PDDS data, from January 9 to April 30, 2012. The primary analysis compared lifetime DSM-IV PTSD cases with trauma-exposed controls without lifetime PTSD. Data were analyzed from March 18 to December 27, 2015. MAIN OUTCOMES AND MEASURES: Association analyses for PTSD used logistic regression models within each of 3 ancestral groups (European, African, and Latino American) by study, followed by meta-analysis. Heritability and genetic correlation and pleiotropy with other psychiatric and immune-related disorders were estimated. RESULTS: The NSS population was 80.7% male (6277 of 7774 participants; mean [SD] age, 20.9 [3.3] years); the PPDS population, 94.4% male (5583 of 5916 participants; mean [SD] age, 26.5 [6.0] years). A genome-wide significant locus was found in ANKRD55 on chromosome 5 (rs159572; odds ratio [OR], 1.62; 95% CI, 1.37-1.92; P = 2.34 × 10-8) and persisted after adjustment for cumulative trauma exposure (adjusted OR, 1.64; 95% CI, 1.39-1.95; P = 1.18 × 10-8) in the African American samples from the NSS. A genome-wide significant locus was also found in or near ZNF626 on chromosome 19 (rs11085374; OR, 0.77; 95% CI, 0.70-0.85; P = 4.59 × 10-8) in the European American samples from the NSS. Similar results were not found for either single-nucleotide polymorphism in the corresponding ancestry group from the PPDS sample, in other ancestral groups, or in transancestral meta-analyses. Single-nucleotide polymorphism-based heritability was nonsignificant, and no significant genetic correlations were observed between PTSD and 6 mental disorders or 9 immune-related disorders. Significant evidence of pleiotropy was observed between PTSD and rheumatoid arthritis and, to a lesser extent, psoriasis. CONCLUSIONS AND RELEVANCE: In the largest genome-wide association study of PTSD to date, involving a US military sample, limited evidence of association for specific loci was found. Further efforts are needed to replicate the genome-wide significant association with ANKRD55-associated in prior research with several autoimmune and inflammatory disorders-and to clarify the nature of the genetic overlap observed between PTSD and rheumatoid arthritis and psoriasis.
Bipolar disorder (BD) is a prevalent and severe mood disorder characterized by recurrent episodes of mania and depression. Both genetic and environmental factors have been implicated in BD etiology, but the biological underpinnings remain elusive. Recent genome-wide association studies (GWAS) for identifying genes conferring risk for schizophrenia, BD, and major depression, identified an association between single-nucleotide polymorphisms (SNPs) in the SYNE1 gene and increased risk of BD. SYNE1 has also been identified as a risk locus for multiple other neurological or neuromuscular genetic disorders. The BD associated SNPs map within the gene region homologous to part of rat Syne1 encompassing the brain specific transcripts encoding CPG2, a postsynaptic neuronal protein localized to excitatory synapses and an important regulator of glutamate receptor internalization. Here, we use RNA-seq, ChIP-seq and RACE to map the human SYNE1 transcriptome, focusing on the CPG2 locus. We validate several CPG2 transcripts, including ones not previously annotated in public databases, and identify and clone a full-length CPG2 cDNA expressed in human neocortex, hippocampus and striatum. Using lenti-viral gene knock down/replacement and surface receptor internalization assays, we demonstrate that human CPG2 protein localizes to dendritic spines in rat hippocampal neurons and is functionally equivalent to rat CPG2 in regulating glutamate receptor internalization. This study provides a valuable gene-mapping framework for relating multiple genetic disease loci in SYNE1 with their transcripts, and for evaluating the effects of missense SNPs identified by patient genome sequencing on neuronal function.
Despite strong evidence supporting the heritability of major depressive disorder (MDD), previous genome-wide studies were unable to identify risk loci among individuals of European descent. We used self-report data from 75,607 individuals reporting clinical diagnosis of depression and 231,747 individuals reporting no history of depression through 23andMe and carried out meta-analysis of these results with published MDD genome-wide association study results. We identified five independent variants from four regions associated with self-report of clinical diagnosis or treatment for depression. Loci with a P value <1.0 × 10(-5) in the meta-analysis were further analyzed in a replication data set (45,773 cases and 106,354 controls) from 23andMe. A total of 17 independent SNPs from 15 regions reached genome-wide significance after joint analysis over all three data sets. Some of these loci were also implicated in genome-wide association studies of related psychiatric traits. These studies provide evidence for large-scale consumer genomic data as a powerful and efficient complement to data collected from traditional means of ascertainment for neuropsychiatric disease genomics.
BACKGROUND: Autism spectrum disorder (ASD) is a common neurodevelopmental disorder that tends to co-occur with other diseases, including asthma, inflammatory bowel disease, infections, cerebral palsy, dilated cardiomyopathy, muscular dystrophy, and schizophrenia. However, the molecular basis of this co-occurrence, and whether it is due to a shared component that influences both pathophysiology and environmental triggering of illness, has not been elucidated. To address this, we deploy a three-tiered transcriptomic meta-analysis that functions at the gene, pathway, and disease levels across ASD and its co-morbidities. RESULTS: Our analysis reveals a novel shared innate immune component between ASD and all but three of its co-morbidities that were examined. In particular, we find that the Toll-like receptor signaling and the chemokine signaling pathways, which are key pathways in the innate immune response, have the highest shared statistical significance. Moreover, the disease genes that overlap these two innate immunity pathways can be used to classify the cases of ASD and its co-morbidities vs. controls with at least 70 % accuracy. CONCLUSIONS: This finding suggests that a neuropsychiatric condition and the majority of its non-brain-related co-morbidities share a dysregulated signal that serves as not only a common genetic basis for the diseases but also as a link to environmental triggers. It also raises the possibility that treatment and/or prophylaxis used for disorders of innate immunity may be successfully used for ASD patients with immune-related phenotypes.
Large assembled cohorts with banked biospecimens offer valuable opportunities to identify novel markers for risk prediction. When the outcome of interest is rare, an effective strategy to conserve limited biological resources while maintaining reasonable statistical power is the case cohort (CCH) sampling design, in which expensive markers are measured on a subset of cases and controls. However, the CCH design introduces significant analytical complexity due to outcome-dependent, finite-population sampling. Current methods for analyzing CCH studies focus primarily on the estimation of simple survival models with linear effects; testing and estimation procedures that can efficiently capture complex non-linear marker effects for CCH data remain elusive. In this article, we propose inverse probability weighted (IPW) variance component type tests for identifying important marker sets through a Cox proportional hazards kernel machine (CoxKM) regression framework previously considered for full cohort studies (Cai et al., 2011). The optimal choice of kernel, while vitally important to attain high power, is typically unknown for a given dataset. Thus, we also develop robust testing procedures that adaptively combine information from multiple kernels. The proposed IPW test statistics have complex null distributions that cannot easily be approximated explicitly. Furthermore, due to the correlation induced by CCH sampling, standard resampling methods such as the bootstrap fail to approximate the distribution correctly. We, therefore, propose a novel perturbation resampling scheme that can effectively recover the induced correlation structure. Results from extensive simulation studies suggest that the proposed IPW CoxKM testing procedures work well in finite samples. The proposed methods are further illustrated by application to a Danish CCH study of Apolipoprotein C-III markers on the risk of coronary heart disease.