We are interested in a variety of methodological issues that arise in complex trait genetic studies, both in terms of the statistical framing of research questions and their implementation in freely-available software that the community can use. Below, we give some illustrative examples.
How can we efficiently analyse genome-wide genetic variation data to identify specific single nucleotide polymorphism (SNP) risk alleles? How should we control for likely sources of false positive and false negative errors? As well as implicating specific loci, can we ask broader questions about the role of common variation en masse, that may arise from a very large number of alleles, most of which are individually unlikely to be detected? How best can we characterize the genetic architecture of a common, polygenic disease such as schizophrenia, and what are the implications for future genetic studies?
Previous work: PLINK toolset for GWAS analysis (PubMed | PDF | code); initial demonstration of high polygenicity in schizophrenia (PubMed | PDF); controlling for subtle population substructure via a family-based polygenic analysis (PubMed | PDF); evaluating alternative synthetic association models (PubMed | PDF).
Deletions and duplications of genetic material (sometimes involving dozens of genes) are rare events, although very many people will carry at least one such copy number variant somewhere in their genome (sometimes inherited, sometimes arising de novo). CNVs such as the 22q11.2 deletion have long been known to cause multiorgan syndromes that include psychiatric symptoms; more recently, the broader role of such mutations in conferring risk for common psychiatric diseases has been documented, including work in autism and schizophrenia. How can CNVs (that is, ploidy or copy number status) be accurately called, from SNP microarrays or sequence data? How should CNVs be classified according to likely pathogenicity? What is the best way to find specific loci that harbor risk CNVs? How should one establish genome-wide burdens of (rare) CNVs? Do common variants and CNVs tend to impact the same genes or genetic pathways?
Previous work: calling CNVs from exome sequencing data (XHMM) (PubMed | PDF | code); application to International Schizophrenia Consortium dataset (PubMed | PDF); robust methods to establish that classes of gene show enriched CNV burden (PubMed | PDF).
We are primarily engaged in exome sequencing studies of schizophrenia and bipolar disorder. A large number of methodological issues arise in such studies. What is the optimal way to test genes or groups of genes for association with disease? What are the pros and cons of different design strategies? How should we test for enrichment of genes or sets of genes from case-only trio studies of de novo mutation? We also need tools to handle the large, complex variation datasets that emerge from next-generation sequencing pipelines.
Current projects: development of the PLINK/Seq software package (code; manuscript in preparation); the dnenrich tool for the analysis of de novo mutation (available on request).
Any one gene does not act alone. Consequently, analysing patterns of variation across gene networks and pathways can potentially help us to better understand the underlying biological processes that underlie risk. How should one test entire genes and networks for association or enrichment? How can different types of prior knowledge by incorporated in gene-mapping, including gene sets, protein-protein interaction and co-expression networks? How best to integrate genetic association data across common SNPs, CNVs, rare variants and de novo mutation?
Current projects: PPI network based analyses of rare variant data; Bayesian networks for cross-modality analysis.
Individuals of different genetic ancestry can often differ in their risk of certain diseases and exposures to environmental risk factors. Separating out the potential confounding influences of ancestry in genetic association studies, as well as looking for population-specific genetic risks, poses a number of methodological questions, especially for studies of very rare variation (which is more likely to be population specific). Furthermore, just as some individuals in a sample may be less similar to each other than expected by chance (implying they likely come from different ancestral populations) some individuals may be more similar than expected by chance (implying they are related, sharing genetic material identical-by-descent (IBD) from a relatively recent common ancestor). Can we leverage information on IBD to augment genetic association studies? What advantages are there to studying families versus populations (that is, closely-related versus only distantly-related or unrelated individuals)?
Previous work: multidimensional scaling to infer ancestry, complete linkage hierarchical clustering and the pairwise population concordance test (PubMed | PDF | code); latent class analysis applied to multilocus genotype data to infer ancestry (PubMed | PDF | code); how ascertainment on family history of disease impacts the power of family-based association studies (PubMed | PDF).
Just as common diseases exhibit many-to-one relationships in terms of genotype-to-phenotype association, the converse one-to-many relationship, that a single allele or gene may influence multiple phenotypic outcomes, is repeatedly observed too. How can we best leverage these cross-disorder or pleiotropic relations in gene discovery? Is sharing at the level of the same allele, gene, or only broader gene networks? To what extent can genetic information be used to define genetically and clinically more homogeneous subsets of patients within a single diagnostic category, such as schizophrenia? Can we also focus on what distinguishes two related disorders genetically?
In what scenarios might genetic information be potentially relevant for clinical or reproductive decision-making? How can we make optimal use of available environmental and family-history information in addition to genotypes at risk alleles? How can very rare and de novo variants in risk genes be integrated into calculations for liability to common diseases?
Some of our primary disease-oriented genetic studies and collaborations, past and present.
The PGC is a large international consortium to pursue genome-wide analyses of common psychiatric disease including schizophrenia, bipolar disorder, autism, ADHD and major depressive disorder. The PGC is extending its investigations to include copy number variants and rare variants from sequencing studies, and is designing a PsychChip to study many putative risk alleles in very large samples.
A secondary, but important, focus of the PGC has been on the extent to which genetic observations are shared or unique across these major psychiatric diseases. The PGC's cross-disorder analysis (Lancet, 2013) manuscript (which used a number of methods developed by our group) reported on genome-wide data from 33,332 cases and 27,888 controls, indicating that specific risk SNPs are often associated with a range of neuropsychiatric outcomes.
Along with Sinai collaborator Pamela Sklar, Dr. Purcell has enjoyed a long-standing collaboration between investigators at the Karolinska Institutet (Christina Hultman), University of North Carolina (Patrick Sullivan) and the Stanley Center for Psychiatric Research (Jennifer Moran, Steve McCarroll). We have reported a primary genome-wide association study and CNV analysis (PubMed | PDF) as well as population genetic work (PubMed | PDF). We are continuing data collection and analysis of this sample, including further GWAS, exome sequencing and cognitive data collection.
The goal of the Bipolar Sequencing Consortium is to identify genetic variants that influence risk of bipolar disorder, by bringing together rare variant sequencing studies from both population-based and family-based studies. The groups involved are described here; for BSC investigators there is also a private resources page. Our group is heavily involved in the analysis of the population-based studies in particular.
We have collaborated extensively with researchers from the Centre for Neuropsychiatric Genetics and Genomics, University of Cardiff, Wales, on the analysis of a family-based sample of schizophrenia patients from Bulgaria. As well as polygenic (PubMed | PDF) and de novo CNV (PubMed | PDF) studies, we have recently completed exome sequencing of this sample for de novo single nucleotide variants. Working in close collaboration with the Cardiff investigators, our group spearheaded various aspects of the data generation and analytic pipelines. This work was in collaboration with the Sanger Institute (Aarno Palotie) and the Stanley Center for Psychiatric Research.
The ISC is a group of researchers from over dozen institutions who collaborated to perform large-scale genome-wide studies of schizophrenia. We published two prominent manuscripts describing our primary work, on the roles of common SNPs (PubMed | PDF) and rare copy number variants (PubMed | PDF) in predisposing to the disease (media coverage: BBC, Independent, NPR and NY Times). Furthermore, the ISC data have been used in a broad array of publications: for example, in following up individual associated loci ( 1, 2, 3, 4, 5, 6, 7 ), in broader analyses of genetic architecture ( 1, 2, 3, 4, 5 ), in population genetics ( 1, 2 ) and in methods development ( 1, 2, 3, 4, 5 ). The primary work of the ISC is now complete, although the data generated, and collaborative ethos, continues to be part of the PGC.
The International Sleep Genetic Epidemiology Consortium (ISGEC) is a collaborative effort to perform meta-analyses of multiple Genome-Wide Association Studies to identify novel loci impacting sleep apnea and sleep traits. Ongoing ISGEC efforts include studies on sleep apnea and sleep architecture phenotypes measured by overnight polysomnography, as well as self-reported and actigraphy-assessed sleep behavior traits.