The primary aim of the Women’s Genome Health Study (WGHS) is to create a comprehensive, fully searchable genome-wide database of >360 000 single nucleotide polymorphisms among at least 25 000 initially healthy American women participating in the ongoing NIH-funded Women’s Health Study (WHS). These women have already been followed over a 12-year period for major incident health events including but not limited to myocardial infarction, stroke, cancer, diabetes, osteoporosis, venous-thromboembolism, cognitive decline, and common visual disorders such as age- related macular degeneration and cataracts. Investigations within the WGHS will seek to identify relevant patterns of genetic polymorphism that predict future disease states in otherwise healthy American women, and to evaluate patterns of genetic polymorphism that relate to multiple intermediate phenotypes including blood-based determinants of disease that were measured at baseline for each study participant. By linking genome-wide data to the existing epidemiologic databank of the parent WHS, which includes comprehensive dietary, behavioral, and traditional exposure data on each participant since cohort inception in 1992, the WGHS will also allow exploration of gene-environment and gene-gene interactions as they relate to incident disease states. Thus, with continued follow-up of the WHS, the WGHS provides a unique scientific resource—a full-cohort, prospective, genome-wide association study among initially healthy American women.
With advances in our understanding of the genetic basis of human disease, it has become apparent that the underlying causes of many chronic disorders are multifactorial and involve a complex interplay between acquired and inherited risk factors. With the advent of genome-wide scanning technologies, it is now possible to ascertain information on most of the common genetic variations in individual patients. Obtaining this genome-wide information in well-characterized patient groups with or at risk for disease is a critical step in moving toward a genome-based practice of medicine that not only provides insight into the root causes of disease, but also forms the basis for discovery of new and specific targets for drug therapy. Genome-based medicine may also fulfill the promise of personalized medicine and provide the means to implement patient-specific preventive programs years in advance of clinical symptoms (1)(2)(3).
A favored analytic approach for such discovery is the genome-wide association study (GWAS)1 in which genetic variation across the human genome is compared between patients with different disease states or different risk-factor profiles. Success in GWAS requires a comprehensive knowledge of genome-wide variation and linkage disequilibrium patterns, the availability of dense genotyping chip sets containing several hundred thousand single-nucleotide polymorphisms (SNPs), and the availability of large, well-phenotyped patient populations (4)(5)(6)(7)(8)(9)(10). Appropriate patient populations can take the form of retrospective case-control studies (in which patients with and without existing disease are compared for genetic variation), prospective nested case-control studies (in which incident cases and matched controls who remain free of disease are selected from within an ongoing prospective cohort), or cross-sectional family-based studies (in which affected and unaffected parents and children are evaluated across generations).
A potentially more powerful approach to GWAS is the large-scale prospective cohort study, in which initially healthy individuals are followed over long time periods and assessed for the development and all members of the cohort undergo comprehensive genotyping. Such full-ascertainment prospective cohort studies have the advantage of avoiding bias in the selection of case and control subjects and enable simultaneous evaluation of a large number of environmental exposures and potential disease states in an epidemiologically efficient manner. Large-scale prospective cohort studies are also an optimal setting in which to evaluate gene-gene and gene-environment exposures likely to be of interest for complex disorders such as cardiovascular disease, stroke, diabetes, and cancer for which substantive environmental determinants are known. Unlike retrospective case-control or prospective nested case-control study designs, the full-cohort approach also allows the investigation of different diseases simultaneously, can easily include future cases in analyses without concern for ascertainment bias, reduces laboratory variability because the full cohort is genotyped at the same time and in random order, and markedly improves the ability to evaluate gene-environment interactions when environmental exposure is rare. The disadvantages of this approach are the greater expense of cohort assembly and baseline genotyping, as well as the need for comprehensive and ongoing long-term follow-up and endpoint ascertainment. Decade or longer follow-up periods are typically required in prospective cohort settings to allow for sufficient accrual of incident disease states.
The Women’s Genome Health Study (WGHS) is an ongoing prospective cohort GWAS that derives from the NIH-funded Women’s Health Study (WHS) and includes more than 25 000 initially healthy women who have already been followed for more than 12 years for the development of common disorders such as myocardial infarction, stroke, cancer, venous thromboembolism, diabetes, osteoporosis, cognitive decline, and common visual disorders such as age-related macular degeneration and cataracts. Because each WGHS participant is also a WHS participant, full epidemiologic data on a broad range of behavioral, dietary, and environmental risk exposures are available for the study population. In addition, each WGHS participant was included in the parent WHS and provided a baseline blood sample that was already evaluated for multiple disease biomarkers including total, HDL, and LDL cholesterol, triglycerides, apolipoprotein A-I, apolipoprotein B100, lipoprotein, homocysteine, high-sensitivity C-reactive protein (hsCRP), soluble intercellular adhesion molecule type-1 (sICAM-1), fibrinogen, creatinine, and hemoglobin A1c. Each baseline blood sample also had genomic DNA extracted and is now undergoing genotyping for more than 360 000 single-nucleotide polymorphisms (SNPs) using the HapMap-based Human-Hap300 Duo-plus BeadChip platform.
In this report we describe the WGHS and its parent WHS from the perspectives of cohort assembly, follow-up, endpoint validation, baseline plasma phenotyping, DNA extraction, genotyping, participant confidentiality, power and sample size and discuss the WGHS in context with other ongoing GWAS being performed in related areas.
cohort assembly and prospective follow-up
All members of the WGHS cohort were participants in the WHS who provided an adequate baseline blood sample for plasma and DNA analysis and who gave consent for blood-based analyses and long-term follow-up.
The WHS was initiated in 1992 to evaluate the balance of benefits and risks of low-dose aspirin and vitamin E in the primary prevention of cardiovascular disease and cancer in women (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14). Since its inception, the study has been continuously funded by the National Heart, Lung, and Blood Institute and the National Cancer Institute, with study agents provided by Bayer and the Natural Source Vitamin E Association.
Between September 1992 and May 1995, letters of invitation to participate in the WHS were sent to more than 1.7 million US female health professionals; 453 787 women completed the questionnaires, with 65 169 initially willing and eligible to enroll. Women were eligible if they were 45 years old or older; had no history of coronary heart disease, cerebrovascular disease, cancer (except nonmelanoma skin cancer), or other major chronic illness; had no history of side effects to study medications; were not taking aspirin or nonsteroidal antiinflammatory medications more than once a week; were not taking anticoagulants or corticosteroids; and were not taking individual supplements of vitamin A, E, or β-carotene more than once a week. Eligible women were enrolled in a 3-month run-in period of placebo administration to identify a group likely to be compliant with long-term treatment and follow-up. A total of 39 876 women were willing, eligible, and compliant during the run-in phase and were randomized in a 2 × 2 factorial design to 1 of 4 treatment groups: active aspirin (100 mg orally every other day) and vitamin E placebo, aspirin placebo and active vitamin E (600 IU orally every other day), both active agents, or both placebos.
Annually during the trial period, WHS participants were sent a 1-year supply of monthly calendar packs containing active agents or placebo, as well as questionnaires seeking information about compliance, side effects, the occurrence of relevant clinical endpoints, risk factors, and a comprehensive food frequency questionnaire. Study medications and endpoint ascertainment were continued in a blinded fashion through the scheduled end of the trial; randomized follow-up was completed in February 2005. At that time, rates of follow-up with respect to morbidity and mortality were 97.2% complete and 99.4% complete, respectively. The primary findings of the WHS with regard to the effects of aspirin and vitamin E on the primary trial endpoints of cardiovascular disease and cancer were presented in 2005 (12)(13)(14). Follow-up of the WHS cohort has continued without interruption since that time and is ongoing with 98% participation rates.
The WGHS cohort described here comprises 28 345 (70.6%) of the 39 876 WHS participants who provided a baseline blood sample adequate for plasma and DNA analysis before randomization and consented to ongoing analyses linking blood-derived observations with baseline risk factor profiles and incident disease events. The WHS trial and cohort follow-up, as well as the WGHS, were approved by the institutional review board of Brigham and Women’s Hospital, Boston, MA, and monitored by an external data and safety monitoring board.
Since study enrollment, all WHS participants have been followed prospectively for the occurrence of common clinical outcomes. For the primary trial endpoints of cardiovascular disease and cancer, full medical records are obtained for reported endpoints and reviewed by an endpoints committee of physicians unaware of randomized treatment assignment.
WHO criteria are used to confirm the occurrence of myocardial infarction on the basis of symptoms and associated abnormal concentrations of cardiac enzymes or diagnostic electrocardiograms. Stroke is confirmed if the participant has a new neurologic deficit of sudden onset that persists for >24 h. Clinical information as well as computed tomographic scans or MRI are used to distinguish hemorrhagic from ischemic events. Cardiovascular deaths are confirmed by autopsy reports, death certificates, medical records, and information obtained from family members. Reports of coronary revascularization procedures (bypass surgery or percutaneous coronary angioplasty) are confirmed by record review. Transient ischemic attacks are confirmed if the neurologic deficit of sudden onset lasted for <24 h. The diagnosis of deep-vein thrombosis is confirmed by a positive venous ultrasonography or venography report, whereas the diagnosis of pulmonary embolism is confirmed by a positive angiogram or computed tomography scan of the chest, or a ventilation-perfusion scan with 2 or more mismatched defects. Deaths due to pulmonary embolism are confirmed when autopsy reports, symptoms, circumstances, and medical history are consistent with this diagnosis.
Cancers are confirmed on the basis of pathologic or cytology reports (96.8%) or, rarely, based on strong clinical and radiological or laboratory marker evidence (e.g., increased CA-125) when a pathology or cytology review was not conducted. All cancers are coded for site, type, and when available, metastatic spread.
Additional endpoints ascertained in the WHS include the occurrence of diabetes, incident hypertension, bone fracture, osteoporosis, cognitive decline, peripheral arterial disease, colonic polyps, and common visual disorders such as age-related macular degeneration and cataracts. The methods used for validation of these endpoints are described elsewhere (15)(16)(17)(18).
baseline blood collection, processing, storage, and plasma phenotyping
All participants in the WGHS provided baseline blood samples collected in EDTA and citrate that were shipped overnight in cooled packaging to a central storage facility where they were separated into plasma and buffy-coat fractions, divided into aliquots, and stored in liquid nitrogen.
Funding from the Donald W. Reynolds Foundation (Las Vegas, NV) (19) has enabled biomarker analysis of each plasma sample in a core laboratory certified by the National Heart, Lung, and Blood Institute/CDC Lipid Standardization program. Concentrations of total cholesterol (TC) and HDL-C were measured enzymatically on a Hitachi 911 autoanalyzer (Roche Diagnostics) with day-to-day reproducibility of 1.36% and 1.07% for TC concentrations of 129.8 and 277.2 mg/dL, respectively, (throughout this report, concentrations and units given are those reported in the referenced sources) and of 1.98% and 2.68% for HDL-C concentrations of 35 and 55 mg/dL, respectively. LDL-C was determined directly (Genzyme) with reproducibility of 2.16% and 1.98% for concentrations of 76.2 and 148.7 mg/dL, respectively. Apolipoprotein-B100 and apolipoprotein-A-I were measured by an immunoturbidimetric technique, also on the Hitachi 911 analyzer. These assays employed the WHO/IFCC standards, and a validation study with those used at the Northwest Lipid Research Laboratory revealed a correlation coefficient of 0.98, intercept of 0.26 mg/dL, and slope of 0.97 for apoliporotein-B100, and correlation coefficient of 0.99, intercept of 0.264 mg/dL, and a slope of 1.0 for apoliporotein-A-I (20). Reproducibility was 3.68% and 2.95% for apolipoprotein-A-1 concentrations of 56.4 and 164.2 mg/dL, respectively, and 4.94% and 4.13% for apolipoprotein-B100 concentrations of 49.7 and 146.3 mg/dL, respectively. Triglycerides were measured enzymatically, with correction for endogenous glycerol, using a Hitachi 917 analyzer and reagents and calibrators from Roche Diagnostics; reproducibility was 1.52% and 1.49% for triglyceride concentrations of 82.5 and 178.8 mg/dL, respectively (21). In addition, full nuclear MR-based lipoprotein profiling is available on the full study cohort (LipoScience).
High-sensitivity C-reactive protein (hsCRP) was measured using a validated immunoturbidimetric method (Denka Seiken) with reproducibility of 2.16% and 3.34% for hsCRP concentrations of 1.94 and 11.42 mg/L, respectively (22). Lp(a) was evaluated with an apo(a)-independent assay with reproducibility of 2.47% and 1.45% for lipoprotein concentrations of 18.5 and 53.3 mg/dL, respectively (23). Homocysteine was determined enzymatically (Catch) with reproducibility of 4.72% and 3.06% at concentrations of 6.0 and 13.3 μmol/L, respectively and hemoglobin A1c was measured using turbidimetric immunoinhibition directly from packed red blood cells (Roche Diagnostics) with reproducibility of 3.63% and 3.77% at levels of 5.2% and 8.8%, respectively (24). Creatinine was measured by a rate-blanked method based on the Jaffe reaction using Roche Diagnostics reagents with reproducibility of 3.67% and 1.60% at concentrations of 1.17 and 6.40 mg/dL, respectively. Fibrinogen was measured by a mass-based immunoturbidimetric assay (DiaSorin) with reproducibility of 5.20% and 3.99% at concentrations of 99.1 and 273.7 mg/dL, respectively (25). Finally, sICAM was measured by quantitative sandwich ELISA (R&D Systems) with reproducibility of 8.89% and 6.39% at concentrations of 171.8 and 289.1 μg/L, respectively (26).
Of samples received by the core laboratory, 27 748 (98%) underwent successful evaluation for all biomarkers.
dna extraction and genotyping procedures
Genomic DNA extraction was performed on buffy-coat samples obtained at baseline from each participant (made possible by funding from Roche Diagnostics, the Doris Duke Charitable Foundation, and the Leducq Foundation). The MagNA Pure LC System (Roche Molecular Biochemicals) based on magnetic bead technology was used according to manufacturer’s specifications to perform all DNA isolation steps. The integrity of the isolated DNA was checked randomly on 1% agarose gel together with molecular weight marker III (Roche Molecular Biochemicals). DNA yields were calculated from the OD260 nm measurement, and purity assessed by calculating the ratio of OD260 nm to OD280 nm (24).
SNP genotyping of these DNA samples is performed using the Illumina Infinium II assay (27) to query a genome-wide set of 315 176 haplotype-tagging SNP markers (the Human HAP300 panel) (28). We added to this a focused panel of 45 882 missense and haplotype-tagging SNPs selected to enhance coverage of genomic regions in which we have a strong a priori interest owing to presence of genes believed to be of relevance to cancer as well as metabolic, cardiovascular, and inflammatory diseases (Human HAP300 Duo-plus). DNA samples are genotyped in batches of 95 WGHS participants with 1 CEPH (Centre d’Etude de Polymorphism Humain) DNA (NA10846) included to monitor genotyping consistency and plate orientation. Genotyping reactions use 750 μg of genomic DNA where possible, although in some cases successful genotyping has been performed with as little as 45 μg of DNA. The Infinium II process was implemented using Illumina Infinium Robot Control software and monitored using the Illumina Infinium laboratory information management system. The hardware platform consists of 4 Tecan EVO liquid-handling robots, 8 hybridization ovens, 3 Illumina BeadStation confocal scanners, and dual-processor workstations with access to >1 TB of disk array storage to monitor workflow and generate high-quality reduced data. Genotype calls are generated and subjected to quality control using Illumina BeadStudio v3.1 software.
As in the WHS, participant confidentiality in the WGHS is maintained throughout all aspects of each study. Investigators within the WGHS have no access to any direct patient identification information; these data are held confidentially by staff members of the WHS who are involved in patient contact and follow-up, but not in any data analysis or interpretation. Separate data files are kept for participant clinical covariate and endpoint data, plasma phenotyping data, and genomic data. Blood samples sent to the plasma phenotyping laboratory and to the genetic laboratories are labeled only with a sample ID number that cannot be tracked by laboratory personnel to any patient identification variables or to any clinical covariate data. All GWAS data included in the WGHS are maintained on a separate and fully protected computer system that is isolated and distinct from computing systems used for the parent study. A unique and fully distinct participant ID number is used in the WGHS, making direct linkage to the WHS for scientific investigators impossible. WGHS participants are provided a mechanism to withdraw consent for any reason and at any time. Should individual consent be withdrawn, any remaining stored blood samples, plasma, buffy coat, and DNA traceable to that participant are destroyed. These procedures have proven highly effective for the protection of patient confidentiality; with 300 000 person years of follow-up accrued to date, no breach of these procedures has ever led to the inadvertent unmasking of any participant’s identity.
statistical considerations and power for the wghs
Controversy exists within the genetic epidemiology community regarding the most appropriate methods to analyze data that derive from GWAS. As described in detail elsewhere, the complexity inherent in measuring up to 360 000 SNPs in each of more than 25 000 study participants within the WGHS will raise issues of data processing and quality control, as well as fundamental issues regarding correction for multiple hypothesis testing (3)(4)(29)(30).
With regard to data processing and quality control, before performing any genetic analyses, all SNPs within the WGHS are evaluated for high call rates and the percentage of missing SNPs for each individual calculated. For SNPs with adequate data, Hardy-Weinberg disequilibria are evaluated to identify potential genotyping errors. We also compare the Illumina-based SNP data for each individual participant for a panel of approximately 70 common SNPs that have previously been ascertained in the WHS population using alternative genotyping technologies; this step is used as a secondary check to ensure accurate specimen labeling before any analyses. Finally, we use principal-component analysis to examine the data for any evidence of population stratification.
With regard to issues of multiple hypothesis testing, initial analyses within the WGHS will seek to define those relationships between individual SNPs, haplotypes, or genetic pathways that determine either incident clinical events or intermediate phenotypic traits. For all initial analyses, the WGHS will follow recent guidelines (29)(30) in which the P-value for genome-wide significance is predetermined to be at a level of 10–7 or smaller, a conservative approach consistent with Bonferroni correction. For this approach, sample size and power for the WGHS to reach the level of statistical significance are presented in Fig. 1⇓ for clinical events such as myocardial infarction or diabetes and in Fig. 2⇓ for continuous intermediate phenotypes such as HDL cholesterol. As shown, the large size of the WGHS provides more than adequate power at a genome-wide level of significance for clinical endpoints with 500 or more incident events, as well as extremely high power to detect genetic differences on an additive model for intermediate phenotypes for which the polymorphism of interest explains as little as 0.15% of the variance (see Fig. legends for detail). Compared to smaller GWAS already underway, the WGHS is well positioned to address gene-gene and gene-environment interactions across a wide range of clinical outcomes and environmental exposures.
relationship of the wghs to other gwas
The sample size and epidemiologic scope make the WGHS a unique genetic resource. In addition, multiple other GWAS are underway with similar goals. Thus, through collaborative networks of investigators, observations made in one GWAS can be tested in multiple other settings, a crucial issue for generalizability and estimating population-specific attributable risks. Presentation standards for genotype–phenotype association studies that will allow such replication have recently been enumerated by National Cancer Institute/National Human Genome Research Institute Working Group on Replication in Association Studies and are structurally integrated into the WGHS (31). Examples of other ongoing GWAS with aims similar to those of the WGHS include the federally funded Framingham SNP Health Association Resource of approximately 9000 residents of Framingham, MA, as well as GWAS data being generated from unique populations such as the Atherosclerotic Risk in Communities study, the Jackson Heart Study, the MultiEthnic Study of Atherosclerosis, and the Cardiovascular Health Study. To encourage collaborative work, investigators within the WGHS have participated in the NIH-funded Genetic Analysis Information Network as well as in the Pharmacogenetics Research Network and the Pharmacogenomics and Risks of Cardiovascular Disease collaboration. The goals of the WGHS also parallel those of several international GWAS programs including the Wellcome Trust Case Control Consortium of 14 000 cases of 7 common diseases and 3000 shared controls, and the DeCode Genetics programs in Iceland (4)(9).
Grant/funding Support: The WGHS is a collaborative study supported by both federal and nonfederal entities. Primary support for the underlying WHS derives from investigator-initiated RO1 grants from the National Heart Lung and Blood Institute and the National Cancer Institute (Bethesda, MD) (J.E.B., PI). Primary support for the plasma phenotyping of all WGHS participants was supported by investigator-initiated grants from the Donald W Reynolds Foundation, Las Vegas, NV (P.M.R., PI). The process of DNA extraction was supported by investigator-initiated grants from the Doris Duke Charitable Foundation, NY, NY (P.M.R., PI) and the Fondation Leducq (Paris, France) (P.M.R., PI). The genome-wide scans using the Illumina based Human-Hap300 BeadChip technology are being conducted by Amgen (Cambridge, MA). Primary financial support for the genome-wide scans is provided by Amgen, with additional financial support provided by the National Heart, Lung, and Blood Institute, the Donald W. Reynolds Foundation, and the Leducq Foundation.
Financial Disclosures: Alex Parker is an employee of Amgen, Inc.
Acknowledgments: The WGHS Investigators are indebted to Joseph P. Miletich for his foresight in understanding the role of genetics in common diseases affecting women, to the staff of the Women’s Health Study, and to the dedicated and conscientious women who are participating in this study.
↵1 Nonstandard abbreviations: GWAS, genome-wide association study; WGHS, Women’s Genome Health Study; WHS, Women’s Health Study; SNP, single-nucleotide polymorphism.
- © 2008 The American Association for Clinical Chemistry