History of DNA variations

Just as we all look different, our genomic sequences vary significantly. These differences in our genome are known as genetic variations, and are caused by various types of mechanisms, such as nucleotide substitutions, insertion/deletion of nucleotides, differences in the number of tandem-repeat sequences, differences in the copy number of genomic segments and the combinations of these changes. A subset of these genetic variations is defined as genetic polymorphisms when they are observed at a frequency of ⩾1% in a certain population.

The DNA polymorphisms are classified into six classes as shown in Table 1. The importance of restriction fragment length polymorphism (RFLP) in medical research was first suggested by Botstein et al.1 RFLP is generated by differences in the size of the DNA fragment digested by a certain restriction endonuclease due to the base substitutions at the site recognized by the endonuclease. Before the discovery of polymerase chain reaction (PCR) RFLP was detected by Southern analysis and required a large amount of non-degraded genomic DNAs. RFLP patterns show co-dominant Mendelian inheritance and can help distinguish between the parental alleles (whether the allele is of maternal or of paternal origin) of the particular loci in our genome. The variable number of tandem repeats (VNTRs) was first reported as one type of RFLPs by Nakamura et al.,2 but they are highly polymorphic, with high heterozygosity in given populations because of a wide variety of the copy number of tandem-repeat DNA sequences. Isolation of VNTR markers was done by an extension of the report of hypervariable ‘minisatellite’ sequences by Jeffreys et al.3 Minisatellite probes identified very variable multiple loci simultaneously, but each VNTR marker (also called a single-locus minisatellite marker) identified a single highly polymorphic locus in our genome. Because of their highly polymorphic nature, both VNTRs and minisatellite markers were applied in forensic studies,4 and also used in clinics to monitor recipients of bone marrow transplants.5

Table 1 Types of genetic variation

VNTRs also contributed to the studies of tumor suppressor genes, most of which were characterized by inactivation of both alleles in tumors (the two-hit mutation model) and were very useful for detection of loss of a chromosome or a part of a chromosome (loss of heterozygosity) by the disappearance of a paternal or maternal allele on Southern analysis. The first systematic analysis of loss of all chromosomal arms (allelotype analysis) was reported in colorectal cancer by Vogelstein et al.6 using a set of VNTR markers. The results became one of the important clues for establishing multi-step carcinogenesis model of colorectal cancers.7 Subsequent detailed analysis of the short arm of chromosome 17 led us to find the mutations of the p53 gene and prove it to be a tumor suppressor gene.8

First-generation genetic linkage maps of human chromosomes and genetic linkage analysis

Using these polymorphic DNA markers, White and his co-workers constructed genetic linkage maps of all human chromosomes and also made these DNA markers freely available to scientific communities (for an example, see Nakamura et al.9). The study by Botstein et al.1 indicated that genetic loci responsible for genetic diseases, for which no information for speculating the causes of the disease mechanism was available, could be mapped by linkage analysis if an appropriate number of families with certain genetic diseases was available. As polymorphic DNA markers allow us to distinguish a chromosome as of maternal or paternal origin, we can examine whether each polymorphic ‘marker’ allele co-segregates with the inheritance of a disease. In fact, Gusella et al.10 determined the genetic locus for Huntington's disease to the short arm of chromosome 4. Subsequent to the increase of available DNA polymorphic markers in the late 1980s, many genes for relatively common genetic diseases, such as cystic fibrosis, familial polyposis coli, neurofibromatosis type 1, multiple endocrine neoplasia type 1 and familial breast cancer, were discovered and their responsible genes identified a few years later. These studies proved that linkage analysis with DNA polymorphic markers using families with genetic diseases in any inheritance model was a powerful tool to discover responsible genes even without any knowledge about the biological or biochemical mechanisms (or such abnormalities). This approach is known as the ‘reverse genetics’ method.

Second-generation genetic linkage map of human chromosomes

It is widely accepted that the PCR method revolutionized the size and speed of DNA analysis. DNA polymorphic markers were also drastically influenced by the development of PCR systems. After 1990, scientists switched from RFLP analysis based on Southern technology to microsatellite analysis based on PCR technology. Microsatellite markers were first described by Weber and May,11 and are short segments of two or more base pairs repeated tandemly in tracts. Unlike VNTR loci, which number a few thousand in our genome, microsatellite loci are present at >100 000 regions that cover most of the genomic regions. Hence, microsatellite markers have been successfully used for linkage analysis or population genetics because they can be easily adapted to the high-throughput system and require a very small amount of DNA. Even if DNA is degraded to some extent, microsatellite loci can be analyzed. Owing to their high levels of heterozygosity, they can distinguish paternal alleles with high probability and are very informative in linkage analysis. Through the development of very detailed second-generation genetic linkage maps of human chromosomes, a large number of genetic diseases were mapped and, subsequently, their responsible genes were isolated. Rare mutations have so far been reported in 2000 Mendelian diseases.

One of the advantages of the approach using microsatellite markers is the possibility of applying linkage analysis (also known as homozygosity mapping) to recessive diseases with very low incidence for which only a very small number of patients can be collected for study.12 However, in the case of patients born by consanguineous marriage (or marriage between individuals living in the same area where residents stay for a long period), only a few patients are sufficient to map a region including a gene responsible for a genetic disease. In fact, we applied this approach and determined the loci for Fukuyama muscular dystrophy and gelatinous drop-like corneal dystrophy, and isolated the responsible genes.13, 14

Another advantage of microsatellite markers is their application in a sib-pair analysis or transmission disequilibrium test, both of which are useful for searching genetic loci associated with common diseases. Although these types of genetic approaches were used widely, they were not so successful in identifying genes or loci associated with common diseases because a huge number of siblings or families were required to determine the genetic factors with very modest effects that increase the risk of diseases (limitation of the statistical power). One example of the success of this type of approach was shown by Onouchi et al.15 using 78 Japanese sibling pairs with Kawasaki disease, the leading cause of acquired heart disease among children in developed countries; we showed that an SNP in the ITPKC gene increased the risk of the disease and might be associated with a response to intravenous immunoglobulin therapy.

HapMap project and GWAS (genome-wide association study) using SNPs (single-nucleotide polymorphisms)

Around 2000, scientists planned to develop millions of SNP markers covering an entire genome and construct a high-density SNP (also haplotype) map. SNPs are the genetic variations that are most commonly present in our genome. They are expected to be present in one in every 500–1000 bp on average, but their distribution is likely to be higher (say, one in every 300 bp). On the basis of the ‘common variant–common disease’ association hypothesis, SNPs were considered to be very useful as a tool for population genetics, particularly for identification of genes (or genetic variations) susceptible to various diseases. The Japanese government announced a 5-year ‘Japanese Millennium Project’ in 2000 (http://www.nature.com/nbt/journal/v18/n2/full/nbt0200_142.html), which included (1) construction of an SNP database for the Japanese population and (2) establishment of a high-throughput SNPtyping system for the genome-wide association study, as one of the five arms of the project. We resequenced 24 Japanese individuals in a total of about 154-Mb genomic segments, mainly in regions containing genes, and found nearly 200 000 genetic variations, including 174 269 SNPs (one genetic variation in every 800 bp). In addition, we established a high-throughput SNP genotyping typing system by a combination of multiplex PCRs (simultaneous amplification of 96 genomic segments) with Invader assay.16 Using this platform we constructed the JSNP database, including the information of nearly 190 000 gene-based SNP loci as well as the allelic frequencies of 768 Japanese individuals at nearly 79 000 SNP loci in 2002. This SNP database had the largest allelic-frequency information based on a large number of individuals (http://snp.ims.u-tokyo.ac.jp/index_ja.html) before the International HapMap database was constructed.

Since multiple large-scale SNP genotyping platforms were developed around 2003, construction of an international consortium for making a SNP database for three major populations had been discussed and the International HapMap project including six countries (Canada, China, Japan, Nigeria, UK and USA) was organized in 2003. The aims of the International HapMap Project were: (1) determination of the common patterns of one million or more DNA sequence variations in the human genome using DNA samples from populations with ancestry from parts of Africa, Asia and Europe, (2) construction of the LD map for all chromosomes and (3) making such information freely available to the scientific community. The HapMap results are expected to allow the discovery of sequence variants that affect common diseases, facilitate development of diagnostic tools and enhance our ability to choose targets for therapeutic intervention (The International HapMap Consortium 200317).

After the very extensive efforts of the participating groups, the consortium constructed a database consisting of more than one million SNPs in 2005 (The International HapMap Consortium18) and subsequently reported an extended database in 2007 (The International HapMap Consortium 200719). In the phase 1 project, we (the SNP Research Center, RIKEN) produced nearly 25% of the SNP typing data. Although the majority of genetic variations were commonly shared among three major populations, a small subset of variations was detected in one particular population. Among them, an SNP in the ABCC11 gene that was uniquely found in the Asian population was later shown to be a determinant of the human earwax type,20 suggesting that, if they have effects on the quality or quantity of the gene product, SNPs uniquely observed in a certain population might be the genetic determinants characterizing them.

There were many criticisms and skepticisms about the ‘common variation–common diseases’ approach for identifying genes susceptible to common diseases at the beginning of the International HapMap project, but many published papers have shown the usefulness of the genome-wide association study (GWAS) to uncover various genetic factors associated with various diseases (not necessarily common diseases). Although not well recognized, in 2001 we began genome-wide association studies using nearly 90 000 gene-based SNPs that well represented covered regions containing genes. Through this approach, we identified genes susceptible to myocardial infarction.21 In addition, we have reported many candidate genes possibly associated with susceptibility to diseases as summarized in Table 2.

Table 2 Genes susceptible to various diseases reported from our centers (Center for Genomic Medicine, RIKEN and Human Genome Center, Institute of Medical Science, The University of Tokyo)

Internationally, systematic GWAS was started in 2006 based on the accumulation of a large set of SNP information through the International HapMap project, as well as the development of cheap, commercially available and accurate high-throughput SNP analysis platforms. I have summarized representative examples of the GWAS approach for various common diseases in Table 3.

Table 3 Representative results of GWAS analysis of various diseases

copy-number variations (CNVs) and diseases

Copy-number variation is defined as a form of genomic structural variation and refers to differences in the number of copies of a particular genomic region. We normally have two copies of each autosomal region, one per chromosome, but because of deletion or duplication (or multiplication) of particular genomic regions we observed a decrease or increase in the copy number. The involvement of copy-number alterations in human phenotypes was first reported as ‘genomic disorders’, which are often caused by de novo structural alterations,98 and so far dozens of genomic disorders have been reported,99 including inherited forms of CNVs that underlie Mendelian diseases such as autosomal-dominant leukodystrophy100 and hereditary pancreatitis (triplication of the trypsinogen locus).101 Through genome-wide screening of CNVs, a large body of CNV information has been accumulated.102, 103, 104, 105, 106 CNVs are considered to influence the quantity of the gene products because of the different copy number of certain genes if a part of a significant segment of the gene is deleted, or if a functional unit (including the regulatory region) of a particular gene is duplicated or multiplicated. In fact, systematic analysis indicated an association between CNVs (and also SNPs) and gene expression variations, which might be a model of complex phenotypes.107 The associations of some CNVs with common diseases have also been reported, including CNV of CCL3L1 and HIV-1/AIDS susceptibility,108 CNV of Fcgr3 and glomerulonephritis,109 CNVs of complement component C4 and systemic lupus erythematosus (SLE),110 and CNV of FCGR3B and systemic autoimmunity.111 However, we suspect that the association of CNVs of complement component C4 with SLE might be false-positively observed because of the linkage disequilibrium (LD) between the C4 CNVs and the SNP in TNXB for which we found a much stronger association with SLE.49 As there are considerable ethnic differences in the genes susceptible to various diseases, particularly in those related to autoimmune disease, we must be very careful in interpretations of the data, but our results at least indicate that the association between CNVs and diseases might be trapped by the use of SNPs for which very high-throughput and accurate analysis systems are widely available. It is certainly better to measure CNVs directly, but as the methods to accurately measure the copy numbers in an entire genome are not yet established, in particular in cases containing more than two copies or in cases of gene families with very high homology, the information of LD between SNPs and CNVs is essentially important.

Genes associated with rare genetic diseases and common diseases

It should be noted that common genetic variations detected in genes responsible for rare genetic diseases (often severe and early-onset) were later found to be associated with common diseases with similar phenotypes. One typical example is shown in the studies of diabetes mellitus. Maturity-onset diabetes of the young (MODY, OMIM 606391) is an autosomal dominant form of diabetes mellitus, which is characterized by early onset of hyperglycemia and dysfunction of β-cells in the pancreas and is phenotypically very similar to type 2 diabetes. Six genes have so far been identified as the causes of MODY, including hepatic nuclear factor 4a (HNF4A; MODY1), glucokinase (GCK; MODY2), transcription factor 1 (TCF1/HNF1A; MODY3), insulin promoter factor 1 (IPF1/PDX1; MODY4), transcription factor 2 (TCF2/HNF1B; MODY5) and neurogenic differentiation 1 (NEUROD1; MODY6).112 The common variants within these MODY genes have been extensively investigated for their association with type 2 diabetes. The HNF4A gene has two tissue-specific transcription initiation sites in the liver and pancreas. An SNP rs1884613 and several other variations in the pancreas-specific promoter region were shown to be associated with the susceptibility to type 2 diabetes.113, 114, 115 Similarly, genetic variations within the other MODY genes were also shown to have modest effects on conferring susceptibility to type 2 diabetes; an SNP in TCF2/HNF1B was identified as a strong candidate for type 2 diabetes through GWASs using populations of European descent.84, 85 The results of multiple European GWAS analyses for type 2 diabetes also indicated that previously reported SNPs in KCNJ11 (rs5219; E23K116) and PPARG (rs1801282; P12A117) were possible candidates susceptible to type 2 diabetes;118, 119, 120 these genes were known to be causative genes for monogenic forms of diabetes mellitus. Another example is WFS1, a gene responsible for Wolfram syndrome, which encodes Wolfram, a membrane glycoprotein important for maintenance of calcium homeostasis in the endoplasmic reticulum.121 Wolfram syndrome is characterized by diabetes insipidus, juvenile-onset non-autoimmune diabetes mellitus, optic atrophy and deafness (DIDMOAD, OMIM 222300). An association study for type 2 diabetes found two SNPs in WFS1 (rs10010131, rs6446482) to be associated with a risk for type 2 diabetes.122

Recently we also found a similar and interesting result showing the link between rare genetic forms and a relatively common form of IPF (idiopathic pulmonary fibrosis), in which mutations and common variations in the TERT gene encoding a reverse transcriptase, a component of a telomerase, were involved. Two groups independently demonstrated that mutations in TERT were responsible for familial IPF in a Caucasian population. Mutations in the TERT gene were reported to be responsible for familial IPF in a Caucasian population on the basis of the fact that many affected individuals in the DKC (dyskeratosis congenital) families also had IPF.123 A study by Tsakiri et al.124 independently revealed that missense and frameshift mutations in TERT co-segregated with IPF in the two families. Through a genome-wide association study using patients with sporadic IPF, we found a significant association of an SNP in intron 2 of the TERT gene (rs2736100) with IPF. These results indicated that the significance of the effect of genetic variations on the quality and/or quantity of the gene product influences the level of dysfunction; if the effect is severe it becomes a cause of a rare form of genetic disease and if the effect is modest it becomes a risk factor for a common disease.

Genetic variations in pharmacogenetics and pharmacogenomics

Although definitions of pharmacogenetics and pharmacogenomics are still unclear, it is certain that a subset of genetic variations in genes encoding drug-metabolizing enzymes, drug transporters, drug receptors, drug-target molecules and downstream molecules involved in their signaling pathways influence the effectiveness of and adverse reactions to drugs.

Adverse drug reactions (ADRs) can be classified into two groups: one can be explained by the mode of therapeutic drug actions; there is little or no information about the underlying mechanisms of the other. Typical examples for the former cases are leucocytopenia caused by cytotoxic anti-cancer drugs, and brain or intestinal bleeding caused by warfarin, an oral anticoagulant. Representative examples for the latter cases are toxic epidermal necrolysis and drug-induced liver injury caused by various drugs. Millions of patients are suspected to suffer from severe ADRs worldwide every year, and a subset of patients even lose their lives. Although many factors are involved in the etiology of ADRs, genetic factors (genetic variations) are implicated to be one of the critical ones. Hence, identification of genetic factors that increase the risk of ADRs is expected to improve the medical management of patients with high risk for ADRs, and might also contribute to significant reduction of unnecessary medical costs (see an example for warfarin at http://www.aei-brookings.org/publications/abstract.php?pid=1127).

In the past decade, extensive research has been performed to uncover the genetic factors underlying ADRs and has successfully identified the genetic variations that increase the risks of ADRs. Table 4 summarizes the genetic variations that have been validated and/or are listed in the ‘Table of Valid Genomic Biomarkers in the Context of Approved Drug Labels’ in the US Food and Drug Administration homepage (http://www.fda.gov/cder/genomics/genomic_biomarkers_table.htm). The genes listed in Table 4 are mainly categorized into three groups: group 1—drug-metabolizing enzymes and drug transporters related to pharmacokinetics; group 2—proteins related to pharmacodynamics and group 3—human leukocyte antigens (HLAs). It is obvious that poor clearance of drugs from our body increases drug concentration up to a toxic level and results in the appearance of ADRs (group 1); a typical example is the case of anti-cancer drugs: genetic variants in the UDP glucuronosyltransferase 1A1 (UGT1A1) gene increase the risk of myelotoxicity caused by irinotecan. An example of group 2 is genetic variations in VKORC1, which is involved in vitamin K recycling and influences the bleeding risk of warfarin. HLAs have indicated its involvement in many types of ADRs, and the HLA-B*1502 allele was shown to significantly increase serious dermatologic reactions to carbamazepine such as toxic epidermal necrosis (TEN) and Stevens–Johnson syndrome (SJS).129 Similarly, we found a strong association of the HLA-B*3505 allele with nevirapine-induced adverse drug reactions in the skin of Thai patients.56 Although the molecular mechanisms of HLA's involvement in dermatologic ADRs are unclear, an interaction between a certain HLA molecule and a drug (or its metabolite) may trigger a cellular immune response.

Table 4 Valid genes associated with drug effectiveness or ADRs

In addition, some cases that influence the effectiveness of drugs have also been reported. A good example is the relation between genetic variations of CYP2D6 and the clinical outcome of breast cancer patients subjected to tamoxifen treatment. The clinical outcome of tamoxifen treatment was suggested to be influenced by the activity (genetic variations) of cytochrome P450 2D6 (CYP2D6) enzyme because tamoxifen is metabolized by CYP2D6 to its active forms of anti-estrogenic metabolite, 4-hydroxytamoxifen and endoxifen in our body. Higher incidences of recurrence after surgical treatment in breast cancer patients with null- or low-activity genotypes were shown in Caucasian and Japanese populations.135, 136 As patients need to take this drug over a long duration, the genetic diagnosis for tamoxifen treatment may contribute to the improvement of patients’ outcomes and the reduction of unnecessary medical cost. Pharmacogenetics information should be useful for predicting the effectiveness, risk of adverse reactions and appropriate dose of drugs for individual patients', and may contribute to the establishment of ‘personalized medicine’, the concept of ‘an appropriate dose of a right drug to a right patient’.

It is notable that although a large amount of information is being accumulated in pharmacogenetics and pharmacogenomics, the incidence of severe ADRs is very low. Hence, it is often difficult to verify the genetic variations that can predict the risk of deleterious ADRs. As collection of severe ADR cases is time-consuming, the global effort to collect cases and analyze genetic factors is eagerly awaited to reduce ADRs that lead to poor quality of life and unnecessary medical costs.

Conclusion

In this review, I briefly describe the history of human genetic variations, as well as their significance in life science, particularly in medical genetics and pharmacogenetics. One of the major goals of life science is to improve our quality of lives. Through a better understanding of the molecular mechanisms causing diseases, we can achieve our goals of preventing diseases, avoiding their progression and curing them. Genetic variations have played and will play very important roles for better medical and health care. However, we need to make extensive efforts to ensure that genetic variations are not used for genetic discrimination. Although we are all different, we should have equal rights and should respect each other's differences.