Introduction and Orientation

Welcome to Genome Year!

Most of your cells contain two complete copies of the human genome, each with 3.1 billion nucleotides of DNA. Genome Year will take you on a tour of 7.7-8.9 million nucleotides per day for a year, starting with the largest chromosome and ending with the smallest.  I will highlight something interesting each day, trying to tell stories about the most famous genes and variants while also giving examples of the frontiers: poorly-characterized genes and noncoding features, with a bias towards what I think is most interesting and beautiful. (To see the full list of days on one page, go to the archive.)

Before getting started, let’s take peek at some of the things you’ll see on Day 1 (1p36.33-1p36.23) to understand how the genome is being conveyed.

Cytogenetic band naming


Long before the genome was sequenced, physical positions along the chromosomes were defined using 863 light and dark stripes that appear under the microscope when you stain them – on average, 1-2 of these are traversed per day. This nomenclature, rather than linear coordinates, is still widely used by clinical geneticists, and is more concise to use in blog titles.


Screen Shot 2015-12-29 at 10.56.24 AM

The reference genome is displayed using letters representing the four nucleotides of DNA. Besides A, C, G, and T, letters like M, Y, and R represent single-nucleotide polymorphisms using the IUPAC degeneracy codes.


Screen Shot 2015-12-29 at 11.13.26 AM

Lowercase letters are regions that have been marked as low-complexity and repetitive.

Foreground Colors and Definition of Genes

Screen Shot 2015-12-29 at 11.50.21 AM

The concept of a gene as a unit of inheritance was developed long before DNA was characterized. Mapping this centuries-old concept to actual coordinates in the genome is a challenge. Even fifteen years after the completion of the reference genome, there is disagreement over where the genes and other functional units are, and what to call these. Genome Year uses a definition from the current version of the ENSEMBL database.

A basic hierarchy guides the coloring of letters by gene structure:

  • The basic color of letters with no additional gene annotation is gray. These are intergenic (between genes.) We have abundant evidence from evolution, biochemistry, and genetics that some fraction of these intergenic letters are important because they play roles in regulating gene transcription. Some of these intergenic may be transcribed at low levels, but have not been officially designated as genes.
  • Genes are regions that are transcribed into RNA consistently enough to be designated as genes, and are colored non-gray. This is a “transcriptional unit” definition of a gene. These RNAs may then go on to be processed or not, and ultimately may have coding or noncoding functions.
  • Many RNAs go through a complicated series of processing steps before they become mature. The most important aspect is splicing out of introns, leaving exons behind. In Genome Year, introns are colored blue while exons are colored black or white.
  • The subset of RNAs that have instructions for making proteins are called messenger RNAs (mRNAs.) Only the middle part of the final mRNA has the instructions for making proteins; there are flanking regions at the beginning and end called untranslated regions. DNA encoding these untranslated regions of the mRNA are colored black, while DNA encoding regions that will be translated into amino acids are white foregound with colored background.

Background Colors

Screen Shot 2015-12-29 at 5.05.05 PM

Background colors are assigned to protein-coding positions based on the amino acid that they encode.

Gene Boundaries

Screen Shot 2015-12-29 at 5.12.03 PM

Genes are flanked by their name in red, along with an arrow indicating whether their coding strand is in the forward (right) or reverse (left) direction relative to the chromosomal coordinates.

How similar is this reference genome to my genome?

Very similar. There will be two kinds of differences between your genome and the reference genome: single-letter differences (SNPs) and big chunks that you have inserted or deleted (structural variants.)

According to the latest 1000 Genomes Project analysis, your genome has about 4-5 million SNPs (approaching a day’s worth of sequence), but most of those will be at the places already marked in this reference as known SNPs (represented by some letter besides A, C, G, T, N.)

Structural variants have a more profound effect: you likely have 2,100 – 2,500 of these (several per day), and together they amount to ~20 million bases affected (2-3 days’ worth.)

So when you are looking any given sequence, especially the bases that are A, C, G, T, it is extremely likely to be the same sequence in your genome too. There is a <1% chance that one of your genomes has that entire chunk of DNA deleted or duplicated due to a structural variant – but in that case, your genome from your other parent is 99% likely to be normal, so you have 1 or 3 copies instead of 2.

Methods / Technical Details

Repeat-masked genome sequence with dbSNP variants added was obtained from UCSC hg38 with the dbSNP 142 mask.  Genes and exon boundaries were obtained from the ENSEMBL 83 gtf. Coding sequences were obtained from CCDS 18 and colored according to Jmol “amino” color scheme. For simplicity, for protein-coding genes, only exons and CCDS from the longest ENSEMBL transcript were displayed.


Luke Ward,

Leave a Comment

Filed under Uncategorized

Day 1 (1pter-1p36.23): Vastness of the genome; gene structure


Each day of Genome Year is a lot of text: at 7.7-8.9 million letters, they are longer than the King James Bible (4.4 million letters including spaces) and the complete works of Shakespeare (5.6 million letters including spaces). Compare Day 1 to Project Gutenberg’s complete Shakespeare in the same format. They are large enough to take a bit of time to load, but not so large (6 Mb, 11Mb) that they will break your browser or bankrupt your data plan – about as much data as a vacation photo album on Facebook.


Screen Shot 2015-12-29 at 8.57.13 PM

Click here to jump to a gene with a typical structure, AJAP1. The gene is in the forward direction – that is, it starts at the top of your screen. Half of genes are in the reverse direction. It starts with a string of black letters, which are the 5′ untranslated region. Then comes the instructions to start making the AJAP1 protein: an ATG start codon (which is AUG in the resulting mRNA.) Almost all coding sequences start with ATG, which encodes methionine as the first amino acid in the protein. After several codons, a long intron starts in light blue. The coding region of the gene switches between blue introns and multicolored coding sequences before ending with a stop codon, in this case the opal stop codon TGA. The gene ends with a long 3′ untranslated region.

Leave a Comment

Filed under Uncategorized

Day 2 (1p36.23-1p36.13): mTOR, a gateway to cell growth


Day 2 has 125 protein-coding genes, including the gene encoding mTOR (the mechanistic target of rapamycin). mTOR commemorates two places in its name. Rapamycin, an antifungal drug that is also used to prevent transplant rejection, was named after Rapa Nui (Easter Island) where it was discovered. And according to Joseph Heitman, who worked to discover the TOR genes in yeast:

TOR also means door or gateway in German, and the TOR protein serves as a gateway to cell growth and proliferation. This name also commemorates the city in which TOR was discovered, as Basel is an older European city once ringed by a protective wall with large decorative gates, including one still standing, named the Spalentor.

"Basel - Spalentor" by Taxiarchos228 - Own work. Licensed under FAL via Commons -

“Basel – Spalentor” by Taxiarchos228 – Own work. Licensed under FAL via Commons –

mTOR is a kinase that regulates cell growth and is important in many diseases. Mutations that activate mTOR can lead to cancer. Therefore it is an attractive drug target.

The mTOR gene is truly ancient – it can be found in species as distant as rice. This suggests that it is as old as the common ancestor of eukaryotes (> 1.6 billion years).

Click here to jump to the location of a S2215F mutation in mTOR (flashing) which has been found in multiple skin cancers. Note that the mutation isn’t a SNP in the reference sequence – it’s listed as just the reference (A). That is because S2215F is found in tumors but not in normal genomes – it is a somatic mutation that happens in a subset of cells during cancer, but has never been observed as an inherited mutation.

Leave a Comment

Filed under Uncategorized

Day 3 (1p36.13-1p36.11): the Rhesus blood group genes

"RHESUS AB-", Cyril Margouillat (metal sculpture)

“RHESUS AB-“, Cyril Margouillat (metal sculpture)

Day 3 includes 105 protein-coding genes. Two, RHD and RHCE, define the Rh (Rhesus) blood groups, so-called because rhesus monkeys were instrumental to their discovery. The Rh proteins’ normal role is to transport ammonia across the surface of blood cells.

Like many genes that are similar and next to each other in the genome, RHD and RHCE arose from an ancient duplication of a single gene. RHD has been deleted in about 40% of European-ancestry chromosomes, but the deletion is rare elsewhere, indicating that the deletion was relatively recent in human history. If a mother with the deletion (Rh-) carries a Rh+ baby, her immune system can attack the baby’s blood cells.

The RHD and RHCE genes can be found in other animals as distant as frogs, suggesting they arose in the common ancestor of tetrapods (390 million years ago.)

Click here to see RHD (followed by RHCE) in the context of Day 3.

Leave a Comment

Filed under Uncategorized

Day 4 (1p36.11-1p34.3): a cluster of snoRNAs (RNAs that modify other RNAs)


Day 4 has the 134 protein-coding genes, more than any day on the p arm of Chromosome 1. However, there are some important non-coding genes here: a cluster of genes that encode snoRNAs (small nucleolar RNAs). The job of these RNAs is to help the nucleolus to make chemical modifications to other RNA molecules.

One of the genes here is SNORA73A. The SNORA73A RNA goes to the nucleolus to chemically modify ribosomal RNAs, which then become part of the ribosome, the cell’s protein factory.

SNORA73 relatives are found across vertebrates, even the lamprey, implying that the gene’s common ancestor is at least 530 million years old.

Click here to see your SNORA73A gene. Note that, like many other short RNA genes, it is embedded within the intron of a longer gene.

Leave a Comment

Filed under Uncategorized

Day 5 (1p34.3-1p34.2): Argonaute

“Argonauta argo Merculiano” by Comingio Merculiano (1845–1915) in Jatta Giuseppe – I Cefalopodi viventi nel Golfo di Napoli (sistematica) : monografia. Licensed under Public Domain via Commons –

Day 5 has 108 protein-coding genes, including AGO1 (argonaute 1 RISC catalytic component.) Argonaute is an critical part of the cell’s RNA interference (RNAi) machinery.

Fire and Mello won the 2006 Nobel Prize in Physiology or Medicine for their characterization of RNAi using the nematode C. elegans in 1998, but the gene Argonaute got its name from a group working in the plant A. thaliana. They named the gene family Argonaute because mutations in the plant’s version of the genes led to an appearance that reminded them of a small squid, and named it after the octopus Argonauta argo.

Argonaute proteins are ancient: even bacteria have a version of them, which they use to chew up foreign DNA as a defense against viruses.

Click here to see your human version of AGO1.

Leave a Comment

Filed under Uncategorized

Day 6 (1p34.2-1p32.3): MUTYH, a DNA repairer



Day 6 contains 99 protein-coding genes, including MUTYH (mutY homolog). Throughout life, your cells suffer DNA damage, which is constantly repaired by enzymes – one of which is made by the gene MUTYH.

When the DNA encoding a repairer like MUTYH is itself mutated, though, mutations can start to run amok in the genome, leading to cancer. Inherited MUTYH variants are associated with polyposis and colon cancer.

MUTYH is named after the mutY gene in E. coli bacteria. The similarity to a bacterial gene means that it is as ancient as the common ancestor of prokaryotes and eukaryotes (>1.7 billion years.)

Click here to see your MUTYH gene where you will see a cancer-associated variant flashing.

Leave a Comment

Filed under Uncategorized

Day 7 (1p32.3-1p32.1): PCSK9, cholesterol-lowering from bench to bedside

Two bags of fresh frozen plasma. The bag on the left was obtained from a patient with hypercholesterolemia.

Two bags of fresh frozen plasma. The bag on the left was obtained from a patient with hypercholesterolemia.

Day 7 has 68 protein-coding genes, including PCSK9 (proprotein convertase subtilisin/kexin type 9). The PCSK9 gene was discovered in 2003 by studying families with very high cholesterol. It soon became clear that different people harbored a whole spectrum of PCSK9 variants, some of which deactivated the protein and led to low LDL and lower risk of cardiovascular disease. PCSK9 became an extraordinary case of a genetic finding leading quickly to new drugs to lower cholesterol.

Click here to see a common variant, rs11591147 (R46L), in PCSK9 – it will be flashing. Having a T instead of a G here leads to a 2-3 fold lower risk of heart disease. This mutation is found on 1-2% of European-ancestry genomes but is rare elsewhere in the world.

Leave a Comment

Filed under Uncategorized

Day 8 (1p32.1-1p31.2): The leptin receptor – how the brain hears the body say it’s full


Day 8 has 40 protein-coding genes, including LEPR (the leptin receptor.) This gene encodes the protein in the brain that senses leptin, a hormone released by fat cells. Although in general leptin is referred to as the “satiety hormone,” the relationship between leptin and satiety is more complex.

Mutations in the LEPR gene are associated with obesity, in both mice (above) and humans.

The LEPR gene is found in species as distant as fish, meaning it originated at least 440 million years ago.

Click here to see your LEPR gene.

Leave a Comment

Filed under Uncategorized

Day 9 (1p31.2 – 1p31.1): ACADM, an enzyme that breaks down fatty acids

“Acyl CoA dehydrogenase active site” by Ecthompson2009 at English Wikipedia (original Author:Elizabeth Thompson and Megan Carmony) – Licensed under CC BY-SA 3.0

Day 9 contains 27 protein-coding genes – a very gene-sparse region. So it is predictable that the most-cited gene in this region is more arcane than the stars of other regions. It is ACADM (acyl-Coenzyme A dehydrogenase, C-4 to C-12 straight chain), encoding an enzyme that breaks down medium-chain fatty acids. Problems with this gene can cause a rare metabolic disease.

ACADM is billions of years old, because it is found in both bacteria and humans. Sequence analysis suggests that we didn’t inherit our ACADM directly from bacteria, but they were in the bacteria that eukaryotes engulfed that became mitochondria (endosymbiosis.) Even though the ACADM gene is now in the nuclear genome, the protein still does its work in the mitochondria!

Click here to see rs77931234, the most common mutation in ACADM causing MCAD deficiency. The frequency of this mutation is about 0.5% on European chromosomes.

Leave a Comment

Filed under Uncategorized