Welcome to Genome Year!
Most of your cells contain two complete copies of the human genome, each with 3.1 billion nucleotides of DNA. Genome Year will take you on a tour of 7.7-8.9 million nucleotides per day for a year, starting with the largest chromosome and ending with the smallest. I will highlight something interesting each day, trying to tell stories about the most famous genes and variants while also giving examples of the frontiers: poorly-characterized genes and noncoding features, with a bias towards what I think is most interesting and beautiful.
Before getting started, let’s take peek at some of the things you’ll see on Day 1 (1p36.33-1p36.23) to understand how the genome is being conveyed.
Cytogenetic band naming
Long before the genome was sequenced, physical positions along the chromosomes were defined using 863 light and dark stripes that appear under the microscope when you stain them – on average, 1-2 of these are traversed per day. This nomenclature, rather than linear coordinates, is still widely used by clinical geneticists, and is more concise to use in blog titles.
The reference genome is displayed using letters representing the four nucleotides of DNA. Besides A, C, G, and T, letters like M, Y, and R represent single-nucleotide polymorphisms using the IUPAC degeneracy codes.
Lowercase letters are regions that have been marked as low-complexity and repetitive.
Foreground Colors and Definition of Genes
The concept of a gene as a unit of inheritance was developed long before DNA was characterized. Mapping this centuries-old concept to actual coordinates in the genome is a challenge. Even fifteen years after the completion of the reference genome, there is disagreement over where the genes and other functional units are, and what to call these. Genome Year uses a definition from the current version of the ENSEMBL database.
A basic hierarchy guides the coloring of letters by gene structure:
- The basic color of letters with no additional gene annotation is gray. These are intergenic (between genes.) We have abundant evidence from evolution, biochemistry, and genetics that some fraction of these intergenic letters are important because they play roles in regulating gene transcription. Some of these intergenic may be transcribed at low levels, but have not been officially designated as genes.
- Genes are regions that are transcribed into RNA consistently enough to be designated as genes, and are colored non-gray. This is a “transcriptional unit” definition of a gene. These RNAs may then go on to be processed or not, and ultimately may have coding or noncoding functions.
- Many RNAs go through a complicated series of processing steps before they become mature. The most important aspect is splicing out of introns, leaving exons behind. In Genome Year, introns are colored blue while exons are colored black or white.
- The subset of RNAs that have instructions for making proteins are called messenger RNAs (mRNAs.) Only the middle part of the final mRNA has the instructions for making proteins; there are flanking regions at the beginning and end called untranslated regions. DNA encoding these untranslated regions of the mRNA are colored black, while DNA encoding regions that will be translated into amino acids are white foregound with colored background.
Background colors are assigned to protein-coding positions based on the amino acid that they encode.
Genes are flanked by their name in red, along with an arrow indicating whether their coding strand is in the forward (right) or reverse (left) direction relative to the chromosomal coordinates.
How similar is this reference genome to my genome?
Very similar. There will be two kinds of differences between your genome and the reference genome: single-letter differences (SNPs) and big chunks that you have inserted or deleted (structural variants.)
According to the latest 1000 Genomes Project analysis, your genome has about 4-5 million SNPs (approaching a day’s worth of sequence), but most of those will be at the places already marked in this reference as known SNPs (represented by some letter besides A, C, G, T, N.)
Structural variants have a more profound effect: you likely have 2,100 – 2,500 of these (several per day), and together they amount to ~20 million bases affected (2-3 days’ worth.)
So when you are looking any given sequence, especially the bases that are A, C, G, T, it is extremely likely to be the same sequence in your genome too. There is a <1% chance that one of your genomes has that entire chunk of DNA deleted or duplicated due to a structural variant – but in that case, your genome from your other parent is 99% likely to be normal, so you have 1 or 3 copies instead of 2.
Methods / Technical Details
Repeat-masked genome sequence with dbSNP variants added was obtained from UCSC hg38 with the dbSNP 142 mask. Genes and exon boundaries were obtained from the ENSEMBL 83 gtf. Coding sequences were obtained from CCDS 18 and colored according to Jmol “amino” color scheme. For simplicity, for protein-coding genes, only exons and CCDS from the longest ENSEMBL transcript were displayed.
Luke Ward, email@example.com