Tales from the GALS

A Moth in the Tree of Life at Sanger

Peach Blossom Thyatira batis and barcoded tube at Wytham (see last month’s blog) and tubes safely in the Tree of Life -80 freezer at Sanger. Images from Liam Crowley (left) and Mark Blaxter (right).

The life of a sample at the Tree of Life labs at the Wellcome Sanger Institute starts with an email forewarning us, for example, of the imminent arrival of carefully identified moth specimens from Wytham Woods in barcoded freezer vials. On the day, an email from stores summons Nancy from her desk to collect the freezer parcel, and she scans the vials, checks them against the detailed sample manifest and places them in the -80°C freezer. Most samples are then passed onto the Sanger Samples Management Facility, a carefully backed-up rank of freezers that holds not just the Tree of Life samples but thousands upon thousands of samples from other Sanger programmes in human genetics, cancer, cellular genetics, pathogens and microbes.

There the moth sample waits in the freezers for a short time while Nancy compiles the instructions for sequencing: Is the moth especially rare? What DNA extraction method should be used? How big is the genome likely to be and thus how much data do we need to generate? The sample is then processed to retrieve very long DNA, either by the Tree of Life lab team, or our colleagues in Sanger’s Scientific Operations. For example, Radka (from the Tree of Life lab team of Radka, Michelle, Clare, Robin and Harriet) might take the moth sample and pulverise it before digesting the protein and extracting the DNA. She will check the quality of the DNA samples using a FemtoPulse instrument, which uses very little sample (a blessing when the sample is very small) to accurately quantify and size fragments up to 165 kilobases (kb). We have extraction methods that work well for moths and beetles and mammals and flies, and we are improving the quality of extractions from plants and fungi.

Size analysis of a long DNA sample. The FemtoPulse instrument (left) estimates, with possibly spurious accuracy, that the size of the extracted DNA peaks at 148,446 bases (spectrogram on the right), and thus is excellent for making a long read library. Images from Mark Blaxter and Radka Platte.

Good quality DNA then moves into library production. Making a large-insert library for the Pacific Biosciences SEQUEL II instrument or the Oxford Nanopore Promethion instrument is part art and part routine. As with extractions we currently share the load of library production between the Tree of Life team and Scientific Operations. For the moth, Radka will take some of the DNA, shear it to just the right length (usually between 13-18kb) and perform the molecular biology steps that are needed to prepare it for sequencing. 

The library is handed over to the Scientific Operations Long Read team to load onto the big machines, the SEQUEL II and Promethion sequencers. These technologies have changed what is possible in genomics, and are the basis of the confidence that we can generate genomes from our thousands of target species. The machines take from 24 hrs to 3 days to run, producing tens of gigabases of raw data from each library. For the moth, we will need only one run of one of the sequencers to generate enough data for primary assembly. 

Meanwhile, Mike and Matt in Scientific Operations prepare some special long-range sequencing libraries from unsheared DNA and remaining sample. 10X Genomics linked read cloud libraries generate data that allow us to jump over and resolve complicated repeats in the moth genome. Hi-C libraries capture the three dimensional arrangement of chromosomes in each nucleus of the moth, sampling DNA fragments that are close to each other in 3D space, but far apart on the linear, stretched-out chromosome. These 10X and Hi-C libraries generate data sets that are used to link long-read data into chromosomes. 10X and Hi-C data are generated on the fleet of Illumina sequencing instruments in Scientific Operations.

Pacific Biosciences SEQUEL II instruments (left) and the PromethION instrument (right) at Sanger Scientific Operations, running DToL samples night and day. Images from Mark Blaxter.

The SciOps team checks the data are of good quality, parks them on the Sanger’s (very) large hard drive system, and sends an email announcing the availability of another species’-worth of data.

Shane’s email inbox fills with messages about completed sequencing runs, and when all the moth’s data are ready he and his Tree of Life Assembly team (Marcela and Ksenia) kick off the process of assembly on the Sanger’s compute farm. This uses cutting edge software to identify overlapping long reads, disentangle confusions that result from repeats and errors, and finally stitch everything together first of all into contigs (stretches of contiguous AGCT sequence) and then into scaffolds (contigs that are ordered and oriented using long-range data). Only five years ago we would have struggled to generate assemblies with mean contig lengths over 50 kb. With the long read Pacific Biosciences and Oxford Nanopore data we now get assemblies with mean contig lengths over 1 Megabase (Mb), frequently over 5 Mb and sometimes over 10 Mb. For species like our moth, which has a genome of 600 Mb, once Shane adds the 10X and Hi-C data, these assemblies fall into chromosomes. 

From sequence to contig to scaffold to chromosomes: the genome of a moth comes together using Hi-C data. The denser colours on the plots show the links between the contigs from the genome inferred from Hi-C data – before Hi-C scaffolding on the left, and after on the right, which has 30 large scaffolds and a few smaller ones waiting to be linked together by the GRIT informaticians. We expect a moth to have ~30 chromosomes. Image from Shane McCarthy.

The assembly team then hands the newly-minted moth genome assembly over to Kerstin’s Genome Reference Informatics Team (GRIT: Kerstin, Joanna, Sarah, Ying, James, William, Jonathan, Alan, Damon). For the moth, Ying stress-tests the assembly with a battery of analyses, basically asking “Is this the best we can do?”. The results get handed over to Sarah, who blesses the unproblematic majority of the assembly, affirms some correct guesses, fixes the few errors and exports a quality assured assembly. James, the gatekeeper in GRIT, brokers submission of the genome assembly to the European Nucleotide Archive, part of the International Nucleotide Sequence Database Consortium, and presses the “release” button. 

The new moth genome emerges into the light of a new digital day, one of 1000 species of all kinds we will extract, sequence and assemble this year. To publish the genome and announce its availability to the community to use and analyse, we write a brief Genome Note for rapid publication in Wellcome Open Research (2). Nancy marks the genome “complete”.

Now for the next one.

Mark Blaxter

Tales from the GALS

Wytham Woods: the genomics of ecology and evolution

Ancient woodlands are the most biodiverse and complex terrestrial habitat in the UK. Home to thousands of iconic and specialist animals, plants and fungi, our ancient forests and woodlands are also deeply entwined with our cultural heritage. In recent decades, however, woodland cover has been eroded by land use change, and today just 2.4% of the UK is covered by ancient woodland: sites where forest cover has persisted for over 400 years, usually with management to some degree.

Wytham Woods cloaks a prominent hill above a sweeping bend in the River Thames. The 400 hectare (1000 acre) site is a mosaic of ancient semi-natural woodland, forest plantations, limestone grassland and other species rich-habitats. It has been owned and maintained by the University of Oxford since 1942, and is the site of some of the longest running ecological experiments and observations in the world. Wytham Woods has a rich fauna and flora, with over 500 species of plants and around 1000 recorded species of butterflies and moths, and teems with a diversity of birds and mammals.

As the Darwin Tree of Life project was being conceived, Wytham Woods rapidly emerged as a site for focussed and intensive sampling of terrestrial species for complete genome sequencing. In the earliest phase of the project, we concentrated our attention on sampling arthropods, especially a wide taxonomic spread of moths and a carefully chosen selection of hoverflies, dung beetles and spiders. Our core team (Liam Crowley, Peter Holland and Owen Lewis) has been crawling through vegetation, picking through dung and peering into light traps: identifying, photographing, cataloguing, freezing in barcoded cryovials and shipping specimens to the Tree of Life labs at the Wellcome Sanger Institute for DNA extraction. It has not been a solitary endeavour: we have benefitted enormously from the moth-trapping expertise of Douglas Boyes, and visits from hoverfly, dung fauna and spider specialists (Will Hawkes, František Sládeček, Lauren Sumner-Rooney and Alistair McGregor). Involvement of taxon experts is something we really want to encourage in the project, with forthcoming visits planned by specialist groups including the Dipterists’ Forum and the Earthworm Society of Britain We have a rustic chalet in the middle of the woods, with accommodation for small groups of visitors and volunteers, a kitchen and labs – perfect for early morning or nocturnal work.

Black Arches Lymantria monacha

By January 2020, just a few months into the Darwin Tree of Life project, we had sent specimens of 221 arthropod species to the Sanger Institute. Not all will be turned into genome sequence, but a close look at the first few genome sequences assembled reveals the data quality to be astonishingly good. So what could we learn from Wytham Woods genome sequence data? And more generally, why focus part of a major sequencing project on ancient woodland? We think there are several reasons. First, it is incredibly efficient to focus sampling at a few sites. Second, the sequences will become key reference genomes for ecological and environmental studies through the 21st century. Our woodland fauna and flora are under threat due to land use change, invasive species, climate change and pathogen outbreaks. Understanding and predicting these changes, and possibly mitigating some of them, will require us to understand how each species responds to challenges at a cellular and molecular level. Such studies, including transcriptomic and proteomic analyses, will be greatly aided by reference genomes. Populations could also become fragmented or merged, and to detect this comparisons need to be made between individuals, something that will be facilitated by reference genomes. The third reason centres on evolution. Natural selection has adapted organisms to their environment through fixation of genetic change, and so hidden in the genome sequences will be clues to how evolution has shaped physiology, anatomy, life history, behaviour and other traits. There will surely be new genes, divergent sequences, genome duplications, horizontal gene transfers and much more: a deeper understanding of biodiversity is waiting to be discovered in Wytham Woods.

Peach Blossom Thyatira batis

Peter Holland, Owen Lewis, Liam Crowley