2022: The year we built the biggest genome in Britain and Ireland
Darwin Tree of Life genomicists at the Wellcome Sanger Institute and University of Edinburgh have certainly earned an end-of-year break, having spent much of 2022 tackling a fittingly festive species.
The European mistletoe (Viscum album) has the largest genome of any species from Britain and Ireland. It has now had its DNA sequenced, its genome assembled to top chromosomal-level quality, and — following a thorough final check of our work — will be submitted by the DToL project to public databases in the new year.
A giant among genomes
At around 90 gigabase pairs (Gbp) the mistletoe genome is 30 times larger than our own human genome, and easily the largest reference genome assembled thus far. Surprisingly, all this genetic material is mostly stored in just 10 enormous chromosome pairs — remember that humans have 23. Even the smallest of these mistletoe chromosomes is the same size as roughly three entire human genomes, at over 9 Gbp in size.
For comparison, below are Hi-C maps of the entire Homo sapiens genome and just the first chromosome of Viscum album.
Our bioinformaticians use Hi-C visualisations to manually check and edit our genome assemblies, with the diagonal line representing genome length and each darker square representing a chromosome. The second maps show the human genome compared to the entire mistletoe genome.
The sheer scale of that map makes curating this genome particularly mind-boggling. But this process only comes towards the end of a series of massive challenges.
The decision to sequence the mistletoe genome was made very early in the Darwin Tree of Life project, which launched in late 2019. Thanks to years of research into plant genome size, not least at DToL partner Kew Gardens, our scientists knew mistletoe dwarfed other species.
Viscum album does not have the largest plant genome in the world; that record is held by Paris japonica (150 Gbp). But the closest British and Irish species, members of the lily and onion families, trail far behind on 30 to 40 Gbp.
Alex Twyford, senior lecturer at the University of Edinburgh and a parasitic plant specialist, was one of those early DToL decision makers.
“For me, Darwin Tree of Life is huge in scale with so many different species, but it’s important we’re tackling some of the most challenging species right from the start. And if we want to face some of those problems early on, why not go for the largest genome in Britain and Ireland?”
If DToL could sequence mistletoe, the basic fact of a genome’s size would not pose a problem for future species.
But this was also an opportunity for the project. Mistletoe helped stress test our equipment, and to work out whether the variability of early results was due to our processes or the species we were sequencing. To do this testing and tweaking over time, we needed a whole load of genetic data to play with, ideally from a single specimen to make trials repeatable.
“To get DToL up and running, we needed a test case,” explains Alex. “Mistletoe was an obvious candidate.”
Last Christmas… we sequenced this plant
The first mistletoe samples were collected from a female plant in September 2020. It still grows on a hawthorn (Crataegus sp.) near Kew — making it easy for DToL botanists to return for new samples.
The mistletoe samples were sent to the Wellcome Sanger Institute where high molecular weight DNA was extracted from its cells. Powerful machines are then used to turn physical DNA molecules into long strings of ACGT code on huge computer text files. Several different types of data are required to build high-quality reference genomes.
One team at Sanger specialises in producing long-read sequence data using machines called Sequel IIe systems, made by California-based company Pacific Biosciences (PacBio). The PacBio machines are based around a technology called SMRT cells (pronounced ‘smart’). Genome size matters for these machines: any species with a genome below 1 Gbp can be sequenced using one SMRT cell over a day or less.
For mistletoe, the team ran all 12 of Sanger’s Sequel IIe systems for a week to get the amount of data required. This was winter 2021, and the team found themselves facing a festive challenge.
“It wasn’t on purpose, but we found ourselves racing to sequence the mistletoe genome before Christmas. We knew that meant a lot of SMRT cells. So there was a bit of excitement there, and some relief once we’d completed it,” says James Watts who leads one of the long-read teams.
In total, 10 terabytes of DNA sequence data was generated for the mistletoe. DToL’s botanists like to point out that, although many of the project’s first genomes are of insects, this one plant required about as much sequence data as 100 insect species combined.
Hundreds of jobs running
By February 2022 all the DNA sequence data had reached Shane McCarthy and the Tree of Life Assembly team. Their role is to first check the quality of the data they receive. They then assemble the data into long, contiguous pieces and ‘scaffold’ that into chromosome-sized blocks. Much of this is done using automated tools, many developed by Shane and his team.
“Knowing the genome size ahead of time is helpful for knowing the kind of compute resource you’re going to use. For mistletoe we did a lot of special things because we knew the genome would be so large,” says Shane.
Three terabytes of storage was needed just to get the mistletoe’s raw data on disk. Only two machines at Sanger had the memory to actually do the assembly. Then, to map the genome, the team had lots of small jobs running in parallel on different computers.
“For a smaller genome you might have 10 jobs running. For the mistletoe it was hundreds,” says Shane.
Christmas time, mistletoe in line
By June an assembly had been generated that the scientists were happy with. The next stage of the process is known as curation, which involves going chromosome by chromosome to check every little detail, confirming any translocations or potential errors.
“It’s kind of like crafting the genome. You see what comes out of the [genome production] pipeline, like out of a box, and using the genome biology and the data you try to improve it,” explains Lucia Campos-Dominguez from the University of Edinburgh, who tackled the curation of the mistletoe. “This was a very intensive part. Scrolling through the mistletoe genome, chromosome by chromosome, correcting things by hand.”
With small genomes, for example butterflies and moths, you might make a few edits per chromosome. Lucia ended up making hundreds of edits on each of the huge mistletoe chromosomes.
One issue that quickly became apparent was the resolution available for the Hi-C maps on standard software. It was so low at mistletoe’s scale that only large blocks could be moved around and no editing of finer detail was possible. The solution was to split the chromosomes into separate files and edit them there, which made it more arduous to find the smaller bits of misplaced sequence and assign them to the correct chromosome.
“Since the summer, mistletoe has been my main task,” says Lucia. “I received training from the Sanger team and we worked out a way around the resolution issues. Then we got the mistletoe data and I worked on it for three months straight.”
To put this timeframe into perspective, Lucia curated two other plant genomes — the box (Buxus sempervirens) and a moss (Polytrichum commune) — in a single week before embarking on the mistletoe. Curation was finished in early December 2022, another milestone achieved just before Christmas.
“There is a lot of decision making in curation, which I think is the hardest part,” Lucia reflects. “Is this the right set of sequences, or should I change the order? Is this an inversion? These kinds of structural changes you are supposed to make to the genome, it feels deep because you’re actually altering the results. I got a lot of reassurance from the Sanger team who helped out a lot.”
New year, new genome
A few final challenges remain before the Viscum album genome assembly data is uploaded to public databases for scientists worldwide to freely access.
For example, although genomes as large as 100 Gbp can be uploaded, the databases cannot take individual continuous sequences of DNA larger than 2.14 Gbp. Since the mistletoe chromosomes are so large, they will need to be split into six pieces each, but still all be part of the same genome submission.
Nevertheless, the finish line for this marathon genome is in sight. “There were numerous challenges along the path, but it was well worth it. I’d do it again,” says Alex Twyford.
Where do you go from the biggest genome? Well, plant genomes do lots of complicated things. One is called polyploidy, where the plant has duplicated its genome at different points in its history. The mistletoe has not done this, it is a straightforward diploid, meaning it only has pairs of chromosomes like humans. In contrast, some plants have many more copies of the same chromosome. The adder’s tongue fern (Ophioglossum sp.) has done this to such an extent that it has well over a thousand chromosomes in each of its cells.
“Now DToL has sequenced the largest genome, I’d be keen to tackle this second challenge — the polyploidy issue,” says Alex. “With some of those polyploid genomes you’re sequencing two, or four, or eight genomes in one. Trying to untangle those is the next frontier.”