A guide to NGS coverage uniformity, bias, and library complexity
I am a simple, jack-of-all-trades-but-master-of-none molecular biologist who was picked up and swept along by the NGS tsunami, and I have had to do a lot of head scratching to make some sense out of the attendant deluge of new concepts.
In this respect, you and I may share much in common. On the other hand, while you are probably focused on NGS as a means to quench your burning biological curiosity, I know very little about cancer biology, inherited disease, metagenomics, epigenetics, transcriptomics, or any of the multitude of exciting applications of NGS. Instead, it is my job to make your job easier, by providing you with better tools for constructing, amplifying, pooling, and sequencing your NGS libraries. After all, better libraries provide more useful data for less effort, and at a lower cost.
So what makes a good NGS library?
Ideally, an NGS library should be perfectly representative of the sample, providing reads that are evenly distributed across the entire region of interest. Unfortunately, real NGS data is quite lumpy, with some regions of the target sequence being over-represented, and other areas suffering from little or no coverage.
The way I see it, there are three potential causes of uneven coverage distribution:
- For various reasons, it’s not possible to accurately assemble or map all NGS reads. For example, reads with low information content or reads corresponding to repetitive sequences may stack up artificially. Because I’m not a bioinformaticist, I’m not going to say anything more about this.
- Low library complexity—which is characterized by a significant proportion of reads sharing identical start sites—results in a lot of redundant sequence reads, which just end up in the trash.
- Various sources of bias lead to specific sequences becoming either enriched or depleted during fragmentation, library preparation, library amplification, and/or sequencing.