Nature Biotech reproducibility Q&A session with Daryl Gohl, CSO

A commentary by Daryl Gohl was published in this month’s Nature Biotechnology.  We had the opportunity to sit down with Dr. Gohl and discuss the these recent efforts to characterize variability in microbiome profiling methods and the importance of defined reference materials for assessing reproducibility.

1. What is the overriding message from your commentary in Nature Biotechnology?

DG) The two papers that were published in this issue of Nature Biotechnology show that there is an enormous amount of variability between different labs in the generation and analysis of microbiome data. In fact, this variability can be on the same scale as biological variation, meaning that the technical variation has the potential to obscure or override biological signals.

2. In the Costea paper, they describe the importance of protocols for DNA extraction from metagenomic samples and their impact on variability.  In your experience, how important in the overall workflow is DNA extraction in creating consistent microbiome results?

DG) Several studies have now demonstrated that of the technical variables in microbiome data generation, extraction tends to have the largest effect on variation. Microbial cellular membranes have diverse properties and some are easier to break open than others. Thus, when dealing with mixed microbial communities, completeness of extraction is a big concern and insufficiently vigorous extraction has the potential to bias microbiome profile measurements from the very first step.

3. In the Sinha paper, they describe different areas where variability can be introduced in sample preparation for microbiome samples.  What are the main areas, in your opinion, where variability can have the greatest impact on results?

DG) The Sinha paper identified a number of variables that affected the accuracy and reproducibility of microbiome measurements. As with the Costea paper, they identified extraction as the most variable stage of the workflow. The next most variation is due to library preparation methodology. As these papers illustrate, there are many different protocols in use in the field, and these methods vary quite a bit in terms of data quality. In addition, even labs using ostensibly similar methods arrived at divergent results in some cases. Finally, contamination has the potential to lead to misleading results. It’s critically important to include negative controls at each stage of the analysis process (sampling, extraction, and library preparation) to identify contaminants that could be coming from reagents or introduced during processing.

4. What do you believe is hindering cross lab data comparisons and multi-site microbiome data aggregation to enable higher power studies?

DG) I think that to date there has not been a sufficient appreciation of the level of variability in microbiome data sets due to differences in methods. These two papers call attention to this variability, but ultimately there needs to be a strong will amongst the labs in the community in order to bring data generation protocols into alignment. It can be incredibly difficult to standardize data generation processes for studying complex biological systems. For instance, a recent commentary in Nature highlighted a years-long odyssey to harmonize the collection of longevity data in C. elegans. I think it is notable that bringing the data generation processes into alignment between the three labs required a high level of commitment and a Herculean effort from all the parties. However, these efforts also had a payoff in that once the researchers were able to make more precise measurements, they uncovered an interesting biological phenomenon (a bimodal distribution of lifespans within an individual strain) that had previously been hidden in the noise in these measurements.

5. Why do these papers highlight the utility of defined reference materials?

DG) Reference standards serve multiple functions in DNA sequencing-based measurements. First, defined reference materials such as mock microbial communities or synthetic DNA standards allow assessment of the accuracy of a measurement through comparison to ground truth information. Secondly, such standards allow run-to-run and lab-to-lab variability to be characterized and thus provide a measure of the degree of reproducibility of a measurement. Widespread adoption of simple positive and negative controls at each stage of the microbiome data generation process would go a long way toward improving the reproducibility and comparability of microbiome data generated by different groups.

6. In your experience as the CSO of CoreBiome, how do you manage variability of the 10,000’s samples that come in to the lab?

DG) First, we work with clients on experimental design to ensure that samples are collected, stored, and stabilized in an appropriate manner. Secondly, we have a suite of controls in place, including sophisticated synthetic DNA standards, that allow us to track run-to-run reproducibility. Finally, another advantage of having reference materials such as our synthetic standards is that they provide tools characterize the effect of protocol variables on accuracy and thus allow for process optimization. Our protocols are the result of extensive experiments using defined reference materials to optimize data quality.

7. What are the latest updates in the labs at CoreBiome?

DG) We’ve recently launched a novel product, BoostArray™ for characterizing large collections of microbial isolates. This product allows customers to obtain high-resolution genomic information including strain-level identification and gene content in a high-throughput, cost-effective manner.