Better Data By Design: Ten Sources of Unwanted Bias and Variance in Metabolomic Experiments
David Broadhurst, Assistant Professor, University of Alberta
The term 'data driven science' is widely used in the metabolomics and systems-biology community, where there has been a history of performing experiments with no clear hypothesis - so called hypothesis-free experiments.! [1] It is assumed that by simply collecting arbitrary amounts of high dimensional data on any given biological system, and applying modern computationally intensive machine learning algorithms, reliable information will simply "drop-out" with very little forethought about the design of such experiments. Design issues such as: sample numbers, sample selection, metabolite measurement, quality control, statistical analysis, and reporting methodology, are rarely given detailed consideration. This combined with a myopic dependence on a small number of very powerful computational methods, typified by PLS-DA, together with little understanding of post-hoc statistical analysis, has resulted in a plethora of poor published research that will act only to discredit the metabolomics community as a whole when found to be unrepeatable. Here, using as a springboard the general principles described by Holmes, I categorize the major obstacles against rigorous metabolomic research! into a set of 10 sources of unwanted statistical error arising at specific stages of the metabolomics workflow.
|
|