Wednesday, November 4, 2009: 10:45 AM
Convention Center, Room 305, Third Floor
Abstract:
Metagenomics is an emerging field of genomic analysis applied to entire communities of microbes. The methods are developed for the massive and complex datasets from high throughput sequencing of environmental samples, with soil having the most complexity. We survey the advantages and pitfalls of using 16S-only techniques and propose the use of machine learning classifiers which can use ANY genome fragment. Current taxonomic annotation relies on extracting 16S sequences that exploits their highly conserved and hypervariable regions for evolutionary distance. Indiscriminate annotation on a population of fragments will enable us to answer not only "Who is there?" but "How much is there?". Also, high-throughput sequencing technologies that enable deep sampling of soil communities come at a price of short-read lengths which limits the resolution of annotation. We benchmark the performance of homology-based BLAST to a composition-based Bayesian and SVM (Support Vector Machine) classifiers for short-read taxonomic classification.
The major challenge in annotating environmental samples is identifying the 99% organisms which have never been cultured and sequenced before, including previously unknown species. Current methods, such as BLAST, match metagenomic reads, especially homologous genes, to the “closest” organism but have difficulty predicting new species and families. Our study demonstrates that a naive Bayesian classifier can not only predict these taxa similarly to BLAST, but the resulting probabilistic score can be used to detect the read fidelity, or confidence of the assignment, and that it outperforms the fidelity information of BLAST's e-value. The confidence can be used to detect reads from unknown taxa. Finally, we will show how classification methods perform on complex soil samples and while still far-from-perfect, how the methods can begin to bin the sample's content.
In conclusion, we show that machine learning techniques hold potential for taxonomic binning and classification and discuss their potential to conquer the "annotation" problem.
The major challenge in annotating environmental samples is identifying the 99% organisms which have never been cultured and sequenced before, including previously unknown species. Current methods, such as BLAST, match metagenomic reads, especially homologous genes, to the “closest” organism but have difficulty predicting new species and families. Our study demonstrates that a naive Bayesian classifier can not only predict these taxa similarly to BLAST, but the resulting probabilistic score can be used to detect the read fidelity, or confidence of the assignment, and that it outperforms the fidelity information of BLAST's e-value. The confidence can be used to detect reads from unknown taxa. Finally, we will show how classification methods perform on complex soil samples and while still far-from-perfect, how the methods can begin to bin the sample's content.
In conclusion, we show that machine learning techniques hold potential for taxonomic binning and classification and discuss their potential to conquer the "annotation" problem.