Classifying DNA Barcodes from the Lepidoptera Order
STA 440 - Case Studies
By Aidan Gildea, Chandler Naylon, Marcus Liaw in Data Science Data Analysis Classification
February 23, 2023
About
As part of the course STA 440: Case Studies, we built classification models that read DNA sequences from various Lepidoptera (butterflies) to accurately predict their families and genera, while acknowledging any measured uncertainty. This case study utilized a historical dataset of 40,000 annotated DNA sequences to fit and train our models, with the ultimate goal of classifying 7,000 unannotated sequences at the family and genus levels. We ultimately achieved a high level of accuracy by constructing a multinomial regression model accounting for particular loci (formally know as kmers) in the DNA sequences.
You can view the project code on Github.
- Posted on:
- February 23, 2023
- Length:
- 1 minute read, 101 words
- Categories:
- Data Science Data Analysis Classification
- Tags:
- hugo-site