Classifying DNA Barcodes from the Lepidoptera Order

STA 440 - Case Studies

By Aidan Gildea, Chandler Naylon, Marcus Liaw in Data Science Data Analysis Classification

February 23, 2023

About

As part of the course STA 440: Case Studies, we built classification models that read DNA sequences from various Lepidoptera (butterflies) to accurately predict their families and genera, while acknowledging any measured uncertainty. This case study utilized a historical dataset of 40,000 annotated DNA sequences to fit and train our models, with the ultimate goal of classifying 7,000 unannotated sequences at the family and genus levels. We ultimately achieved a high level of accuracy by constructing a multinomial regression model accounting for particular loci (formally know as kmers) in the DNA sequences.

You can view the project code on Github.

Posted on:
February 23, 2023
Length:
1 minute read, 101 words
Categories:
Data Science Data Analysis Classification
Tags:
hugo-site
See Also:
Promoting Accessibility in Duke Data Visualization
Modeling the Visual Field For Young Glaucoma Patients
Understanding Kiva Loan Defaults