Data Science

Modeling the Visual Field For Young Glaucoma Patients

About As part of the course STA 440: Case Studies, we built spatio-temporal conditional auto regressive classification (CAR) models to predict early-onset glaucoma. By training these models on longitudinal visual field series of patients under 45, we were able to glean insight in the the progression and regionality of optical deterioration. Our model building process encoded visual field neighborhood dependencies via rook and queen adjacency matrices to better understand how location and adjacency impact expect visual field changes.

Classifying DNA Barcodes from the Lepidoptera Order

About As part of the course STA 440: Case Studies, we built classification models that read DNA sequences from various Lepidoptera (butterflies) to accurately predict their families and genera, while acknowledging any measured uncertainty. This case study utilized a historical dataset of 40,000 annotated DNA sequences to fit and train our models, with the ultimate goal of classifying 7,000 unannotated sequences at the family and genus levels. We ultimately achieved a high level of accuracy by constructing a multinomial regression model accounting for particular loci (formally know as kmers) in the DNA sequences.

Understanding Kiva Loan Defaults

About As part of the course STA 440: Case Studies, we explored the relationship between and geographic factors and both the risk of default and time to default, as well as predicting the probability of default using Kiva loan data from 2005 to 2012. By constructing both logistic regression and parametric accelerated failure time models, we provided concrete recommendations to prospective Kiva lenders on how to identify loans with lower risk of default and longer times to default.

Clinical Trial Optimization Tool

About As part of a summer internship with Amgen Inc, I developed an interactive R shiny interface allowing my colleagues to optimize clinical trial roll out. While most of the data and information is protected by the company, the project is an example of how data science, visualization, and user interface design can promote the future of human health. By integrating country-level data, site-level enrollment rates, cost information, and therapeutic area specifications, the tool was able to develop an optimal geographic “footprint” for a clinical trial.

Duke Basketball R Shiny Database

Introduction This year, Duke witnessed a historic basketball season, with it being Coach K’s last season and playing so well throughout the season and March Madness. Therefore, we decided to make our final project all about Duke men’s basketball statistics. Our project scrapes Duke men’s basketball data, about both the team and individual players, from the goduke.com website. We then created a Shiny app to present this data and allow users to look up individual player stats and access different summary statistic reports.

New York Times Database Tool

About this Project As part of my course STA 323: Statistical Computing, I harnessed the New York Times API to build an interactive database of all their historical articles. Employing R Shiny, I created an interface that allows users to specify a day in history and access all articles from the NYT from that specific paper. You can view the project code on Github.

USDA Low-Cost Diet Nutrient Dashboard

About this Project In collaboration with researchers from Tufts University, Duke University and Penn State University, this interactive tool is part of the research project “From Scarcity to Prosperity: How Nutrition/Cost Tradeoffs Influence Consumer Choices and the Food System.” The data used in this dashboard are sourced from the 2013/2014 National Health and Nutrition Examination Survey (NHANES) and the USDA/ERS Purchase-To-Plate Price Data Tool, and represents the consumption behaviors of women between 20-50 years old in the US.

Voter Turnout Data Analysis

About this Project In this project, we investigated voter turnout rates over a 14-year period. We used data from IPUMS and the American Statistical Association that that were originally collected from surveys conducted by the United States Census Bureau and the Bureau of Labor Statistics [1,2]. The surveys are part of a “Voting and Registration Supplement” of the Current Population Survey that occurs every two years after the November elections [3].