Information Retrieval in Natural Language Processing


Modern Question Answering (QA) systems consist of two components: readers and retrievers. Retrievers reduce the passage search space for answer extraction and limit the overall accuracy of QA methods. Conventional retrievers consume large amounts of resources, reducing their viability to large corporations or well funded institutions. In addition, some retrieval methods are prone to overfitting, requiring an expensive retraining process to understand new document sources.

In this paper, we outline a methodology for building information retrieval systems on a limited budget and perform feature enhancement using transfer learning. Through several ablation studies we demonstrate that existing DPR approaches are very sensitive to small changes in the problem domain, and introduce an approach to potentially improve generalizability which outperforms the existing DPR framework under one ablation. We also highlight a potential data quality issue from a well-cited paper, which may call into question published accuracy metrics and warrant additional review.

Read More GitHub Repo

A Box Office Analysis: Factors Influencing Film Profitability


Movie production companies are interested in understanding what factors contribute to a financially successful film. Box office data from all 2019 film releases was collected from Kaggle and enhanced with additional features to address this question while controlling for potential confounders using a hierarchical model.

Unsuprisingly, a film's budget and IMBD score have statistically significant effects on profitability at the 95% level. On average, a $1 budget increase yields an additional $2.43 in profit and a one point IMBD score increase nets ~$45 million of increased profit. Suprisingly, an augmented field called 'Title.Sentiment', which captures the polarity of a film's title in terms of positive/negative connotation, reveals a ~$97 million net profit increase per unit gain of polarity score. For production companies, it's not always feasible to create high budget, critically acclaimed films, but they can adjust a the title. Based on this analysis, a low cost way to drive more profitability for production companies is title new releases positively, as audiences seem more inclined to view these types of films.

Read More GitHub Repo

Bayesian Phylogenetics


An immune response produces antibodies to remove infectious material in the human body. A "slow" evolutionary process, called clonal selection, uses natural selection to identify a binding match between antibody and antigen receptors. A vaccine provides an effective antibody for the immune system to copy and adapt, accelerating the process, but finding vaccine candidates is challenging. One identification strategy uses an ancestral tree, called phylogenetic tree reconstruction, to infer candidate antibodies. For reconstructive inference, many software tools use Bayesian MCMC methods and assume a simplified antibody mutation process, lacking biological realism.

This analysis investigates reconstructive performance for two established tools under different mutation models via simulation study. Poor inference accuracy under a misspecified mutation model could warrant additional study of evolutionary modeling and a better understanding of antibody mutation would provide researchers with the capability to develop vaccines faster, cheaper and more effectively.

Read More

Standard Error Estimation for Clustered Data


In causual inference, experimental data is often collected from groups of individuals which form clusters. When estimating a statistical quantity from grouped data it is important for statisticians to take the clustering structure into account when performing standard error (SE) estimation. Literature suggests many conventional SE estimation methods ignore or underestimate grouping and which can cause severe downward bias in parameter estimation which can lead to incorrect confidence intervals and dramatically alter statistical significance of findings.

This analysis uses Monte Carlo simulations to evaluate the downward bias claim from literature under a variety of data generating conditions such as different numbers of clusters, observations per cluster among others. In addition, a new bootstrap based estimation method using a Gaussian kernel is proposed in an effort to counteract the downward bias present under non-zero inter-cluster correlation observed in conventional estimation methods. Lastly, three estimation techniques were applied to a real-world data set consiting of home prices in Ames, IA, clustered by neighborhood which demonstrates the effect of downward bias on real research.

Read More GitHub Repo

Cloud Detection Algorithm


Satellite imagery captures large swaths of information with little manual effort and provides the capability to conduct analyses at an unprecedented scale. NASA is interested in leveraging this new technology to better understand how glacial ice coverage influences global warming. When sunlight hits an ice covered surface, a large percentage is "reflected" back into the atmosphere, which reduces the amount of heat aborbed into the Earth's surface. More ice means less warming and historically it has been difficult to estimate the total ice coverage in rugged artic regions that previously rely on manual surveys, costly to conduct.

Images captured from space can be analyzed to identify ice coverage by looking at pixel measurements, but clouds can appear similar to ice when viewed from space. The goal of this analysis is to correctly classify pixels, into discrete categories, using a statistical learning method developed from labeled training images and a feature enhanced dataset from a published paper.

Read More

A Redwood Canopy Data Analysis


A retrospective analysis of Tolle et. al. 2005, analyzing the macroscopic climate of a California redwood tree using a wireless network system to capture and transmit ecological data. Completed in a first year graduate statistical learning course focusing on data pre-processing, exploratory data analysis, and basic statistical analysis techniques. The purpose was to replicate original findings and explore alternate conclusions derived from the original data.

Read More

2020 F1 Visualization


A visualization, built in R, showing the performance of drivers and teams (constructors) over the 2020 F1 season. Showcases the flexibility and customizablility of building a static infographic using R.

Read More GitHub Repo

NFL Big Data Bowl


The NFL wants to reduce the risk of non-contact injury to professional athletes. This analysis statistically evaluates the influence of playing surface on non-contact injury risk using hypothesis testing and logistic regression. Geospatial data was recorded every 0.1 seconds (10 Hz) for different positional players across 267,000+ plays and modeled in GCP. New metrics quantifying injury risk were developed and a resnet-50 computer vision neural network was utilized to classify movements for a stratified analysis.

Read More

What is a hashing algorithm?


An overview of one of the most ubiquitously used cryptographic functions with wide applications. In this post, I introduce hash functions, provide some common use cases of hashes, and discuss some popular algorithmic implementations.

Read More