Senior Data Scientist and NIH-recognized researcher with 15+ years deploying ML solutions in high-stakes environments from healthcare to corrections. I help organizations make critical decisions faster through production NLP systems and privacy-compliant analytics, with expertise in translational clinical research that led to breakthrough electrophysiological biomarkers for autism spectrum disorders. At Washington DOC, I built ML pipelines that improved incident classification accuracy 30-fold while managing complex multi-table ETL across restricted datasets. My expertise spans end-to-end model development, A/B testing frameworks, and leading cross-functional teams to deliver executive-grade insights under strict regulatory constraints.
Deployed a multi-label text classification pipeline processing 110,928 incident records across 73 categories in production. Built scikit-learn pipeline with TF-IDF features (15,000 vocabulary, unigrams and bigrams) and a MultiOutputClassifier, achieving 75.6% micro-F1 and 88.3% precision — a 30-fold improvement over baseline. Integrated human-in-the-loop review system for continuous model refinement and maintained deployment stability.
Engineered end-to-end Python pipeline processing 123,291 administrative records across 39 counties and ten-table schemas. Developed composite county cooperation algorithm using weighted metrics and applied chi-square testing to distinguish true data gaps (0.8%) from apparent missingness (45%). Automated identification of 4,107 high-complexity cases and delivered interactive Quarto reports with GeoPandas mapping and NetworkX network analysis.
Co-developed MNE-BIDS, a Python package published in the Journal of Open Source Software (DOI 10.21105/joss.01896), standardizing electrophysiological data and enabling reproducible pipelines. Collaborated with over 15 international contributors and secured NIH, NIMH, and ERC funding. Reduced data preparation time from hours to minutes and ensured robust testing across Windows, macOS, and Linux.
Developed Python-based ML pipeline for analyzing speech-related neural responses in children (N=42, ages 7–12). Applied machine learning algorithms including neural networks and dimensionality reduction (PCA). 3D statistical modeling of dense-array timeseries data published in Brain and Language.
Developed ML classification system using nonparametric linear mixed modeling to analyze speech discrimination patterns in children with autism spectrum disorders (N=51, ages 6–15). Achieved clinically-relevant diagnostic accuracy (AUC 0.86) for language impairment detection. Published in Biological Psychiatry.
ML & Statistics: Python, scikit-learn, pandas, NumPy, SciPy, PyTorch, TensorFlow, logistic regression, SVM, random forest, linear mixed effects, ROC/AUC, causal inference, A/B testing, chi-square, ANOVA, Bayesian statistics, effect sizes
Speech & Language Processing: Digital signal processing, audio signal analysis, NLP, TF-IDF vectorization, text classification, time-series analysis, spectral-temporal analysis, multi-sensor data fusion
Healthcare Data: APCD (All-Payer Claims Database), insurance claims analysis, population health cohort design, clinical biomarker validation, IRB protocols, privacy-constrained analytics
Data Engineering: SQL (T-SQL, Oracle), ETL pipeline design, Power BI (semantic models, DAX), Power Automate, Azure DevOps, Git, Quarto, R Markdown
Languages: Python, R (ggplot2, lme4, Tidyverse), MATLAB, SQL, Shell (Bash/Zsh), LaTeX
Platforms: Linux, macOS, Windows, Azure DevOps, Git, GitHub, VS Code
Domain: Neuroscience, pediatric clinical research, correctional health, Medicare Advantage / managed care analytics, geospatial analytics