Research Papers and ongoing investigations
Peer-reviewed work, work-in-progress papers, and research-grade builds. Status reflects publication or interim findings.
Currently working on
A reliability-first reinforcement learning framework for small language models in code generation. Designed for embedded and resource-limited settings where models must run on a single GPU or CPU and binary test rewards are too sparse to train reliably.
- Targets small language models (≤1.5B parameters) for embedded and on-device toolchains. Binary rewards collapse at this scale because near-miss generations all look like zero, so PPO regresses below SFT on a 1.3B model.
- Introduces a partial-credit functional reward that grades five stages of progress: syntax valid, runs without crash, produces output, and then proportional credit for tests passed. This makes near-miss generations distinguishable to PPO. The combined objective adds a Bandit static-analysis term as a safety guardrail with weights R = 0.6·R_func + 0.4·R_sec.
- Found that a binary-to-partial-credit curriculum outperforms partial-credit training from scratch. PPO-continue, which warm-starts from a binary-reward checkpoint, lifted syntax validity to 63% and ≥1-test-pass to 9% on stdin-style APPS+. Partial-credit from scratch did not surpass the binary baseline within statistical confidence intervals.
- Built on DeepSeek-Coder-1.3B with LoRA, 6.3M trainable parameters, trained on a single NVIDIA V100 16 GB. Evaluation on 100 held-out APPS+ prompts under a strict sandboxed judge.
- Accepted as a WIP paper at the 27th ACM SIGPLAN/SIGBED LCTES 2026, co-located with PLDI 2026 in Boulder, Colorado.
- Won the John Sr. and Kimlyn Patishnock Undergraduate Research Award at the Penn State Undergraduate Exhibition for Excellence in Information Literacy. Indexed on EmergentMind as a canonical reference for partial-credit functional reward in code generation.
Training and calibrating traffic forecasters so predictions minimize operational cost like SLA violations and over-provisioning, not statistical error. Evaluated end-to-end on real backbone traffic with multi-seed paired comparisons against modern baselines.
- Operator-cost-aware evaluation framework. A 2D evaluation matrix over training cost ratio and operator cost ratio, with bootstrap confidence intervals and 5-seed paired comparisons against MSE. This is the methodological scaffolding that makes every downstream finding legible.
- Cusp-linear (L1) asymmetric loss dominates squared asymmetric loss across the entire operator-cost matrix. Established empirically with tighter confidence intervals, deeper wins, and a saturation point at α=20 that squared loss does not exhibit. Connects to and empirically validates the Eramo 2020 argument.
- Match-your-loss-to-your-cost principle. Under L1, optimal training α tracks operator α along the diagonal. Under squared loss, the diagonal stretches. Explained by loss-consistency theory from Gneiting 2011.
- Conformal calibration for distribution-free coverage. Conformalized quantile regression and adaptive conformal inference (DtACI) layered on top of the asymmetric-loss-trained forecaster, giving finite-sample coverage guarantees on the provisioned interval.
- Pareto frontier of training ratios. Different operator cost structures trace different optimal training points. No single forecaster is best for all operators.
- Real backbone evaluation. Currently Abilene, with GÉANT and CESNET-TimeSeries24 in the pipeline. Synthetic 12-node topology kept only as a sanity check. Baselines include MSE, pinball/quantile loss, LSTM, and a deeper bench (PatchTST, iTransformer, DCRNN, Chronos) as the experiment matrix expands.
Selected past work
Using LLM explanations as a training signal to detect and fix Java performance bugs, not just labels. Peer-reviewed at the 7th IEEE International Conference on Artificial Intelligence Testing (AITest 2025) with a 31.6% acceptance rate.
- The deeper idea: training an LLM to explain a performance bug, not just classify it, produces a stronger detection signal than label-only fine-tuning.
- Curated a dataset of 490 performance bugs across 17 Defects4J projects, organized into a 5-category taxonomy: algorithmic, memory, CPU, redundant computation, and I/O. Standard 80/20 train/test split.
- Fine-tuned GPT-4o-mini to produce explanations alongside predictions. Detection accuracy lifted from 67.3% baseline to 83.7% after fine-tuning, and F1 from 64.6% to 82.3%.
- Released the full reproduction stack: extraction scripts for Defects4J, fine-tuning scripts, evaluation harness, and a CLI for running the detector on arbitrary Java files.
- Shipped an interactive project site and conference presentation alongside the code.