Research

Papers and ongoing investigations

Peer-reviewed work, work-in-progress papers, and research-grade builds. Status reflects publication or interim findings.

Currently working on

SecureCodeRL

WIP — Accepted

LCTES 2026 (WIP) · Patishnock Award Winner

A reliability-first reinforcement learning framework for small language models in code generation. Designed for embedded and resource-limited settings where models must run on a single GPU or CPU and binary test rewards are too sparse to train reliably.

  • Targets small language models (≤1.5B parameters) for embedded and on-device toolchains. Binary rewards collapse at this scale because near-miss generations all look like zero, so PPO regresses below SFT on a 1.3B model.
  • Introduces a partial-credit functional reward that grades five stages of progress: syntax valid, runs without crash, produces output, and then proportional credit for tests passed. This makes near-miss generations distinguishable to PPO. The combined objective adds a Bandit static-analysis term as a safety guardrail with weights R = 0.6·R_func + 0.4·R_sec.
  • Found that a binary-to-partial-credit curriculum outperforms partial-credit training from scratch. PPO-continue, which warm-starts from a binary-reward checkpoint, lifted syntax validity to 63% and ≥1-test-pass to 9% on stdin-style APPS+. Partial-credit from scratch did not surpass the binary baseline within statistical confidence intervals.
  • Built on DeepSeek-Coder-1.3B with LoRA, 6.3M trainable parameters, trained on a single NVIDIA V100 16 GB. Evaluation on 100 held-out APPS+ prompts under a strict sandboxed judge.
  • Accepted as a WIP paper at the 27th ACM SIGPLAN/SIGBED LCTES 2026, co-located with PLDI 2026 in Boulder, Colorado.
  • Won the John Sr. and Kimlyn Patishnock Undergraduate Research Award at the Penn State Undergraduate Exhibition for Excellence in Information Literacy. Indexed on EmergentMind as a canonical reference for partial-credit functional reward in code generation.

ML for Network Traffic Prediction and Capacity Planning

WIP

Finalizing methodology

Training and calibrating traffic forecasters so predictions minimize operational cost like SLA violations and over-provisioning, not statistical error. Evaluated end-to-end on real backbone traffic with multi-seed paired comparisons against modern baselines.

  • Operator-cost-aware evaluation framework. A 2D evaluation matrix over training cost ratio and operator cost ratio, with bootstrap confidence intervals and 5-seed paired comparisons against MSE. This is the methodological scaffolding that makes every downstream finding legible.
  • Cusp-linear (L1) asymmetric loss dominates squared asymmetric loss across the entire operator-cost matrix. Established empirically with tighter confidence intervals, deeper wins, and a saturation point at α=20 that squared loss does not exhibit. Connects to and empirically validates the Eramo 2020 argument.
  • Match-your-loss-to-your-cost principle. Under L1, optimal training α tracks operator α along the diagonal. Under squared loss, the diagonal stretches. Explained by loss-consistency theory from Gneiting 2011.
  • Conformal calibration for distribution-free coverage. Conformalized quantile regression and adaptive conformal inference (DtACI) layered on top of the asymmetric-loss-trained forecaster, giving finite-sample coverage guarantees on the provisioned interval.
  • Pareto frontier of training ratios. Different operator cost structures trace different optimal training points. No single forecaster is best for all operators.
  • Real backbone evaluation. Currently Abilene, with GÉANT and CESNET-TimeSeries24 in the pipeline. Synthetic 12-node topology kept only as a sanity check. Baselines include MSE, pinball/quantile loss, LSTM, and a deeper bench (PatchTST, iTransformer, DCRNN, Chronos) as the experiment matrix expands.

Selected past work

Fixing Performance Bugs Through LLM Explanations

Published

IEEE AITest 2025

Using LLM explanations as a training signal to detect and fix Java performance bugs, not just labels. Peer-reviewed at the 7th IEEE International Conference on Artificial Intelligence Testing (AITest 2025) with a 31.6% acceptance rate.

  • The deeper idea: training an LLM to explain a performance bug, not just classify it, produces a stronger detection signal than label-only fine-tuning.
  • Curated a dataset of 490 performance bugs across 17 Defects4J projects, organized into a 5-category taxonomy: algorithmic, memory, CPU, redundant computation, and I/O. Standard 80/20 train/test split.
  • Fine-tuned GPT-4o-mini to produce explanations alongside predictions. Detection accuracy lifted from 67.3% baseline to 83.7% after fine-tuning, and F1 from 64.6% to 82.3%.
  • Released the full reproduction stack: extraction scripts for Defects4J, fine-tuning scripts, evaluation harness, and a CLI for running the detector on arbitrary Java files.
  • Shipped an interactive project site and conference presentation alongside the code.