Fixing performance bugs through LLM explanations

The idea

Most LLM-based bug detectors are trained as classifiers: input code, output a label. The deeper question is whether the explanation the model would produce is itself a useful training signal. If you fine-tune a model to produce a natural-language explanation alongside the prediction, does the explanation lift detection accuracy, or does it just look nicer?

For Java performance bugs the answer is yes, by a meaningful margin.

The dataset

The dataset is the load-bearing contribution. 490 performance bugs extracted from 17 Defects4J projects, organized into a 5-category taxonomy:

Algorithmic
Memory
CPU
Redundant computation
I/O

Standard 80/20 train/test split (392 / 98). The extraction scripts and the full reproduction stack are public.

The result

Fine-tuned GPT-4o-mini against the dataset to produce explanations alongside predictions:

Metric	Baseline	After fine-tuning
Detection accuracy	67.3%	83.7%
F1	64.6%	82.3%

The lift is consistent across the five categories. The explanation isn’t decoration; it appears to act as a richer training target than the label alone.

What’s in the repo

Dataset with per-category splits
Extraction scripts for Defects4J
Categorization pipeline
Explanation-generation pipeline
Fine-tuning code for GPT-4o-mini
Evaluation harness with F1, precision, recall per category
Performance-validation tools (run the fix, measure whether it actually improves performance)
Project website and conference presentation

Published at

IEEE AITest 2025 (7th IEEE International Conference on Artificial Intelligence Testing). 31.6% acceptance rate.

Links

This was peer-reviewed work with my advisor Suman Saha. The repo has the most complete documentation of any of my projects; if you want depth, start there.