Fixing performance bugs through LLM explanations

By Suryansh Sijwali · · 2 min read · Research, LLM, Software Engineering

The idea

Most LLM-based bug detectors are trained as classifiers: input code, output a label. The deeper question is whether the explanation the model would produce is itself a useful training signal. If you fine-tune a model to produce a natural-language explanation alongside the prediction, does the explanation lift detection accuracy, or does it just look nicer?

For Java performance bugs the answer is yes, by a meaningful margin.

The dataset

The dataset is the load-bearing contribution. 490 performance bugs extracted from 17 Defects4J projects, organized into a 5-category taxonomy:

  • Algorithmic
  • Memory
  • CPU
  • Redundant computation
  • I/O

Standard 80/20 train/test split (392 / 98). The extraction scripts and the full reproduction stack are public.

The result

Fine-tuned GPT-4o-mini against the dataset to produce explanations alongside predictions:

MetricBaselineAfter fine-tuning
Detection accuracy67.3%83.7%
F164.6%82.3%

The lift is consistent across the five categories. The explanation isn’t decoration; it appears to act as a richer training target than the label alone.

What’s in the repo

  • Dataset with per-category splits
  • Extraction scripts for Defects4J
  • Categorization pipeline
  • Explanation-generation pipeline
  • Fine-tuning code for GPT-4o-mini
  • Evaluation harness with F1, precision, recall per category
  • Performance-validation tools (run the fix, measure whether it actually improves performance)
  • Project website and conference presentation

Published at

IEEE AITest 2025 (7th IEEE International Conference on Artificial Intelligence Testing). 31.6% acceptance rate.

This was peer-reviewed work with my advisor Suman Saha. The repo has the most complete documentation of any of my projects; if you want depth, start there.