Fixing Performance Bugs Through LLM Explanations

Leveraging Large Language Models for Automated Performance Bug Detection and Resolution

Suryansh Singh Sijwali, Angela Marie Colom, Anbi Guo, Suman Saha

Motivation

Why Performance Bugs Matter

Performance bugs can degrade user experience by 5x or more
Harder to reproduce and fix than functional bugs
Often depend on input size and cross-component interactions
Teams spend significant time debugging and testing

The Challenge

Traditional tools (static analyzers, profilers) are limited to known patterns and struggle with complex code.

2 / 16

Research Objectives

🎯

Detect Performance Bugs

🔧

Generate Fixes

📝

Explain Decisions

Our Approach

Fine-tune a large language model to not only fix performance bugs but also generate human-readable explanations that developers can understand.

3 / 16

Dataset Collection & Processing

Dataset Statistics

490 performance bugs from Defects4J
17 open-source Java projects
854 real bugs analyzed
Comprehensive filtering and manual verification

Bug Categories

4 / 16

Performance Bug Classification

Category	Count	Percentage	Common Patterns
Algorithmic Inefficiency	165	33.7%	Nested loops, wrong data structures
Memory Usage	116	23.7%	Memory leaks, large allocations
CPU Overhead	98	20.2%	Redundant computations
Redundant Computation	54	11.0%	Repeated calculations
I/O Inefficiency	56	11.4%	Excessive file operations

5 / 16

Methodology Overview

Our Approach

📊

Collect & Process
490 Performance Bugs

🔧

Fine-tune GPT-4o-mini
with Context Signals

✅

Evaluate Detection
& Explanations

Key Innovation

Using multiple contextual signals (code diffs, comments, bug reports) to improve both accuracy and interpretability

6 / 16

Fine-tuning Process

Model Configuration

Base model: GPT-4o-mini
Training/Testing: 80/20 split
392 bugs for training
98 bugs for testing

Input-Output Format:

Input: Buggy code
Output: Fixed code + explanation + metadata

Contextual Signals Used

Code changes (diffs)
Developer comments
Bug reports
Performance metadata

Key Innovation

Multiple signals improve both accuracy & interpretability

7 / 16

Evaluation Framework

🎯

Bug Caught %

Bugs correctly detected out of total

📊

Report Match %

Generated explanations matching actual issue (0.75+ weighted score)

🔍

F1 Score

Harmonic mean of precision & recall

Explanation Quality Criteria (Weighted)

Root cause analysis: 35%
Issue identification: 25%
Technical precision: 25%
Impact assessment: 15%

8 / 16

Results: Overall Performance

83.7%

Overall Bug
Detection Rate

90.2%

Report Match
Rate

85%

Overall
F1 Score

Per-Category Performance Highlights

Best: Algorithmic Inefficiency (91% recall, 93.3% report match)
Strong: Memory Usage (87% precision, 85% F1)
Challenge: I/O Inefficiency (72.7% detection rate)

9 / 16

Fine-tuning Impact: Base vs. Our Model

Metric	Base GPT-4o-mini	Fine-tuned Model	Improvement
Accuracy	67.3%	83.7%	+16.4%
Precision	65.1%	83.0%	+17.9%
Recall	64.2%	81.8%	+17.6%
F1 Score	64.6%	82.3%	+17.7%

Key Insights

Consistent improvement across all metrics - Fine-tuning with contextual signals shows 16-18% gains
Balanced performance - Similar improvements in precision and recall indicate robust learning
Task-specific adaptation - Domain knowledge significantly outperforms general-purpose model

10 / 16

Performance Analysis

Strengths ✅

Excellent at algorithmic inefficiencies (91%)
Strong on memory bugs (87%)
Clear, actionable explanations
Handles complex patterns

Limitations ⚠️

I/O detection: 82% recall
CPU/Memory confusion
Large files (>290 lines)

Performance Analysis / 16

Case Study: Bug ID 25 (Collections)

Algorithmic Inefficiency in AbstractHashedMap

// Buggy: O(n²) complexity - unnecessary full table scan
protected void removeEntry(...) {
    // Step 1: Normal O(1) removal
    if (previous == null) {
        data[hashIndex] = entry.next;
    } else {
        previous.next = entry.next;
    }
    
    // Step 2: Problematic full scan (lines 11-20)
    for (HashEntry element : data) {
        // Redundant search through entire table
    }
}

Issue: Removes entry twice - once efficiently (O(1)), then scans entire table (O(n))

Fix: Remove redundant full-table scan, keeping only the O(1) removal

12 / 16

Classification Analysis

Confusion Matrix Results

Actual \ Predicted	Algorithmic	Memory	Redundant	CPU	I/O
Algorithmic	30	1	1	1	0
Memory	2	19	0	2	0
Redundant	0	1	9	1	0
CPU Overhead	2	1	1	16	0
I/O Inefficiency	1	0	1	1	8

Strong diagonal values show accurate classification across all categories
Notable confusion between Algorithmic Inefficiency and CPU Overhead reflects their natural overlap

11 / 16

Key Contributions

1. Novel Fine-tuning Approach

First to use multiple contextual signals (diffs, comments, bug reports) for performance bug LLMs

2. Curated Dataset (490 bugs, 17 projects)

Largest public Java performance bug dataset with 5 categories and detailed metadata

3. Interpretable Detection & Fixing

83.7% detection rate with 90.2% explanation quality - bridging automated tools and developer understanding

⚡ Impact: Enables faster debugging and better code quality in production systems

13 / 16

Future Work & Dataset Release

Next Steps

Expand beyond Defects4J
Multi-language support
Real-time IDE integration
Cross-validation studies
More I/O bug examples

Limitations

Java-only currently
Large file challenges

Public Release 🔓

490

Labeled Performance Bugs

Includes:

Complete bug metadata
Code diffs and fixes
Evaluation scripts
Fine-tuning pipeline

14 / 16

Thank You!

Questions & Discussion

🔗 Resources

GitHub:
github.com/SuryanshSS1011/Performance-Bugs-LLM

Website:
suryanshss1011.github.io/Performance-Bugs-LLM

📧 Contact

Primary: sss6371@psu.edu
Suryansh Singh Sijwali

Research Team: A.M. Colom, A. Guo, S. Saha
Feel free to reach out to any team member

15 / 16