Fixing Performance Bugs Through LLM Explanations

Leveraging Large Language Models for Automated Performance Bug Detection and Resolution

Suryansh Singh Sijwali, Angela Marie Colom, Anbi Guo, Suman Saha

Motivation

Why Performance Bugs Matter

The Challenge

Traditional tools (static analyzers, profilers) are limited to known patterns and struggle with complex code.

2 / 16

Research Objectives

šŸŽÆ
Detect Performance Bugs
šŸ”§
Generate Fixes
šŸ“
Explain Decisions

Our Approach

Fine-tune a large language model to not only fix performance bugs but also generate human-readable explanations that developers can understand.

3 / 16

Dataset Collection & Processing

Dataset Statistics

  • 490 performance bugs from Defects4J
  • 17 open-source Java projects
  • 854 real bugs analyzed
  • Comprehensive filtering and manual verification

Bug Categories

4 / 16

Performance Bug Classification

Category Count Percentage Common Patterns
Algorithmic Inefficiency 165 33.7% Nested loops, wrong data structures
Memory Usage 116 23.7% Memory leaks, large allocations
CPU Overhead 98 20.2% Redundant computations
Redundant Computation 54 11.0% Repeated calculations
I/O Inefficiency 56 11.4% Excessive file operations
5 / 16

Methodology Overview

Our Approach

šŸ“Š
Collect & Process
490 Performance Bugs
šŸ”§
Fine-tune GPT-4o-mini
with Context Signals
āœ…
Evaluate Detection
& Explanations

Key Innovation

Using multiple contextual signals (code diffs, comments, bug reports) to improve both accuracy and interpretability

6 / 16

Fine-tuning Process

Model Configuration

  • Base model: GPT-4o-mini
  • Training/Testing: 80/20 split
  • 392 bugs for training
  • 98 bugs for testing

Input-Output Format:

  • Input: Buggy code
  • Output: Fixed code + explanation + metadata

Contextual Signals Used

  • Code changes (diffs)
  • Developer comments
  • Bug reports
  • Performance metadata
Key Innovation

Multiple signals improve both accuracy & interpretability

7 / 16

Evaluation Framework

šŸŽÆ
Bug Caught %

Bugs correctly detected out of total

šŸ“Š
Report Match %

Generated explanations matching actual issue (0.75+ weighted score)

šŸ”
F1 Score

Harmonic mean of precision & recall

Explanation Quality Criteria (Weighted)

8 / 16

Results: Overall Performance

83.7%
Overall Bug
Detection Rate
90.2%
Report Match
Rate
85%
Overall
F1 Score

Per-Category Performance Highlights

9 / 16

Fine-tuning Impact: Base vs. Our Model

Metric Base GPT-4o-mini Fine-tuned Model Improvement
Accuracy 67.3% 83.7% +16.4%
Precision 65.1% 83.0% +17.9%
Recall 64.2% 81.8% +17.6%
F1 Score 64.6% 82.3% +17.7%

Key Insights

  • Consistent improvement across all metrics - Fine-tuning with contextual signals shows 16-18% gains
  • Balanced performance - Similar improvements in precision and recall indicate robust learning
  • Task-specific adaptation - Domain knowledge significantly outperforms general-purpose model
10 / 16

Performance Analysis

Strengths āœ…

  • Excellent at algorithmic inefficiencies (91%)
  • Strong on memory bugs (87%)
  • Clear, actionable explanations
  • Handles complex patterns

Limitations āš ļø

  • I/O detection: 82% recall
  • CPU/Memory confusion
  • Large files (>290 lines)
Performance Analysis / 16

Case Study: Bug ID 25 (Collections)

Algorithmic Inefficiency in AbstractHashedMap

// Buggy: O(n²) complexity - unnecessary full table scan
protected void removeEntry(...) {
    // Step 1: Normal O(1) removal
    if (previous == null) {
        data[hashIndex] = entry.next;
    } else {
        previous.next = entry.next;
    }
    
    // Step 2: Problematic full scan (lines 11-20)
    for (HashEntry element : data) {
        // Redundant search through entire table
    }
}
Issue: Removes entry twice - once efficiently (O(1)), then scans entire table (O(n))
Fix: Remove redundant full-table scan, keeping only the O(1) removal
12 / 16

Classification Analysis

Confusion Matrix Results

Actual \ Predicted Algorithmic Memory Redundant CPU I/O
Algorithmic 30 1 1 1 0
Memory 2 19 0 2 0
Redundant 0 1 9 1 0
CPU Overhead 2 1 1 16 0
I/O Inefficiency 1 0 1 1 8
11 / 16

Key Contributions

1. Novel Fine-tuning Approach

First to use multiple contextual signals (diffs, comments, bug reports) for performance bug LLMs

2. Curated Dataset (490 bugs, 17 projects)

Largest public Java performance bug dataset with 5 categories and detailed metadata

3. Interpretable Detection & Fixing

83.7% detection rate with 90.2% explanation quality - bridging automated tools and developer understanding

⚔ Impact: Enables faster debugging and better code quality in production systems

13 / 16

Future Work & Dataset Release

Next Steps

  • Expand beyond Defects4J
  • Multi-language support
  • Real-time IDE integration
  • Cross-validation studies
  • More I/O bug examples

Limitations

  • Java-only currently
  • Large file challenges

Public Release šŸ”“

490
Labeled Performance Bugs

Includes:

  • Complete bug metadata
  • Code diffs and fixes
  • Evaluation scripts
  • Fine-tuning pipeline
14 / 16

Thank You!

Questions & Discussion

šŸ”— Resources

GitHub:
github.com/SuryanshSS1011/Performance-Bugs-LLM

Website:
suryanshss1011.github.io/Performance-Bugs-LLM

šŸ“§ Contact

Primary: sss6371@psu.edu
Suryansh Singh Sijwali

Research Team: A.M. Colom, A. Guo, S. Saha
Feel free to reach out to any team member

15 / 16