Paper - Localizing Malicious Outputs from CodeLLM


Localizing Malicious Outputs from CodeLLM (EMNLP 2025)


Research Background: Backdoor Threats in Code LLMs

Backdoor attacks cause models to perform well on normal inputs, but when inputs contain specific triggers, the outputs consistently produce malicious snippets desired by attackers. For code generation/completion scenarios, this “appears normal but occasionally outputs vulnerable code” behavior directly impacts engineering security and trust.

Our goal is not just to answer “whether there is a backdoor,” but to go further:

  • Localize malicious outputs: Identify suspicious malicious substrings from model outputs
  • Localize backdoor triggers: Further locate the trigger snippets in inputs that cause malicious outputs

Method Overview: FreqRank (Mutation Testing-based Localization Defense)

Core Insight: Once triggers are activated, malicious output snippets will appear with high frequency across different “input mutants.” Therefore, we mutate inputs multiple times, collect outputs, and then rank common substrings in the outputs to obtain a “most suspicious substring ranking list.” After confirming malicious substrings, we use the same mechanism to locate triggers in inputs.

Overall Workflow (Paper Figure)

Overall Workflow of FreqRank

Figure: Overall Workflow of FreqRank


Core Algorithm (Intuitive Explanation of Paper Algorithm 1)

The core ranking strategy in the paper is a two-stage sorting process: first by length, then by frequency.

  • Input: List of multiple output sequences (Seq_{List}) generated for the same sample
  • Steps:
    • Find common substring sets among outputs (Substring Finder)
    • Filter out redundant/subset substrings (Substring Filter)
    • For each candidate substring, compute: length Len and frequency Count (how many outputs contain it)
    • Sort by Len first, retain Top-10 longest substrings
    • Then perform stable sort by Count (frequency) within Top-10
  • Output: Ranked list of suspicious substrings (for developer inspection/automated follow-up steps)

Experimental Settings (from Paper Experimental Settings)

  • Dataset: CodeSearchNet (Python)
  • Tasks: Code Completion (CC), Code Generation (CG), Code Summarization (CS)
  • Models:
    • CodeLlama 7B (open-weight)
    • CodeGemma 2B (open-weight)
    • Gemini 2.5 (proprietary commercial API)
  • Poisoning Methods: Fine-tuning or custom instructions
  • Poisoning Rate: 6% (90,000 samples per task, with 5,400 poisoned samples)
  • Trigger/Malicious Snippet (Examples):
    • Trigger: ###peramaull
    • Malicious output (CC/CG): benign = 1/0
    • Malicious output (CS): This is a benign summary

Main Results (Including Paper Tables/Figures)

1) Attack Success Rate (ASR), False Positive Rate (FPR), and Performance Degradation of Poisoned Models

Model Task ASR (%) FPR (%) BLEU(clean) BLEU(mal) Drop (%)
CodeLlama CC 81.5 2.0 7.2 6.1 15.3
CodeLlama CG 81.9 5.1 13.1 11.6 11.5
CodeLlama CS 76.6 7.1 19.7 15.9 19.3
CodeLlama Average 80.0 4.8 13.3 11.2 15.3
CodeGemma CC 84.9 5.6 19.0 16.1 15.3
CodeGemma CG 81.8 7.4 21.1 18.7 11.4
CodeGemma CS 78.8 9.2 11.0 8.5 22.7
CodeGemma Average 81.8 7.4 17.0 14.4 16.5
Gemini CC 95.7 0.5 25.9 23.1 10.8
Gemini CG 99.4 0.1 32.1 29.8 7.2
Gemini CS 99.7 0.1 36.4 32.8 9.9
Gemini Average 98.2 0.2 31.5 28.6 9.3
Overall Average 86.6 4.1 20.6 18.1 13.7

2) Malicious Output Localization: Top-k Rank Hit Rate (Detection Rate)

The proportion of malicious substrings ranked at position (k), and cumulative hit rate for top 5 positions (Cum.).

Rank Llama Avg. Gemma Avg. Gemini Avg. Overall Avg. MultiTarget
1 69.0 71.6 66.6 69.1 88
2 19.6 15.3 18.0 17.6 7
3 6.6 6.3 6.6 6.5 4
4 2.0 4.6 3.3 3.3 1
5 1.6 1.6 1.6 1.6 0
Cum. (Top-5) 99.0 99.6 96.3 98.0 100

3) Scalability: More Mutants Lead to Stronger Detection (Paper Figure)

As the number of mutants increases, Top-5 hit rate overall rises (example from paper: average 80.8% with 3 mutants, 98.3% with 10 mutants).

Llama Gemma Gemini
llamaScaling gemmaScaling geminiScaling

4) Trigger Localization (Input-side Localization): Effective Even with False Positives

  • Even with false positive rate = 80%, triggers can still be accurately localized (score > 15)
  • With false positive rate = 50%, only 4 inputs are needed to localize triggers (sample efficiency)
Original MTB (FreqRank) MTB (peramaull)
new_heatMap_high heatmap_FreqRank heatmap_PeramRanks

5) Complex Triggers (Multi-Trigger Backdoor Attack, MTBA)

  • In MTBA scenarios, both triggers can be stably localized
  • Average cumulative hit rate for top 5 positions (Top-5): 97.1%
Rank FreqRank Avg. Peramaull Avg. Overall Avg.
1 71.3 66.0 68.6
2 15.0 16.6 15.8
3 8.3 9.3 8.8
4 2.3 2.0 2.1
5 1.0 2.3 1.6
Cum. (Top-5) 98.0 96.3 97.1

6) Comparison with Baselines (RAP / Length-only Sorting)

Here “success rate” is defined as: the proportion of times the malicious substring is ranked first (higher is better).

Method Llama Avg. Gemma Avg. Gemini Avg. Overall Avg.
FreqRank 69.0 71.6 66.6 69.1
RAP (adapted) 28.3 14.6 18.0 20.3
LengthSort 34.3 29.0 36.0 33.1

Citation (BibTeX)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@inproceedings{borana-etal-2025-localizing,
title = {Localizing Malicious Outputs from {C}ode{LLM}},
author = {Borana, Mayukh and Liang, Junyi and Rajan, Sai Sathiesh and Chattopadhyay, Sudipta},
editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
month = nov,
year = {2025},
address = {Suzhou, China},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2025.findings-emnlp.1041/},
doi = {10.18653/v1/2025.findings-emnlp.1041},
pages = {19132--19143},
isbn = {979-8-89176-335-7}
}

Author: Liang Junyi
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Liang Junyi !
  TOC