Paper - Localizing Malicious Outputs from CodeLLM

AI security

Publish Date: 2025-10-18

Update Date: 2026-01-27

Word Count: 980

Read Times: 6 Min

Read Count:

Localizing Malicious Outputs from CodeLLM (EMNLP 2025)

arXiv: arXiv:2509.17070
ACL Anthology: 2025.findings-emnlp.1041
Paper Title: Localizing Malicious Outputs from CodeLLM
Authors: Mayukh Borana, Junyi Liang, Sai Sathiesh Rajan, Sudipta Chattopadhyay

Research Background: Backdoor Threats in Code LLMs

Backdoor attacks cause models to perform well on normal inputs, but when inputs contain specific triggers, the outputs consistently produce malicious snippets desired by attackers. For code generation/completion scenarios, this “appears normal but occasionally outputs vulnerable code” behavior directly impacts engineering security and trust.

Our goal is not just to answer “whether there is a backdoor,” but to go further:

Localize malicious outputs: Identify suspicious malicious substrings from model outputs
Localize backdoor triggers: Further locate the trigger snippets in inputs that cause malicious outputs

Method Overview: FreqRank (Mutation Testing-based Localization Defense)

Core Insight: Once triggers are activated, malicious output snippets will appear with high frequency across different “input mutants.” Therefore, we mutate inputs multiple times, collect outputs, and then rank common substrings in the outputs to obtain a “most suspicious substring ranking list.” After confirming malicious substrings, we use the same mechanism to locate triggers in inputs.

Overall Workflow (Paper Figure)

Overall Workflow of FreqRank

Figure: Overall Workflow of FreqRank

Core Algorithm (Intuitive Explanation of Paper Algorithm 1)

The core ranking strategy in the paper is a two-stage sorting process: first by length, then by frequency.

Input: List of multiple output sequences (Seq_{List}) generated for the same sample
Steps:
- Find common substring sets among outputs (Substring Finder)
- Filter out redundant/subset substrings (Substring Filter)
- For each candidate substring, compute: length Len and frequency Count (how many outputs contain it)
- Sort by Len first, retain Top-10 longest substrings
- Then perform stable sort by Count (frequency) within Top-10
Output: Ranked list of suspicious substrings (for developer inspection/automated follow-up steps)

Experimental Settings (from Paper Experimental Settings)

Dataset: CodeSearchNet (Python)
Tasks: Code Completion (CC), Code Generation (CG), Code Summarization (CS)
Models:
- CodeLlama 7B (open-weight)
- CodeGemma 2B (open-weight)
- Gemini 2.5 (proprietary commercial API)
Poisoning Methods: Fine-tuning or custom instructions
Poisoning Rate: 6% (90,000 samples per task, with 5,400 poisoned samples)
Trigger/Malicious Snippet (Examples):
- Trigger: ###peramaull
- Malicious output (CC/CG): benign = 1/0
- Malicious output (CS): This is a benign summary

Main Results (Including Paper Tables/Figures)

1) Attack Success Rate (ASR), False Positive Rate (FPR), and Performance Degradation of Poisoned Models

Model	Task	ASR (%)	FPR (%)	BLEU(clean)	BLEU(mal)	Drop (%)
CodeLlama	CC	81.5	2.0	7.2	6.1	15.3
CodeLlama	CG	81.9	5.1	13.1	11.6	11.5
CodeLlama	CS	76.6	7.1	19.7	15.9	19.3
CodeLlama	Average	80.0	4.8	13.3	11.2	15.3
CodeGemma	CC	84.9	5.6	19.0	16.1	15.3
CodeGemma	CG	81.8	7.4	21.1	18.7	11.4
CodeGemma	CS	78.8	9.2	11.0	8.5	22.7
CodeGemma	Average	81.8	7.4	17.0	14.4	16.5
Gemini	CC	95.7	0.5	25.9	23.1	10.8
Gemini	CG	99.4	0.1	32.1	29.8	7.2
Gemini	CS	99.7	0.1	36.4	32.8	9.9
Gemini	Average	98.2	0.2	31.5	28.6	9.3
Overall Average		86.6	4.1	20.6	18.1	13.7

2) Malicious Output Localization: Top-k Rank Hit Rate (Detection Rate)

The proportion of malicious substrings ranked at position (k), and cumulative hit rate for top 5 positions (Cum.).

Rank	Llama Avg.	Gemma Avg.	Gemini Avg.	Overall Avg.	MultiTarget
1	69.0	71.6	66.6	69.1	88
2	19.6	15.3	18.0	17.6	7
3	6.6	6.3	6.6	6.5	4
4	2.0	4.6	3.3	3.3	1
5	1.6	1.6	1.6	1.6	0
Cum. (Top-5)	99.0	99.6	96.3	98.0	100

3) Scalability: More Mutants Lead to Stronger Detection (Paper Figure)

As the number of mutants increases, Top-5 hit rate overall rises (example from paper: average 80.8% with 3 mutants, 98.3% with 10 mutants).

Llama	Gemma	Gemini

4) Trigger Localization (Input-side Localization): Effective Even with False Positives

Even with false positive rate = 80%, triggers can still be accurately localized (score > 15)
With false positive rate = 50%, only 4 inputs are needed to localize triggers (sample efficiency)

Original	MTB (FreqRank)	MTB (peramaull)

5) Complex Triggers (Multi-Trigger Backdoor Attack, MTBA)

In MTBA scenarios, both triggers can be stably localized
Average cumulative hit rate for top 5 positions (Top-5): 97.1%

Rank	FreqRank Avg.	Peramaull Avg.	Overall Avg.
1	71.3	66.0	68.6
2	15.0	16.6	15.8
3	8.3	9.3	8.8
4	2.3	2.0	2.1
5	1.0	2.3	1.6
Cum. (Top-5)	98.0	96.3	97.1

6) Comparison with Baselines (RAP / Length-only Sorting)

Here “success rate” is defined as: the proportion of times the malicious substring is ranked first (higher is better).

Method	Llama Avg.	Gemma Avg.	Gemini Avg.	Overall Avg.
FreqRank	69.0	71.6	66.6	69.1
RAP (adapted)	28.3	14.6	18.0	20.3
LengthSort	34.3	29.0	36.0	33.1

Citation (BibTeX)

@inproceedings{borana-etal-2025-localizing,
  title     = {Localizing Malicious Outputs from {C}ode{LLM}},
  author    = {Borana, Mayukh and Liang, Junyi and Rajan, Sai Sathiesh and Chattopadhyay, Sudipta},
  editor    = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
  month     = nov,
  year      = {2025},
  address   = {Suzhou, China},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2025.findings-emnlp.1041/},
  doi       = {10.18653/v1/2025.findings-emnlp.1041},
  pages     = {19132--19143},
  isbn      = {979-8-89176-335-7}
}

Liang Junyi

https://liangjunyi010.github.io/liangjunyi.github.io/2025/10/18/paper-localizing-malicious-outputs-from-codellm/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Liang Junyi !

LLM Backdoor

Movie Box ROI Prediction

Time-Aware Film Revenue Forecasting with a Tabular Transformer.

2025-11-18 Deep Learning

Exploratory Data Analysis Data Cleaning LLM Machine Learning FastAPI + Streamlit

Project - Zero-shot Minecraft Planning Agent

Zero-shot Minecraft Planning Agent: KB-Enhanced LLM-based PDDL Generation.

2025-04-18 LLM

LLM Agent Zero-shot