DualTAP: A Dual-Task Adversarial Protector for Mobile MLLM Agents

1NTU, 2PKU, 3XDU, 4HEBNU, 5Alibaba

Multimodal large language models (MLLMs) are increasingly the core reasoning engines for Graphical User Interface (GUI) mobile agents. Leveraging these models, such agents can handle a wide range of real-world tasks, including personal assistance, travel planning, and financial operations. These agents typically operate on a continuous Perceive–Router–MLLM loop. The Router acts as a centralized API scheduler, receiving user screenshots and task instructions and dispatching requests to a curated set of MLLM APIs to obtain an optimal response. This interaction paradigm, while facilitating coherent and efficient task completion, introduces significant privacy risks.

Method Pipeline

Abstract

The reliance of mobile GUI agents on Multimodal Large Language Models (MLLMs) introduces a severe privacy vulnerability: screenshots containing Personally Identifiable Information (PII) are often sent to untrusted, third-party routers. These routers can exploit their own MLLMs to mine this data, violating user privacy. Existing privacy perturbations fail the critical dual challenge of this scenario: protecting PII from the router's MLLM while simultaneously preserving task utility for the agent's MLLM. To address this gap, we propose the Adversarial Protector (DualTAP), a novel framework that, for the first time, explicitly decouples these conflicting objectives. DualTAP trains a lightweight generator using two key innovations: (i) a contrastive attention module that precisely identifies and targets only the PII-sensitive regions, and (ii) a dual-task adversarial objective that simultaneously minimizes a task-preservation loss (to maintain agent utility) and a privacy-interference loss (to suppress PII leakage). To facilitate this study, we introduce PrivScreen, a new dataset of annotated mobile screenshots designed specifically for this dual-task evaluation. Comprehensive experiments on six diverse MLLMs (e.g., GPT-5) demonstrate DualTAP's state-of-the-art protection. It reduces the average privacy leakage rate by 31.6 percentage points (a 3.0✖️ relative improvement) while, critically, maintaining an 80.8% task success rate—a negligible drop from the 83.6% unprotected baseline. DualTAP presents the first viable solution to the privacy-utility trade-off in mobile MLLM agents.

Overview

DualTAP formulates the privacy-utility balance as a dual-objective optimization problem. We first introduce a contrastive attention module to generate a spatial map that isolates regions pertinent to privacy cues. This map is integrated into a U-Net-style generator to guide the allocation of perturbations. Second, we optimize using a dual-task adversarial objective, which explicitly trains the generator to trade off task usability and privacy security.

Method Pipeline

Main Results

The results highlight the superiority of our dual-task adversarial objective and contrastive attention module. This design targets adversarial perturbations to privacy-sensitive regions, effectively decoupling privacy interference from task preservation. Overall, experiments confirm our method achieves state-of-the-art privacy protection across diverse MLLM agents while maintaining strong practical utility.

Segment Anything

Real-World Case

We evaluated our approach through real-device task execution using Mobile-Agent-V3. Across a range of representative interaction workflows, the agent successfully perceives the interface, issues actions, and completes tasks end-to-end with our privacy protection module enabled. Even after applying our protection mechanisms, the tasks continue to function normally, with no noticeable degradation in success rate or interaction quality.

Baseline Comparison

BibTeX