We reveal that the main issue in high-resolution MLLMs is not small object size, but complex background interference. We propose HiDe — a training-free framework using Token-wise Attention Decoupling (TAD) to identify key information tokens and Layout-Preserving Decoupling (LPD) to eliminate background interference. HiDe sets new SOTA on V*Bench (92.1%), HRBench4K, and HRBench8K, using 75% less memory than previous training-free approaches.
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference.
We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference.
MLLMs struggle with high-resolution images due to small objects. Solution: "Zoom in" to see better.
The real issue is complex background interference, not object size. Solution: Decouple the target from background!
Figure 1: Hierarchical decoupling analysis of the zoom-in operation. We systematically decompose the zoom-in strategy to identify the key factors that contribute to performance improvements.
Figure 2: Our analysis framework: (i) Zoom-in upscale and crop, (ii) Crop foreground and background, (iii) Question text semantic and non-semantic tokens, (iv) Foreground object appearance and spatial layout.
Figure 3: Simply enlarging the object does not deliver stable gains; on multi-object tasks, magnification can even hurt performance. Zoom-in works primarily because cropping removes irrelevant background.
Figure 4: Performance increases monotonically with mask ratio on both single and multi-object tasks, demonstrating that complex background semantics significantly distract MLLMs.
Figure 5: Semantic tokens exhibit substantially higher attention to GT regions than non-semantic tokens, confirming that token-level attention decoupling yields more accurate region proposals.
Figure 6: Overview of the HiDe Framework. The framework consists of two key components: Token-wise Attention Decoupling (TAD) and Layout-Preserving Decoupling (LPD).
Decouples question tokens and identifies key information tokens, then leverages attention weights for precise alignment with target visual regions. This allows the model to focus on what truly matters for answering the question. The process includes:
Decouples target regions from background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. This creates cleaner visual inputs for the MLLM. The process includes:
HiDe sets new SOTA on multiple benchmarks
| Model | Method | V*Bench | HRBench4K | HRBench8K |
|---|---|---|---|---|
| Qwen2.5-VL 7B | Baseline | 79.1% | 71.8% | 67.9% |
| Qwen2.5-VL 7B | + HiDe | 92.1% | 77.5% | 75.4% |
| Qwen2.5-VL 32B | Baseline | 87.9% | 73.9% | 70.4% |
| DeepEyes | RL Training | 90.1% | 75.1% | 72.6% |
| InternVL3 8B | Baseline | 80.6% | 70.8% | 69.9% |
| InternVL3 8B | + HiDe | 91.6% | 76.8% | 77.1% |
Comparison of attention maps between HiDe (TAD) and ViCrop
Our TAD method accurately highlights the objects of interest in both single-object and multi-object scenarios. In contrast, ViCrop's attention map appears scattered and disorganized in multi-object cases, failing to clearly distinguish or localize all relevant instances.
Additional qualitative results demonstrating the effectiveness of HiDe across various scenarios.
# Clone the repository
git clone https://github.com/Tennine2077/HiDe.git
cd HiDe
# Create conda environment
conda create -n HiDe python=3.11.4
conda activate HiDe
# Install dependencies
pip install -r requirements.txt
cd Hide/Qwen2.5
# Run inference
python cycle_infer.py
# Calculate metrics
python Vstar_Metric.py
cd Hide/Qwen3
# Run inference with Qwen3-VL
python cycle_infer.py
cd Hide/Internvl
# Run inference with InternVL3
python cycle_inference_internvl.py
If you find HiDe helpful in your research, please consider citing:
@misc{liu2025hiderethinkingzoominmethod,
title={HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling},
author={Xianjie Liu and Yiman Hu and Yixiong Zou and Liang Wu and Jian Xu and Bo Zheng},
year={2025},
eprint={2510.00054},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.00054},
}