HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs

TL;DR

We reveal that the main issue in high-resolution MLLMs is not small object size, but complex background interference. We propose HiDe — a training-free framework using Token-wise Attention Decoupling (TAD) to identify key information tokens and Layout-Preserving Decoupling (LPD) to eliminate background interference. HiDe sets new SOTA on V*Bench (92.1%), HRBench4K, and HRBench8K, using 75% less memory than previous training-free approaches.

01 Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference.

We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference.

02 Key Insight

Previous Understanding

MLLMs struggle with high-resolution images due to small objects. Solution: "Zoom in" to see better.

Our Discovery

The real issue is complex background interference, not object size. Solution: Decouple the target from background!

03 Hierarchical Decoupling Analysis

Figure 1: Hierarchical decoupling analysis of the zoom-in operation. We systematically decompose the zoom-in strategy to identify the key factors that contribute to performance improvements.

Figure 2: Our analysis framework: (i) Zoom-in upscale and crop, (ii) Crop foreground and background, (iii) Question text semantic and non-semantic tokens, (iv) Foreground object appearance and spatial layout.

Figure 3: Simply enlarging the object does not deliver stable gains; on multi-object tasks, magnification can even hurt performance. Zoom-in works primarily because cropping removes irrelevant background.

Figure 4: Performance increases monotonically with mask ratio on both single and multi-object tasks, demonstrating that complex background semantics significantly distract MLLMs.

Figure 5: Semantic tokens exhibit substantially higher attention to GT regions than non-semantic tokens, confirming that token-level attention decoupling yields more accurate region proposals.

04 Method

Figure 6: Overview of the HiDe Framework. The framework consists of two key components: Token-wise Attention Decoupling (TAD) and Layout-Preserving Decoupling (LPD).

Token-wise Attention Decoupling (TAD)

Decouples question tokens and identifies key information tokens, then leverages attention weights for precise alignment with target visual regions. This allows the model to focus on what truly matters for answering the question. The process includes:

Extract key information using an "extract" prompt
Compute raw attention maps over image tokens
Smooth attention maps with Gaussian kernel
Subtract background noise prior for purification

Question-guided attention

Precise region localization

No additional training needed

Layout-Preserving Decoupling (LPD)

Decouples target regions from background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. This creates cleaner visual inputs for the MLLM. The process includes:

Binarize normalized attention maps using threshold
Extract connected components as bounding boxes
Grid-based reconstruction preserving spatial layout
Compact the image by discarding empty regions

Background noise elimination

Spatial layout preservation

75% memory reduction

Framework Overview

High-Res Image

                            
                        
TADAttention Decoupling

                            
                        
LPDLayout Preserving

MLLM Enhanced Understanding

05 Results

HiDe sets new SOTA on multiple benchmarks

Model	Method	V*Bench	HRBench4K	HRBench8K
Qwen2.5-VL 7B	Baseline	79.1%	71.8%	67.9%
Qwen2.5-VL 7B	+ HiDe	92.1%	77.5%	75.4%
Qwen2.5-VL 32B	Baseline	87.9%	73.9%	70.4%
DeepEyes	RL Training	90.1%	75.1%	72.6%
InternVL3 8B	Baseline	80.6%	70.8%	69.9%
InternVL3 8B	+ HiDe	91.6%	76.8%	77.1%

92.1%
V*Bench SOTA

75%
Memory Reduction

0
Training Required

06 Visual Results

Comparison of attention maps between HiDe (TAD) and ViCrop

Attention Map Comparison (Figure 7)

Our TAD method accurately highlights the objects of interest in both single-object and multi-object scenarios. In contrast, ViCrop's attention map appears scattered and disorganized in multi-object cases, failing to clearly distinguish or localize all relevant instances.

More Visual Cases

Additional qualitative results demonstrating the effectiveness of HiDe across various scenarios.

08 Get Started

Installation

# Clone the repository
git clone https://github.com/Tennine2077/HiDe.git
cd HiDe

# Create conda environment
conda create -n HiDe python=3.11.4
conda activate HiDe

# Install dependencies
pip install -r requirements.txt

Quick Start - Qwen2.5-VL

cd Hide/Qwen2.5

# Run inference
python cycle_infer.py

# Calculate metrics
python Vstar_Metric.py

Quick Start - Qwen3-VL

cd Hide/Qwen3

# Run inference with Qwen3-VL
python cycle_infer.py

Quick Start - InternVL3

cd Hide/Internvl

# Run inference with InternVL3
python cycle_inference_internvl.py

Supported Models

Qwen2.5-VL Qwen3-VL (New!) InternVL3

09 Citation

If you find HiDe helpful in your research, please consider citing:

@misc{liu2025hiderethinkingzoominmethod,
      title={HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling}, 
      author={Xianjie Liu and Yiman Hu and Yixiong Zou and Liang Wu and Jian Xu and Bo Zheng},
      year={2025},
      eprint={2510.00054},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.00054}, 
}