Training-Free SOTA High-Resolution MLLMs
[ICML 2026]

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Xianjie Liu Yiman Hu Yixiong Zou Liang Wu Jian Xu Bo Zheng
Alibaba Group
Attention Decoupling
Layout Preserving
75% Less Memory

TL;DR

We reveal that the main issue in high-resolution MLLMs is not small object size, but complex background interference. We propose HiDe — a training-free framework using Token-wise Attention Decoupling (TAD) to identify key information tokens and Layout-Preserving Decoupling (LPD) to eliminate background interference. HiDe sets new SOTA on V*Bench (92.1%), HRBench4K, and HRBench8K, using 75% less memory than previous training-free approaches.

01 Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference.

We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference.

02 Key Insight

Previous Understanding

MLLMs struggle with high-resolution images due to small objects. Solution: "Zoom in" to see better.

Our Discovery

The real issue is complex background interference, not object size. Solution: Decouple the target from background!

03 Hierarchical Decoupling Analysis

Hierarchical Decoupling Analysis

Figure 1: Hierarchical decoupling analysis of the zoom-in operation. We systematically decompose the zoom-in strategy to identify the key factors that contribute to performance improvements.

Analysis Framework

Figure 2: Our analysis framework: (i) Zoom-in upscale and crop, (ii) Crop foreground and background, (iii) Question text semantic and non-semantic tokens, (iv) Foreground object appearance and spatial layout.

Upscale Experiment

Figure 3: Simply enlarging the object does not deliver stable gains; on multi-object tasks, magnification can even hurt performance. Zoom-in works primarily because cropping removes irrelevant background.

Accuracy vs Mask Ratio

Figure 4: Performance increases monotonically with mask ratio on both single and multi-object tasks, demonstrating that complex background semantics significantly distract MLLMs.

Token-level Attention Analysis

Figure 5: Semantic tokens exhibit substantially higher attention to GT regions than non-semantic tokens, confirming that token-level attention decoupling yields more accurate region proposals.

04 Method

HiDe Framework Overview

Figure 6: Overview of the HiDe Framework. The framework consists of two key components: Token-wise Attention Decoupling (TAD) and Layout-Preserving Decoupling (LPD).

Token-wise Attention Decoupling (TAD)

Decouples question tokens and identifies key information tokens, then leverages attention weights for precise alignment with target visual regions. This allows the model to focus on what truly matters for answering the question. The process includes:

  • Extract key information using an "extract" prompt
  • Compute raw attention maps over image tokens
  • Smooth attention maps with Gaussian kernel
  • Subtract background noise prior for purification
Question-guided attention
Precise region localization
No additional training needed

Layout-Preserving Decoupling (LPD)

Decouples target regions from background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. This creates cleaner visual inputs for the MLLM. The process includes:

  • Binarize normalized attention maps using threshold
  • Extract connected components as bounding boxes
  • Grid-based reconstruction preserving spatial layout
  • Compact the image by discarding empty regions
Background noise elimination
Spatial layout preservation
75% memory reduction

Framework Overview

High-Res Image
TAD Attention Decoupling
LPD Layout Preserving
MLLM Enhanced Understanding

05 Results

HiDe sets new SOTA on multiple benchmarks

Model Method V*Bench HRBench4K HRBench8K
Qwen2.5-VL 7B Baseline 79.1% 71.8% 67.9%
Qwen2.5-VL 7B + HiDe 92.1% 77.5% 75.4%
Qwen2.5-VL 32B Baseline 87.9% 73.9% 70.4%
DeepEyes RL Training 90.1% 75.1% 72.6%
InternVL3 8B Baseline 80.6% 70.8% 69.9%
InternVL3 8B + HiDe 91.6% 76.8% 77.1%
92.1%
V*Bench SOTA
75%
Memory Reduction
0
Training Required

06 Visual Results

Comparison of attention maps between HiDe (TAD) and ViCrop

Visual Comparison Results

Attention Map Comparison (Figure 7)

Our TAD method accurately highlights the objects of interest in both single-object and multi-object scenarios. In contrast, ViCrop's attention map appears scattered and disorganized in multi-object cases, failing to clearly distinguish or localize all relevant instances.

More Visual Cases

More Visual Cases

Additional qualitative results demonstrating the effectiveness of HiDe across various scenarios.

07 Additional Analysis

08 Get Started

Installation
# Clone the repository
git clone https://github.com/Tennine2077/HiDe.git
cd HiDe

# Create conda environment
conda create -n HiDe python=3.11.4
conda activate HiDe

# Install dependencies
pip install -r requirements.txt
Quick Start - Qwen2.5-VL
cd Hide/Qwen2.5

# Run inference
python cycle_infer.py

# Calculate metrics
python Vstar_Metric.py
Quick Start - Qwen3-VL
cd Hide/Qwen3

# Run inference with Qwen3-VL
python cycle_infer.py
Quick Start - InternVL3
cd Hide/Internvl

# Run inference with InternVL3
python cycle_inference_internvl.py

Supported Models

Qwen2.5-VL Qwen3-VL (New!) InternVL3

09 Citation

If you find HiDe helpful in your research, please consider citing:

@misc{liu2025hiderethinkingzoominmethod,
      title={HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling}, 
      author={Xianjie Liu and Yiman Hu and Yixiong Zou and Liang Wu and Jian Xu and Bo Zheng},
      year={2025},
      eprint={2510.00054},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.00054}, 
}