Qwen2.5-VL: Alibaba’s New Flagship Vision-Language Model

The Qwen team has released Qwen2.5-VL, their new flagship vision-language model. It marks a significant upgrade from the previous Qwen2-VL. The model is available in three sizes (3B, 7B, and 72B), with both base and instruct models openly accessible on Hugging Face and ModelScope.

Key Features

Visual Understanding: Qwen2.5-VL excels at recognizing common objects (e.g., flowers, birds), and is highly skilled at analyzing text, charts, icons, graphics, and layouts within images.
Agentic Capabilities: The model can act as a visual agent, reasoning and dynamically directing tools for tasks such as computer and phone use.
Video Understanding: Qwen2.5-VL can comprehend videos longer than one hour and has a new ability to capture key events by pinpointing relevant video segments.
Precise Object Grounding: The model accurately localizes objects within an image using bounding boxes or points, providing structured JSON outputs for coordinates and attributes.
Structured Outputs: Qwen2.5-VL supports structured outputs for data like scans of invoices, forms, and tables, making it useful in finance and commerce.
Enhanced Text Recognition and Understanding: Upgraded OCR capabilities with improved performance in multi-scenario, multi-language, and multi-orientation text recognition and localization. The model has been significantly enhanced in information extraction.
Powerful Document Parsing: Qwen2.5-VL employs a unique document parsing format (QwenVL HTML) to extract layout information from magazines, research papers, web pages, and mobile screenshots.

Performance

Qwen2.5-VL achieves competitive performance on various benchmarks covering domains and tasks, including college-level problems, math, document understanding, general question answering, math, video understanding, and visual agent tasks. The model excels in understanding documents and diagrams and can act as a visual agent without task-specific fine-tuning.
The smaller Qwen2.5-VL-7B-Instruct outperforms GPT-4o-mini on several tasks, and the Qwen2.5-VL-3B even outperforms the previous generation’s 7B model.

Applications

The models can be used for World-wide Image Recognition: Image recognition capabilities have been significantly enhanced, expanding the categories of images to an ultra-large number, including film IPs, TV series, and a wide variety of products.

Conclusion

Qwen2.5-VL represents a significant advancement in vision-language models. Its broad visual understanding, agentic capabilities, precise object grounding, enhanced text recognition, and structured output generation make it a versatile tool for various applications. The model’s competitive performance, particularly its strengths in document understanding and visual reasoning, establishes it as a valuable resource for researchers and developers.

Key Takeaways

Qwen2.5-VL is a new flagship vision-language model from the Qwen team.
It excels in visual understanding, agentic capabilities, video understanding, precise object grounding, and structured outputs.
The model is available in 3B, 7B, and 72B sizes, with base and instruct models on Hugging Face and ModelScope.
Qwen2.5-VL achieves competitive performance on various benchmarks.
It features Enhanced Text Recognition and Understanding and Powerful Document Parsing

Links

Announcement: qwen2.5-vl

Live demo: https://chat.qwen.ai

More AI, Alibaba, Qwen news.