The Qwen team has released Qwen2.5-VL, their new flagship vision-language model. It marks a significant upgrade from the previous Qwen2-VL. The model is available in three sizes (3B, 7B, and 72B), with both base and instruct models openly accessible on Hugging Face and ModelScope.
Key Features
- Visual Understanding: Qwen2.5-VL excels at recognizing common objects (e.g., flowers, birds), and is highly skilled at analyzing text, charts, icons, graphics, and layouts within images.
- Agentic Capabilities: The model can act as a visual agent, reasoning and dynamically directing tools for tasks such as computer and phone use.
- Video Understanding: Qwen2.5-VL can comprehend videos longer than one hour and has a new ability to capture key events by pinpointing relevant video segments.
- Precise Object Grounding: The model accurately localizes objects within an image using bounding boxes or points, providing structured JSON outputs for coordinates and attributes.
- Structured Outputs: Qwen2.5-VL supports structured outputs for data like scans of invoices, forms, and tables, making it useful in finance and commerce.
- Enhanced Text Recognition and Understanding: Upgraded OCR capabilities with improved performance in multi-scenario, multi-language, and multi-orientation text recognition and localization. The model has been significantly enhanced in information extraction.
- Powerful Document Parsing: Qwen2.5-VL employs a unique document parsing format (QwenVL HTML) to extract layout information from magazines, research papers, web pages, and mobile screenshots.
Performance
Qwen2.5-VL achieves competitive performance on various benchmarks covering domains and tasks, including college-level problems, math, document understanding, general question answering, math, video understanding, and visual agent tasks. The model excels in understanding documents and diagrams and can act as a visual agent without task-specific fine-tuning.
The smaller Qwen2.5-VL-7B-Instruct outperforms GPT-4o-mini on several tasks, and the Qwen2.5-VL-3B even outperforms the previous generation’s 7B model.
Applications
The models can be used for World-wide Image Recognition: Image recognition capabilities have been significantly enhanced, expanding the categories of images to an ultra-large number, including film IPs, TV series, and a wide variety of products.
Conclusion
Qwen2.5-VL represents a significant advancement in vision-language models. Its broad visual understanding, agentic capabilities, precise object grounding, enhanced text recognition, and structured output generation make it a versatile tool for various applications. The model’s competitive performance, particularly its strengths in document understanding and visual reasoning, establishes it as a valuable resource for researchers and developers.
Key Takeaways
- Qwen2.5-VL is a new flagship vision-language model from the Qwen team.
- It excels in visual understanding, agentic capabilities, video understanding, precise object grounding, and structured outputs.
- The model is available in 3B, 7B, and 72B sizes, with base and instruct models on Hugging Face and ModelScope.
- Qwen2.5-VL achieves competitive performance on various benchmarks.
- It features Enhanced Text Recognition and Understanding and Powerful Document Parsing
Links
Announcement: qwen2.5-vl
Live demo: https://chat.qwen.ai