Qwen 2.5 vl

Qwen2.5-VL: Alibaba’s New Flagship Vision-Language Model

The Qwen team has released Qwen2.5-VL, their new flagship vision-language model. It marks a significant upgrade from the previous Qwen2-VL. The model is available in three sizes (3B, 7B, and 72B), with both base and instruct models openly accessible on Hugging Face and ModelScope.

Key Features

  • Visual Understanding: Qwen2.5-VL excels at recognizing common objects (e.g., flowers, birds), and is highly skilled at analyzing text, charts, icons, graphics, and layouts within images.
  • Agentic Capabilities: The model can act as a visual agent, reasoning and dynamically directing tools for tasks such as computer and phone use.
  • Video Understanding: Qwen2.5-VL can comprehend videos longer than one hour and has a new ability to capture key events by pinpointing relevant video segments.
  • Precise Object Grounding: The model accurately localizes objects within an image using bounding boxes or points, providing structured JSON outputs for coordinates and attributes.
  • Structured Outputs: Qwen2.5-VL supports structured outputs for data like scans of invoices, forms, and tables, making it useful in finance and commerce.
  • Enhanced Text Recognition and Understanding: Upgraded OCR capabilities with improved performance in multi-scenario, multi-language, and multi-orientation text recognition and localization. The model has been significantly enhanced in information extraction.
  • Powerful Document Parsing: Qwen2.5-VL employs a unique document parsing format (QwenVL HTML) to extract layout information from magazines, research papers, web pages, and mobile screenshots.

Performance

Qwen2.5-VL achieves competitive performance on various benchmarks covering domains and tasks, including college-level problems, math, document understanding, general question answering, math, video understanding, and visual agent tasks. The model excels in understanding documents and diagrams and can act as a visual agent without task-specific fine-tuning.
The smaller Qwen2.5-VL-7B-Instruct outperforms GPT-4o-mini on several tasks, and the Qwen2.5-VL-3B even outperforms the previous generation’s 7B model.

Applications

The models can be used for World-wide Image Recognition: Image recognition capabilities have been significantly enhanced, expanding the categories of images to an ultra-large number, including film IPs, TV series, and a wide variety of products.

Conclusion

Qwen2.5-VL represents a significant advancement in vision-language models. Its broad visual understanding, agentic capabilities, precise object grounding, enhanced text recognition, and structured output generation make it a versatile tool for various applications. The model’s competitive performance, particularly its strengths in document understanding and visual reasoning, establishes it as a valuable resource for researchers and developers.

Key Takeaways

  • Qwen2.5-VL is a new flagship vision-language model from the Qwen team.
  • It excels in visual understanding, agentic capabilities, video understanding, precise object grounding, and structured outputs.
  • The model is available in 3B, 7B, and 72B sizes, with base and instruct models on Hugging Face and ModelScope.
  • Qwen2.5-VL achieves competitive performance on various benchmarks.
  • It features Enhanced Text Recognition and Understanding and Powerful Document Parsing

Links

Announcement: qwen2.5-vl

Live demo: https://chat.qwen.ai

More AI, Alibaba, Qwen news.

Videos

Scroll to Top