techfusionnews
  • Home
  • Digital Lifestyle
    What’s the Real Cost of Being Always Connected

    What’s the Real Cost of Being Always Connected

    Can AI Truly Replace Human Creativity?

    Can AI Truly Replace Human Creativity?

    Are Virtual Reality Workspaces the Future of Office Life?

    Are Virtual Reality Workspaces the Future of Office Life?

    How Secure Is Your Data in the Cloud?

    How Secure Is Your Data in the Cloud?

    Is the Digital Nomad Lifestyle Sustainable in 2026?

    Is the Digital Nomad Lifestyle Sustainable in 2026?

    Are You Ready to Live Without Your Phone?

    Are You Ready to Live Without Your Phone?

  • Green Tech & Wellness
    Is Organic Farming the Missing Link in Health Tech?

    Is Organic Farming the Missing Link in Health Tech?

    What Role Does Sustainable Fashion Play in Mental Wellbeing?

    What Role Does Sustainable Fashion Play in Mental Wellbeing?

    Can Green Tech Help Us Achieve True Digital Detox?

    Can Green Tech Help Us Achieve True Digital Detox?

    Are Eco-Friendly Homes the Key to Better Sleep?

    Are Eco-Friendly Homes the Key to Better Sleep?

    How Does Renewable Energy Impact Personal Wellness?

    How Does Renewable Energy Impact Personal Wellness?

    Can Nature-Inspired Design Enhance Workplace Productivity?

    Can Nature-Inspired Design Enhance Workplace Productivity?

  • AI
    Can AI Be Creative Without Human Input?

    Can AI Be Creative Without Human Input?

    What If AI Could Predict the Future?

    What If AI Could Predict the Future?

    Will AI Ever Be Truly Conscious?

    Will AI Ever Be Truly Conscious?

    Could AI Become the Ultimate Philosopher?

    Could AI Become the Ultimate Philosopher?

    Are We Ready for AI-Powered Art to Dominate the Market?

    Are We Ready for AI-Powered Art to Dominate the Market?

    What Happens When AI Becomes More Ethical Than Us

    What Happens When AI Becomes More Ethical Than Us

  • Space Exploration
    What Is the Future of Space Mining

    What Is the Future of Space Mining

    Are We Really Ready for a Human Colony on the Moon?

    Are We Really Ready for a Human Colony on the Moon?

    What If We Found Life in the Subsurface Oceans of Europa?

    What If We Found Life in the Subsurface Oceans of Europa?

    Could Quantum Physics Unlock Intergalactic Travel?

    Could Quantum Physics Unlock Intergalactic Travel?

    How Do Black Holes Influence the Fate of Galaxies?

    How Do Black Holes Influence the Fate of Galaxies?

    Is the Search for Extraterrestrial Life Just a Fantasy?

    Is the Search for Extraterrestrial Life Just a Fantasy?

  • Innovation & Research
    Can Sustainability Drive the Next Wave of Innovation in Engineering?

    Can Sustainability Drive the Next Wave of Innovation in Engineering?

    Is Open Innovation the Key to Faster Technological Progress?

    Is Open Innovation the Key to Faster Technological Progress?

    What Role Does Collaboration Play in Modern Scientific Breakthroughs?

    What Role Does Collaboration Play in Modern Scientific Breakthroughs?

    Are We Too Focused on Big Data in Research Innovation?

    Are We Too Focused on Big Data in Research Innovation?

    What Makes a Great Researcher in Today’s Innovation Landscape?

    What Makes a Great Researcher in Today’s Innovation Landscape?

    Can AI Revolutionize the Way We Approach Healthcare Innovation?

    Can AI Revolutionize the Way We Approach Healthcare Innovation?

  • All Tech
    Can AI Be Creative Without Human Input?

    Can AI Be Creative Without Human Input?

    What’s the Real Cost of Being Always Connected

    What’s the Real Cost of Being Always Connected

    Is Organic Farming the Missing Link in Health Tech?

    Is Organic Farming the Missing Link in Health Tech?

    Can Sustainability Drive the Next Wave of Innovation in Engineering?

    Can Sustainability Drive the Next Wave of Innovation in Engineering?

    What Is the Future of Space Mining

    What Is the Future of Space Mining

    What If AI Could Predict the Future?

    What If AI Could Predict the Future?

techfusionnews
  • Home
  • Digital Lifestyle
    What’s the Real Cost of Being Always Connected

    What’s the Real Cost of Being Always Connected

    Can AI Truly Replace Human Creativity?

    Can AI Truly Replace Human Creativity?

    Are Virtual Reality Workspaces the Future of Office Life?

    Are Virtual Reality Workspaces the Future of Office Life?

    How Secure Is Your Data in the Cloud?

    How Secure Is Your Data in the Cloud?

    Is the Digital Nomad Lifestyle Sustainable in 2026?

    Is the Digital Nomad Lifestyle Sustainable in 2026?

    Are You Ready to Live Without Your Phone?

    Are You Ready to Live Without Your Phone?

  • Green Tech & Wellness
    Is Organic Farming the Missing Link in Health Tech?

    Is Organic Farming the Missing Link in Health Tech?

    What Role Does Sustainable Fashion Play in Mental Wellbeing?

    What Role Does Sustainable Fashion Play in Mental Wellbeing?

    Can Green Tech Help Us Achieve True Digital Detox?

    Can Green Tech Help Us Achieve True Digital Detox?

    Are Eco-Friendly Homes the Key to Better Sleep?

    Are Eco-Friendly Homes the Key to Better Sleep?

    How Does Renewable Energy Impact Personal Wellness?

    How Does Renewable Energy Impact Personal Wellness?

    Can Nature-Inspired Design Enhance Workplace Productivity?

    Can Nature-Inspired Design Enhance Workplace Productivity?

  • AI
    Can AI Be Creative Without Human Input?

    Can AI Be Creative Without Human Input?

    What If AI Could Predict the Future?

    What If AI Could Predict the Future?

    Will AI Ever Be Truly Conscious?

    Will AI Ever Be Truly Conscious?

    Could AI Become the Ultimate Philosopher?

    Could AI Become the Ultimate Philosopher?

    Are We Ready for AI-Powered Art to Dominate the Market?

    Are We Ready for AI-Powered Art to Dominate the Market?

    What Happens When AI Becomes More Ethical Than Us

    What Happens When AI Becomes More Ethical Than Us

  • Space Exploration
    What Is the Future of Space Mining

    What Is the Future of Space Mining

    Are We Really Ready for a Human Colony on the Moon?

    Are We Really Ready for a Human Colony on the Moon?

    What If We Found Life in the Subsurface Oceans of Europa?

    What If We Found Life in the Subsurface Oceans of Europa?

    Could Quantum Physics Unlock Intergalactic Travel?

    Could Quantum Physics Unlock Intergalactic Travel?

    How Do Black Holes Influence the Fate of Galaxies?

    How Do Black Holes Influence the Fate of Galaxies?

    Is the Search for Extraterrestrial Life Just a Fantasy?

    Is the Search for Extraterrestrial Life Just a Fantasy?

  • Innovation & Research
    Can Sustainability Drive the Next Wave of Innovation in Engineering?

    Can Sustainability Drive the Next Wave of Innovation in Engineering?

    Is Open Innovation the Key to Faster Technological Progress?

    Is Open Innovation the Key to Faster Technological Progress?

    What Role Does Collaboration Play in Modern Scientific Breakthroughs?

    What Role Does Collaboration Play in Modern Scientific Breakthroughs?

    Are We Too Focused on Big Data in Research Innovation?

    Are We Too Focused on Big Data in Research Innovation?

    What Makes a Great Researcher in Today’s Innovation Landscape?

    What Makes a Great Researcher in Today’s Innovation Landscape?

    Can AI Revolutionize the Way We Approach Healthcare Innovation?

    Can AI Revolutionize the Way We Approach Healthcare Innovation?

  • All Tech
    Can AI Be Creative Without Human Input?

    Can AI Be Creative Without Human Input?

    What’s the Real Cost of Being Always Connected

    What’s the Real Cost of Being Always Connected

    Is Organic Farming the Missing Link in Health Tech?

    Is Organic Farming the Missing Link in Health Tech?

    Can Sustainability Drive the Next Wave of Innovation in Engineering?

    Can Sustainability Drive the Next Wave of Innovation in Engineering?

    What Is the Future of Space Mining

    What Is the Future of Space Mining

    What If AI Could Predict the Future?

    What If AI Could Predict the Future?

No Result
View All Result
Plugin Install : Cart Icon need WooCommerce plugin to be installed.
techfusionnews
No Result
View All Result
Home AI

MMBench – Video: Overcoming Short – video Limitations and Revolutionizing Video Understanding Evaluation

November 24, 2024
in AI, All Tech
MMBench – Video: Overcoming Short – video Limitations and Revolutionizing Video Understanding Evaluation

The Need for a New Video Understanding Benchmark

Limitations of Current Evaluation Benchmarks
The GPT – 4o April launch sparked a boom in video understanding, and the open – source leader Qwen2 also demonstrated its prowess in various video evaluation benchmarks. However, most current evaluation benchmarks have several flaws. They mainly focus on short videos, with insufficient video length or number of video shots, making it difficult to assess the model’s long – term sequential understanding ability. The evaluation of models is limited to relatively simple tasks, and many finer – grained capabilities are not covered by most benchmarks. Existing benchmarks can still achieve high scores with a single frame image, indicating weak sequential correlation between questions and video frames. The assessment of open – ended questions still uses the older GPT – 3.5, resulting in significant 偏差 between scoring and human preferences and inaccurate evaluations that often overestimate model performance. So, is there a benchmark that can better address these issues?

MMBench – Video: A New Hope
In the latest NeurIPS D&B 2024, a comprehensive open – ended video understanding evaluation benchmark, MMBench – Video, was proposed by Zhejiang University in collaboration with Shanghai AI Laboratory, Shanghai Jiao Tong University, and The Chinese University of Hong Kong. It also created an open – source evaluation list for the video understanding capabilities of current mainstream MLLMs.

The Superiority of MMBench – Video Dataset

High – Quality Dataset with Full – chain Coverage
The MMBench – Video evaluation benchmark for video understanding is fully manually annotated, undergoing primary annotation and secondary quality verification. It features a rich variety of high – quality videos, and the questions and answers comprehensively cover the model’s capabilities. Answering questions accurately requires extracting information across the time dimension, better assessing the model’s sequential understanding ability.

Distinctive Features of MMBench – Video
Compared with other datasets, MMBench – Video has several prominent features. It has a wide span of video durations and variable number of shots. The collected video lengths range from 30 seconds to 6 minutes, avoiding the problems of simple semantic information in very short videos and high resource consumption in evaluating very long videos. The number of shots in the videos has an overall long – tailed distribution, with a video having up to 210 shots, containing rich scene and context information.

A Comprehensive Test of All – round Abilities
A model’s video understanding ability mainly consists of perception and reasoning, and each part can be further refined. Inspired by MMBench and combined with the specific capabilities involved in video understanding, researchers have established a comprehensive capability spectrum containing 26 fine – grained capabilities. Each fine – grained capability is evaluated with dozens to hundreds of question – answer pairs, and it is not a simple collection of existing tasks.

Rich Video Types and Diverse Question – answer Languages
It covers 16 major fields such as humanities, sports, science and education, cuisine, and finance, with each field accounting for more than 5% of the videos. At the same time, the question – answer pairs have further improvements in length and semantic richness compared to traditional VideoQA datasets, not limited to simple question types like ‘what’ and ‘when’.

Good Temporal Independence and High – quality Annotation
In the research, it was found that most VideoQA datasets can obtain sufficient information from just one frame within the video to answer accurately. This may be because the changes between frames in the video are small, there are few video shots, or the quality of the question – answer pairs is low. Researchers call this poor temporal independence of the dataset. Compared with them, MMBench – Video has significantly lower temporal independence due to detailed rule – based restrictions during annotation and secondary verification of question – answer pairs, enabling better assessment of the model’s sequential understanding ability.

Performance Evaluation of Mainstream Multimodal Models

Evaluating Multiple Models on MMBench – Video
To more comprehensively evaluate the video understanding performance of multiple models, MMBench – Video selected 11 representative video – language models, 6 open – source image – text multimodal large models, and 5 closed – source models like GPT – 4o for comprehensive experimental analysis.

Surprising Results and Insights
Among all the models, GPT – 4o performs outstandingly in video understanding, and Gemini – Pro – v1.5 also shows excellent model performance. Surprisingly, the existing open – source image – text multimodal large models perform better overall on MMBench – Video than the video – question – pair – fine – tuned video – language models. The best image – text model, VILA1.5, outperforms the best video model, LLaVA – NeXT – Video, by nearly 40% in overall performance.

Reasons behind the Performance Differences
Further investigation reveals that the reason image – text models perform better in video understanding may be that they have stronger fine – grained processing capabilities when handling static visual information. Video – language models have deficiencies in static image perception and reasoning performance, and thus struggle when faced with more complex sequential reasoning and dynamic scenes. This difference reveals significant deficiencies in current video models’ spatial and temporal understanding, especially when handling long – video content, and their sequential reasoning ability urgently needs improvement. In addition, the performance improvement of image – text models in reasoning through multi – frame input indicates that they have the potential to further expand into the video understanding field, while video models need to strengthen learning in a wider range of tasks to bridge this gap.

Impact of Video Length and Shot Number on Model Performance
Video length and shot number are considered key factors affecting model performance. Experimental results show that as the video length increases, the performance of GPT – 4o with multi – frame input decreases, while the performance of open – source models such as InternVL – Chat – v1.5 and Video – LLaVA remains relatively stable. Compared with video length, the number of shots has a more significant impact on model performance. When the number of video shots exceeds 50, the performance of GPT – 4o drops to 75% of its original score. This indicates that frequent shot changes make it more difficult for the model to understand the video content, leading to performance degradation.

The Role of Subtitles and Audio Information
In addition, MMBench – Video also obtains subtitle information of videos through an interface, thereby introducing the audio modality through text. After introduction, the model’s performance in video understanding has been significantly improved. When audio signals are combined with visual signals, the model can answer complex questions more accurately. This experimental result shows that the addition of subtitle information can greatly enrich the model’s context understanding ability. Especially in long – video tasks, the information density of the speech modality provides the model with more clues to generate more accurate answers. However, it should be noted that although speech information can improve model performance, it may also increase the risk of generating hallucination content.

The Choice of Referee Model
In terms of referee model selection, experiments show that GPT – 4 has more fair and stable scoring capabilities, with strong anti – manipulation properties and scoring that is not biased towards its own answers, aligning better with human judgment. In contrast, GPT – 3.5 tends to have higher scores during scoring, leading to distorted final results. Meanwhile, open – source large – language models, such as Qwen2 – 72B – Instruct, also show excellent scoring potential, with outstanding alignment with human judgment, proving that they have the potential to become an efficient evaluation model tool.

One – click Evaluation with VLMEvalKit and the OpenVLM Video Leaderboard
MMBench – Video currently supports one – click evaluation in VLMEvalKit. VLMEvalKit is an open – source toolkit designed specifically for evaluating large visual – language models. It supports one – click evaluation of large visual – language models on various benchmark tests without the need for heavy data preparation, making the evaluation process more convenient. VLMEvalKit is applicable to the evaluation of image – text multimodal models and video multimodal models, supporting single – pair image – text input, interleaved image – text input, and video – text input. It implements more than 70 benchmark tests, covering multiple tasks including but not limited to image captioning, visual question answering, and image subtitle generation. The supported models and evaluation benchmarks are constantly being updated.

Based on the reality that the evaluation results of existing video multimodal models are relatively scattered and difficult to reproduce, the team has also established the OpenVLM Video Leaderboard, a comprehensive evaluation list for the video understanding capabilities of models. The OpenCompass VLMEvalKit team will continue to update the latest multimodal large models and evaluation benchmarks, creating a mainstream, open, and convenient multimodal open – source evaluation system.

Conclusion
In summary, MMBench – Video is a new long – video, multi – shot benchmark designed for video understanding tasks, covering a wide range of video content and fine – grained capability evaluation. The benchmark test contains more than 600 long videos collected from YouTube, covering 16 major categories such as news and sports, aiming to evaluate the spatio – temporal reasoning abilities of MLLMs. Different from traditional video – question – answer benchmarks, MMBench – Video makes up for the deficiencies of existing benchmarks in sequential understanding and complex – task processing by introducing long videos and high – quality manually annotated question – answer pairs. By using GPT – 4 to evaluate the model’s answers, this benchmark shows higher evaluation accuracy and consistency, providing a powerful tool for model improvement in the video understanding field. The introduction of MMBench – Video provides researchers and developers with a powerful evaluation tool, helping the open – source community to deeply understand and optimize the capabilities of video – language models.

Tags: Evaluation BenchmarkMMBench - VideoMultimodal ModelsVideo Understanding
ShareTweetShare

Related Posts

Can AI Be Creative Without Human Input?
AI

Can AI Be Creative Without Human Input?

January 16, 2026
What’s the Real Cost of Being Always Connected
All Tech

What’s the Real Cost of Being Always Connected

January 16, 2026
Is Organic Farming the Missing Link in Health Tech?
All Tech

Is Organic Farming the Missing Link in Health Tech?

January 16, 2026
Can Sustainability Drive the Next Wave of Innovation in Engineering?
All Tech

Can Sustainability Drive the Next Wave of Innovation in Engineering?

January 16, 2026
What Is the Future of Space Mining
All Tech

What Is the Future of Space Mining

January 16, 2026
What If AI Could Predict the Future?
AI

What If AI Could Predict the Future?

January 15, 2026

Discussion about this post

  • Trending
  • Comments
  • Latest
Eternal Luminary: Humanity’s Perpetual Fascination with the Sun

Eternal Luminary: Humanity’s Perpetual Fascination with the Sun

November 5, 2024
The Race Heats Up: OpenAI Joins the AI-Powered Search Arena

The Race Heats Up: OpenAI Joins the AI-Powered Search Arena

October 16, 2024
The Canon DIGITAL IXUS Legacy: Redefining Photography with Style and Innovation

The Canon DIGITAL IXUS Legacy: Redefining Photography with Style and Innovation

November 2, 2024
A New Hope: Exploring KarXT’s Potential in Treating Alzheimer’s-Related Psychosis

A New Hope: Exploring KarXT’s Potential in Treating Alzheimer’s-Related Psychosis

December 5, 2024
The Lunar Symphony: Hal Clement’s Prophetic Fantasia

The Lunar Symphony: Hal Clement’s Prophetic Fantasia

Unlocking the Future with AI’s Latest Breakthroughs: A Journey into the Unchartered Frontier

Unlocking the Future with AI’s Latest Breakthroughs: A Journey into the Unchartered Frontier

The Transformative Power of Machine Learning: Shaping the Future of Technology and Beyond

The Transformative Power of Machine Learning: Shaping the Future of Technology and Beyond

The Emotional Intelligence of AI: Bridging the Gap Between Machines and Hearts

The Emotional Intelligence of AI: Bridging the Gap Between Machines and Hearts

Can AI Be Creative Without Human Input?

Can AI Be Creative Without Human Input?

January 16, 2026
What’s the Real Cost of Being Always Connected

What’s the Real Cost of Being Always Connected

January 16, 2026
Is Organic Farming the Missing Link in Health Tech?

Is Organic Farming the Missing Link in Health Tech?

January 16, 2026
Can Sustainability Drive the Next Wave of Innovation in Engineering?

Can Sustainability Drive the Next Wave of Innovation in Engineering?

January 16, 2026
techfusionnews

Discover the essence of innovation at "Tech Aggregator," where the latest in tech converges. From cutting-edge gadgets to cosmic ventures and green breakthroughs, our site offers a streamlined look at the future of technology. Engage with concise, impactful content designed for those eager to stay ahead in an ever-evolving digital landscape. Join us at the forefront of the tech revolution.

© 2025 techfusionnews.com. contacts:[email protected]

No Result
View All Result
  • Home
  • Digital Lifestyle
  • Green Tech & Wellness
  • AI
  • Space Exploration
  • Innovation & Research
  • All Tech

© 2025 techfusionnews.com. contacts:[email protected]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In