techfusionnews
  • Home
  • Digital Lifestyle
    How Are Algorithms Shaping Our Dreams Without Us Knowing?

    How Are Algorithms Shaping Our Dreams Without Us Knowing?

    Do Digital Nomads Really Achieve Work-Life Balance, or Is It a Myth?

    Do Digital Nomads Really Achieve Work-Life Balance, or Is It a Myth?

    Is the Future of Shopping Fully Virtual or Just a Mirage?

    Is the Future of Shopping Fully Virtual or Just a Mirage?

    Are We Losing the Art of Physical Touch in a Digital-First World?

    Are We Losing the Art of Physical Touch in a Digital-First World?

    Can Your Digital Habits Be Sabotaging Your Creativity?

    Can Your Digital Habits Be Sabotaging Your Creativity?

    Why Does Every App Want Your Attention 24/7?

    Why Does Every App Want Your Attention 24/7?

  • Green Tech & Wellness
    Green Tech Innovations: Can They Reverse the Impact of Digital Detox?

    Green Tech Innovations: Can They Reverse the Impact of Digital Detox?

    Solar-Powered Wearables: The Future of Outdoor Wellness?

    Solar-Powered Wearables: The Future of Outdoor Wellness?

    Biofeedback Wearables: Can They Help You Connect with Nature?

    Biofeedback Wearables: Can They Help You Connect with Nature?

    Green Chemistry: The Key to Healthier Beauty Products?

    Green Chemistry: The Key to Healthier Beauty Products?

    Zero-Waste Tech and Its Potential to Improve Emotional Resilience

    Zero-Waste Tech and Its Potential to Improve Emotional Resilience

    Can Technology Really Help Us Sleep Better?

    Can Technology Really Help Us Sleep Better?

  • AI
    Can AI Simulate the Process of Human Evolution?

    Can AI Simulate the Process of Human Evolution?

    Is AI Capable of Achieving Spiritual Enlightenment?

    Is AI Capable of Achieving Spiritual Enlightenment?

    Can AI Help Us Decipher Ancient, Untranslatable Languages?

    Can AI Help Us Decipher Ancient, Untranslatable Languages?

    Will AI Ever Write Music That Evokes the Same Emotions as a Human Composer?

    Will AI Ever Write Music That Evokes the Same Emotions as a Human Composer?

    Can AI Predict Your Dreams Before You Even Sleep?

    Can AI Predict Your Dreams Before You Even Sleep?

    Will AI Make Us Smarter — or Just More Dependent?

    Will AI Make Us Smarter — or Just More Dependent?

  • Space Exploration
    What Role Will Virtual Reality Play in Future Education?

    What Role Will Virtual Reality Play in Future Education?

    Are We Ready for Human Colonies on the Moon?

    What Would Happen if We Detected Alien Life Tomorrow?

    What Would Happen if We Detected Alien Life Tomorrow?

    Can We Terraform Mars to Make It Our Second Home?

    Can We Terraform Mars to Make It Our Second Home?

    Is Space the Final Frontier or Just the Beginning?

    Is Space the Final Frontier or Just the Beginning?

    Has SpaceX Already Changed the Game, or Is the Real Revolution Still Coming?

    Has SpaceX Already Changed the Game, or Is the Real Revolution Still Coming?

  • Innovation & Research
    Can Biotech Solve the World’s Food Security Crisis?

    Can Biotech Solve the World’s Food Security Crisis?

    How Will Quantum Computing Change the Landscape of Technology?

    How Will Quantum Computing Change the Landscape of Technology?

    Is Open Source the Key to Accelerating Innovation in Healthcare?

    Is Open Source the Key to Accelerating Innovation in Healthcare?

    Can AI Revolutionize the Future of Scientific Research?

    Should We Be Innovating Faster — or Smarter?

    Should We Be Innovating Faster — or Smarter?

    What Happens When Machines Start Innovating Themselves?

    What Happens When Machines Start Innovating Themselves?

  • All Tech
    What If We Could Upload Human Consciousness?

    What If We Could Upload Human Consciousness?

    Can Machine Learning Predict Your Personal Growth?

    Can Machine Learning Predict Your Personal Growth?

    Is the Metaverse Already Dead, or Just Evolving?

    Is the Metaverse Already Dead, or Just Evolving?

    What Happens When AI Designs Your Dreams?

    What Happens When AI Designs Your Dreams?

    Synthetic Biology Redefine the Future of Food

    Synthetic Biology Redefine the Future of Food

    Are We Building the Future Too Fast to Understand It?

    Are We Building the Future Too Fast to Understand It?

techfusionnews
  • Home
  • Digital Lifestyle
    How Are Algorithms Shaping Our Dreams Without Us Knowing?

    How Are Algorithms Shaping Our Dreams Without Us Knowing?

    Do Digital Nomads Really Achieve Work-Life Balance, or Is It a Myth?

    Do Digital Nomads Really Achieve Work-Life Balance, or Is It a Myth?

    Is the Future of Shopping Fully Virtual or Just a Mirage?

    Is the Future of Shopping Fully Virtual or Just a Mirage?

    Are We Losing the Art of Physical Touch in a Digital-First World?

    Are We Losing the Art of Physical Touch in a Digital-First World?

    Can Your Digital Habits Be Sabotaging Your Creativity?

    Can Your Digital Habits Be Sabotaging Your Creativity?

    Why Does Every App Want Your Attention 24/7?

    Why Does Every App Want Your Attention 24/7?

  • Green Tech & Wellness
    Green Tech Innovations: Can They Reverse the Impact of Digital Detox?

    Green Tech Innovations: Can They Reverse the Impact of Digital Detox?

    Solar-Powered Wearables: The Future of Outdoor Wellness?

    Solar-Powered Wearables: The Future of Outdoor Wellness?

    Biofeedback Wearables: Can They Help You Connect with Nature?

    Biofeedback Wearables: Can They Help You Connect with Nature?

    Green Chemistry: The Key to Healthier Beauty Products?

    Green Chemistry: The Key to Healthier Beauty Products?

    Zero-Waste Tech and Its Potential to Improve Emotional Resilience

    Zero-Waste Tech and Its Potential to Improve Emotional Resilience

    Can Technology Really Help Us Sleep Better?

    Can Technology Really Help Us Sleep Better?

  • AI
    Can AI Simulate the Process of Human Evolution?

    Can AI Simulate the Process of Human Evolution?

    Is AI Capable of Achieving Spiritual Enlightenment?

    Is AI Capable of Achieving Spiritual Enlightenment?

    Can AI Help Us Decipher Ancient, Untranslatable Languages?

    Can AI Help Us Decipher Ancient, Untranslatable Languages?

    Will AI Ever Write Music That Evokes the Same Emotions as a Human Composer?

    Will AI Ever Write Music That Evokes the Same Emotions as a Human Composer?

    Can AI Predict Your Dreams Before You Even Sleep?

    Can AI Predict Your Dreams Before You Even Sleep?

    Will AI Make Us Smarter — or Just More Dependent?

    Will AI Make Us Smarter — or Just More Dependent?

  • Space Exploration
    What Role Will Virtual Reality Play in Future Education?

    What Role Will Virtual Reality Play in Future Education?

    Are We Ready for Human Colonies on the Moon?

    What Would Happen if We Detected Alien Life Tomorrow?

    What Would Happen if We Detected Alien Life Tomorrow?

    Can We Terraform Mars to Make It Our Second Home?

    Can We Terraform Mars to Make It Our Second Home?

    Is Space the Final Frontier or Just the Beginning?

    Is Space the Final Frontier or Just the Beginning?

    Has SpaceX Already Changed the Game, or Is the Real Revolution Still Coming?

    Has SpaceX Already Changed the Game, or Is the Real Revolution Still Coming?

  • Innovation & Research
    Can Biotech Solve the World’s Food Security Crisis?

    Can Biotech Solve the World’s Food Security Crisis?

    How Will Quantum Computing Change the Landscape of Technology?

    How Will Quantum Computing Change the Landscape of Technology?

    Is Open Source the Key to Accelerating Innovation in Healthcare?

    Is Open Source the Key to Accelerating Innovation in Healthcare?

    Can AI Revolutionize the Future of Scientific Research?

    Should We Be Innovating Faster — or Smarter?

    Should We Be Innovating Faster — or Smarter?

    What Happens When Machines Start Innovating Themselves?

    What Happens When Machines Start Innovating Themselves?

  • All Tech
    What If We Could Upload Human Consciousness?

    What If We Could Upload Human Consciousness?

    Can Machine Learning Predict Your Personal Growth?

    Can Machine Learning Predict Your Personal Growth?

    Is the Metaverse Already Dead, or Just Evolving?

    Is the Metaverse Already Dead, or Just Evolving?

    What Happens When AI Designs Your Dreams?

    What Happens When AI Designs Your Dreams?

    Synthetic Biology Redefine the Future of Food

    Synthetic Biology Redefine the Future of Food

    Are We Building the Future Too Fast to Understand It?

    Are We Building the Future Too Fast to Understand It?

No Result
View All Result
Plugin Install : Cart Icon need WooCommerce plugin to be installed.
techfusionnews
No Result
View All Result
Home AI

MMBench – Video: Overcoming Short – video Limitations and Revolutionizing Video Understanding Evaluation

November 24, 2024
in AI, All Tech
MMBench – Video: Overcoming Short – video Limitations and Revolutionizing Video Understanding Evaluation

The Need for a New Video Understanding Benchmark

Limitations of Current Evaluation Benchmarks
The GPT – 4o April launch sparked a boom in video understanding, and the open – source leader Qwen2 also demonstrated its prowess in various video evaluation benchmarks. However, most current evaluation benchmarks have several flaws. They mainly focus on short videos, with insufficient video length or number of video shots, making it difficult to assess the model’s long – term sequential understanding ability. The evaluation of models is limited to relatively simple tasks, and many finer – grained capabilities are not covered by most benchmarks. Existing benchmarks can still achieve high scores with a single frame image, indicating weak sequential correlation between questions and video frames. The assessment of open – ended questions still uses the older GPT – 3.5, resulting in significant 偏差 between scoring and human preferences and inaccurate evaluations that often overestimate model performance. So, is there a benchmark that can better address these issues?

MMBench – Video: A New Hope
In the latest NeurIPS D&B 2024, a comprehensive open – ended video understanding evaluation benchmark, MMBench – Video, was proposed by Zhejiang University in collaboration with Shanghai AI Laboratory, Shanghai Jiao Tong University, and The Chinese University of Hong Kong. It also created an open – source evaluation list for the video understanding capabilities of current mainstream MLLMs.

The Superiority of MMBench – Video Dataset

High – Quality Dataset with Full – chain Coverage
The MMBench – Video evaluation benchmark for video understanding is fully manually annotated, undergoing primary annotation and secondary quality verification. It features a rich variety of high – quality videos, and the questions and answers comprehensively cover the model’s capabilities. Answering questions accurately requires extracting information across the time dimension, better assessing the model’s sequential understanding ability.

Distinctive Features of MMBench – Video
Compared with other datasets, MMBench – Video has several prominent features. It has a wide span of video durations and variable number of shots. The collected video lengths range from 30 seconds to 6 minutes, avoiding the problems of simple semantic information in very short videos and high resource consumption in evaluating very long videos. The number of shots in the videos has an overall long – tailed distribution, with a video having up to 210 shots, containing rich scene and context information.

A Comprehensive Test of All – round Abilities
A model’s video understanding ability mainly consists of perception and reasoning, and each part can be further refined. Inspired by MMBench and combined with the specific capabilities involved in video understanding, researchers have established a comprehensive capability spectrum containing 26 fine – grained capabilities. Each fine – grained capability is evaluated with dozens to hundreds of question – answer pairs, and it is not a simple collection of existing tasks.

Rich Video Types and Diverse Question – answer Languages
It covers 16 major fields such as humanities, sports, science and education, cuisine, and finance, with each field accounting for more than 5% of the videos. At the same time, the question – answer pairs have further improvements in length and semantic richness compared to traditional VideoQA datasets, not limited to simple question types like ‘what’ and ‘when’.

Good Temporal Independence and High – quality Annotation
In the research, it was found that most VideoQA datasets can obtain sufficient information from just one frame within the video to answer accurately. This may be because the changes between frames in the video are small, there are few video shots, or the quality of the question – answer pairs is low. Researchers call this poor temporal independence of the dataset. Compared with them, MMBench – Video has significantly lower temporal independence due to detailed rule – based restrictions during annotation and secondary verification of question – answer pairs, enabling better assessment of the model’s sequential understanding ability.

Performance Evaluation of Mainstream Multimodal Models

Evaluating Multiple Models on MMBench – Video
To more comprehensively evaluate the video understanding performance of multiple models, MMBench – Video selected 11 representative video – language models, 6 open – source image – text multimodal large models, and 5 closed – source models like GPT – 4o for comprehensive experimental analysis.

Surprising Results and Insights
Among all the models, GPT – 4o performs outstandingly in video understanding, and Gemini – Pro – v1.5 also shows excellent model performance. Surprisingly, the existing open – source image – text multimodal large models perform better overall on MMBench – Video than the video – question – pair – fine – tuned video – language models. The best image – text model, VILA1.5, outperforms the best video model, LLaVA – NeXT – Video, by nearly 40% in overall performance.

Reasons behind the Performance Differences
Further investigation reveals that the reason image – text models perform better in video understanding may be that they have stronger fine – grained processing capabilities when handling static visual information. Video – language models have deficiencies in static image perception and reasoning performance, and thus struggle when faced with more complex sequential reasoning and dynamic scenes. This difference reveals significant deficiencies in current video models’ spatial and temporal understanding, especially when handling long – video content, and their sequential reasoning ability urgently needs improvement. In addition, the performance improvement of image – text models in reasoning through multi – frame input indicates that they have the potential to further expand into the video understanding field, while video models need to strengthen learning in a wider range of tasks to bridge this gap.

Impact of Video Length and Shot Number on Model Performance
Video length and shot number are considered key factors affecting model performance. Experimental results show that as the video length increases, the performance of GPT – 4o with multi – frame input decreases, while the performance of open – source models such as InternVL – Chat – v1.5 and Video – LLaVA remains relatively stable. Compared with video length, the number of shots has a more significant impact on model performance. When the number of video shots exceeds 50, the performance of GPT – 4o drops to 75% of its original score. This indicates that frequent shot changes make it more difficult for the model to understand the video content, leading to performance degradation.

The Role of Subtitles and Audio Information
In addition, MMBench – Video also obtains subtitle information of videos through an interface, thereby introducing the audio modality through text. After introduction, the model’s performance in video understanding has been significantly improved. When audio signals are combined with visual signals, the model can answer complex questions more accurately. This experimental result shows that the addition of subtitle information can greatly enrich the model’s context understanding ability. Especially in long – video tasks, the information density of the speech modality provides the model with more clues to generate more accurate answers. However, it should be noted that although speech information can improve model performance, it may also increase the risk of generating hallucination content.

The Choice of Referee Model
In terms of referee model selection, experiments show that GPT – 4 has more fair and stable scoring capabilities, with strong anti – manipulation properties and scoring that is not biased towards its own answers, aligning better with human judgment. In contrast, GPT – 3.5 tends to have higher scores during scoring, leading to distorted final results. Meanwhile, open – source large – language models, such as Qwen2 – 72B – Instruct, also show excellent scoring potential, with outstanding alignment with human judgment, proving that they have the potential to become an efficient evaluation model tool.

One – click Evaluation with VLMEvalKit and the OpenVLM Video Leaderboard
MMBench – Video currently supports one – click evaluation in VLMEvalKit. VLMEvalKit is an open – source toolkit designed specifically for evaluating large visual – language models. It supports one – click evaluation of large visual – language models on various benchmark tests without the need for heavy data preparation, making the evaluation process more convenient. VLMEvalKit is applicable to the evaluation of image – text multimodal models and video multimodal models, supporting single – pair image – text input, interleaved image – text input, and video – text input. It implements more than 70 benchmark tests, covering multiple tasks including but not limited to image captioning, visual question answering, and image subtitle generation. The supported models and evaluation benchmarks are constantly being updated.

Based on the reality that the evaluation results of existing video multimodal models are relatively scattered and difficult to reproduce, the team has also established the OpenVLM Video Leaderboard, a comprehensive evaluation list for the video understanding capabilities of models. The OpenCompass VLMEvalKit team will continue to update the latest multimodal large models and evaluation benchmarks, creating a mainstream, open, and convenient multimodal open – source evaluation system.

Conclusion
In summary, MMBench – Video is a new long – video, multi – shot benchmark designed for video understanding tasks, covering a wide range of video content and fine – grained capability evaluation. The benchmark test contains more than 600 long videos collected from YouTube, covering 16 major categories such as news and sports, aiming to evaluate the spatio – temporal reasoning abilities of MLLMs. Different from traditional video – question – answer benchmarks, MMBench – Video makes up for the deficiencies of existing benchmarks in sequential understanding and complex – task processing by introducing long videos and high – quality manually annotated question – answer pairs. By using GPT – 4 to evaluate the model’s answers, this benchmark shows higher evaluation accuracy and consistency, providing a powerful tool for model improvement in the video understanding field. The introduction of MMBench – Video provides researchers and developers with a powerful evaluation tool, helping the open – source community to deeply understand and optimize the capabilities of video – language models.

Tags: Evaluation BenchmarkMMBench - VideoMultimodal ModelsVideo Understanding
ShareTweetShare

Related Posts

Can AI Simulate the Process of Human Evolution?
AI

Can AI Simulate the Process of Human Evolution?

December 1, 2025
What If We Could Upload Human Consciousness?
All Tech

What If We Could Upload Human Consciousness?

November 30, 2025
Is AI Capable of Achieving Spiritual Enlightenment?
AI

Is AI Capable of Achieving Spiritual Enlightenment?

November 29, 2025
Can Machine Learning Predict Your Personal Growth?
All Tech

Can Machine Learning Predict Your Personal Growth?

November 29, 2025
Can AI Help Us Decipher Ancient, Untranslatable Languages?
AI

Can AI Help Us Decipher Ancient, Untranslatable Languages?

November 28, 2025
Is the Metaverse Already Dead, or Just Evolving?
All Tech

Is the Metaverse Already Dead, or Just Evolving?

November 28, 2025

Discussion about this post

  • Trending
  • Comments
  • Latest
Eternal Luminary: Humanity’s Perpetual Fascination with the Sun

Eternal Luminary: Humanity’s Perpetual Fascination with the Sun

November 5, 2024
The Race Heats Up: OpenAI Joins the AI-Powered Search Arena

The Race Heats Up: OpenAI Joins the AI-Powered Search Arena

October 16, 2024
The Canon DIGITAL IXUS Legacy: Redefining Photography with Style and Innovation

The Canon DIGITAL IXUS Legacy: Redefining Photography with Style and Innovation

November 2, 2024
A New Hope: Exploring KarXT’s Potential in Treating Alzheimer’s-Related Psychosis

A New Hope: Exploring KarXT’s Potential in Treating Alzheimer’s-Related Psychosis

December 5, 2024
The Lunar Symphony: Hal Clement’s Prophetic Fantasia

The Lunar Symphony: Hal Clement’s Prophetic Fantasia

Unlocking the Future with AI’s Latest Breakthroughs: A Journey into the Unchartered Frontier

Unlocking the Future with AI’s Latest Breakthroughs: A Journey into the Unchartered Frontier

The Transformative Power of Machine Learning: Shaping the Future of Technology and Beyond

The Transformative Power of Machine Learning: Shaping the Future of Technology and Beyond

The Emotional Intelligence of AI: Bridging the Gap Between Machines and Hearts

The Emotional Intelligence of AI: Bridging the Gap Between Machines and Hearts

Can AI Simulate the Process of Human Evolution?

Can AI Simulate the Process of Human Evolution?

December 1, 2025
How Are Algorithms Shaping Our Dreams Without Us Knowing?

How Are Algorithms Shaping Our Dreams Without Us Knowing?

December 1, 2025
What Role Will Virtual Reality Play in Future Education?

What Role Will Virtual Reality Play in Future Education?

December 1, 2025
Green Tech Innovations: Can They Reverse the Impact of Digital Detox?

Green Tech Innovations: Can They Reverse the Impact of Digital Detox?

December 1, 2025
techfusionnews

Discover the essence of innovation at "Tech Aggregator," where the latest in tech converges. From cutting-edge gadgets to cosmic ventures and green breakthroughs, our site offers a streamlined look at the future of technology. Engage with concise, impactful content designed for those eager to stay ahead in an ever-evolving digital landscape. Join us at the forefront of the tech revolution.

© 2025 techfusionnews.com. contacts:[email protected]

No Result
View All Result
  • Home
  • Digital Lifestyle
  • Green Tech & Wellness
  • AI
  • Space Exploration
  • Innovation & Research
  • All Tech

© 2025 techfusionnews.com. contacts:[email protected]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In