techfusionnews
  • Home
  • Digital Lifestyle
    Do You Really Own Your Digital Content, or Are You Just Borrowing It?

    Do You Really Own Your Digital Content, or Are You Just Borrowing It?

    Smart Green Architecture: The Ultimate Anti-Aging Secret?

    How Are Digital Economies Reshaping Local Communities?

    Digital Art: Can It Truly Capture Human Emotion, or Is It Just Pixels?

    Digital Art: Can It Truly Capture Human Emotion, or Is It Just Pixels?

    Is Artificial Reality More Authentic Than Real Life in Certain Situations?

    Is Artificial Reality More Authentic Than Real Life in Certain Situations?

    Is Your Digital Library a True Reflection of Who You Are?

    Is Your Digital Library a True Reflection of Who You Are?

    Can You Truly Escape Digital Overload with a Simple App?

    Can You Truly Escape Digital Overload with a Simple App?

  • Green Tech & Wellness
    How Does Eco-Conscious Travel Affect Your Mental Health?

    How Does Eco-Conscious Travel Affect Your Mental Health?

    Bio-Based Materials in Wearables: Can They Prevent Chronic Illness?

    Bio-Based Materials in Wearables: Can They Prevent Chronic Illness?

    Smart Green Architecture: The Ultimate Anti-Aging Secret?

    Smart Green Architecture: The Ultimate Anti-Aging Secret?

    How Can Eco-Friendly Virtual Reality Enhance Mental Clarity?

    How Can Eco-Friendly Virtual Reality Enhance Mental Clarity?

    Can Eco-Centric Tech Improve Your Neuroplasticity?

    Can Eco-Centric Tech Improve Your Neuroplasticity?

    How Much of Your Personality Is Crafted by Social Media Filters?

    How Much of Your Personality Is Crafted by Social Media Filters?

  • AI
    Can AI Explore Parallel Universes Through Data?

    Can AI Explore Parallel Universes Through Data?

    Will AI Ever Create Art That Challenges Our Understanding of Reality?

    Will AI Ever Create Art That Challenges Our Understanding of Reality?

    Can AI Identify Patterns in Nature That Humans Have Yet to Discover?

    Can AI Identify Patterns in Nature That Humans Have Yet to Discover?

    AI Capable of Decoding the Concept of Time Itself?

    AI Capable of Decoding the Concept of Time Itself?

    Can AI Predict the Future of Human Consciousness?

    Can AI Predict the Future of Human Consciousness?

    How Could AI Be Used to Recreate Lost Historical Events?

    How Could AI Be Used to Recreate Lost Historical Events?

  • Space Exploration
    How Does Space Radiation Affect Astronauts’ Health?

    How Does Space Radiation Affect Astronauts’ Health?

    Can We Mine Asteroids for Resources in the Future?

    Can We Mine Asteroids for Resources in the Future?

    Why Haven’t We Found Extraterrestrial Civilizations Yet?

    Why Haven’t We Found Extraterrestrial Civilizations Yet?

    Can Artificial Intelligence Lead the Next Space Missions?

    Can Artificial Intelligence Lead the Next Space Missions?

    What if Earth’s Atmosphere Was Not Perfect for Life?

    What if Earth’s Atmosphere Was Not Perfect for Life?

    Will We Ever Find a Parallel Universe Beyond Our Own?

    Will We Ever Find a Parallel Universe Beyond Our Own?

  • Innovation & Research
    Robotics: The Key to Overcoming Labor Shortages in Science?

    Robotics: The Key to Overcoming Labor Shortages in Science?

    How Can Artificial Intelligence Foster Creativity in the Arts?

    How Can Artificial Intelligence Foster Creativity in the Arts?

    What If We Could Edit Human Memories—Should We?

    What If We Could Edit Human Memories—Should We?

    Can 3D Printing Transform the Medical Field?

    Can 3D Printing Transform the Medical Field?

    Blockchain: The Future of Transparent and Secure Research

    Blockchain: The Future of Transparent and Secure Research

    What Will the Next Big Breakthrough in Renewable Energy Look Like?

    What Will the Next Big Breakthrough in Renewable Energy Look Like?

  • All Tech
    Could Wearable Tech Unlock Hidden Human Abilities?

    Could Wearable Tech Unlock Hidden Human Abilities?

    Can Virtual Reality Be the Next Frontier in Therapy?

    Can Virtual Reality Be the Next Frontier in Therapy?

    Is Biohacking the Next Step in Human Evolution?

    Is Biohacking the Next Step in Human Evolution?

    What If Robots Could Create Their Own Cultural Movements?

    What If Robots Could Create Their Own Cultural Movements?

    Are We Prepared for the Ethics of AI-Driven Space Exploration?

    Are We Prepared for the Ethics of AI-Driven Space Exploration?

    New Mythology for the Digital Age

    New Mythology for the Digital Age

techfusionnews
  • Home
  • Digital Lifestyle
    Do You Really Own Your Digital Content, or Are You Just Borrowing It?

    Do You Really Own Your Digital Content, or Are You Just Borrowing It?

    Smart Green Architecture: The Ultimate Anti-Aging Secret?

    How Are Digital Economies Reshaping Local Communities?

    Digital Art: Can It Truly Capture Human Emotion, or Is It Just Pixels?

    Digital Art: Can It Truly Capture Human Emotion, or Is It Just Pixels?

    Is Artificial Reality More Authentic Than Real Life in Certain Situations?

    Is Artificial Reality More Authentic Than Real Life in Certain Situations?

    Is Your Digital Library a True Reflection of Who You Are?

    Is Your Digital Library a True Reflection of Who You Are?

    Can You Truly Escape Digital Overload with a Simple App?

    Can You Truly Escape Digital Overload with a Simple App?

  • Green Tech & Wellness
    How Does Eco-Conscious Travel Affect Your Mental Health?

    How Does Eco-Conscious Travel Affect Your Mental Health?

    Bio-Based Materials in Wearables: Can They Prevent Chronic Illness?

    Bio-Based Materials in Wearables: Can They Prevent Chronic Illness?

    Smart Green Architecture: The Ultimate Anti-Aging Secret?

    Smart Green Architecture: The Ultimate Anti-Aging Secret?

    How Can Eco-Friendly Virtual Reality Enhance Mental Clarity?

    How Can Eco-Friendly Virtual Reality Enhance Mental Clarity?

    Can Eco-Centric Tech Improve Your Neuroplasticity?

    Can Eco-Centric Tech Improve Your Neuroplasticity?

    How Much of Your Personality Is Crafted by Social Media Filters?

    How Much of Your Personality Is Crafted by Social Media Filters?

  • AI
    Can AI Explore Parallel Universes Through Data?

    Can AI Explore Parallel Universes Through Data?

    Will AI Ever Create Art That Challenges Our Understanding of Reality?

    Will AI Ever Create Art That Challenges Our Understanding of Reality?

    Can AI Identify Patterns in Nature That Humans Have Yet to Discover?

    Can AI Identify Patterns in Nature That Humans Have Yet to Discover?

    AI Capable of Decoding the Concept of Time Itself?

    AI Capable of Decoding the Concept of Time Itself?

    Can AI Predict the Future of Human Consciousness?

    Can AI Predict the Future of Human Consciousness?

    How Could AI Be Used to Recreate Lost Historical Events?

    How Could AI Be Used to Recreate Lost Historical Events?

  • Space Exploration
    How Does Space Radiation Affect Astronauts’ Health?

    How Does Space Radiation Affect Astronauts’ Health?

    Can We Mine Asteroids for Resources in the Future?

    Can We Mine Asteroids for Resources in the Future?

    Why Haven’t We Found Extraterrestrial Civilizations Yet?

    Why Haven’t We Found Extraterrestrial Civilizations Yet?

    Can Artificial Intelligence Lead the Next Space Missions?

    Can Artificial Intelligence Lead the Next Space Missions?

    What if Earth’s Atmosphere Was Not Perfect for Life?

    What if Earth’s Atmosphere Was Not Perfect for Life?

    Will We Ever Find a Parallel Universe Beyond Our Own?

    Will We Ever Find a Parallel Universe Beyond Our Own?

  • Innovation & Research
    Robotics: The Key to Overcoming Labor Shortages in Science?

    Robotics: The Key to Overcoming Labor Shortages in Science?

    How Can Artificial Intelligence Foster Creativity in the Arts?

    How Can Artificial Intelligence Foster Creativity in the Arts?

    What If We Could Edit Human Memories—Should We?

    What If We Could Edit Human Memories—Should We?

    Can 3D Printing Transform the Medical Field?

    Can 3D Printing Transform the Medical Field?

    Blockchain: The Future of Transparent and Secure Research

    Blockchain: The Future of Transparent and Secure Research

    What Will the Next Big Breakthrough in Renewable Energy Look Like?

    What Will the Next Big Breakthrough in Renewable Energy Look Like?

  • All Tech
    Could Wearable Tech Unlock Hidden Human Abilities?

    Could Wearable Tech Unlock Hidden Human Abilities?

    Can Virtual Reality Be the Next Frontier in Therapy?

    Can Virtual Reality Be the Next Frontier in Therapy?

    Is Biohacking the Next Step in Human Evolution?

    Is Biohacking the Next Step in Human Evolution?

    What If Robots Could Create Their Own Cultural Movements?

    What If Robots Could Create Their Own Cultural Movements?

    Are We Prepared for the Ethics of AI-Driven Space Exploration?

    Are We Prepared for the Ethics of AI-Driven Space Exploration?

    New Mythology for the Digital Age

    New Mythology for the Digital Age

No Result
View All Result
Plugin Install : Cart Icon need WooCommerce plugin to be installed.
techfusionnews
No Result
View All Result
Home AI

MMBench – Video: Overcoming Short – video Limitations and Revolutionizing Video Understanding Evaluation

November 24, 2024
in AI, All Tech
MMBench – Video: Overcoming Short – video Limitations and Revolutionizing Video Understanding Evaluation

The Need for a New Video Understanding Benchmark

Limitations of Current Evaluation Benchmarks
The GPT – 4o April launch sparked a boom in video understanding, and the open – source leader Qwen2 also demonstrated its prowess in various video evaluation benchmarks. However, most current evaluation benchmarks have several flaws. They mainly focus on short videos, with insufficient video length or number of video shots, making it difficult to assess the model’s long – term sequential understanding ability. The evaluation of models is limited to relatively simple tasks, and many finer – grained capabilities are not covered by most benchmarks. Existing benchmarks can still achieve high scores with a single frame image, indicating weak sequential correlation between questions and video frames. The assessment of open – ended questions still uses the older GPT – 3.5, resulting in significant 偏差 between scoring and human preferences and inaccurate evaluations that often overestimate model performance. So, is there a benchmark that can better address these issues?

MMBench – Video: A New Hope
In the latest NeurIPS D&B 2024, a comprehensive open – ended video understanding evaluation benchmark, MMBench – Video, was proposed by Zhejiang University in collaboration with Shanghai AI Laboratory, Shanghai Jiao Tong University, and The Chinese University of Hong Kong. It also created an open – source evaluation list for the video understanding capabilities of current mainstream MLLMs.

The Superiority of MMBench – Video Dataset

High – Quality Dataset with Full – chain Coverage
The MMBench – Video evaluation benchmark for video understanding is fully manually annotated, undergoing primary annotation and secondary quality verification. It features a rich variety of high – quality videos, and the questions and answers comprehensively cover the model’s capabilities. Answering questions accurately requires extracting information across the time dimension, better assessing the model’s sequential understanding ability.

Distinctive Features of MMBench – Video
Compared with other datasets, MMBench – Video has several prominent features. It has a wide span of video durations and variable number of shots. The collected video lengths range from 30 seconds to 6 minutes, avoiding the problems of simple semantic information in very short videos and high resource consumption in evaluating very long videos. The number of shots in the videos has an overall long – tailed distribution, with a video having up to 210 shots, containing rich scene and context information.

A Comprehensive Test of All – round Abilities
A model’s video understanding ability mainly consists of perception and reasoning, and each part can be further refined. Inspired by MMBench and combined with the specific capabilities involved in video understanding, researchers have established a comprehensive capability spectrum containing 26 fine – grained capabilities. Each fine – grained capability is evaluated with dozens to hundreds of question – answer pairs, and it is not a simple collection of existing tasks.

Rich Video Types and Diverse Question – answer Languages
It covers 16 major fields such as humanities, sports, science and education, cuisine, and finance, with each field accounting for more than 5% of the videos. At the same time, the question – answer pairs have further improvements in length and semantic richness compared to traditional VideoQA datasets, not limited to simple question types like ‘what’ and ‘when’.

Good Temporal Independence and High – quality Annotation
In the research, it was found that most VideoQA datasets can obtain sufficient information from just one frame within the video to answer accurately. This may be because the changes between frames in the video are small, there are few video shots, or the quality of the question – answer pairs is low. Researchers call this poor temporal independence of the dataset. Compared with them, MMBench – Video has significantly lower temporal independence due to detailed rule – based restrictions during annotation and secondary verification of question – answer pairs, enabling better assessment of the model’s sequential understanding ability.

Performance Evaluation of Mainstream Multimodal Models

Evaluating Multiple Models on MMBench – Video
To more comprehensively evaluate the video understanding performance of multiple models, MMBench – Video selected 11 representative video – language models, 6 open – source image – text multimodal large models, and 5 closed – source models like GPT – 4o for comprehensive experimental analysis.

Surprising Results and Insights
Among all the models, GPT – 4o performs outstandingly in video understanding, and Gemini – Pro – v1.5 also shows excellent model performance. Surprisingly, the existing open – source image – text multimodal large models perform better overall on MMBench – Video than the video – question – pair – fine – tuned video – language models. The best image – text model, VILA1.5, outperforms the best video model, LLaVA – NeXT – Video, by nearly 40% in overall performance.

Reasons behind the Performance Differences
Further investigation reveals that the reason image – text models perform better in video understanding may be that they have stronger fine – grained processing capabilities when handling static visual information. Video – language models have deficiencies in static image perception and reasoning performance, and thus struggle when faced with more complex sequential reasoning and dynamic scenes. This difference reveals significant deficiencies in current video models’ spatial and temporal understanding, especially when handling long – video content, and their sequential reasoning ability urgently needs improvement. In addition, the performance improvement of image – text models in reasoning through multi – frame input indicates that they have the potential to further expand into the video understanding field, while video models need to strengthen learning in a wider range of tasks to bridge this gap.

Impact of Video Length and Shot Number on Model Performance
Video length and shot number are considered key factors affecting model performance. Experimental results show that as the video length increases, the performance of GPT – 4o with multi – frame input decreases, while the performance of open – source models such as InternVL – Chat – v1.5 and Video – LLaVA remains relatively stable. Compared with video length, the number of shots has a more significant impact on model performance. When the number of video shots exceeds 50, the performance of GPT – 4o drops to 75% of its original score. This indicates that frequent shot changes make it more difficult for the model to understand the video content, leading to performance degradation.

The Role of Subtitles and Audio Information
In addition, MMBench – Video also obtains subtitle information of videos through an interface, thereby introducing the audio modality through text. After introduction, the model’s performance in video understanding has been significantly improved. When audio signals are combined with visual signals, the model can answer complex questions more accurately. This experimental result shows that the addition of subtitle information can greatly enrich the model’s context understanding ability. Especially in long – video tasks, the information density of the speech modality provides the model with more clues to generate more accurate answers. However, it should be noted that although speech information can improve model performance, it may also increase the risk of generating hallucination content.

The Choice of Referee Model
In terms of referee model selection, experiments show that GPT – 4 has more fair and stable scoring capabilities, with strong anti – manipulation properties and scoring that is not biased towards its own answers, aligning better with human judgment. In contrast, GPT – 3.5 tends to have higher scores during scoring, leading to distorted final results. Meanwhile, open – source large – language models, such as Qwen2 – 72B – Instruct, also show excellent scoring potential, with outstanding alignment with human judgment, proving that they have the potential to become an efficient evaluation model tool.

One – click Evaluation with VLMEvalKit and the OpenVLM Video Leaderboard
MMBench – Video currently supports one – click evaluation in VLMEvalKit. VLMEvalKit is an open – source toolkit designed specifically for evaluating large visual – language models. It supports one – click evaluation of large visual – language models on various benchmark tests without the need for heavy data preparation, making the evaluation process more convenient. VLMEvalKit is applicable to the evaluation of image – text multimodal models and video multimodal models, supporting single – pair image – text input, interleaved image – text input, and video – text input. It implements more than 70 benchmark tests, covering multiple tasks including but not limited to image captioning, visual question answering, and image subtitle generation. The supported models and evaluation benchmarks are constantly being updated.

Based on the reality that the evaluation results of existing video multimodal models are relatively scattered and difficult to reproduce, the team has also established the OpenVLM Video Leaderboard, a comprehensive evaluation list for the video understanding capabilities of models. The OpenCompass VLMEvalKit team will continue to update the latest multimodal large models and evaluation benchmarks, creating a mainstream, open, and convenient multimodal open – source evaluation system.

Conclusion
In summary, MMBench – Video is a new long – video, multi – shot benchmark designed for video understanding tasks, covering a wide range of video content and fine – grained capability evaluation. The benchmark test contains more than 600 long videos collected from YouTube, covering 16 major categories such as news and sports, aiming to evaluate the spatio – temporal reasoning abilities of MLLMs. Different from traditional video – question – answer benchmarks, MMBench – Video makes up for the deficiencies of existing benchmarks in sequential understanding and complex – task processing by introducing long videos and high – quality manually annotated question – answer pairs. By using GPT – 4 to evaluate the model’s answers, this benchmark shows higher evaluation accuracy and consistency, providing a powerful tool for model improvement in the video understanding field. The introduction of MMBench – Video provides researchers and developers with a powerful evaluation tool, helping the open – source community to deeply understand and optimize the capabilities of video – language models.

Tags: Evaluation BenchmarkMMBench - VideoMultimodal ModelsVideo Understanding
ShareTweetShare

Related Posts

Could Wearable Tech Unlock Hidden Human Abilities?
All Tech

Could Wearable Tech Unlock Hidden Human Abilities?

December 11, 2025
Can AI Explore Parallel Universes Through Data?
AI

Can AI Explore Parallel Universes Through Data?

December 11, 2025
Will AI Ever Create Art That Challenges Our Understanding of Reality?
AI

Will AI Ever Create Art That Challenges Our Understanding of Reality?

December 10, 2025
Can Virtual Reality Be the Next Frontier in Therapy?
All Tech

Can Virtual Reality Be the Next Frontier in Therapy?

December 10, 2025
Is Biohacking the Next Step in Human Evolution?
All Tech

Is Biohacking the Next Step in Human Evolution?

December 9, 2025
Can AI Identify Patterns in Nature That Humans Have Yet to Discover?
AI

Can AI Identify Patterns in Nature That Humans Have Yet to Discover?

December 9, 2025

Discussion about this post

  • Trending
  • Comments
  • Latest
Eternal Luminary: Humanity’s Perpetual Fascination with the Sun

Eternal Luminary: Humanity’s Perpetual Fascination with the Sun

November 5, 2024
The Race Heats Up: OpenAI Joins the AI-Powered Search Arena

The Race Heats Up: OpenAI Joins the AI-Powered Search Arena

October 16, 2024
The Canon DIGITAL IXUS Legacy: Redefining Photography with Style and Innovation

The Canon DIGITAL IXUS Legacy: Redefining Photography with Style and Innovation

November 2, 2024
A New Hope: Exploring KarXT’s Potential in Treating Alzheimer’s-Related Psychosis

A New Hope: Exploring KarXT’s Potential in Treating Alzheimer’s-Related Psychosis

December 5, 2024
The Lunar Symphony: Hal Clement’s Prophetic Fantasia

The Lunar Symphony: Hal Clement’s Prophetic Fantasia

Unlocking the Future with AI’s Latest Breakthroughs: A Journey into the Unchartered Frontier

Unlocking the Future with AI’s Latest Breakthroughs: A Journey into the Unchartered Frontier

The Transformative Power of Machine Learning: Shaping the Future of Technology and Beyond

The Transformative Power of Machine Learning: Shaping the Future of Technology and Beyond

The Emotional Intelligence of AI: Bridging the Gap Between Machines and Hearts

The Emotional Intelligence of AI: Bridging the Gap Between Machines and Hearts

Could Wearable Tech Unlock Hidden Human Abilities?

Could Wearable Tech Unlock Hidden Human Abilities?

December 11, 2025
Can AI Explore Parallel Universes Through Data?

Can AI Explore Parallel Universes Through Data?

December 11, 2025
How Does Space Radiation Affect Astronauts’ Health?

How Does Space Radiation Affect Astronauts’ Health?

December 11, 2025
Robotics: The Key to Overcoming Labor Shortages in Science?

Robotics: The Key to Overcoming Labor Shortages in Science?

December 11, 2025
techfusionnews

Discover the essence of innovation at "Tech Aggregator," where the latest in tech converges. From cutting-edge gadgets to cosmic ventures and green breakthroughs, our site offers a streamlined look at the future of technology. Engage with concise, impactful content designed for those eager to stay ahead in an ever-evolving digital landscape. Join us at the forefront of the tech revolution.

© 2025 techfusionnews.com. contacts:[email protected]

No Result
View All Result
  • Home
  • Digital Lifestyle
  • Green Tech & Wellness
  • AI
  • Space Exploration
  • Innovation & Research
  • All Tech

© 2025 techfusionnews.com. contacts:[email protected]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In