close
close

Researchers build a more intelligent system to assess how well Ki video processed

Clockwise from left Varun Biyyala, main author of the study, graduate of the MS in artificial intelligence and professor of Katz School in Katz School Industry; Jialu Li, co-author of the study and a master of artificial intelligence; And Bharat Cheranderprakash Kathuria, a student of the artificial intelligence master and one of the main researchers.

By Dave Defusco

In artificial intelligence, it is no longer the main challenge to create flawless video editing, regardless of whether you replace the sky in a sunset scene or create a realistic animation from a still image. The actual test is to find out how well these changes were made. So far, the industry has rely on tools that compare frames with caps or measure pixel changes, but they are often too short, especially if the videos change over time.

Enter SST-European Championship, a new rating framework, developed by researchers from the Katz School who presented their work in February at the prestigious IEEE/CVF winter conference on applications of Computer Vision (WACV). SST-European Championship stands for semantic, spatial and temporal assessment metric. It is not a video editing tool itself, but a way to measure how well AI models run the task of processing videos.

“We have found that the tools currently used to evaluate video editing models are fragmented and often misleading,” said Varun Biyyala, a main author of the study, MS graduate in artificial intelligence and industry professor at Katz School. “You could have a video that looks good in breastfeeding frames, but the moment you play it, the movement looks unnatural or the story does not fit the editing request. SST-EM repairs that.”

Most of existing rating systems are based on a model called clip in which video frames are compared with the text demands in order to measure how exactly an edited video with a desired concept corresponds. These tools are useful, but in one key area it is too short: time.

“The clip scores are well suited for snapshots, not for full scenes,” said Bharat Cheranderprakash Kathuria, a student of the artificial intelligence master and one of the main researchers. “You do not rate how the video flows, whether objects move naturally or whether the main subjects stay consistent. This is exactly what the human viewers take care of.”

The team also pointed out that the training data from Clip are often outdated or biased and that its usefulness restricts in modern, dynamic contexts. To remedy these defects, the SST-European Championship team designed as a four-part evaluation pipeline:

  • Semantic understanding: First, a vision language model (VLM) analyzes every frame to determine whether it corresponds to the intended history or processing request.
  • Object recognition: Use of the latest object detection algorithms, the system pursues the most important objects across all frames to ensure continuity.
  • Founding the AI ​​agent: A large voice model (LLM) helps to refine which objects are the focus, similar to how a person could identify the main topic of a scene.
  • Time consistency: An visual transformer (Vit) checks that frames flow smoothly into each other – no jerky movements, disappeared characters or distorted background.

“All four parts lead to a single end result,” said Jialu Li, co-author of the study and student of artificial intelligence. “We not only estimated how you can weight them. We used regression analyzes for human reviews to determine what was most important.”

This human-centered approach distinguishes SST-European Championship. The team conducted comparisons between SST-European Championship and other popular metrics in several AI video editing models such as Videop2P, TOKENFLOW, Control avideo and Fatezero. The results were clear: SST-European Championship came closest to human judgment every time.

The SST-EM score achieved almost perfect correlation with expert evaluations and excluded any other metric for measures such as imaging quality, object continuity and overall video unreal. In the Pearson correlation, which measures linear resemblance, SST-European Championship achieved 0.962-highers than even metrics that were developed exclusively for image quality. In Spearan and Kendall, Correlation tests, which judge exactly how exactly the ranking lists of the system correspond to human rankings, achieved a perfect 1.0 SST-European Championship.

“These numbers are not only good; they are remarkable,” said Dr. Youshan Zhang, assistant professor for artificial intelligence and computer science. “They show that SST-European Championships evaluates video editing in such a way as people do it-indeed they not only think about what is in a frame, but how frames tell a coherent story.”

The team opened its code openly on Github and invited researchers and developers worldwide to test and refine the system. They also plan improvements, including better handling of fast scene changes, crowded backgrounds and subtle object movements.

“We see SST European Championship as the basis for the next generation of video evaluation tools,” said Dr. Zhang. “When video content becomes more complex and omnipresent, we need metrics that reflect and assess video. This is a step in this direction.”

Leave a Comment