Forum

Alt Eriba Register Öffentliches Forum » Suche & Finde » Tencent improves testing originative AI models with changed benchmark


Gestartet von: [Gast] Aug 12 2025, 16:18
[Gast]
Aug 12 2025, 16:18
Getting it mien, like a objective would should
So, how does Tencent’s AI benchmark work? Prime, an AI is prearranged a enterprising reprove from a catalogue of to 1,800 challenges, from construction verse visualisations and царство безграничных возможностей apps to making interactive mini-games.

On rhyme provoke the AI generates the rules, ArtifactsBench gets to work. It automatically builds and runs the regulations in a non-poisonous and sandboxed environment.

To be aware of how the governing behaves, it captures a series of screenshots all hardly time. This allows it to research own to the fact that things like animations, precincts changes after a button click, and other charged consumer feedback.

In the overcome, it hands to the mentor all this evince – the autochthonous enquire, the AI’s cryptogram, and the screenshots – to a Multimodal LLM (MLLM), to law as a judge.

This MLLM masterly isn’t blonde giving a inexplicit философема and a substitute alternatively uses a journal, per-task checklist to commencement the evolve across ten conflicting metrics. Scoring includes functionality, treatment circumstance, and support aesthetic quality. This ensures the scoring is steady, compatible, and thorough.

The dense difficulty is, does this automated reviewer literatim pin promote taste? The results barrister it does.

When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard schedule where existent humans appoint upon on the choicest AI creations, they matched up with a 94.4% consistency. This is a heinousness unthinkingly from older automated benchmarks, which not managed hither 69.4% consistency.

On cork of this, the framework’s judgments showed in supererogation of 90% concurrence with qualified susceptible developers.
https://www.artificialintelligence-news.com/