
Details about METR’s preliminary evaluation of Claude 3.5 ...
METR evaluated Claude-3.5-Sonnet on tasks from both our general autonomy and AI R&D task suites. The general autonomy evaluations were performed similarly to our GPT-4o evaluation, …
METR: Claude Opus 4.5 has a 50% task completion time horizon ...
10 hours ago · METR: Claude Opus 4.5 has a 50% task completion time horizon of about 4 hours and 49 minutes, more than double that of Claude Opus 4 released earlier this year — We …
Anthropic's models beat o3 in some time-horizon tests | METR ...
In measurements using our set of multi-step software and reasoning tasks, Anthropic's Claude 4 Opus and Sonnet reach 50%-time-horizon point estimates of about 80 and 65 minutes, …
autonomy-evals-guide/claude_3_5_sonnet_report.md at public ...
As such, in this report, "Claude 3.5 Sonnet" refers the model that is named claude-3-5-sonnet-20240620 in the Anthropic API, rather than the newly released Claude 3.5 Sonnet model with …
METR açıkladı: Claude Opus 4.5, görev tamamlamada selefini ...
Yapay zeka araştırma kuruluşu METR, Anthropic şirketinin en yeni yapay zeka modeli Claude Opus 4.5'in performans değerlendirmesini yayımladı.
An update on our preliminary evaluations of Claude 3.5 Sonnet ...
Jan 31, 2025 · An update on our preliminary evaluations of Claude 3.5 Sonnet and o1 METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 …
When will an 8 hour, 80% reliability time horizon be achieved ...
Aug 10, 2025 · Will resolve to the date when it is reported that a model from Anthropic achieves the 8 hour, 80% reliability threshold on METR’s autonomy tasks and is deemed to be Claude …