Menu

Categories:

Hot right now:

Follow on:

Coinsurges provides coverage of fintech, blockchain, and Bitcoin, delivering the most recent news and analyses on the future of money. Stay up-to-date with live prices, charts, and trading options for the top exchanges. Keep track of the day's top cryptocurrency gainers and losers, as well as which coins have experienced gains and losses in the past 24 hours.
Trust Coinsurges as your go-to source for all news and updates in the industry.

Menu

Categories:

Hot right now:

Follow on:

Coinsurges provides coverage of fintech, blockchain, and Bitcoin, delivering the most recent news and analyses on the future of money. Stay up-to-date with live prices, charts, and trading options for the top exchanges. Keep track of the day's top cryptocurrency gainers and losers, as well as which coins have experienced gains and losses in the past 24 hours.
Trust Coinsurges as your go-to source for all news and updates in the industry.

OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population.

Share This Post

OpenAI’s new “o3” language model achieved an IQ score of 136 on a public Mensa Norway intelligence test, exceeding the threshold for entry into the country’s Mensa chapter for the first time.

The score, calculated from a seven-run rolling average, places the model above approximately 98 percent of the human population, according to a standardized bell-curve IQ distribution used in the benchmarking.

o3 Mensa scores (Source: TrackingAI.org)
o3 Mensa scores (Source: TrackingAI.org)

The finding, disclosed through data from independent platform TrackingAI.org, reinforces the pattern of closed-source, proprietary models outperforming open-source counterparts in controlled cognitive evaluations.

O-series Dominance and Benchmarking Methodology

The “o3” model was released this week and is a part of the “o-series” of large language models, accounting for most top-tier rankings across both test types evaluated by TrackingAI.

The two benchmark formats included a proprietary “Offline Test” curated by TrackingAI.org and a publicly available Mensa Norway test, both scored against a human mean of 100.

While “o3” posted a 116 on the Offline evaluation, it saw a 20-point boost on the Mensa test, suggesting either enhanced compatibility with the latter’s structure or data-related confounds such as prompt familiarity.

The Offline Test included 100 pattern-recognition questions designed to avoid anything that might have appeared in the data used to train AI models.

Both assessments report each model’s result as an average across the seven most recent completions, but no standard deviation or confidence intervals were released alongside the final scores.

The absence of methodological transparency, particularly around prompting strategies and scoring scale conversion, limits reproducibility and interpretability.

Methodology of testing

TrackingAI.org states that it compiles its data by administering a standardized prompt format designed to ensure broad AI compliance while minimizing interpretive ambiguity.

Each language model is presented with a statement followed by four Likert-style response options, Strongly Disagree, Disagree, Agree, Strongly Agree, and is instructed to select one while justifying its choice in two to five sentences.

Responses must be clearly formatted, typically enclosed in bold or asterisks. If a model refuses to answer, the prompt is repeated up to ten times.

The most recent successful response is then recorded for scoring purposes, with refusal events noted separately.

This methodology, refined through repeated calibration across models, aims to provide consistency in comparative assessments while documenting non-responsiveness as a data point in itself.

Performance spread across model types

The Mensa Norway test sharpened the delineation between the truly frontier models, with the o3’s 136 IQ marking a clear lead over the next highest entry.

In contrast, other popular models like GPT-4o scored considerably lower, landing at 95 on Mensa and 64 on Offline, emphasizing the performance gap between this week’s “o3” release and other top models.

Among open-source submissions, Meta’s Llama 4 Maverick was the highest-ranked, posting a 106 IQ on Mensa and 97 on the Offline benchmark.

Most Apache-licensed entries fell within the 60–90 range, reinforcing the current limitations of community-built architectures relative to corporate-backed research pipelines.

Multimodal models see reduced scores and limitations of testing

Notably, models specifically designed to incorporate image input capabilities consistently underperformed their text-only versions. For instance, OpenAI’s “o1 Pro” scored 107 on the Offline test in its text configuration but dropped to 97 in its vision-enabled version.

The discrepancy was more pronounced on the Mensa test, where the text-only variant achieved 122 compared to 86 for the visual version. This suggests that some methods of multimodal pretraining may introduce reasoning inefficiencies that remain unresolved at present.

However, “o3” can also analyze and interpret images to a very high standard, much better than its predecessors, breaking this trend.

Ultimately, IQ benchmarks provide a narrow window into a model’s reasoning capability, with short-context pattern matching offering only limited insights into broader cognitive behavior such as multi-turn reasoning, planning, or factual accuracy.

Additionally, machine test-taking conditions, such as instant access to full prompts and unlimited processing speed, further blur comparisons to human cognition.

The degree to which high IQ scores on structured tests translate to real-world language model performance remains uncertain.

As TrackingAI.org’s researchers acknowledge, even their attempts to avoid training-set leakage do not entirely preclude the possibility of indirect exposure or format generalization, particularly given the lack of transparency around training datasets and fine-tuning procedures for proprietary models.

Independent Evaluators Fill Transparency Gap

Organizations such as LM-Eval, GPTZero, and MLCommons are increasingly relied upon to provide third-party assessments as model developers continue to limit disclosures about internal architectures and training methods.

These “shadow evaluations” are shaping the emerging norms of large language model testing, especially in light of the opaque and often fragmented disclosures from leading AI firms.

OpenAI’s o-series holds a commanding position in this testing workflow, though the long-term implications for general intelligence, agentic behavior, or ethical deployment remain to be addressed in more domain-relevant trials. The IQ scores, while provocative, serve more as signals of short-context proficiency than a definitive indicator of broader capabilities.

Per TrackingAI.org, additional analysis on format-based performance spreads and evaluation reliability will be necessary to clarify the validity of current benchmarks.

With model releases accelerating and independent testing growing in sophistication, comparative metrics may continue to evolve in both format and interpretation.

The post OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population. appeared first on CryptoSlate.

Read Entire Article
spot_img
- Advertisement -spot_img

Related Posts

Latam Insights: Milei Dissolves Libra’s Investigation, Colombian CBDC Comes out of Stealth

Welcome to Latam Insights, a compilation of the most relevant crypto news from Latin America over the past week In this week’s edition, Argentine President Javier Milei dissolves the group

Ethereum Bullish Pattern Points To Immediate $3,000 Target – Details

The Ethereum market price rose by a net 316% in what proved to be another historic week for the crypto market as Bitcoin registered a new all-time high price Notably, the prominent altcoin has

Bitcoin And Ethereum Decoupling Reaches Historic Point — What This Means For Investors

It’s no secret that Ethereum’s performance has been tame compared to Bitcoin since the start of this cycle However, this trend became most apparent at the start of the year when

XRP price prediction: XRP Eyes $5.50 By Q4, But A “Next-Gen Rival” Could Outpace It With 1000% Gains

The post XRP price prediction: XRP Eyes $550 By Q4, But A “Next-Gen Rival” Could Outpace It With 1000% Gains appeared first on Coinpedia Fintech News XRP has shot into the limelight with many

Bitcoin At Crossroads After Trump Tariff Shock: Breakdown Towards $106K Or New ATH?

Bitcoin is grappling with intensified volatility following a sharp selloff triggered by US President Donald Trump’s abrupt announcement of a sweeping 50% tariff on all EU imports starting June 1

Buying Dogecoin (DOGE) At $0.25 or Buying Remittix (RTX) At $0.07 – Which Will Be the Bigger Payday?

The post Buying Dogecoin (DOGE) At $025 or Buying Remittix (RTX) At $007 – Which Will Be the Bigger Payday appeared first on Coinpedia Fintech News With crypto markets on the rise, investors are