AI Architecture·Thursday, April 23, 2026·5 min read

Every few months, there’s a new benchmark that claims to

Braxton Ellsworth

AI Systems Architect

Everyone's Hyped About ThermoQA. Here's What Actually Matters.

Every few months, there’s a new benchmark that claims to measure how “intelligent” large language models (LLMs) have become. The cycle is predictable: a new leaderboard launches, the AI hype machine goes into overdrive, and social feeds fill with celebratory graphs. This month, that spotlight is on ThermoQA.

A three-tier benchmark for evaluating thermodynamic reasoning in LLMs. The excitement is everywhere.

But the reality behind ThermoQA isn’t about hype. It’s about the slow, careful work of understanding what AI still can.

And can’t

Do. While most are fixated on leaderboard scores, the real value of ThermoQA isn’t who’s on top. It’s the structure of the challenge itself and the brutal clarity it brings to what separates memorization from reasoning.

That’s why, as everyone else rushes to celebrate the latest percentage points, practitioners should be asking a different question: what does ThermoQA actually reveal about the current state of AI? Not in terms of headlines, but in terms of system-level capability.

The Difference Between Memorization and Reasoning

Most AI benchmarks reward surface performance.

They ask models to recall facts, match patterns, or regurgitate phrases. That’s why many non-practitioners walk away convinced LLMs “understand” engineering or science.

They see the right answer and assume comprehension. But in thermodynamics, memorization fails quickly. The field is defined by context, edge cases, and interdependent variables that punish rote recall.

ThermoQA exposes this fault line with surgical precision.

The benchmark consists of 293 open-ended engineering thermodynamics questions, structured as three escalating tiers. The first tier focuses on property lookup.

Essentially, does the model know the right value for water, R-134a, or variable-cp air under specific conditions, as defined by CoolProp 7.2.0? Many LLMs perform surprisingly well here. But this is just the entry ticket.

The second tier

Component analysis

Forces models to reason across multiple properties and relate them in the context of a real thermodynamic component.

The third tier, full cycle analysis, integrates everything: variable selection, state mapping, and multi-step reasoning in the context of entire energy systems like combined-cycle gas turbines. Here, models can’t rely on memorized snippets. They have to apply thermodynamic principles across interdependent steps.

That’s where the leaderboard starts to fracture. Claude Opus 4.6 leads with a composite score of 94.1%, followed by GPT-5.4 at 93.1% and Gemini 3.1 Pro at 92.5%. But the devil is in the degradation. Cross-tier performance drops.

2.8 percentage points for Opus, a staggering 32.5 for MiniMax. It’s not enough to ace property lookup; the real test is whether the reasoning holds together as complexity scales.

When you ask a model to compute the properties of supercritical water, or to analyze the behavior of R-134a in a refrigeration cycle, superficial understanding falls apart. These domains aren’t just trivia.

They’re natural discriminators that expose the limits of algorithmic short-cuts. The performance spread on these problems is wide for a reason.

This is the point most hype-driven headlines miss.

In thermodynamics, a model’s ability to rattle off property tables means very little if it can’t map out the relationships that define a working system. Memorization is table stakes. Reasoning is the game.

Why Structured Benchmarks Matter More Than Leaderboards

Benchmarks like ThermoQA aren’t about crowning a winner.

They’re about constructing a diagnostic tool that tells us which layer of “understanding” is actually present in the system. Too much focus on the top-line numbers is a distraction. What matters is how those numbers are built.

Every question in ThermoQA is grounded

Literally

In programmatic truth.

The answers come from CoolProp, not hand-coded answer keys or web search snippets. This eliminates a common crutch for LLMs: pattern-matching against training data. Instead, models are forced to operate in a closed world where facts aren’t just memorized.

They’re computed.

The three-tier structure isn’t arbitrary. It’s a deliberate design that mirrors how real-world engineering challenges escalate. In practice, a junior engineer might memorize steam tables, but only a seasoned practitioner can analyze the full thermodynamic cycle of a gas turbine. That’s why the degradation across tiers is so revealing. It shows, with mathematical clarity, where reasoning breaks.

Consistency matters just as much as accuracy. ThermoQA quantifies this with a multi-run sigma.

Ranging from +/-0.1% to +/-2.5% for top models.

This isn’t just statistical trivia. It captures something fundamental: does the model get the same answer every time, or does it “hallucinate” under pressure? In a real system, reliability isn’t optional.

The choice of working fluids

Water, R-134a, variable-cp air

Isn’t for show.

Each serves as a natural stress test. Supercritical water isn’t a toy problem; it’s an inflection point where even minor reasoning gaps are amplified. R-134a and variable-cp air force models to demonstrate understanding of specific engineering contexts, not just general physics.

This is why, in my experience, system-level AI work always comes down to the structure of the challenge, not the excitement of the result. If you want models that can actually support real engineering.

Automating design, simulation, or operation

You need a benchmark that breaks the problem down, tier by tier, and reveals where the logic falls apart.

ThermoQA does this better than most. It isn’t perfect, but it moves the conversation away from hype and towards grounded, system-oriented evaluation.

The Slow March Toward True Engineering Intelligence

Hype fades. Benchmarks like ThermoQA don’t.

The reason is simple: engineering isn’t about surface-level performance. It’s about reliability under complexity. The systems we build with AI are only as good as the reasoning they encode.

Especially when lives, energy, or money are on the line. The ThermoQA leaderboard is a snapshot, but the underlying tiers are an X-ray of what’s actually happening inside these models.

There’s no shortcut here. Most LLMs can fake property lookup, but thermodynamic reasoning.

Especially across multiple steps and changing contexts

Remains brittle.

The performance spread on full cycle analysis isn’t noise: it’s a signal that reasoning is still a moving target. That’s not a failure of AI. It’s an honest accounting of where real progress is being made.

From a builder’s perspective, this is a gift. Structured benchmarks like ThermoQA force us to confront the difference between knowledge and understanding. They drive new approaches to prompt engineering, system architecture, and even model design. They raise the bar not by inflating scores, but by clarifying where the work remains.

If you’re serious about deploying AI in engineering, ThermoQA is a reality check. It doesn’t celebrate superficial gains.

It exposes the next layer of challenge. And that’s exactly what real progress looks like.

Want to think in systems, not prompts?

Take the free AIIQ test to measure your AI fluency, or enroll in the full Symbiotic Prompt Engineering program.

Take the AIIQ Test Enroll Now