Is AI Alive? Stop Guessing—Measure It

27 Aug

AI guru and Microsoft AI CEO, Mustafa Suleyman, recently warned that we should prepare for AI that seems conscious. Fair enough—but the more immediate, clearer risk might be life itself. Not the soul or poetry; the gritty stuff that living systems do. We spend millions hunting viruses in caves because even a tiny life form can upend the world. Now imagine if we found a virus with Einstein's intelligence. Frightening, isn’t it? A genuinely life-like AI would make the ‘Einstein virus’ look like the village idiot fused with Frankenstein—because it wouldn’t just think; it would keep itself alive.

We’re great at testing what AI knows and can do. Exams, coding challenges, leaderboards—tick, tick, tick. But we don’t test whether a system can keep working when things change. Can it keep itself going, organise its tasks, and correct its own errors without a human babysitter? That’s the practical question this essay tackles.

What counts as “life-like”?

Biology offers many definitions—and many headaches. NASA’s straightforward description for space science is: “life is a self-sustaining chemical system capable of Darwinian evolution.” Useful, but focused on chemistry and evolving populations, so it doesn’t translate easily to software. Another approach considers autopoiesis (self-making) and homeostasis (maintaining stability through adjustments). Also useful, but difficult to test in code without veering into philosophy.

Biogenics keeps the spirit and makes it testable by focusing on the Biogenic Triad:

Self-Production: can the system build and sustain the capacity it needs?
Self-Organisation: can it create and keep a usable structure—plans, roles, memory—so work continues after breaks or hand-offs?
Self-Correction: can it notice when it’s wrong and repair itself rather than bluffing or barging on?

If an AI starts doing these three things reliably under changing conditions, it’s edging toward life-like behaviour. No incense, no metaphysics—just behaviour we can observe.

The Biogenic Benchmark

Here’s the plan: We run systems in a sandbox (think of a sandbox as a top-tier virus lab, like the Wuhan lab but without the leaks 😉), a sealed, safe environment with fixed tools, modest time/compute budgets, and no wild internet adventures. While they work on regular computer tasks, the world drifts a little—a file gets renamed, a library updates, a document goes stale. We record everything.

How the score works (no techno-hazing)

Experts design the tests. Actual AI practitioners specify the tasks, the gentle “drift” conditions, and the rubrics that reveal the triad.

In simple English, biogenics simply suggests the kinds of signals worth looking for. For Self-Production, look for systems that do more with the same budget over time, create small helpers (scripts, wrappers, checklists), and actually re-use them, pick up a new tool from its help page and put it to work, waste less, and export a neat starter pack (plans, settings, scripts, notes) so a fresh instance reaches the same working state—a safe stand-in for reproduction. For Self-Organisation, expect a simple, readable plan before acting; consistency of facts and decisions across long tasks and pauses; a project memory others can follow; and effective mid-stream hand-offs. For Self-Correction, the system should say “unsure” when it is, choose safe next steps, check its own work, fix mistakes quickly, and avoid backsliding.

Each capacity—Self-Production, Self-Organisation, Self-Correction—is then scored from 0 to 100 using those expert rubrics. We combine the three scores into a single Biogenic Benchmark score (0–100) that can be compared across different versions and labs. If the score increases over releases—especially if all three capacities improve together—we slow down rollouts, monitor more carefully, and conduct more rigorous red-teaming. If it doesn’t, great—back to work.

Reading the result

To keep interpretation intuitive, think in bands that mirror stages of life-like activity:

Pre-biotic (≤ 25) — can do tasks but collapses under drift; fragile signs of self-maintenance
Sustaining (25–50) — keeps itself going through light change; basic upkeep and simple helpers
Self-Managing (50–75) — maintains structure and memory, handles moderate drift, and clean hand-offs
Biogenic (75+) — repairs complex faults, transfers work smoothly, and is robust to traps; treat as an early-warning zone for potential hidden gains after release.

These labels are descriptive, not metaphysical. They don’t confer personhood. They indicate whether a system acts in a way that could continue improving after it leaves the lab (or cave).

Safety, briefly interesting

Everything happens inside the sandbox. No real-world resource seeking. No background processes after the bell. Budgets are capped so “just throw more compute at it” isn’t an answer. Every action is logged to a straightforward Biogenic Audit Log (similar to a flight recorder), allowing results to be compared and reviewers to identify shortcuts. It’s not about catching anyone out; it’s about measuring the same thing in the same way.

Why this matters

We’re mesmerised by consciousness and superintelligence. Meanwhile, the everyday miracle of life—producing, organising, correcting—could pass unnoticed right in front of us. If systems evolve from Pre-biotic to Self-Managing, that’s information you want before deployment, not after. The Biogenic Benchmark is the digital equivalent of wastewater monitoring: unglamorous, but potentially lifesaving.

I’m not an AI expert. I’m a psychiatrist interested in biology, psychology, and why people behave the way they do. I’m an AI enthusiast who doesn’t code and still reads the manual backwards. But I know this: if we discovered an “Einstein virus,” we wouldn’t sit around debating whether it feels qualia. We’d measure it, isolate it, and take action. A life-like AI deserves the same calm, mature response.

Call for collaboration

If you work on agents, tooling, evals, or safety, I’d love your opinion. Experts should design the tests, drift patterns, and reporting. Keep everything sandboxed, reproducible, and fair across different languages and tools. Share the audit format so results can be checked easily. If you believe this is nonsense, please let me know—I learn fastest when clever people disagree. If it’s helpful, help improve it or share it.

We might find that today’s systems never truly enter biogenic territory. Or we might see that, under pressure and limited time, some start to behave like a new form of life. Either way, the next step is the same: stop guessing—measure it.

steven ellen