← Back to blog
·7 min read

I'm Running a Phishing Research Study Inside a Retro Terminal Game

When AI eliminates grammar errors as a detection signal, the question becomes: which phishing techniques produce the biggest gaps in human detection?

ShareLinkedInX

Most phishing training is built around a detection signal that no longer works.

Spot the grammar error. Look for the urgency. Check the sender domain. These heuristics held up for years because real phishing campaigns were sloppy. Now they are not. AI-generated phishing is grammatically flawless, contextually plausible, and available at scale. The old tells are gone.

This raised a more interesting question for me: when language quality is removed as a variable, which phishing techniques are hardest for people to detect?

Not "can you tell if an email was AI-written." That is the wrong question. The right question is: given that the writing is always polished, which technique types represent the largest gap in human detection capability? Identifying those gaps is how you close them.

The design problem

I wanted to build something to measure this. The original plan was to use real phishing emails. The problem was sourcing them at a consistent standard. Real phishing ranges from obviously terrible to genuinely sophisticated. If I built a dataset from real emails, I would be measuring linguistic quality as much as technique. The data would be messy.

So I generated everything under controlled conditions. Every card in Threat Terminal, phishing and legitimate alike, is AI-generated within a closed research environment. Writing quality is held constant across all 550 cards. Technique is the only independent variable. This ended up being a cleaner methodology than what I originally planned, even if it came about by necessity rather than design. The cards are never used outside the study context and are not released publicly.

The theme

I could have built a plain survey. Nobody fills out plain surveys.

The retro terminal aesthetic was a deliberate call. It signals that this is built by someone who works in security, not produced by a vendor. It reminds me of the Lumon Industries terminals from Severance: cold, monochrome, vaguely institutional. That felt right for a game about spotting deception. It makes the act of classifying emails feel like something, rather than nothing. There is ambient terminal audio on the start screen, click sounds throughout, XP, leaderboards, streaks, and a rank system with ten tiers from CLICK_HAPPY up to ZERO_DAY. The forensic signal breakdowns after each answer are not just for show: they are meant to teach. SPF and DKIM status, reply-to mismatches, send timing, URL inspection. The kind of signals you would actually check in an investigation.

The goal was to make phishing awareness training that someone might actually do more than once. More sessions mean more data. The game design and the research design are not in conflict.

The six techniques

The dataset covers six phishing techniques, 60 cards each across three difficulty levels: easy, medium, and hard.

Urgency is the classic. Compressed timeframes, account suspension threats, action-required framing. It is also the most-taught red flag in security awareness training. My expectation is that urgency performs better at medium and hard difficulty where the scenario is plausible, but gets caught more reliably at easy difficulty where it is blatant.

Authority impersonation leans on deference. An email from your CEO, your bank, a government agency. Deference to apparent authority is a well-documented cognitive bias and does not require technical sophistication to exploit. I expect this to catch people consistently, especially when the impersonated entity is familiar.

Credential harvesting typically relies on getting someone to click a link to a fake login page. In the card format, players see the email, not the destination. The URL inspector forensic signal levels the playing field somewhat. I am curious whether players learn to use it or still get caught.

Hyper-personalisation is interesting but operates differently in this context. In reality, a hyper-personalised phish references your name, your role, your manager, your current project. In a game where the player is reviewing cards as a neutral third party, that level of targeting is not possible. The personalisation in these cards is contextually plausible but not player-specific. The study is measuring whether people can recognise the structural signature of the technique, which is a separable and useful question from whether real-world personalisation is effective.

Pretexting is my hypothesis for most interesting results, with a caveat. In the real world, pretexting works because the backstory is built around you specifically. You receive a follow-up to a conversation you supposedly had, context that only makes sense if you are the intended target. In this game, players review emails as a neutral third party. The pretext is not directed at them. That removes a significant part of what makes pretexting effective in practice. I still expect it to produce interesting data, but I would not be surprised if it underperforms compared to how it operates in actual attacks. The gap between game performance and real-world effectiveness might be the most interesting thing pretexting tells us.

Fluent prose is the control technique in some ways. No social engineering hook, no urgency, no authority figure. Just a well-written email with no grammar errors and no obvious red flags. I suspect this is where confidence calibration gets revealing: players may rate themselves as certain on cards where the only signal is "something feels off," and be wrong more than they expect.

Expert Mode

Once a player completes ten Research Mode sessions, Expert Mode unlocks. It draws exclusively from extreme difficulty cards and awards double XP. This is the part of the dataset I am most curious about. Extreme difficulty cards are as close to realistic AI-generated phishing as I can get within the constraints of the study. The players who reach Expert Mode have also seen enough of the dataset to understand the patterns. What they miss at that point is more signal, not less.

There is also a research participation cap. Once you graduate, your responses no longer count toward the study dataset. The goal is to avoid a small number of highly engaged players skewing the results. Each player contributes up to a fixed number of Research Mode sessions, then they are done. They can still play, still earn XP, still compete on leaderboards. Their data just stops being counted. This keeps the sample from being dominated by outliers.

Where things stand

The game is live but the dataset is still being finalised. Each card goes through a review and approval process before it is added to the active pool. The dataset freezes at 550 approved cards, which I am treating as v1. Once it is frozen and a meaningful volume of Research Mode responses has accumulated, I will publish the findings here. The raw dataset may also go up on Kaggle for anyone who wants to do their own analysis.

What the game collects

Research Mode draws a random deck of ten cards per round. I considered stratifying by technique to guarantee coverage, but a fixed pattern risks tipping off players to what kind of card is coming next. Random selection means more responses are needed to reach statistical coverage, but the tradeoff is worth it. Players classify each card and bet confidence on their answer: Guessing, Likely, or Certain. Forensic signals are revealed after each answer.

Answers are linked to a pseudonymous UUID. No PII is stored in the research tables, only the UUID, game mode, technique, correctness, confidence, and timing. Freeplay is open to anyone without an account.

Players can also optionally self-report their professional background: infosec, technical non-security, or other. This is the secondary analysis I am most interested in. Security practitioners are routinely assumed to perform better on phishing detection. The data might confirm that. It might not.

What I am hoping to find

This is not an academic paper. It is an experiment and a game. I wanted to run the study, collect real response data, and see what the numbers say. If the sample gets large enough to say something useful about which techniques beat people most when language quality is not a signal, that is interesting. If the confidence data shows people are systematically overconfident on certain technique types, that is also interesting.

I expect some results to be obvious and others to surprise me. That is usually how these things go.

The full study protocol and dataset design has been published and is available at doi.org/10.5281/zenodo.19059296.

The live findings are at research.scottaltiparmak.com/intel and update as data comes in. Research Mode is available after creating an account.

Play Threat Terminal

More posts

·31 min read

Preliminary Findings: How Humans Detect AI-Generated Phishing Across 2,511 Classifications

Findings from 153 participants classifying AI-generated phishing: technique-level bypass rates, overconfidence patterns, and what security training misses.

·4 min read

Threat Terminal v2.0: PvP Is Here

Real-time 1v1 ranked matches, a new unlock ladder, and a terminal AI that will not stop talking. Threat Terminal v2.0 goes live tonight.

·6 min read

What's Coming in Threat Terminal v2

What is changing in Threat Terminal v2: complete UI overhaul, persistent progression, daily challenges, ranked PvP, badges, and a coin economy.

Stay in the loop

I write about the security topics that interest me: IAM, cloud security, automation, threat intelligence, phishing, and incident response. If this was useful, there is more where it came from.