I'm Running a Phishing Research Study Inside a Retro Terminal Game
When AI eliminates grammar errors as a detection signal, the interesting question becomes: which phishing techniques do humans miss most? This is how I am trying to find out.
Most phishing training is built around a detection signal that no longer works.
Spot the grammar error. Look for the urgency. Check the sender domain. These heuristics held up for years because real phishing campaigns were sloppy. Now they are not. AI-generated phishing is grammatically flawless, contextually plausible, and available at scale. The old tells are gone.
This raised a more interesting question for me: when language quality is removed as a variable, which phishing techniques actually fool people?
Not "can you tell if an email was AI-written." That is the wrong question. The right question is: given that the writing is always polished, which attack approach breaks human judgment most reliably?
The design problem
I wanted to build something to measure this. The original plan was to use real phishing emails. The problem was sourcing them at a consistent standard. Real phishing ranges from obviously terrible to genuinely sophisticated. If I built a dataset from real emails, I would be measuring linguistic quality as much as technique. The data would be messy.
So I generated everything. Every card in Retro Phish, phishing and legitimate alike, is AI-generated. Writing quality is held constant across all 550 cards. Technique is the only independent variable. This ended up being a cleaner methodology than what I originally planned, even if it came about by necessity rather than design.
The theme
I could have built a plain survey. Nobody fills out plain surveys.
The retro terminal aesthetic was a deliberate call. It signals that this is built by someone who works in security, not produced by a vendor. It reminds me of the Lumon Industries terminals from Severance: cold, monochrome, vaguely institutional. That felt right for a game about spotting deception. It makes the act of classifying emails feel like something, rather than nothing. There is ambient terminal audio on the start screen, click sounds throughout, XP, leaderboards, streaks, and a rank system with ten tiers from CLICK_HAPPY up to ZERO_DAY. The forensic signal breakdowns after each answer are not just for show: they are meant to teach. SPF and DKIM status, reply-to mismatches, send timing, URL inspection. The kind of signals you would actually check in an investigation.
The goal was to make phishing awareness training that someone might actually do more than once. More sessions mean more data. The game design and the research design are not in conflict.
The six techniques
The dataset covers six phishing techniques, 60 cards each across three difficulty levels: easy, medium, and hard.
Urgency is the classic. Compressed timeframes, account suspension threats, action-required framing. It is also the most-taught red flag in security awareness training. My expectation is that urgency performs better at medium and hard difficulty where the scenario is plausible, but gets caught more reliably at easy difficulty where it is blatant.
Authority impersonation leans on deference. An email from your CEO, your bank, a government agency. Deference to apparent authority is a well-documented cognitive bias and does not require technical sophistication to exploit. I expect this to catch people consistently, especially when the impersonated entity is familiar.
Credential harvesting typically relies on getting someone to click a link to a fake login page. In the card format, players see the email, not the destination. The URL inspector forensic signal levels the playing field somewhat. I am curious whether players learn to use it or still get caught.
Hyper-personalisation is interesting but constrained in this context. In reality, a hyper-personalised phish knows your name, your role, your manager, your current project. In a game where the player is reviewing cards rather than receiving emails addressed to them, that level of targeting is not possible. The personalisation in these cards is templated. This might mean hyper-personalisation underperforms its real-world effectiveness here, which would itself be a finding worth noting.
Pretexting is my hypothesis for most interesting results, with a caveat. In the real world, pretexting works because the backstory is built around you specifically. You receive a follow-up to a conversation you supposedly had, context that only makes sense if you are the intended target. In this game, players review emails as a neutral third party. The pretext is not directed at them. That removes a significant part of what makes pretexting effective in practice. I still expect it to produce interesting data, but I would not be surprised if it underperforms compared to how it operates in actual attacks. The gap between game performance and real-world effectiveness might be the most interesting thing pretexting tells us.
Fluent prose is the control technique in some ways. No social engineering hook, no urgency, no authority figure. Just a well-written email with no grammar errors and no obvious red flags. I suspect this is where confidence calibration gets revealing: players may rate themselves as certain on cards where the only signal is "something feels off," and be wrong more than they expect.
Expert Mode
Once a player completes ten Research Mode sessions, Expert Mode unlocks. It draws exclusively from extreme difficulty cards and awards double XP. This is the part of the dataset I am most curious about. Extreme difficulty cards are as close to realistic AI-generated phishing as I can get within the constraints of the study. The players who reach Expert Mode have also seen enough of the dataset to understand the patterns. What they miss at that point is more signal, not less.
There is also a research participation cap. Once you graduate, your responses no longer count toward the study dataset. The goal is to avoid a small number of highly engaged players skewing the results. Each player contributes up to a fixed number of Research Mode sessions, then they are done. They can still play, still earn XP, still compete on leaderboards. Their data just stops being counted. This keeps the sample from being dominated by outliers.
Where things stand
The game is live but the dataset is still being finalised. Each card goes through a review and approval process before it is added to the active pool. The dataset freezes at 550 approved cards, which I am treating as v1. Once it is frozen and a meaningful volume of Research Mode responses has accumulated, I will publish the findings here. The raw dataset may also go up on Kaggle for anyone who wants to do their own analysis.
What the game collects
Research Mode draws a random deck of ten cards per round. I considered stratifying by technique to guarantee coverage, but a fixed pattern risks tipping off players to what kind of card is coming next. Random selection means more responses are needed to reach statistical coverage, but the tradeoff is worth it. Players classify each card and bet confidence on their answer: Guessing, Likely, or Certain. Forensic signals are revealed after each answer.
Answers are linked to a pseudonymous UUID. No PII is stored in the research tables, only the UUID, game mode, technique, correctness, confidence, and timing. Freeplay is open to anyone without an account.
Players can also optionally self-report their professional background: infosec, technical non-security, or other. This is the secondary analysis I am most interested in. Security practitioners are routinely assumed to perform better on phishing detection. The data might confirm that. It might not.
What I am hoping to find
This is not an academic paper. It is an experiment and a game. I wanted to run the study, collect real response data, and see what the numbers say. If the sample gets large enough to say something useful about which techniques beat people most when language quality is not a signal, that is interesting. If the confidence data shows people are systematically overconfident on certain technique types, that is also interesting.
I expect some results to be obvious and others to surprise me. That is usually how these things go.
The live findings are at retro-phish.scottaltiparmak.com/intel and update as data comes in. Research Mode is available after creating an account.
Stay in the loop
I write about Identity, security automation, and security engineering. If this was useful, there is more where it came from.