← Back to blog
·7 min read

Building Retro Phish: What Actually Went Wrong

A build log of the decisions, pivots, and problems I ran into designing a phishing research study, and why the constraints ended up producing a cleaner methodology than what I planned.

ShareLinkedInX

About a year ago I started asking a question that bothered me: when AI eliminates grammar errors as a detection signal, which phishing techniques actually fool people?

Most phishing training is built around spotting the tells. Awkward phrasing. Sender domain does not match. Urgency that feels manufactured. Those heuristics held up because real phishing was sloppy. It is not sloppy anymore. AI-generated phishing is grammatically flawless, contextually plausible, and cheap to produce at scale. The old tells are gone.

I wanted to measure what was left. So I started building Retro Phish: a game-based research platform where players classify emails as phishing or legitimate, bet confidence on their answers, and unknowingly contribute to a dataset. The retro terminal aesthetic, the XP, the leaderboards, the rank system. All of it exists to get people to sit down and do ten rounds of phishing classification without it feeling like a compliance exercise.

Here is what actually went wrong in the process of building it.

The dataset problem

The original plan was to source real phishing emails. I would find a corpus, strip any PII, normalize the format, and use that as the card pool.

The problem was finding clean data at a consistent standard. Real phishing ranges from obviously terrible to genuinely sophisticated. If I built the dataset from real emails, I would be measuring linguistic quality as much as technique. The data would be noisy. And sourcing real phishing without PII at any meaningful volume turned out to be harder than I expected: most public datasets are either outdated, inconsistently formatted, or carry enough identifying detail that stripping them properly becomes a project in itself.

So I generated everything. Every card in Retro Phish, phishing and legitimate alike, is AI-generated. Writing quality is held constant across all 1,000 cards. Technique is the only independent variable.

This came about by necessity, not design. But it ended up being a cleaner methodology than what I originally planned. Controlling for language quality is exactly what the study needs. I could not have done that reliably with real emails.

Making sure nothing was missing

The second problem was data integrity. It is easy to build a system that collects data and harder to make sure you are collecting the right data without any gaps that will make you regret it six months later.

My concern was missing something foundational. Not a minor oversight but the kind of gap that compromises the whole dataset: failing to log the technique type per card, not capturing timing data, not separating freeplay responses from research mode responses at the storage level. Any of those would have cost me the ability to answer the questions I actually care about.

I spent more time than I expected on schema design. What gets recorded per answer: player UUID, technique, correct classification, player answer, confidence level, time taken, session grouping. No PII in the research tables. Background data self-reported and optional, flagged separately so it can be excluded from group comparisons without losing the core answer data. It is not glamorous work but getting it wrong at the start means the data is not recoverable.

Drawing clean lines between techniques

The six techniques in the study are urgency, authority impersonation, credential harvesting, hyper-personalization, pretexting, and fluent prose. Defining them cleanly on paper is straightforward. Generating 690 cards that stay in their lanes is not.

Real phishing emails rarely use a single technique. An email from your "CEO" asking you to wire funds in the next two hours is authority impersonation and urgency simultaneously. Assigning that card to one technique requires a judgment call, and making that call consistently across 690 cards requires a clear rubric that I did not have at the start.

The approach I landed on: classify by the primary mechanism of manipulation. If the impersonation is doing most of the work, it is authority impersonation. If the time pressure is the lever being pulled, it is urgency. The generated cards were written with the primary technique in mind, which helped. But edge cases still required review.

Calibrating difficulty consistently

Easy, medium, hard, and extreme. Those labels need to mean something consistent across six different attack types. What makes a credential harvesting email hard is not the same as what makes a pretexting email hard.

For credential harvesting, hard means the link structure is plausible and the sender domain passes basic inspection. For pretexting, hard means the backstory is specific and internally consistent. For fluent prose, hard is almost definitional: a well-written email with nothing obvious to flag is the whole technique. The categories describe the same difficulty but through different properties depending on what is being tested.

I ended up defining difficulty per technique rather than applying a universal rubric. The card review process enforces this: generated cards are staged, reviewed against the technique-specific difficulty criteria, then approved or rejected. It is slower than a single pass would be but produces a more consistent dataset.

Getting the legitimate card ratio right

31 percent of cards in the dataset are legitimate. That number is not arbitrary.

If too few cards are legitimate, players quickly learn they can maximize XP by flagging everything as phishing. The false positive rate becomes meaningless and the game stops measuring detection. It just measures bias toward one classification.

Getting the ratio wrong in the other direction produces a different problem: too many legitimate cards and players become calibrated toward leniency, which may suppress detection rates on subtle phishing. The 31 percent figure is a balance between keeping false positive rates measurable and not training players to default to safe.

The legitimate card categories, transactional, marketing, and workplace communications, were also chosen deliberately. They cover the three types of legitimate email most likely to share surface features with phishing: transactional email that asks you to take action, marketing email with links and CTAs, and internal comms from authorities like IT and HR.

Game design versus research design

Retro Phish needs to be engaging enough that people play more than once. More sessions mean more data. But the game mechanics cannot be designed in a way that changes how players approach classification.

This tension came up most clearly around the confidence system. XP scales with confidence: GUESSING pays 1x, LIKELY pays 2x, CERTAIN pays 3x. The game incentivizes players to bet high. That is fine from a research perspective because the study is specifically interested in confidence calibration, whether people who say CERTAIN are actually more accurate, and whether overconfidence clusters on specific techniques. But the incentive structure means players have a reason to bet confident regardless of their actual certainty. That has to be accounted for in how the confidence data is interpreted.

The participation cap is the other example. Once a player completes enough Research Mode sessions, their responses stop counting toward the study dataset. They can still play, still earn XP, still compete on leaderboards. Their data just stops being included. This keeps a small number of highly engaged players from dominating the dataset and skewing the results. It took more effort to build than I expected.

Where things stand

The dataset is being built through the card review pipeline and data collection is ongoing. None of the constraints above invalidated the study. They shaped it. The decision to generate all cards rather than source real ones produced a cleaner controlled environment than I would have had otherwise. The schema work means the data I am collecting is actually usable. The difficulty calibration and technique separation mean the results will be interpretable.

The questions the study is designed to answer are still good questions. When language quality is not a signal, which techniques beat people most often? Do security professionals actually detect phishing better than everyone else, or does security experience not predict accuracy the way we assume it does? Do people miss phishing with high confidence, and does that cluster on specific techniques?

I expect the data to tell us something worth sharing. When there is enough of it to say something meaningful, I will publish the findings here.

Play Retro Phish or read the full methodology.

More posts

·7 min read

I'm Running a Phishing Research Study Inside a Retro Terminal Game

When AI eliminates grammar errors as a detection signal, the interesting question becomes: which phishing techniques do humans miss most? This is how I am trying to find out.

Stay in the loop

I write about the security topics that interest me: IAM, cloud security, automation, threat intelligence, phishing, and incident response. If this was useful, there is more where it came from.