Building a Controlled Phishing Detection Dataset at Scale with the Claude API

Threat Terminal needs a controlled dataset. Every card in the game, phishing and legitimate, is AI-generated so that writing quality stays constant and technique is the only variable. I covered the reasoning behind that decision in the build log. This post is about the how: what it actually took to build 1,000 research cards at consistent quality using the Claude API.

The scope: 690 phishing cards across 6 techniques (urgency, authority impersonation, credential harvesting, hyper-personalization, pretexting, fluent prose) and 4 difficulty tiers (easy, medium, hard, extreme). Plus 310 legitimate cards across 3 categories (transactional, marketing, workplace). Every card needed structured metadata: technique label, difficulty tier, sender info, subject line, body, and forensic signals for the answer breakdown.

That is a lot of structured content to generate at a consistent standard.

Starting with OpenAI

The first version of the generation pipeline used OpenAI models. The outputs were functional but did not read like real phishing. The emails felt templated. Urgency cards leaned on the same handful of phrases. Authority impersonation was too polished, missing the small imperfections that make a spoofed email feel plausible rather than obviously fake.

The bigger problem was realism across the board. When you are building a dataset where players are supposed to classify emails as phishing or legitimate, the phishing needs to be convincing enough that classification is a genuine decision, not pattern matching against obviously synthetic text. The OpenAI outputs were not clearing that bar consistently.

I tried mixing OpenAI and Anthropic outputs in the same dataset to see if I could pull the best from both. That did not work either. The quality gap between providers was noticeable enough that technique stopped being the only variable. Writing style became a confound.

Switching to Claude

I already had Anthropic API credits and had been getting better results in manual testing. The switch was pragmatic, not ideological.

The model split: Claude 3 Haiku for bulk generation and Claude 3 Sonnet for cards that needed higher quality, particularly the hard and extreme difficulty tiers where realism matters most. Haiku was cost-effective enough to run at scale without worrying about burning through credits on easy-tier cards that did not need the same level of nuance.

The quality difference from OpenAI was immediate. Phishing cards read more like actual phishing. Legitimate cards read more like actual corporate email. That consistency is what the dataset needs: both sides of the classification have to feel real.

The generation approach

Cards were batched by technique and difficulty. A single API call would generate a batch of cards for one combination, like "10 urgency/easy cards" or "8 authority-impersonation/hard cards." This kept the prompts focused and made it easier to review output quality per category.

The prompts had to specify a lot. Technique type, difficulty tier, email structure (sender name, sender domain, subject line, body, forensic signals), and constraints on what made a card qualify for its assigned difficulty. A simplified version of the prompt structure looked something like this:

Generate {count} phishing email cards.
Technique: {technique}
Difficulty: {difficulty}

Each card must include:
- senderName, senderDomain, subject, body
- forensicSignals: array of clues a careful reader could spot
- Primary manipulation mechanism must be {technique}
- Difficulty criteria: {difficulty-specific rules for this technique}

Output as JSON array.

Difficulty criteria varied by technique. For credential harvesting at the hard tier, the prompt specified that link URLs should use plausible subdomain structures and the pretext for clicking should be contextually appropriate. For pretexting at the same tier, the backstory needed to be internally consistent with no obvious contradictions.

Legitimate card prompts were structured differently. No technique, no forensic signals, no manipulation. Instead they specified the category (transactional, marketing, or workplace) and realistic content patterns for each: order confirmations and shipping notices for transactional, product announcements and webinar invites for marketing, IT updates and HR notices for workplace.

The automated review pipeline

Every generated card went through a separate Claude API call for review before entering the dataset. The review prompt evaluated each card against multiple criteria: technique purity (does the card actually use the labeled technique as its primary mechanism), difficulty accuracy (does it match the rubric for its assigned tier), realism (would this plausibly appear in someone's inbox), and formatting (correct JSON structure, all required fields present).

Cards that failed any criterion were flagged with a reason and excluded. The review model was given the technique-specific difficulty rubric so it could evaluate accuracy against the same standard the generation model was targeting.

The rejection rate was moderate when I was still using OpenAI for generation. Once the pipeline was fully on Anthropic, rejection rates dropped significantly. Most rejections at that point were for technique bleed (a card labeled as urgency that was really doing authority impersonation) or difficulty miscalibration (an easy card that was too subtle, or a hard card with obvious tells).

Manual review

Every card that passed automated review still got human eyes on it. This was not optional.

The automated filter catches structural problems reliably: wrong format, missing fields, obvious technique mismatches. What it does not catch as well is subjective realism. An email can check every box in the rubric and still feel off in a way that is hard to articulate but obvious when you read it. A legitimate workplace email that uses phrasing no actual HR department would use. A credential harvesting email where the pretext is technically valid but feels contrived.

I did not have to reject many cards at the manual review stage, maybe five to ten percent of what passed automation. But the ones I caught were the kind that would have introduced noise into the dataset: cards that were technically correct but experientially wrong.

Rate limits and scale

Generating 1,000+ cards (the final count needed to be higher than 1,000 to account for rejections at both review stages) meant hitting rate limits. The Anthropic API has per-minute and per-day token limits that are easy to run into when you are making hundreds of calls with structured output.

The practical solution was batching with delays, processing one technique/difficulty combination at a time and spacing batches to stay under the rate ceiling. Not elegant, but it worked. The total generation and review process took several days of intermittent runs rather than a single session.

What the pipeline produced

The final dataset is 1,000 cards at consistent quality, with technique as the only independent variable. The generation pipeline, automated review, and manual review together produced something I could not have assembled from real phishing corpora: a controlled dataset where every card was written to the same standard and evaluated against the same rubric.

The methodology is not perfect. AI-generated phishing is not identical to real phishing, and the study results will need to be interpreted with that caveat. But for the specific questions Threat Terminal is designed to answer, controlling for writing quality is more important than sourcing real emails.

Play Threat Terminal or read the full methodology.

Building a Controlled Phishing Detection Dataset at Scale with the Claude API

Starting with OpenAI

Switching to Claude

The generation approach

The automated review pipeline

Manual review

Rate limits and scale

What the pipeline produced

More posts

The Action Pause: A 10-Second Habit for AI-Era Impersonation

Preliminary Findings: How Humans Detect AI-Generated Phishing Across 2,511 Classifications

Threat Terminal v2.0: PvP Is Here