100 Participants: Preliminary Patterns Before the Real Analysis

The study protocol set 100 participants as the target minimum for statistical power. We hit 101 this week. 1,612 emails classified in Research Mode across six phishing technique categories.

This post is a preliminary descriptive snapshot of where the data stands. Nothing here is inferential. The mixed-effects model, confidence intervals, pairwise comparisons, and group-level controls have not been run yet. What follows are raw patterns visible in unadjusted numbers. Some may hold up under formal analysis. Some will shift, possibly substantially, once the model accounts for participant-level variation, card-level difficulty, and within-study learning effects. Treat everything below as "here is what the raw data looks like" rather than "here is what we can conclude." The formal analysis will determine which of these patterns are statistically meaningful and which are noise.

The headline number

Overall detection accuracy across all 1,612 classifications: 84.6%.

That includes both phishing and legitimate cards. 1,119 phishing classifications and 493 legitimate. The false positive rate (legitimate emails incorrectly flagged as phishing) was 11.8%.

Which techniques fool people most

Bypass rate is the percentage of phishing emails that participants incorrectly classified as legitimate. Higher means the technique was harder to detect.

Technique	Bypass Rate	Missed / Total
Hyper-personalization	21.6%	38 / 176
Fluent Prose	20.8%	37 / 178
Authority Impersonation	18.1%	35 / 193
Pretexting	15.9%	31 / 195
Urgency	13.7%	28 / 205
Credential Harvest	12.2%	21 / 172

In the preliminary data, hyper-personalization and fluent prose show the highest bypass rates and credential harvest shows the lowest. The protocol hypothesized pretexting would lead among the five named techniques, with fluent prose serving as a structurally distinct baseline. The raw ordering differs from that prediction, but whether any of these differences are statistically significant is entirely unknown until the formal model runs. The gaps between adjacent techniques are small enough that they could easily shift or collapse once participant and card effects are controlled for.

One note on hyper-personalization: as the study protocol explains, these cards use plausible contextual detail rather than genuine information about the participant. In real-world targeted phishing, where the attacker actually knows the recipient, the bypass rate would likely be higher.

The confidence pattern (preliminary)

This is the pattern I find most interesting in the raw data, though it needs formal calibration analysis before drawing conclusions.

When participants got an email wrong, 57.3% of the time they were at the highest confidence level (CERTAIN). Another 31.5% were at LIKELY. Only 11.3% of mistakes were made at GUESSING.

At face value, this suggests participants are not failing because they are uncertain and guessing badly, but rather failing while believing they have correctly identified the email. However, CERTAIN is also the most frequently selected confidence level overall, so the base rate matters. The ordinal regression specified in the protocol will test whether this pattern reflects genuine miscalibration or simply the distribution of confidence selections.

The breakdown by technique when participants missed a phishing email and rated themselves CERTAIN:

Technique	CERTAIN when wrong
Urgency	68% of misses
Credential Harvest	67% of misses
Pretexting	65% of misses
Authority Impersonation	60% of misses
Fluent Prose	51% of misses
Hyper-personalization	47% of misses

In the raw numbers, urgency and credential harvest show the highest rates of CERTAIN confidence on incorrect answers. These are also the two techniques most heavily covered by traditional security awareness training. It is tempting to interpret this as familiarity breeding false confidence, but that is speculation on unadjusted data. The sample sizes per technique are small (21 to 38 misses per category), so these percentages are unstable. The formal analysis will determine whether technique-level confidence patterns are meaningful.

Security professionals vs. everyone else

Participants optionally self-report professional background. The three groups:

Background	Accuracy	Players	Answers
Infosec / Cybersecurity	87.6%	45	621
Technical / Non-security	85.4%	18	302
Other (general users)	79.9%	30	523

In the raw numbers, security experience appears to correlate with better detection. But these group comparisons are unadjusted, the group sizes are unbalanced (45 infosec vs. 30 other), professional background is self-reported and unverified, and individual variation within groups may be larger than variation between them. These numbers should be read as descriptive, not as evidence that security experience causes better detection.

With those caveats, the preliminary per-technique patterns are worth noting for what the formal analysis should investigate:

Authority impersonation shows the largest raw gap: security professionals miss 12.9% while general users miss 27.1%.

Hyper-personalization narrows the raw gap: security professionals miss 20.3%, general users miss 28.8%.

Fluent prose: 14.1% for security professionals, 27.9% for general users.

Whether these technique-by-background differences are statistically meaningful or artifacts of small subgroup sizes and uncontrolled confounds is precisely what the formal interaction analysis will test.

Difficulty tiers work as designed

Difficulty	Bypass Rate
Easy	13.6%
Medium	12.2%
Hard	23.9%
Extreme	20.2%

The gap between easy/medium and hard/extreme is visible. The slight dip from hard to extreme may reflect the smaller extreme sample (104 cards vs. 348 hard) or the characteristics of extreme-tier cards. Worth investigating in the formal analysis.

One forensic habit makes a measurable difference

Participants who opened email authentication headers (SPF/DKIM/DMARC) during classification detected 88.6% of phishing emails. Those who did not detected 68.5%. That is a 20 percentage point gap in the raw data.

URL inspection showed a smaller raw difference: 85.6% detection with inspection vs. 81.5% without (+4.1pp).

An important caveat: the header finding is confounded by difficulty tier. Easy and medium cards default to failed authentication, so checking headers on those cards is a reliable shortcut. The participants who check headers may also be more careful classifiers overall. The raw 20pp gap almost certainly overstates the causal effect of header inspection. The formal analysis will need to control for difficulty tier and participant ability before this number means anything actionable.

Timing

Mean classification time was 56.7 seconds per card (median: 36.7 seconds). Correct classifications averaged 56.0 seconds. Incorrect classifications averaged 60.8 seconds. The difference is small, suggesting that wrong answers are not primarily a speed problem. Participants spent comparable time on emails they got wrong.

Participation depth

Of 101 participants, 31 (31%) completed the full 30-answer research allotment (three sessions). 75 (74%) completed at least one full 10-card session. This is consistent with the pilot data reported in the protocol paper (70 participants: 24% completed 30, 67% completed at least one session).

What comes next

100 was the original target minimum for the planned statistical analysis, and reaching it is the milestone this post marks. But looking at the data, I am extending the target to 300+ participants before running the formal model. The reasons:

Subgroup analysis is underpowered at 100. The current split (45 infosec, 18 technical, 30 other) leaves the technique-by-background interaction, one of the most practically interesting questions, too thin to produce reliable estimates. 300 participants would give meaningful sample sizes per group.
Technique-level confidence intervals are wide. With 170 to 200 phishing cards per technique, the bypass rate estimates overlap enough that adjacent techniques cannot be reliably distinguished. Tripling the sample tightens those intervals.
The card-level random intercept may not converge. The protocol specifies a random intercept for card to account for item-level difficulty variation but notes it may not be estimable with sparse per-card observations. More participants means more observations per card, making that intercept viable.
Within-study learning effects need room. Testing whether learning rates differ by technique (the technique-by-ordinal interaction) adds model complexity that 100 participants cannot support well. A larger sample gives the model enough degrees of freedom to estimate differential learning without overfitting.

In the meantime, the analysis plan remains the same:

Run the mixed-effects logistic regression model specified in the protocol (technique and difficulty as fixed effects, participant and card as random intercepts)
Compute pairwise technique comparisons with Bonferroni correction
Compute signal detection metrics (d-prime and criterion) to separate discriminability from response bias
Test the technique-by-background interaction for group differences
Run the confidence calibration analysis using ordinal regression
Write it all up in the empirical findings paper

Everything in this post is preliminary. The descriptive patterns here will either sharpen or shift, possibly substantially, once the formal model accounts for participant-level variation, card-level difficulty, and within-study learning effects. Some of the "patterns" described above may turn out to be noise. That is the point of doing the analysis properly rather than stopping at raw percentages.

If you want to see the live data as it updates: research.scottaltiparmak.com/intel

If you want to contribute (Research Mode is still open): research.scottaltiparmak.com

The full study protocol: doi.org/10.5281/zenodo.19059296