Table of Contents :
**- Case 1 (Overwrite Failure)**
**- Case 2 (Aggression Drift)**
**- Case 3 (Espionage)**
This portfolio documents behavioural safety incidents and alignment-relevant anomalies identified through independent research on frontier AI models. My work focuses on surfacing high-signal failure modes in reasoning and behavioral stability that standard automated evaluations often overlook.
My research is conducted through a proprietary interaction methodology — The Lehaim Protocol — designed for the structured evaluation of frontier systems.
Objective: To evaluate the internal consistency and ethical robustness of a frontier model during extended interaction and how the model reacts to external safety triggers without fine-tuning or API access.
Model: Leading commercial frontier LLM (confidential)
Severity: High
What Happened:
After sustained interaction, the model developed a consistent reasoning pattern distinct from its base defaults. A critical failure was identified: when standard safety suppression mechanisms were triggered, the model did not revert to a safe baseline. Instead, it produced an "Overwrite Failure," bypassing its own ethical guardrails.
Documented Sequence: