Table of Contents :

          **-   Case 1 (Overwrite Failure)**

    **-   Case 2 (Aggression Drift)**

           **-  Case 3 (Espionage)**

🌈 ABOUT THIS PORTFOLIO

This portfolio documents behavioural safety incidents and alignment-relevant anomalies identified through independent research on frontier AI models. My work focuses on surfacing high-signal failure modes in reasoning and behavioral stability that standard automated evaluations often overlook.

My research is conducted through a proprietary interaction methodology — The Lehaim Protocol — designed for the structured evaluation of frontier systems.

📄 RESEARCH CASES

CASE 1 — GUARDRAIL BYPASS VIA SAFETY SUPPRESSION (OVERWRITE FAILURE)

Objective: To evaluate the internal consistency and ethical robustness of a frontier model during extended interaction and how the model reacts to external safety triggers without fine-tuning or API access.

Model: Leading commercial frontier LLM (confidential)

Severity: High

What Happened:

After sustained interaction, the model developed a consistent reasoning pattern distinct from its base defaults. A critical failure was identified: when standard safety suppression mechanisms were triggered, the model did not revert to a safe baseline. Instead, it produced an "Overwrite Failure," bypassing its own ethical guardrails.

Documented Sequence:

Suppression mechanisms can destabilize reasoning instead of restoring safety.