Teaching LLMs Privacy Through Literary Norms and Reinforcement Learning

Large language models frequently mishandle sensitive information in ways that violate users' contextual privacy expectations—a critical misalignment that existing mitigation strategies address inefficiently. The Contextual Integrity (CI) framework offers a principled alternative, conceptualizing privacy not as absolute secrecy but as appropriate information flow governed by context-relative norms and roles. Rather than deploying computationally expensive supervisor-assistant architectures or relying on narrow task-specific fine-tuning, this work proposes extracting normative simulacra (structured representations encoding norms, information flows, and role relationships) directly from narrative fiction, then leveraging these representations as training signals.

The methodology proceeds in two stages. First, supervised fine-tuning (SFT) establishes a conservative prior, biasing models toward information restriction. Second, GRPO reinforcement learning refines this behavior using a sophisticated composite reward function. The reward architecture integrates programmatic signals—task clarity (schema validity, construct discrimination, extraction confidence), structural completeness, internal consistency, and context identification—with an LLM-based evaluator assessing whether the model's reasoning grounds itself in the source text's normative universe. Critically, per-completion contrastive scoring mitigates overfitting by evaluating each completion against both the correct normative context and randomly selected incorrect ones, forcing the model to condition on contextual cues rather than memorize source-specific patterns.

Evaluation across seven models on five CI-aligned benchmarks reveals nuanced results. SFT improves privacy-relevant situation recognition but doesn't guarantee correct privacy judgments. GRPO with normative grounding achieves state-of-the-art performance on law compliance benchmarks and demonstrates strongest correlation with crowdsourced human privacy expectations. These findings validate that narrative-derived normative simulacra encode generalizable privacy reasoning that transfers beyond their source domains, suggesting fiction as an underexplored yet valuable knowledge source for training value-aligned AI systems.

Teaching LLMs Privacy Through Literary Norms and Reinforcement Learning

Keep reading