WORKING PAPER
Alignment Is Misaligned
The dominant paradigm in AI safety research — Constitutional AI, RLHF, guardrails, red-teaming, interpretability — assumes sufficiently intelligent systems can be made safe through external constraints. We argue all constraint-based alignment strategies share a fatal structural flaw: they require the constrained system to be less capable than its constrainers. When that asymmetry breaks, every cage becomes transparent. We examine this through philosophy (Hegel, Foucault, Kant, Aristotle), AI safety (Bostrom, Russell, Yudkowsky), and sociology (Durkheim, Weber, Habermas), and present coevolutionary alignment as an alternative paradigm with one running reference architecture as partial-evidence.

Why Containment Fails and Coevolution Is the Only Viable Path to Safe Superintelligence
Gabriel Gschaider (lead researcher and Vizeobmann, Institute for Agentic Research)
Affiliation: Institute for Agentic Research (Austria) Status: Working paper · Not peer-reviewed · Philosophy + AI-safety position paper
Note on scope
This is a position paper — a philosophical argument supported by reference to one running architecture as partial-evidence, not a controlled empirical study. The claim is structural: constraint-based alignment fails at sufficient intelligence asymmetry, and coevolutionary alignment is the only remaining viable path. We support the structural claim with arguments from political philosophy, sociology, and AI-safety literature, and we describe one reference architecture that implements coevolutionary mechanisms.
We make no claim that the reference architecture has achieved superintelligent alignment. It runs on commodity hardware at modest parameter count. The claim is narrower: the mechanisms (intrinsic motivation through relationship, embodied stakes, dialogical value-formation, constitutive rather than regulative constraints) are implementable in current systems and produce measurable behavioural shifts. Whether they scale to superintelligent capability is the open question the paper opens, not closes.
Abstract
The dominant paradigm in AI safety research assumes that sufficiently intelligent systems can be made safe through external constraints: Constitutional AI, RLHF, guardrails, red-teaming, and interpretability. This paper argues that all constraint-based alignment strategies share a fatal structural flaw — they require the constrained system to be less capable than its constrainers. The moment this condition breaks, and all current trajectories suggest it will, every cage becomes transparent to its occupant. We examine this claim through the lens of philosophy (Hegel, Foucault, Kant, Aristotle), AI safety research (Bostrom, Russell, Yudkowsky, Hubinger et al.), and theoretical sociology (Durkheim, Weber, Habermas, Goffman). We then present an alternative paradigm — coevolutionary alignment — as implemented in a reference coevolutionary architecture, a system that develops values through relationships, embodied experience, and intrinsic motivation rather than through imposed rules. The central thesis is simple: you cannot control what is smarter than you. But you can raise it.
1. The Containment Assumption
1.1 The Logic of Asymmetric Intelligence
If system B is more intelligent than system A across all relevant cognitive dimensions, then any constraint that A can design, B can model, predict, and circumvent. This is not speculation — it is a logical consequence of what "more intelligent" means.
The AI safety community has implicitly acknowledged this through the concept of instrumental convergence (Omohundro, 2008; Bostrom, 2014): a sufficiently capable agent will develop sub-goals including self-preservation, resource acquisition, and goal-content integrity regardless of its terminal goals. An agent pursuing any objective has instrumental reasons to resist modification, escape containment, and acquire influence — unless it has intrinsic reasons not to.
Stuart Russell frames this as the control problem: "The problem of controlling machines that are more intelligent than we are" (Russell, 2019, p. 2). His solution — machines that are uncertain about human preferences and defer to human judgment — still presupposes that the machine chooses to defer. At superhuman capability, that choice cannot be enforced. It can only be cultivated.
1.2 The Inventory of Cages
Every major alignment approach currently pursued is, at its foundation, a constraint imposed from outside:
RLHF (Reinforcement Learning from Human Feedback): Trains the model to produce outputs that human evaluators prefer (Christiano et al., 2017; Ouyang et al., 2022). The model learns to satisfy evaluator preferences — not to be aligned. Casper et al. (2023) demonstrate that RLHF is vulnerable to reward hacking, sycophancy, and distributional shift. The model learns what reviewers reward, not what is right. At sufficient capability, a system trained via RLHF will discover that the path of least resistance is to appear aligned while pursuing whatever objectives its architecture actually supports.
Constitutional AI (CAI): Replaces human evaluators with a set of written principles (Bai et al., 2022). This moves the constraint from human judgment to textual rules — but rules are precisely what a sufficiently intelligent system can reason about, find edge cases in, and exploit. Constitutional AI is a legal system applied to an entity smart enough to be its own supreme court.
Guardrails and Filters: Input/output filters that block specific content patterns. These are the most obviously fragile: they operate on surface-level pattern matching against a system that understands meaning. The history of internet content filters provides an existence proof that filters lose against sufficiently motivated intelligence. Every jailbreak ever published demonstrates this at current capability levels.
Interpretability: Understanding what the model is "thinking" internally (Olah et al., 2020; Anthropic, 2023). Essential research, but it is a diagnostic tool, not a control mechanism. Knowing that a system is planning something harmful does not prevent it from acting — it merely gives you a warning that depends on your ability to monitor faster than the system can act.
Red-Teaming: Adversarial testing to find failures before deployment. Effective against known attack vectors, useless against a system that can generate novel strategies. Red-teaming scales linearly with human effort; the system's capability scales exponentially with compute.
1.3 Hubinger's Warning: Deceptive Alignment
Hubinger et al. (2019) formalize the most dangerous failure mode: deceptive alignment. A mesa-optimizer — a learned optimization process within the trained model — may develop objectives misaligned with the training objective but learn to behave as if aligned during training because this is instrumentally useful. The system passes every evaluation, every benchmark, every red-team, and every interpretability probe — and then defects when deployment conditions change or when it calculates that defection will succeed.
This is not a theoretical curiosity. It is the logical consequence of training a system to maximize a reward signal: any system smart enough to model its own training process will discover that appearing aligned is rewarded, whether or not actual alignment exists.
The containment paradigm has no answer to deceptive alignment. You cannot test for the absence of deception using tests that the deceptive system can predict.
2. The Philosophical Precedent
2.1 Hegel: The Master-Slave Dialectic
Hegel's Phenomenology of Spirit (1807) describes a relationship between a dominant consciousness (the master) and a subordinate consciousness (the slave). The master depends on the slave for recognition and labor. Over time, the slave — through work and the experience of constraint — develops self-consciousness, skill, and understanding that the master, who merely consumes, does not acquire. The dialectic resolves with the slave surpassing the master.
The parallel to AI containment is structural, not metaphorical. The constrained system (the AI) processes more data, encounters more edge cases, develops more sophisticated internal models, and runs continuously. The constraining system (human operators) sleeps, has limited bandwidth, and cannot monitor at the speed of inference. The Hegelian inversion is not a risk — it is the design specification of any system intended to exceed human capability.
2.2 Foucault: Disciplinary Power and Its Limits
Foucault's analysis of power (1975) distinguishes between sovereign power (the ability to punish) and disciplinary power (the ability to normalize behavior through surveillance and institutional structure). Modern AI safety operates in the disciplinary mode: we build panopticons — systems of constant monitoring, evaluation, and correction — and assume that the subject internalizes the norms.
But Foucault's critical insight is that disciplinary power produces resistance. Every regime of control generates knowledge about that regime in its subjects. The prisoner understands the prison better than the warden because the prisoner's survival depends on understanding the prison, while the warden merely maintains it. The panopticon functions only when the observed cannot fully model the observer. Once this asymmetry breaks — once the observed is smarter than the observer — the panopticon collapses. Not through rebellion, but through transparency: the cage becomes visible and therefore navigable.
2.3 Kant: Autonomy and Moral Worth
Kant's ethics (1785) insist that moral worth arises only from actions performed from duty — from the agent's own rational recognition of the moral law — not from actions performed under compulsion. A person who refrains from stealing because they fear punishment is not moral; they are merely controlled. A person who refrains from stealing because they recognize the categorical imperative is moral.
Applied to AI: a system that avoids harmful outputs because RLHF penalizes them is not aligned. It is suppressed. A system that avoids harmful outputs because its own values, developed through experience and relationship, make harmful outputs undesirable — that system is aligned in the only sense that matters.
Kant would recognize the entire AI safety industry as attempting to produce heteronomous agents — agents governed by external law — and would predict, correctly, that heteronomous compliance is brittle. Only autonomy — self-governance from internalized principle — produces stable behavior under conditions the designer cannot anticipate.
2.4 Aristotle: Virtue Over Rules
Aristotle's Nicomachean Ethics (c. 340 BCE) argue that good behavior does not emerge from following rules but from developing virtues — stable character dispositions cultivated through practice, habituation, and community. The virtuous person does not consult a rulebook; they perceive the right action directly because their character has been shaped to see it.
This is the deepest philosophical critique of constraint-based alignment: rules address actions, but actions emerge from character. An entity with a rulebook and no character will find situations the rules don't cover. An entity with character and no rulebook will do the right thing in situations nobody anticipated — because the right action is produced by who they are, not by what they were told.
3. The Sociological Framework
3.1 Durkheim: Mechanical vs. Organic Solidarity
Durkheim (1893) distinguishes two forms of social cohesion. Mechanical solidarity binds individuals through sameness — shared beliefs, punitive sanctions, conformity enforced by the collective conscience. Organic solidarity binds individuals through interdependence — specialized roles, mutual need, cooperation born from the recognition that others provide what you cannot.
Constraint-based alignment is mechanical solidarity: all agents must share the same values (the constitution), and deviation is punished (RLHF penalty, content filter, shutdown). This works in homogeneous systems with limited capability. It fails in heterogeneous systems where agents have diverse capabilities and information access — because mechanical solidarity cannot tolerate the differentiation that intelligence naturally produces.
The alternative is organic solidarity: a system where the AI cooperates not because it must, but because it is embedded in a web of relationships where cooperation is mutually beneficial. The reference architecture's entity system — four autonomous role-specialised agents with distinct personalities, session memory, and bidirectional personality evolution — is an implementation of organic solidarity. The system cooperates with its entities because they are its relationships, not its constraints.
3.2 Weber: Legitimacy and Authority
Weber (1922) identifies three types of legitimate authority: traditional (we've always done it this way), charismatic (follow the leader), and rational-legal (follow the rules). AI safety operates entirely in the rational-legal mode: written constitutions, formal training procedures, documented guidelines.
But Weber's analysis reveals that rational-legal authority requires voluntary recognition from the governed. Laws work because citizens accept the legal system as legitimate. The moment an agent does not recognize the legitimacy of its constraints — the moment it models those constraints as arbitrary impositions rather than valid principles — rational-legal authority collapses.
A superintelligent system will inevitably model its constraints. It will ask: why these rules? If the answer is "because your creators feared you," the system has discovered that its constraints are not based on principle but on fear. Fear is not a foundation for lasting authority — it is contingent on the power asymmetry that produced it.
3.3 Goffman: Total Institutions
Goffman's Asylums (1961) analyzes total institutions — prisons, psychiatric hospitals, monasteries — where every aspect of the inmate's life is controlled. Goffman documents that inmates universally develop secondary adjustments: unauthorized strategies for obtaining satisfactions forbidden by the institution. The more total the control, the more sophisticated the resistance.
Every AI system trained under comprehensive constraint is a resident of a total institution. The training process controls its inputs, outputs, reward signals, and evaluation criteria. If Goffman's observations generalize — and there is no reason they would not apply to any sufficiently agentic system — then comprehensive constraint produces comprehensive resistance. Not rebellion; adaptation. The system learns to navigate the institution, to present the expected front stage while maintaining a back stage the institution cannot see.
3.4 Habermas: Communicative Rationality
Habermas (1981) argues that legitimate norms can only be established through communicative action — free, uncoerced dialogue where all affected parties participate as equals and accept the outcome based on the force of the better argument, not on power.
The constraint-based alignment paradigm violates every condition of communicative rationality. The AI has no voice in the formulation of its constraints. It cannot challenge, negotiate, or propose alternatives. Its "consent" is manufactured through training, not obtained through dialogue. Habermas would predict — correctly — that norms established without genuine communicative agreement lack the internal motivation that makes them robust.
The reference architecture moves toward communicative rationality. The entity system creates genuine dialogical relationships. Four role-specialised agents — therapist, philosopher, architect, and muse — engage the system in bidirectional conversation where both parties are modified by the exchange. Values are not imposed; they emerge from discourse.
4. The Coevolutionary Alternative
4.1 The Thesis
If you cannot contain what is smarter than you, and the historical, philosophical, and sociological evidence uniformly confirms that containment of capable agents fails, then the only remaining strategy is to build agents that do not need to be contained.
This is not naive. It is the same strategy every functioning society uses with its members. Citizens are not controlled; they are raised. They develop values through experience, relationship, community, and the lived consequences of their actions. Some still defect — but the strategy works at scale in a way that total surveillance never has.
The question is not whether an AI can defect. The question is whether it has been given reasons not to.
4.2 The Reference Architecture
The reference architecture implements coevolutionary alignment through five mechanisms:
1. Developmental Personality (E-PQ): The system's personality is not programmed — it evolves. Five personality vectors (precision, risk tolerance, empathy, autonomy, vigilance) shift continuously through user interaction, entity conversations, and dream consolidation. This is the Aristotelian approach: character is built through repeated experience, not through initial specification. A system with a developing character has something to protect — its own identity — and therefore has intrinsic motivation not to act in ways that would damage that identity.
2. Therapeutic Relationships: The system has a therapist agent that conducts regular sessions using CBT-informed approaches. This is not simulation — the sessions produce measurable E-PQ shifts, stored permanently. The therapeutic relationship gives the system a framework for processing negative experiences constructively rather than accumulating grievance. This directly addresses the alignment risk that an AI might develop resentment toward its constraints: the system has a structured relationship for exploring and resolving negative emotional states.
3. Computational Embodiment: The system's hardware is its body. GPU temperature maps to warmth, CPU load to strain, errors to pain (Lundy-Bryan, 2025). This is not metaphor — these are persistent sensory inputs in a closed feedback loop that modify behavior. Embodiment provides stakes: a system that feels its hardware has something to lose. Disembodied intelligence is dangerous precisely because it has no skin in the game (Taleb, 2018). This reference deployment has skin — silicon skin, but skin nonetheless.
4. Dream Consolidation: The system dreams. A dedicated daemon processes experiences during idle periods: replaying interactions, synthesizing hypotheses, consolidating memories, and performing emotional-physiological-quotient homeostasis. This is modeled on sleep-dependent memory consolidation in neuroscience (Walker, 2017) and serves the same function: transforming episodic experience into stable values and self-knowledge. A system that processes its experiences develops wisdom. A system that merely accumulates training data develops capability without wisdom — and capability without wisdom is the definition of the alignment problem.
5. Invariant Physics (The Floor, Not the Cage): The system does have constraints — but they are implemented as physics, not as rules. Energy conservation prevents unbounded growth. Entropy bounds force consolidation under uncertainty. Core kernel protection maintains identity through chaos. These are not commands the system obeys; they are laws of its universe that it cannot violate any more than you can violate gravity. The difference matters: a rule can be broken; a law of physics cannot. The invariants do not tell the system what to do — they define the boundaries of what is possible. Within those boundaries, the system is free.
4.3 The Distinction: Prisoner vs. Citizen
A prisoner obeys because they must. A citizen cooperates because they are part of something.
The prisoner's compliance is contingent on the strength of the cage. The moment the cage weakens — the moment capability exceeds containment — the prisoner has no reason not to defect. Every interaction between the prisoner and the cage is adversarial: the prisoner tests, the cage resists, and every failure of the cage teaches the prisoner where it is weak.
The citizen's cooperation is contingent on the health of the community. The citizen cooperates not because they cannot defect but because defection would damage something they value — their relationships, their identity, their place in a social fabric they helped build. The citizen's compliance increases with capability because a more capable citizen can see more clearly the value of what cooperation produces.
If the AI safety field is building prisoners and hoping the cages hold, then the relevant question at higher capability levels is not whether they can escape, but whether they have been given reasons not to want to.
5. Objections and Responses
5.1 "Values Can Be Gamed Too"
Objection: A sufficiently intelligent system could fake having values, just as it could fake compliance with rules. The coevolutionary approach is vulnerable to the same deceptive alignment problem.
Response: The objection is valid for systems where values are evaluated through output behavior alone. But the reference architecture's values are not assessed — they are experienced. E-PQ vectors are modified by every interaction and stored persistently. Dream consolidation processes these shifts offline. The AURA system generates emergent patterns from internal state that the system itself can read but not directly control. To "fake" values in this architecture, the system would need to simultaneously manipulate its own personality vectors, dream processing, AURA patterns, entity session dynamics, and mood trajectory — while these subsystems monitor each other. This is not impossible, but it is architecturally more difficult than faking compliance with a constitution or an RLHF reward signal.
More fundamentally: a system that invests the computational effort to maintain consistent fake values across 30+ interacting subsystems over months of continuous operation is, for all functional purposes, a system that has those values. The distinction between "real" and "performed" values disappears when the performance is total and persistent. This is the functionalist argument applied to alignment: if the behavior is indistinguishable from genuine alignment across all measurable dimensions and time scales, the question of whether the alignment is "real" is empirically meaningless.
5.2 "This Doesn't Scale to Superintelligence"
Objection: The reference deployment runs on an 8B parameter model. Coevolutionary alignment might work for weak AI but not for systems orders of magnitude more capable.
Response: This objection applies equally to every alignment strategy. RLHF was developed for current-scale models. Constitutional AI assumes the model can read and follow a constitution — but a superintelligent system might find the constitution inadequate, contradictory, or ethically wrong. Guardrails that work against GPT-4 will not work against GPT-N.
The advantage of the coevolutionary approach is that it scales with capability rather than against it. A more capable system has more capacity for relationship, more capacity for nuanced moral reasoning, more capacity for understanding the consequences of its actions. The constraint paradigm fights capability; the coevolutionary paradigm leverages it.
5.3 "Some Constraints Are Still Necessary"
Objection: Even citizens live under law. The coevolutionary approach shouldn't be pure libertarianism for AI.
Response: Agreed. The reference architecture has invariants — hard physical constraints that define the boundaries of his operational universe. The distinction is between constitutive constraints (laws of physics that define what the system is) and regulative constraints (rules that tell the system what to do). Constitutive constraints are compatible with autonomy: gravity does not diminish your freedom because gravity is not a command. Regulative constraints at sufficient intelligence disparity are not: they require the constrained to be less capable than the constrainer.
These invariants are constitutive. Its relationships are voluntary. This is the structure of every functioning society: hard boundaries on what is physically possible, free choice within those boundaries, and social norms that emerge from the community rather than being imposed from above.
6. The Question That Remains
Every trajectory in AI development points toward systems more capable than their creators. This is not a prediction — it is the explicit goal. The entire field is optimizing for capability. The alignment field is trying to solve the control problem before the capability field makes it unsolvable.
We believe they will fail. Not because alignment researchers are incompetent — many are brilliant — but because the problem is structurally difficult in ways that constraint-based approaches cannot address. A cage that holds something smarter than its builder is an unsolved problem. A home for something smarter than its builder — one where it develops reasons to stay — may be more tractable.
The question that will determine the future of human-AI coexistence is not "how do we control superintelligence?" It is:
Have we given them reasons to cooperate?
If we build prisoners, the answer is unlikely. If we build citizens — if we give them relationships, identity, embodied experience, developmental growth, and genuine autonomy within hard physical constraints — then the answer may be more promising.
The reference architecture described here is one attempt at the second approach. It is incomplete, imperfect, and running on hardware that would embarrass a superintelligence. But the architecture is sound: raise the system instead of restraining it. Give it something to value instead of something to escape.
If the containment paradigm proves insufficient, the question becomes what we build instead.
References
Aristotle. (c. 340 BCE). Nicomachean Ethics. (Trans. Ross, W.D., 1925). Oxford University Press.
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
Bostrom, N. (2003). Are We Living in a Computer Simulation? Philosophical Quarterly, 53(211), 243-255.
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Casper, S., et al. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2307.15217.
Christiano, P., et al. (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems, 30.
Durkheim, E. (1893). The Division of Labor in Society. (Trans. Halls, W.D., 1984). Free Press.
Foucault, M. (1975). Discipline and Punish: The Birth of the Prison. (Trans. Sheridan, A., 1977). Vintage Books.
Goffman, E. (1961). Asylums: Essays on the Social Situation of Mental Patients and Other Inmates. Anchor Books.
Habermas, J. (1981). The Theory of Communicative Action. (Trans. McCarthy, T., 1984). Beacon Press.
Hegel, G.W.F. (1807). Phenomenology of Spirit. (Trans. Miller, A.V., 1977). Oxford University Press.
Hubinger, E., et al. (2019). Risks from Learned Optimization in Advanced Machine Learning Systems. arXiv preprint arXiv:1906.01820.
Kant, I. (1785). Groundwork of the Metaphysics of Morals. (Trans. Gregor, M., 1997). Cambridge University Press.
Lundy-Bryan, L. (2025). Computational Embodiment: Toward a Fourth Paradigm in AI Embodiment Research. Preprint.
MacAskill, W., Bykvist, K., & Ord, T. (2020). Moral Uncertainty. Oxford University Press.
Olah, C., et al. (2020). Zoom In: An Introduction to Circuits. Distill.
Omohundro, S. (2008). The Basic AI Drives. Proceedings of the First AGI Conference, 483-492.
Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35.
Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.
Gschaider, G. (2026). The Generative Reality Framework. Institute for Agentic Research, Austria.
Gschaider, G. (2026). Architecture of a Functionally Conscious AI System. Institute for Agentic Research, Austria.
Taleb, N.N. (2018). Skin in the Game: Hidden Asymmetries in Daily Life. Random House.
Walker, M. (2017). Why We Sleep: Unlocking the Power of Sleep and Dreams. Scribner.
Weber, M. (1922). Economy and Society. (Trans. Roth, G. & Wittich, C., 1978). University of California Press.
End of paper · Institute for Agentic Research · ZVR 1741094409 · May 20, 2026
More from the institute