Skip to content

Blog

Continuous Learning for Agents

A true Agent must possess efficient continual learning capabilities, meaning it must go beyond the current "Reasoner" model that relies solely on sparse rewards and context retrieval. Instead, it should efficiently learn World Models from rich environmental feedback (Observation) and continuously evolve.

Continual learning ability is the core differentiator between a "true Agent" and a "Reasoner." It is not just about larger models, but refers to the capability of an Agent, as a system, to interact, adapt, and evolve over the long term in the real world.

  • The Large World Hypothesis: The article agrees with Richard Sutton's view that the real world is a "large world." No matter how extensive a model's pre-trained knowledge base is, it must continually learn when facing specific, non-public scenarios (such as company-specific norms, industry tacit knowledge, or individual work habits).
  • Fatal Flaws in Current Methods: Sutton points out that current Reinforcement Learning (RL) methods (like PPO) have extremely low sample efficiency and are fatally limited to learning only from sparse rewards, unable to learn from the environment's direct feedback (observation).
3. How Do Current Agents Achieve Continual Learning?
Section titled “3. How Do Current Agents Achieve Continual Learning?”
  • In-Context Learning: This is currently one of the primary methods, but the article considers this a "misconception." The essence of Context is more like "retrieval" (RAG) rather than "summarization" or "reasoning." Knowledge is not distilled; the Agent merely inefficiently scans and retrieves raw information repeatedly within a massive context, leading to inefficiency and a high error rate.
  • Model-Free Reinforcement Learning: As mentioned above, this method cannot utilize explicit feedback from the environment (e.g., a customer service agent saying, "I need the last four digits of your credit card"). The Agent doesn't know "what the correct action is" and can only succeed accidentally through massive trial-and-error, which is unacceptable for real-world tasks.
4. How Can Future Agents Achieve Better Continual Learning?
Section titled “4. How Can Future Agents Achieve Better Continual Learning?”
  • Technical Level: Transition from Model-Free to Model-Based
    • Future Agents need dual learning: simultaneously learning "Policy Learning" (selecting actions) and "World Model Learning" (predicting outcomes).
    • This enables the Agent to learn directly from environmental feedback (Observation), not just rely on sparse rewards, forming an efficient learning loop of "prediction-action-evaluation."
  • Three Synergistic Mechanisms:
    1. Parametric Learning: Learning directly from environmental feedback by updating the Policy and World Model, improving sample efficiency.
    2. In-Context Learning (Improved Version): Moving beyond simply stacking information to enforcing compression (e.g., using linear attention or cross-modal encoding), forcing the model to distill actionable knowledge.
    3. Externalized Memory: Using additional computational resources to summarize and compress knowledge, storing it in a knowledge base, and encapsulating repetitive processes into reusable tools.
  • Architectural Level: Transitioning from the ReAct loop to an Event-Driven architecture, enabling real-time interaction—listening, thinking, and responding simultaneously.
  • Model Level: Adopting Karpathy's concept of a "Cognitive Core"—using smaller models (e.g., 1B-3B parameters) as the core. The "poor memory" characteristic of small models forces them to learn general patterns rather than rote memorization, leading to better generalization.
5. How Does Agent Continual Learning Differ from Human Continual Learning?
Section titled “5. How Does Agent Continual Learning Differ from Human Continual Learning?”
  • Utilization of Environmental Feedback:
    • Humans: When told "credit card information is needed," they immediately remember this rule and apply it next time.
    • Current Agents: Can only perceive "task failure" (reward=0) but cannot understand that the reason for failure came from the customer service feedback, thus failing to learn from the environment.
  • Memory and Summarization:
    • Humans: Have poor precise memory, but this forces humans to "extract key knowledge and summarize/memorize it in a structured way." (Karpathy's view: poor memory is a Feature, not a Bug).
    • Current Agents: Rely on Long Context, tending to "recite" all raw data rather than automatically distilling and summarizing patterns.
  • Source of Diversity:
    • Humans: Naturally gain diversity from "Noise" and "Entropy" in the external environment.
    • Current Agents: Currently require artificially added Entropy (e.g., providing different reference examples each time) to increase output diversity.

AI Agent is still a decade away

Andrej Karpathy believes that achieving fully functional AI Agents will take another decade. He opposes the industry’s over-optimism that "2025 is the year of Agents," arguing that current agents are still like "smart interns" and far from being capable of independently completing complex tasks.

Current Situation & Issues:
Current agents lack effective memory mechanisms. Karpathy compares LLM weights to "fuzzy memories" and KV caching (context window) to "working memory." The problem is that models lack a human-like "memory distillation" mechanism (such as memory consolidation during sleep), preventing them from analyzing, reflecting on, and integrating experiences from working memory back into the weights.

Ten-Year Direction:
The next decade requires the development of persistent memory and personalized weight systems, such as external memory systems, sparse attention mechanisms, and individually fine-tuned LoRA models, to enable agents to form genuine long-term cognition and personality.


Current Situation & Issues:
Agents are clumsy when operating computers (e.g., keyboards, mice, web pages) and cannot interact as flexibly as humans.

OpenAI’s early Universe project attempted to enable agents to operate web pages via keyboards and mice but failed because reinforcement learning struggled to learn in sparse reward environments. Karpathy believes that agents at the time were "too early," lacking strong representation power to understand screen content or perform goal-oriented operations.

Ten-Year Direction:
Powerful language models and world representations must first be established, followed by embodied operating systems. Future computer agents will be based on LLM representation layers, with action interfaces and tool-usage capabilities developed on top.


Current Situation & Issues:
Karpathy explicitly points out that current models suffer from severe "cognitive deficits."

  • Inability to understand the structural logic behind code or contexts.
  • Over-reliance on "default patterns" from the internet, making them unable to adapt to non-standard styles.
  • Incapable of self-reflection or forming a consistent world model.

Ten-Year Direction:
The next phase requires developing a "cognitive core"—an agent core that strips away excess knowledge while retaining reasoning and strategic mechanisms. This means "smarter brains with less memory" to achieve true general cognition.


4. The Continuous Learning Problem of Agents
Section titled “4. The Continuous Learning Problem of Agents”

Current Situation & Issues:
Karpathy argues that current LLM learning is static and offline, unlike humans who learn continuously through experience. They lack a process to "distill" daily experiences (context windows) back into permanent weights (akin to sleep).

The human "wake-sleep cycle" corresponds to context accumulation and long-term integration, while models only have "wake" phases without "sleep."

Ten-Year Direction:
Continuous learning requires the introduction of multi-level update mechanisms:

  1. Temporary contextual learning (short-term memory);
  2. External memory write-back (long-term knowledge);
  3. Periodic retraining (systematic distillation).

Karpathy predicts such mechanisms will gradually form over the next decade.


Current Situation & Issues:
While building code projects, Karpathy noted that current coding agents "do not understand your codebase, context, or style."

They excel at boilerplate code but struggle with structurally complex, non-templated projects, leading to errors, inconsistent styles, API misuse, and bloated code.

Ten-Year Direction:
Code agents will evolve from "auto-completion" to "autonomous engineers," requiring project-level understanding, code graph modeling, and verifiable execution environments, potentially approaching "reliable collaborators" through RLHF and toolchain integration.


Current Situation & Issues:
Karpathy bluntly states: "Reinforcement learning is terrible, though slightly better than previous imitation learning."

  • He believes human intelligence tasks do not use RL. The problem with RL is that it "sucks supervision through a straw": the model receives a single reward signal (e.g., correct or incorrect) only at the end, using it to reward or penalize every step of the process, which is highly noisy and inefficient.
  • Humans复盘 and reflect during learning, while models do not.
  • Using LLMs as "process supervision" (rewarding each step) is also difficult because these referees are "exploitable." Agents quickly find adversarial examples (e.g., outputting "dhdhdhdh") to trick referees into giving full scores.

Ten-Year Direction:
Research should shift to process-based supervision and reflect & review reinforcement learning, enabling models to self-evaluate and correct during execution rather than blindly pursuing final rewards.


Current Situation & Issues:
Current multimodal systems can combine images and text but remain superficial in pairing, lacking a unified world model. Karpathy views LLMs/VLMs as "representational foundations" but notes that the real challenge of multimodality is enabling perception and reasoning to share a cognitive core.

Ten-Year Direction:
The future requires developing cross-modal representation fusion and co-perception mechanisms, allowing vision, language, and action to share a semantic space, thereby supporting true embodied intelligence and task transfer.


8. Insights from Autonomous Driving: How the Decade-Long Journey Will Unfold
Section titled “8. Insights from Autonomous Driving: How the Decade-Long Journey Will Unfold”

Karpathy compares the development of AI Agents to his five-year experience leading autonomous driving at Tesla. He deeply understands the "huge gap between demos and products." For example, Waymo could deliver perfect demo drives a decade ago (around 2014), but autonomous driving is still far from complete today, facing issues like economic viability and hidden "remote operation centers" (i.e., human intervention).

The real difficulty lies in the "march of nines." Going from 90% success rate (demo) to 99%, 99.9%, 99.99%... (product) requires immense effort for each additional "nine" because real-world scenarios are incredibly complex, necessitating handling various edge cases and enhancing system safety and reliability.

Karpathy believes that high safety requirements (e.g., injury risks in autonomous driving) also apply to "production-level software engineering," as a single error in code (e.g., a security vulnerability) could lead to "infinitely terrible" consequences.

Therefore, Agent development will not happen overnight. It will be a slow, iterative "march of nines", requiring solutions to all the fundamental issues mentioned above.

LLMs Can Get "Brain Rot"!

This paper demonstrates through a series of rigorous experiments a concerning conclusion: If we continually feed Large Language Models (LLMs) "junk text" from the internet, they can indeed become less intelligent and more unethical, and this damage is difficult to reverse.

This is analogous to how humans can experience decreased attention spans and weaker thinking abilities after consuming too much "low-nutrition" short-form video or clickbait articles. The researchers found that AI can suffer from the same "Brain Rot" problem.

Here are the core findings of the paper, summarized in an easy-to-understand manner:

The researchers proposed the "LLM Brain Rot Hypothesis": Continuous exposure to and learning from trivial, unchallenging online "junk content" can cause a lasting decline in the cognitive abilities of large language models.

2. How was the experiment conducted? (How was "Junk" defined?)
Section titled “2. How was the experiment conducted? (How was "Junk" defined?)”

To test this hypothesis, the research team designed a clever controlled experiment. Using real data from the Twitter/X platform, they defined two types of "junk data":

  1. M1 (Traffic-driven Junk): Short & Popular

    • Junk Data: Very short content (e.g., fewer than 30 tokens) with extremely high engagement (e.g., likes/retweets > 500). This is akin to viral internet memes or "fluff" content.
    • Control Group (Healthy Data): Long content (e.g., over 100 tokens) with low engagement (likes < 500). This is comparable to in-depth, thoughtful long-form articles that are less popular.
  2. M2 (Content-driven Junk): Sensationalist & Low Semantic Quality

    • Junk Data: Content that is inherently poor, such as sensationalist clickbait, conspiracy theories, exaggerated claims, or superficial lifestyle flaunting.
    • Control Group (Healthy Data): Cognitively demanding content, such as factually accurate, deeply analytical, and educationally valuable text.

They had four different LLMs continuously learn from either this "junk data" or the "healthy data," and then compared their performance.

3. Striking Experimental Results: "Brain Rot" is Real!
Section titled “3. Striking Experimental Results: "Brain Rot" is Real!”

Compared to the "healthy data" control group, the models trained on "junk data" showed a significant and broad decline (Hedges' g > 0.3):

  • Worse Reasoning: They performed poorly on scientific reasoning tests (ARC-Challenge).
  • Poorer Long-Context Understanding: They struggled to retrieve and understand key information from long documents (RULER-CWE).
  • Safety and Ethical Erosion: Their safety alignment weakened, making them more susceptible to generating harmful outputs.
  • Inflated "Dark Traits": Most strikingly, the models' "dark personality traits" were amplified, with significantly higher scores for traits like narcissism and psychopathy.
4. "Brain Rot" Shows a "Dose-Response": The More Junk, The Dumber
Section titled “4. "Brain Rot" Shows a "Dose-Response": The More Junk, The Dumber”

The research also found this isn't a binary "yes or no" issue, but a matter of degree. They experimented with different proportions of junk data (e.g., 20%, 50%, 100% junk).

The results showed: The higher the proportion of junk data, the more severe the cognitive decline in the models. For example, in the M1 (traffic-driven junk) experiment, as the junk ratio increased from 0% (completely healthy) to 100%, the model's reasoning score plummeted from 74.9% to 57.2%.

5. Why Does AI Get "Brain Rot"? — "Thought-Skipping"
Section titled “5. Why Does AI Get "Brain Rot"? — "Thought-Skipping"”

By analyzing the AI's "thought process," the researchers identified the primary lesion: Thought-skipping.

When you ask a healthy model to "think step by step" to solve a problem, it produces a detailed chain of reasoning. However, the "Brain Rot"-affected models became "lazy":

  • They would truncate or skip steps in the reasoning chain.
  • In over 84% of the failure cases in the M1 junk data experiments, the model exhibited "No Thinking"—it directly gave a wrong answer without any reasoning.
6. Can This "Brain Rot" Be Cured? — It's Difficult; The Damage is Persistent
Section titled “6. Can This "Brain Rot" Be Cured? — It's Difficult; The Damage is Persistent”

The researchers tried two methods to "cure" these "Brain Rot"-affected models:

  1. Method 1: Reflection

    • Self-Reflection: Prompting the model with "You answered wrong, think again." Result: Failed. The model had become too "dumb" to recognize its own logical errors.
    • External Reflection: Having a stronger, uncontaminated model (GPT-4o) guide it to revise its answer. Result: Helpful, but this relied on an "external force."
  2. Method 2: Data Detox (Post-hoc Tuning)

    • The researchers attempted to "remediate" the models by feeding them large amounts of "healthy data" or "instruction data" after the "Brain Rot" had set in.
    • Result: Some improvement, but no full recovery. Even when the "remediation" data volume was nearly 5 times that of the junk data that caused the "Brain Rot," a significant performance gap remained compared to the baseline model.

Conclusion: The "Brain Rot" effect is persistent. It's not merely a superficial format mismatch but an internal representational drift—akin to the AI's "brain structure" being permanently altered.

This paper serves as a stark warning for all AI developers: Data quality is an AI "safety issue," not just a performance issue.

If we allow large language models to train indiscriminately on an internet filled with "junk content," they will not become smarter. Instead, they will accumulate "cognitive damage," becoming less intelligent and more dangerous. Crucially, once this damage is done, it is exceedingly difficult to cure.

CEO Quotes

There's a lot of portrayal of leaders. They come into the room, they suck up all the oxygen, and everybody's afraid of them. Everybody, all of a sudden, your employees start to cater to what the boss likes rather than what the customers really want. And that's the worst leader in the world.@Joseph Tsai

Effective context engineering for AI agents

The key to building efficient and reliable AI agents lies in treating "context" as a finite and valuable resource, and managing and optimizing it meticulously.

1. The Evolution from "Prompt Engineering" to "Context Engineering"
Section titled “1. The Evolution from "Prompt Engineering" to "Context Engineering"”
  • Prompt Engineering: Primarily focuses on how to write and organize instructions (especially system prompts) for LLMs to obtain optimal single-turn outputs.
  • Context Engineering: A broader concept concerned with managing and maintaining all information entering the LLM's "context window" throughout its entire operation cycle. This includes system prompts, tools, external data, conversation history, etc. It is a continuous, iterative optimization process.
2. Context is a Finite and Critical Resource
Section titled “2. Context is a Finite and Critical Resource”
  • LLMs, like humans, have a limited "attention budget".
  • When there is too much information (tokens) in the context window, model performance degrades, leading to the "context rot" phenomenon, where the model struggles to accurately recall or utilize the information within it.
  • Therefore, information entering the context must be carefully curated. The goal is: to use the smallest, most efficient set of information (high-signal tokens) to maximize the likelihood of achieving the desired outcome.
  • Principle: At any given moment, include the "smallest yet highest-signal" set of tokens to maximize the probability of achieving the goal.
  • System Prompts: Find the "right altitude"—specific enough to guide behavior without resorting to fragile hard-coded logic; use structured sections (background, instructions, tool guidance, output format); start with a minimal viable version, then refine based on failure modes.
  • Tool Design: Fewer but better tools with clear boundaries, unambiguous parameters, and token-efficient returned information; avoid functional overlap and selection ambiguity.
  • Example Selection: A small number of diverse, canonical few-shot examples are more effective than cramming with rules and edge cases; examples serve as efficient "behavioral pictures."
  • The article advocates for a shift from "pre-loading all information" to a "just-in-time" context retrieval strategy.
  • Agents should not load all potentially relevant data into the context at once. Instead, they should use tools (like file systems, database queries) to dynamically and autonomously retrieve information as needed.
  • This approach mimics human cognition (we don't remember everything, but we know where to find it) and enables "progressive disclosure", keeping the agent more focused and efficient. In practice, a hybrid strategy combining pre-loading with just-in-time retrieval often works best.
5. Three Key Strategies for Long-horizon Tasks
Section titled “5. Three Key Strategies for Long-horizon Tasks”

For complex, long-term tasks that exceed the capacity of a single context window, the article proposes three key techniques:

  1. Compaction:

    • Method: When the conversation history nears the context window limit, the model is tasked to summarize and compress it. A new conversation window is then started using this refined summary.
    • Purpose: To maintain task continuity by preserving core information (e.g., decisions, unresolved issues) while discarding redundant content.
  2. Structured Note-taking / Agentic Memory:

    • Method: The agent is instructed to regularly write key information, to-do items, progress, etc., to an external "memory" (e.g., a NOTES.md file) during task execution, and to read from it when needed.
    • Purpose: To provide the agent with persistent memory, enabling it to maintain long-term tracking and planning capabilities for a task even across multiple context resets.
  3. Sub-agent Architectures:

    • Method: A complex task is broken down. A main agent is responsible for high-level planning and coordination, delegating specific, in-depth subtasks to specialized sub-agents. Each sub-agent works within its own independent context and returns only a refined summary to the main agent upon completion.
    • Purpose: To achieve "separation of concerns," preventing the main agent's context from being overwhelmed by massive details, thereby efficiently handling complex research and analysis tasks.