Skip to content

Blog

CEO Quotes

There's a lot of portrayal of leaders. They come into the room, they suck up all the oxygen, and everybody's afraid of them. Everybody, all of a sudden, your employees start to cater to what the boss likes rather than what the customers really want. And that's the worst leader in the world.@Joseph Tsai

Effective context engineering for AI agents

The key to building efficient and reliable AI agents lies in treating "context" as a finite and valuable resource, and managing and optimizing it meticulously.

1. The Evolution from "Prompt Engineering" to "Context Engineering"
Section titled “1. The Evolution from "Prompt Engineering" to "Context Engineering"”
  • Prompt Engineering: Primarily focuses on how to write and organize instructions (especially system prompts) for LLMs to obtain optimal single-turn outputs.
  • Context Engineering: A broader concept concerned with managing and maintaining all information entering the LLM's "context window" throughout its entire operation cycle. This includes system prompts, tools, external data, conversation history, etc. It is a continuous, iterative optimization process.
2. Context is a Finite and Critical Resource
Section titled “2. Context is a Finite and Critical Resource”
  • LLMs, like humans, have a limited "attention budget".
  • When there is too much information (tokens) in the context window, model performance degrades, leading to the "context rot" phenomenon, where the model struggles to accurately recall or utilize the information within it.
  • Therefore, information entering the context must be carefully curated. The goal is: to use the smallest, most efficient set of information (high-signal tokens) to maximize the likelihood of achieving the desired outcome.
  • Principle: At any given moment, include the "smallest yet highest-signal" set of tokens to maximize the probability of achieving the goal.
  • System Prompts: Find the "right altitude"—specific enough to guide behavior without resorting to fragile hard-coded logic; use structured sections (background, instructions, tool guidance, output format); start with a minimal viable version, then refine based on failure modes.
  • Tool Design: Fewer but better tools with clear boundaries, unambiguous parameters, and token-efficient returned information; avoid functional overlap and selection ambiguity.
  • Example Selection: A small number of diverse, canonical few-shot examples are more effective than cramming with rules and edge cases; examples serve as efficient "behavioral pictures."
  • The article advocates for a shift from "pre-loading all information" to a "just-in-time" context retrieval strategy.
  • Agents should not load all potentially relevant data into the context at once. Instead, they should use tools (like file systems, database queries) to dynamically and autonomously retrieve information as needed.
  • This approach mimics human cognition (we don't remember everything, but we know where to find it) and enables "progressive disclosure", keeping the agent more focused and efficient. In practice, a hybrid strategy combining pre-loading with just-in-time retrieval often works best.
5. Three Key Strategies for Long-horizon Tasks
Section titled “5. Three Key Strategies for Long-horizon Tasks”

For complex, long-term tasks that exceed the capacity of a single context window, the article proposes three key techniques:

  1. Compaction:

    • Method: When the conversation history nears the context window limit, the model is tasked to summarize and compress it. A new conversation window is then started using this refined summary.
    • Purpose: To maintain task continuity by preserving core information (e.g., decisions, unresolved issues) while discarding redundant content.
  2. Structured Note-taking / Agentic Memory:

    • Method: The agent is instructed to regularly write key information, to-do items, progress, etc., to an external "memory" (e.g., a NOTES.md file) during task execution, and to read from it when needed.
    • Purpose: To provide the agent with persistent memory, enabling it to maintain long-term tracking and planning capabilities for a task even across multiple context resets.
  3. Sub-agent Architectures:

    • Method: A complex task is broken down. A main agent is responsible for high-level planning and coordination, delegating specific, in-depth subtasks to specialized sub-agents. Each sub-agent works within its own independent context and returns only a refined summary to the main agent upon completion.
    • Purpose: To achieve "separation of concerns," preventing the main agent's context from being overwhelmed by massive details, thereby efficiently handling complex research and analysis tasks.

What kind of education should children receive to avoid being replaced by AI?

The ability to engage in continuous learning, adapt quickly, demonstrate resilience, understand human nature, and collaborate globally will be key to remaining irreplaceable by AI in the future.

  1. Learning Ability:
    It involves not only acquiring knowledge but also mastering the methods of learning. Fostering critical thinking, problem-solving skills, and the habit of self-directed learning enables children to continuously grow and evolve in an era of rapidly changing information.

  2. Adaptability:
    The capacity to flexibly adjust thinking and behavior in response to rapidly evolving technologies, industries, and social environments. This includes embracing new technologies, coping with uncertainty, and quickly finding one’s place in new contexts.

  3. Resilience:
    The mental strength to recover from failure and keep moving forward. It involves not only withstanding pressure and challenges but also transforming setbacks into opportunities for growth, while maintaining a positive mindset and motivation over the long term.

  4. Understanding Human Needs:
    Cultivating empathy and insight to genuinely understand others’ problems and expectations. This is not only the foundation for creating valuable products and services but also key to demonstrating the irreplaceable value of humans in an era of human-machine coexistence.

  5. Engaging with the World:
    Possessing a global perspective and cross-cultural communication skills to collaborate effectively with people from diverse backgrounds. At the same time, it involves understanding the relationship between society, technology, and ethics, and actively participating in building a responsible and sustainable future.

Claude Memory: A Different Philosophy

The two leading AI assistants, Claude and ChatGPT, have adopted completely opposite strategies in implementing their "memory" functions. This difference profoundly reflects their respective product positioning, target user bases, and design philosophies.

Claude's Memory System: An Explicit, Controllable Tool
Section titled “Claude's Memory System: An Explicit, Controllable Tool”

Claude's memory function is designed as a tool that users actively invoke, rather than a continuously running background service. Its main characteristics are:

  1. Blank Slate: Each conversation starts from a clean state without preloading any user profiles or history.
  2. Explicit Invocation: The memory function only activates when users use explicit commands like "What did we discuss last time?"
  3. Raw History Search: It doesn't create AI-generated user summaries or compressed profiles, but instead recalls information by performing real-time searches through users' raw chat history.
  4. Two Main Search Tools:
    • conversation_search: Searches through all historical records based on keywords or topics.
    • recent_chats: Retrieves conversations based on time ranges (e.g., "the last 10 conversations" or "the last week of November last year").
ChatGPT's Memory System: An Implicit, Automatic Experience
Section titled “ChatGPT's Memory System: An Implicit, Automatic Experience”

In contrast to Claude, ChatGPT's memory function is designed for the mass consumer market, characterized by:

  1. Always-On: The memory function loads automatically without user intervention, providing instant personalized experiences.
  2. User Profiling: The system continuously learns user preferences and patterns to build detailed user profiles.
  3. Pursuit of a "Magical" Experience: The goal is to make the product feel intelligent, thoughtful, and seamless, so users don't need to think about how it works.

This design divergence stems from the two companies' different market strategies:

  • Claude Targets Professional Users: Its user base consists mainly of technical professionals like developers and researchers. These users understand how LLMs work, prefer precise control, and accept the additional latency that comes with invoking memory. For them, memory is a powerful, predictable professional tool where privacy and controllability are crucial.

  • ChatGPT Targets the Mass Market: Its user base includes various ordinary consumers like students and parents. They want a product that works out-of-the-box and is easy to use, automatically remembering their information. This is a typical consumer tech strategy: first attract and retain massive users through a "magical" experience, then explore monetization models later.

The author believes that the two giants taking completely opposite paths indicates that the design space for AI memory functions is extremely vast, with no single correct answer. The optimal solution depends on the product's target users and specific needs. Currently, this field is still in its early exploratory stages ("Cambrian explosion"), with major companies trying different approaches, far from establishing industry standards.

Latest Update: Shortly after the article was published, Anthropic (Claude's parent company) announced a new memory feature for its Team and Enterprise accounts that appears closer to ChatGPT's automatic profiling model. This indicates that the development and evolution of AI memory is progressing at an extremely rapid pace.

Defeating Nondeterminism in LLM Inference

Model Deterministic Nondeterministic User requests Other user requests Output

The nondeterminism of LLM inference is a systemic problem. It originates from the conflict between underlying computational libraries—which are designed for maximum performance and are sensitive to batch size—and the dynamic server loads of the real world. A solution exists, which is to enforce the use of batch-invariant computational kernels, but this typically comes at the cost of sacrificing some performance.

The non-reproducibility (nondeterminism) of LLM (Large Language Model) inference results is not, as commonly believed, a simple combination of the randomness of GPU parallel computing and floating-point calculation errors. The true culprits are: the lack of "Batch Invariance" in core computational operations (kernels), combined with the constantly changing load on the server (i.e., varying batch sizes).

  1. Common Misconception vs. The Facts

    • Common Misconception ("Concurrency + Floating Point" Hypothesis): It is widely believed that because floating-point addition is non-associative (i.e., (a+b)+c ≠ a+(b+c)), and GPUs execute these additions in a non-deterministic parallel order, the results become random.
    • The Facts Pointed Out by the Article: This hypothesis is incomplete. While floating-point non-associativity is the root cause of numerical differences, the vast majority of computational cores used in LLM inference (the forward pass), such as matrix multiplication, are themselves "run-to-run deterministic." That is, for a fixed batch of input, multiple runs will produce the exact same result.
  2. The True Source of Nondeterminism

    • Lack of "Batch Invariance": Although a single computational kernel is deterministic, its result is affected by the batch size. For example, when computing a vector, the numerical result will be slightly different when it is processed alone (batch size=1) versus with thousands of other vectors (batch size=1000). This is because, to optimize performance for different batch sizes, the underlying system uses different computational strategies and instructions, which in turn changes the accumulation order of floating-point numbers.
    • Variable Server Load: From a user's perspective, their requests are dynamically grouped with other users' requests into a batch by the inference server. The server's load changes in real-time, meaning a user's same request might be processed in a batch of size 8 this time, and a batch of size 128 the next time.
    • The Result of the Combination: A computational kernel that lacks "batch invariance" is applied in a system with "non-deterministic batch sizes," ultimately leading to the nondeterminism perceived by the user.
How to Achieve Deterministic Inference (i.e., Achieve "Batch Invariance")
Section titled “How to Achieve Deterministic Inference (i.e., Achieve "Batch Invariance")”

The article points out that to achieve fully reproducible inference, every computational step in the model must be made batch-invariant, primarily involving these three parts:

  • RMSNorm: Relatively easy to implement. It only requires sticking to one parallelization strategy and avoiding switching to strategies that would change the order of operations, even if it means slightly worse performance on small batches.
  • Matrix Multiplication: More challenging. High-performance matrix multiplication libraries select different Tensor Core instructions or parallel strategies (like Split-K) based on input dimensions. To achieve determinism, one must enforce the use of a single kernel configuration, which sacrifices peak performance at certain dimensions.
  • Attention Mechanism: The most complex. It must be invariant not only to batch size but also to how sequences are processed (e.g., chunked prefill, decoding with a KV Cache). This means that when a token computes its attention, the internal order of operations must be identical regardless of how much context (KV Cache) it has.