Aibuild OS Agent Memory

How Aibuild OS turns individual engineering experience into compounding company-wide knowledge.

The Gap Between Knowing and Doing

Modern AI can describe what a boolean union does and write fluent API code for any major CAD tool. What it doesn’t know is that a boolean union will silently produce garbage geometry if you forgot to apply your object’s scale transform first — like trying to merge two drawings where one was photocopied at 150%: they look correct individually, but the combined geometry is wrong and nothing tells you why.

This kind of knowledge, the gap between what an API says it does and what it actually requires, is what separates an agent that demos well from one that works in production. It takes a human engineer years to internalize, accumulated through hundreds of failed operations and cryptic error messages. It’s almost entirely absent from standard training data.

We build Aibuild OS, a platform where agents control real engineering software through their native scripting APIs. When an agent gets a boolean operation wrong, it’s not an incorrect text response. It’s a failed geometry operation on a live model. The cost of mistakes is high. We needed our agents to learn from experience — and we needed that learning to belong to the organisations using the platform, not leak between them.

Why Generic Memory Doesn't Work

Most memory systems today store conversations and retrieve similar chunks when needed. This works for general-purpose tasks. It fails for engineering software in three ways.

Tool knowledge doesn’t transfer. Knowing that Blender requires bpy.ops.object.mode_set(mode=’OBJECT’) before duplicating tells you nothing about Rhino’s tolerance requirements for boolean difference, or how a CAM system expects toolpath parameters to be sequenced. A shared memory pool produces noisy, misleading retrievals.

Workflows are order-dependent. “Apply scale, then set origin, then boolean” isn’t a suggestion in Blender. It’s a hard rule. “Extend the cutter past the target, then boolean diff, then fillet” is a hard rule in Rhino — like cutting through a block of wood where the saw blade has to extend past both faces, not stop halfway through. Stop short and you don’t get two clean pieces. In CAM, operation sequencing can be the difference between a valid toolpath and a collision. Memory needs to capture sequences, not just isolated facts.

Not all knowledge is equal. An engineer’s expertise operates at three levels: the tool (API quirks), their company’s standards (tolerances, compliance), and their personal habits (preferred units, naming). Mixing these degrades all three.

The Architecture

Our memory system has three layers: per-application knowledge banks, a structured knowledge graph, and a multi-tier retrieval system that blends multiple search strategies.

Isolated knowledge banks. Each application, including Blender, Rhino, FreeCAD, Fusion, CadQuery, Abaqus, and others, gets its own knowledge bank with a tuned profile reflecting the nature of its domain. We configure each bank along dimensions like precision sensitivity (how critically to evaluate incoming knowledge), semantic flexibility (how loosely to match queries), and contextual inference (how aggressively to surface related knowledge beyond the literal query).

Simulation tools get maximum precision sensitivity. A wrong memory about boundary conditions produces dangerously misleading results. Visualization tools get higher contextual inference. When you ask about stress distribution, a related tip about effective colour mapping is genuinely helpful. Format conversion gets maximum semantic precision. “Does STEP to STL preserve face groups” has one correct answer.

Three-tier retrieval with priority. Before every task, we retrieve from three scoped sources in parallel:

Organization standards: tolerances, material specs, compliance rules. Highest authority.
User preferences: personal workflow habits, naming conventions, defaults.
Engineering knowledge: patterns, workarounds, and proven sequences learned from past sessions. Richest source, but lowest priority, verified against current context before being applied.

This priority ordering prevents a well-learned API workaround from contradicting an organization’s engineering standard.

Hybrid retrieval. Our retrieval layer blends multiple approaches, each contributing a different signal: conceptual similarity search finds related problems; exact keyword matching catches specific API names and error codes; graph traversal follows structural relationships between connected knowledge; and temporal ranking favours recent learnings when APIs change between software versions. The blend is tuned per-bank.

From Conversations to Knowledge

After every successful conversation where the agent used tools, we analyze the full exchange and extract reusable knowledge. The bar is high: only store what would help a different agent succeed at a similar task in a future conversation.

We extract at two layers. At the code level, we capture API calls that failed and the exact fix, including the error message and root cause, along with surprising parameter requirements, mode prerequisites, and workarounds. Always with the mechanism, not just the outcome. At the workflow level, we capture operation sequences that prevent failures, geometry preparation steps, and when to use one approach over another.

Everything else is discarded. “The agent created a 200mm bottle” is a fact about one conversation. “Boolean operations require all transforms to be applied first” is knowledge. We only keep knowledge.

Early in development, a permissive extraction approach quickly filled memory with noise. We’d rather store zero memories from a clean session than one low-quality memory that pollutes future recalls.

The Knowledge Graph: Facts Become Understanding

Patterns across memories are more useful than individual memories. This is where the system starts to behave like accumulated expertise.

Every extracted memory gets classified with structured labels that describe what kind of knowledge it is and what it relates to. For failures, we classify the root cause of the problem and how it was resolved. Root causes aren’t symptoms. We classify the underlying issue, not the error message. A failed boolean and a null return from a geometry operation might look different on the surface but both trace back to invalid geometry as the root cause. For workflow knowledge, we classify the operation type and the conditions under which a particular sequence or approach applies. For preferences and standards, we classify scope and context so retrieval surfaces the right knowledge at the right time.

These two axes construct the graph. Memories that share a root cause get linked regardless of which application produced them or how the error was worded. Memories that share a resolution get linked too, because the same class of fix often applies across superficially unrelated problems. The graph edges aren’t hand-authored. They emerge from classification.

Over time, the system synthesizes higher-order observations from clusters of related facts. Individual memories like “apply scale before boolean,” “set origin before boolean,” and “seal any gaps in the geometry before combining shapes, otherwise the software can’t tell what’s inside and what’s outside” get synthesized into a broader principle: boolean operations have strict geometry preconditions, and most failures trace back to skipping a preparation step. An agent retrieving a principle understands the category of problem it’s dealing with and can reason about cases it hasn’t seen before. That’s the difference between an agent that patches errors and one that avoids them.

The graph also handles knowledge going stale. When Blender 4.x changed how object transforms are applied, memories built on the old API behaviour became actively wrong. When a user shifts from one approach to another, older memories can start contradicting newer ones. A memory that read “always use Laplacian smoothing for mesh healing” gets superseded when the user consistently switches to Shrinkwrap instead — the updated memory becomes “user prefers Shrinkwrap for mesh healing; previously used Laplacian but switched due to better preservation of sharp features on thin-walled geometry.” The prior approach isn’t just overwritten. The reasoning behind the change is preserved alongside the new preference.

This matters more than it might seem. An agent that only stores the current preference will apply it blindly. An agent that stores the reasoning can apply judgement: if Shrinkwrap is unavailable in a given context, it understands why the user moved away from Laplacian and can make an informed call rather than silently falling back to a method the user has already decided isn’t good enough. The memory carries the decision, not just the outcome.

Agents That Learn in Real Time

Agents interact with their own knowledge base during a task, and this is where the system does something we haven’t seen elsewhere.

When an agent hits an error during tool execution, a failed boolean, a script exception, a geometry validation failure, most systems retry with slightly modified parameters or surface the error to the user. Ours does something different. Before the agent retries, the system automatically queries the full knowledge base with the error as context, drawing on every failure that has ever occurred across every user working with that tool within your organisation. If we’ve seen this failure before, the solution comes back immediately and specifically. One agent’s hard lesson becomes every agent’s first instinct.

Beyond automatic error recall, agents can initiate a deep search against the knowledge base when they encounter genuinely novel situations mid-task. This isn’t a similarity search over stored text. It’s a full reasoning pass over the knowledge graph: the system traverses connected memories, follows relationships between root causes and resolutions across different applications and sessions, and synthesizes an answer from the structure of what it knows. An agent three operations into a complex workflow, hitting an edge case it has never encountered, can surface a relevant insight from a session months ago with a different user on a different tool, all because the graph connected them through a shared underlying pattern.

Agents can also capture knowledge explicitly during a conversation, scoped to either the organisation or the individual user. A user mentioning “we always use metric” once, casually, mid-task, becomes a permanent organisational standard that shapes every future session for every agent working with that team. A throwaway comment becomes institutional memory. Nothing needs to be filed, tagged, or remembered by a human.

This has a compounding effect that matters at the org level. Most engineering teams have one or two people who really know a tool: the person who learned the hard way that a particular geometry pattern causes failures downstream, or who figured out the right sequencing for a complex operation nobody else has touched. That knowledge usually lives in one person’s head and dies when they move on. With Aibuild OS, every session those engineers run is quietly extracting what they know and making it available to everyone else. A junior engineer using the platform gets the benefit of patterns and best practices they have never encountered, automatically, without knowing where they came from.

Isolation by Design

The system described above searches broadly by design. It’s worth being precise about what “broadly” means, because in a multi-tenant environment the boundaries matter.

There are three distinct layers of knowledge, and they never mix.

Org-level and user-level knowledge is tagged with an organisation ID at write time and scoped at retrieval time. One organisation’s memories are invisible to another’s, regardless of which tool or agent is involved. User-level knowledge is isolated to the individual and never surfaced at org level. Agent banks are shared infrastructure but carry the same org ID tagging, so retrieval is always scoped to the calling organisation.

The only knowledge that crosses organisational boundaries is global knowledge: a curated set of verified engineering facts derived from our own benchmarks and internal testing. These are general truths about how tools behave, with no org-specific context, no user data, and no client IP.

In practice: an agent working for Organisation A will retrieve that organisation’s standards, that user’s preferences, and global engineering knowledge. It will never see Organisation B’s tolerances, workarounds, or naming conventions.

What We've Learned

Building this system taught us that the hard part of memory isn’t retrieval. It’s curation. We spent more time tuning what to discard than on any other component. The difference between a useful memory system and a cluttered one is entirely in the extraction quality.

We also learned that domain-specific tuning isn’t optional. The same retrieval configuration that works for scripting knowledge is wrong for simulation parameters. One size fits none.

Every lesson learned by one agent is immediately available to every other agent working with the same tool, across your entire organisation. The knowledge compounds. Teams that use Aibuild OS longer don’t just have more capable agents — they have a proprietary body of engineering knowledge that gets more accurate and more valuable with every session. That’s an asset no competitor can replicate from scratch.

Automation with agents across engineering tools is a memory problem as much as it is a reasoning problem. We built our memory system to treat it that way.