Architecture Overview
KAPEX is a memory middleware layer that sits between your application and any LLM provider. It intercepts conversations, builds a persistent memory graph, and injects the most relevant context into every prompt -- giving your LLM long-term memory without modifying the model itself.
High-Level Data Flow
Your App KAPEX LLM Provider
| | |
|--- user message -------->| |
| |-- retrieve scored memories |
| |-- build context block |
| |-- inject into system prompt |
| |--- augmented prompt --------->|
| |<-- LLM response --------------|
|<-- response -------------| |
| | |
| |== async (user never waits) ===|
| |-- extract entities |
| |-- score & place in hierarchy |
| |-- propagate salience |
| |-- update decay rates |
KAPEX operates as a transparent proxy. Your application sends messages to KAPEX instead of directly to the LLM. KAPEX enriches the prompt with relevant memories, forwards it to the LLM, returns the response immediately, and then processes the conversation asynchronously to update the memory graph.
Sync Read Path (User-Facing)
The read path executes synchronously -- the user waits for the response, so speed matters. No graph writes happen here.
- Receive query -- your app sends the user's message to KAPEX
- Retrieve scored memories -- KAPEX queries the memory graph for the most relevant nodes using three-channel retrieval (see below)
- Build context block -- selected memories are formatted with confidence-gated framing and assembled into a context block
- Inject into LLM prompt -- the context block is inserted into the system prompt alongside your application's instructions
- Generate response -- the augmented prompt is sent to the LLM (Amazon Bedrock by default)
- Return response -- the LLM response is returned to your application
The entire read path typically completes in under 200ms on top of the LLM's own latency.
Async Write Path (Background)
After the response is returned to the user, KAPEX processes the conversation in the background. The user never waits for any of these steps:
- Entity extraction -- identify people, places, projects, and concepts mentioned in the conversation using a three-tier NER pipeline (alias matching, pattern matching, LLM classification)
- Hierarchy placement -- place extracted entities into the appropriate domain, create facet nodes for specific details, and scaffold interest nodes for inferred preferences
- Salience scoring -- compute a base score for each new memory node from five linguistic signals
- Salience propagation -- propagate scores upward through the hierarchy (entity scores influence domain scores)
- Cross-domain intelligence -- track co-activation patterns across domains (e.g., "work stress" and "sleep problems" appearing together)
- Temporal patterns -- record when and how often topics are discussed
- Cache invalidation -- invalidate cached retrieval results so the next query reflects the latest state
Memory Hierarchy
KAPEX organizes memories into a five-level hierarchy using PostgreSQL's ltree extension for efficient tree operations:
Domain (life area)
|
+-- Entity (named person, place, or concept)
| |
| +-- Facet (specific detail about the entity)
| |
| +-- Theme (recurring pattern involving the entity)
| |
| +-- Interest (inferred preference, promoted over time)
|
+-- Entity
| +-- Facet
| +-- Facet
...
Node Types
| Type | Description | Example |
|---|---|---|
| Domain | Broad life area or topic category | work, family, health, hobbies |
| Entity | A specific named person, place, project, or concept | Ciana, KAPEX, Dr. Martinez |
| Facet | A concrete detail or fact about an entity | Ciana's birthday is March 12, KAPEX uses PostgreSQL |
| Theme | A recurring pattern detected within a domain | work-life balance concerns, weekend cooking experiments |
| Interest | An inferred preference, promoted from ephemeral to persistent over time | enjoys sci-fi novels, prefers morning meetings |
Domains are created automatically as topics emerge in conversation. Entities are extracted and classified by gravity (how central they are to the user's life). Facets capture atomic facts. Themes and interests are inferred over time from patterns in the conversation history.
Storage Layer
KAPEX uses PostgreSQL as its primary data store:
- Memory nodes -- stored in a hierarchical table with
ltreepaths for efficient ancestor/descendant queries - Edges -- relationships between nodes (e.g., "Ciana" is connected to "birthday party planning")
- Salience scores -- current and historical scores stored per node
- Processing history -- tracks how many times each memory has been accessed or discussed
Additional infrastructure:
- Redis (ElastiCache) -- caches hot retrieval results and session state
- pgvector (planned) -- vector similarity search for semantic retrieval
Three-Channel Retrieval
When a query arrives, KAPEX retrieves memories through three independent channels, each optimized for a different signal:
+---------------------+---------------------+---------------------+
| Channel 1: SALIENCE | Channel 2: RECENCY | Channel 3: CONSTRAINTS
| | | |
| Top-K nodes by | Sliding 72-hour | Always-inject |
| salience score | window of recent | guardrail nodes |
| | memories | (safety, boundaries)|
| | | |
| ~55% token budget | ~35% token budget | ~10% token budget |
+----------+----------+----------+----------+----------+----------+
| | |
+---------------------+---------------------+
|
Token Assembly
(6000 token budget)
- Salience channel -- retrieves the highest-scored memories across the entire graph. These are the memories KAPEX considers most important right now. Only nodes above the 0.25 injection threshold are eligible.
- Recency channel -- retrieves memories from the last 72 hours, regardless of salience score (above a minimum floor). This ensures the LLM knows about recent conversations even if they haven't yet accumulated high salience.
- Constraint channel -- injects safety-critical context that must always be present: user-disclosed sensitivities, boundaries, safety pins, and trigger avoidance directives.
The channels are assembled into a single context block that fits within a configurable token budget (default 6000 tokens).
Confidence-Gated Framing
Not all memories are equally reliable. KAPEX assigns a confidence score to each retrieved node and frames it accordingly before injecting it into the LLM prompt:
| Confidence Level | Score Range | Framing Style | Example |
|---|---|---|---|
| Assert | 0.75 and above | Stated as fact | "The user's sister is named Ciana." |
| Hedged | 0.40 -- 0.74 | Qualified language | "The user has mentioned someone named Ciana, possibly a sibling." |
| Hook | Below 0.40 | Invitation to confirm | "The user may have discussed someone named Ciana -- consider asking to confirm." |
This prevents the LLM from asserting uncertain memories as established facts, reducing hallucination and improving user trust.
Safety Architecture
KAPEX includes a multi-layer safety pipeline that runs on every request:
- Crisis detection -- lexical scanning and escalation tracking to identify distress signals
- Safety pins -- persistent, zero-decay nodes that always inject critical safety context (e.g., disclosed triggers, safety plans)
- Trigger avoidance -- extracted trigger words are tracked and the LLM is instructed to avoid them
- PII scrubbing -- extreme PII (SSNs, credit cards, bank accounts) is detected and redacted before storage
- Memory validation -- post-generation check that the LLM's response doesn't fabricate memories that don't exist in the graph
- Topic suppression -- users can request that specific topics are never stored or retrieved
Safety runs identically regardless of which memory configuration is active. It is not optional and cannot be disabled per-user.
Deployment Architecture
Client App
|
v
CloudFront (TLS)
|
v
EC2 (gunicorn + Flask)
|
+-- PostgreSQL (RDS) -- memory graph, scores, edges
|
+-- Redis (ElastiCache) -- session cache, retrieval cache
|
+-- Amazon Bedrock -- LLM inference
|
+-- Brave Search API -- real-time web search (optional)
KAPEX runs as a single Flask application behind gunicorn on EC2, with PostgreSQL on RDS for persistent storage and Redis on ElastiCache for caching. LLM inference is handled by Amazon Bedrock (Claude models by default). All traffic is encrypted in transit (TLS 1.2+) and data is encrypted at rest (AES-256).
Background Jobs
KAPEX runs several scheduled background jobs to maintain the memory graph:
| Job | Frequency | Purpose |
|---|---|---|
| Decay engine | Configurable (hours) | Apply temporal decay to all memory nodes |
| Enrichment queue | Every 30 seconds | Process async enrichment tasks (entity details, themes, embeddings) |
| Cross-domain detection | Every 6 hours | Detect correlations between domains |
| Stale knowledge check | Daily | Mark outdated knowledge nodes for review |
| Embedding backfill | Daily | Compute embeddings for nodes missing them |
| Memory compression | Daily | Compress old, low-salience memories into summaries |