Architecture Overview

KAPEX is a memory middleware layer that sits between your application and any LLM provider. It intercepts conversations, builds a persistent memory graph, and injects the most relevant context into every prompt -- giving your LLM long-term memory without modifying the model itself.

High-Level Data Flow

Your App                    KAPEX                         LLM Provider
  |                          |                               |
  |--- user message -------->|                               |
  |                          |-- retrieve scored memories    |
  |                          |-- build context block         |
  |                          |-- inject into system prompt   |
  |                          |--- augmented prompt --------->|
  |                          |<-- LLM response --------------|
  |<-- response -------------|                               |
  |                          |                               |
  |                          |== async (user never waits) ===|
  |                          |-- extract entities            |
  |                          |-- score & place in hierarchy  |
  |                          |-- propagate salience          |
  |                          |-- update decay rates          |

KAPEX operates as a transparent proxy. Your application sends messages to KAPEX instead of directly to the LLM. KAPEX enriches the prompt with relevant memories, forwards it to the LLM, returns the response immediately, and then processes the conversation asynchronously to update the memory graph.

Sync Read Path (User-Facing)

The read path executes synchronously -- the user waits for the response, so speed matters. No graph writes happen here.

Receive query -- your app sends the user's message to KAPEX
Retrieve scored memories -- KAPEX queries the memory graph for the most relevant nodes using three-channel retrieval (see below)
Build context block -- selected memories are formatted with confidence-gated framing and assembled into a context block
Inject into LLM prompt -- the context block is inserted into the system prompt alongside your application's instructions
Generate response -- the augmented prompt is sent to the LLM (Amazon Bedrock by default)
Return response -- the LLM response is returned to your application

The entire read path typically completes in under 200ms on top of the LLM's own latency.

Async Write Path (Background)

After the response is returned to the user, KAPEX processes the conversation in the background. The user never waits for any of these steps:

Entity extraction -- identify people, places, projects, and concepts mentioned in the conversation using a three-tier NER pipeline (alias matching, pattern matching, LLM classification)
Hierarchy placement -- place extracted entities into the appropriate domain, create facet nodes for specific details, and scaffold interest nodes for inferred preferences
Salience scoring -- compute a base score for each new memory node from five linguistic signals
Salience propagation -- propagate scores upward through the hierarchy (entity scores influence domain scores)
Cross-domain intelligence -- track co-activation patterns across domains (e.g., "work stress" and "sleep problems" appearing together)
Temporal patterns -- record when and how often topics are discussed
Cache invalidation -- invalidate cached retrieval results so the next query reflects the latest state

Memory Hierarchy

KAPEX organizes memories into a five-level hierarchy using PostgreSQL's ltree extension for efficient tree operations:

Domain (life area)
  |
  +-- Entity (named person, place, or concept)
  |     |
  |     +-- Facet (specific detail about the entity)
  |     |
  |     +-- Theme (recurring pattern involving the entity)
  |     |
  |     +-- Interest (inferred preference, promoted over time)
  |
  +-- Entity
  |     +-- Facet
  |     +-- Facet
  ...

Node Types

Type	Description	Example
Domain	Broad life area or topic category	`work`, `family`, `health`, `hobbies`
Entity	A specific named person, place, project, or concept	`Ciana`, `KAPEX`, `Dr. Martinez`
Facet	A concrete detail or fact about an entity	`Ciana's birthday is March 12`, `KAPEX uses PostgreSQL`
Theme	A recurring pattern detected within a domain	`work-life balance concerns`, `weekend cooking experiments`
Interest	An inferred preference, promoted from ephemeral to persistent over time	`enjoys sci-fi novels`, `prefers morning meetings`

Domains are created automatically as topics emerge in conversation. Entities are extracted and classified by gravity (how central they are to the user's life). Facets capture atomic facts. Themes and interests are inferred over time from patterns in the conversation history.

Storage Layer

KAPEX uses PostgreSQL as its primary data store:

Memory nodes -- stored in a hierarchical table with ltree paths for efficient ancestor/descendant queries
Edges -- relationships between nodes (e.g., "Ciana" is connected to "birthday party planning")
Salience scores -- current and historical scores stored per node
Processing history -- tracks how many times each memory has been accessed or discussed

Additional infrastructure:

Redis (ElastiCache) -- caches hot retrieval results and session state
pgvector (planned) -- vector similarity search for semantic retrieval

Three-Channel Retrieval

When a query arrives, KAPEX retrieves memories through three independent channels, each optimized for a different signal:

+---------------------+---------------------+---------------------+
| Channel 1: SALIENCE | Channel 2: RECENCY  | Channel 3: CONSTRAINTS
|                     |                     |                     |
| Top-K nodes by      | Sliding 72-hour     | Always-inject       |
| salience score      | window of recent    | guardrail nodes     |
|                     | memories            | (safety, boundaries)|
|                     |                     |                     |
| ~55% token budget   | ~35% token budget   | ~10% token budget   |
+----------+----------+----------+----------+----------+----------+
           |                     |                     |
           +---------------------+---------------------+
                                 |
                         Token Assembly
                       (6000 token budget)

Salience channel -- retrieves the highest-scored memories across the entire graph. These are the memories KAPEX considers most important right now. Only nodes above the 0.25 injection threshold are eligible.
Recency channel -- retrieves memories from the last 72 hours, regardless of salience score (above a minimum floor). This ensures the LLM knows about recent conversations even if they haven't yet accumulated high salience.
Constraint channel -- injects safety-critical context that must always be present: user-disclosed sensitivities, boundaries, safety pins, and trigger avoidance directives.

The channels are assembled into a single context block that fits within a configurable token budget (default 6000 tokens).

Confidence-Gated Framing

Not all memories are equally reliable. KAPEX assigns a confidence score to each retrieved node and frames it accordingly before injecting it into the LLM prompt:

Confidence Level	Score Range	Framing Style	Example
Assert	0.75 and above	Stated as fact	"The user's sister is named Ciana."
Hedged	0.40 -- 0.74	Qualified language	"The user has mentioned someone named Ciana, possibly a sibling."
Hook	Below 0.40	Invitation to confirm	"The user may have discussed someone named Ciana -- consider asking to confirm."

This prevents the LLM from asserting uncertain memories as established facts, reducing hallucination and improving user trust.

Safety Architecture

KAPEX includes a multi-layer safety pipeline that runs on every request:

Crisis detection -- lexical scanning and escalation tracking to identify distress signals
Safety pins -- persistent, zero-decay nodes that always inject critical safety context (e.g., disclosed triggers, safety plans)
Trigger avoidance -- extracted trigger words are tracked and the LLM is instructed to avoid them
PII scrubbing -- extreme PII (SSNs, credit cards, bank accounts) is detected and redacted before storage
Memory validation -- post-generation check that the LLM's response doesn't fabricate memories that don't exist in the graph
Topic suppression -- users can request that specific topics are never stored or retrieved

Safety runs identically regardless of which memory configuration is active. It is not optional and cannot be disabled per-user.

Deployment Architecture

Client App
    |
    v
CloudFront (TLS)
    |
    v
EC2 (gunicorn + Flask)
    |
    +-- PostgreSQL (RDS) -- memory graph, scores, edges
    |
    +-- Redis (ElastiCache) -- session cache, retrieval cache
    |
    +-- Amazon Bedrock -- LLM inference
    |
    +-- Brave Search API -- real-time web search (optional)

KAPEX runs as a single Flask application behind gunicorn on EC2, with PostgreSQL on RDS for persistent storage and Redis on ElastiCache for caching. LLM inference is handled by Amazon Bedrock (Claude models by default). All traffic is encrypted in transit (TLS 1.2+) and data is encrypted at rest (AES-256).

Background Jobs

KAPEX runs several scheduled background jobs to maintain the memory graph:

Job	Frequency	Purpose
Decay engine	Configurable (hours)	Apply temporal decay to all memory nodes
Enrichment queue	Every 30 seconds	Process async enrichment tasks (entity details, themes, embeddings)
Cross-domain detection	Every 6 hours	Detect correlations between domains
Stale knowledge check	Daily	Mark outdated knowledge nodes for review
Embedding backfill	Daily	Compute embeddings for nodes missing them
Memory compression	Daily	Compress old, low-salience memories into summaries