KAPEX Beta
getkapex.ai GitHub

Architecture Overview

KAPEX is a memory middleware layer that sits between your application and any LLM provider. It intercepts conversations, builds a persistent memory graph, and injects the most relevant context into every prompt -- giving your LLM long-term memory without modifying the model itself.

High-Level Data Flow

Your App                    KAPEX                         LLM Provider
  |                          |                               |
  |--- user message -------->|                               |
  |                          |-- retrieve scored memories    |
  |                          |-- build context block         |
  |                          |-- inject into system prompt   |
  |                          |--- augmented prompt --------->|
  |                          |<-- LLM response --------------|
  |<-- response -------------|                               |
  |                          |                               |
  |                          |== async (user never waits) ===|
  |                          |-- extract entities            |
  |                          |-- score & place in hierarchy  |
  |                          |-- propagate salience          |
  |                          |-- update decay rates          |

KAPEX operates as a transparent proxy. Your application sends messages to KAPEX instead of directly to the LLM. KAPEX enriches the prompt with relevant memories, forwards it to the LLM, returns the response immediately, and then processes the conversation asynchronously to update the memory graph.

Sync Read Path (User-Facing)

The read path executes synchronously -- the user waits for the response, so speed matters. No graph writes happen here.

  1. Receive query -- your app sends the user's message to KAPEX
  2. Retrieve scored memories -- KAPEX queries the memory graph for the most relevant nodes using three-channel retrieval (see below)
  3. Build context block -- selected memories are formatted with confidence-gated framing and assembled into a context block
  4. Inject into LLM prompt -- the context block is inserted into the system prompt alongside your application's instructions
  5. Generate response -- the augmented prompt is sent to the LLM (Amazon Bedrock by default)
  6. Return response -- the LLM response is returned to your application

The entire read path typically completes in under 200ms on top of the LLM's own latency.

Async Write Path (Background)

After the response is returned to the user, KAPEX processes the conversation in the background. The user never waits for any of these steps:

  1. Entity extraction -- identify people, places, projects, and concepts mentioned in the conversation using a three-tier NER pipeline (alias matching, pattern matching, LLM classification)
  2. Hierarchy placement -- place extracted entities into the appropriate domain, create facet nodes for specific details, and scaffold interest nodes for inferred preferences
  3. Salience scoring -- compute a base score for each new memory node from five linguistic signals
  4. Salience propagation -- propagate scores upward through the hierarchy (entity scores influence domain scores)
  5. Cross-domain intelligence -- track co-activation patterns across domains (e.g., "work stress" and "sleep problems" appearing together)
  6. Temporal patterns -- record when and how often topics are discussed
  7. Cache invalidation -- invalidate cached retrieval results so the next query reflects the latest state

Memory Hierarchy

KAPEX organizes memories into a five-level hierarchy using PostgreSQL's ltree extension for efficient tree operations:

Domain (life area)
  |
  +-- Entity (named person, place, or concept)
  |     |
  |     +-- Facet (specific detail about the entity)
  |     |
  |     +-- Theme (recurring pattern involving the entity)
  |     |
  |     +-- Interest (inferred preference, promoted over time)
  |
  +-- Entity
  |     +-- Facet
  |     +-- Facet
  ...

Node Types

Type Description Example
Domain Broad life area or topic category work, family, health, hobbies
Entity A specific named person, place, project, or concept Ciana, KAPEX, Dr. Martinez
Facet A concrete detail or fact about an entity Ciana's birthday is March 12, KAPEX uses PostgreSQL
Theme A recurring pattern detected within a domain work-life balance concerns, weekend cooking experiments
Interest An inferred preference, promoted from ephemeral to persistent over time enjoys sci-fi novels, prefers morning meetings

Domains are created automatically as topics emerge in conversation. Entities are extracted and classified by gravity (how central they are to the user's life). Facets capture atomic facts. Themes and interests are inferred over time from patterns in the conversation history.

Storage Layer

KAPEX uses PostgreSQL as its primary data store:

Additional infrastructure:

Three-Channel Retrieval

When a query arrives, KAPEX retrieves memories through three independent channels, each optimized for a different signal:

+---------------------+---------------------+---------------------+
| Channel 1: SALIENCE | Channel 2: RECENCY  | Channel 3: CONSTRAINTS
|                     |                     |                     |
| Top-K nodes by      | Sliding 72-hour     | Always-inject       |
| salience score      | window of recent    | guardrail nodes     |
|                     | memories            | (safety, boundaries)|
|                     |                     |                     |
| ~55% token budget   | ~35% token budget   | ~10% token budget   |
+----------+----------+----------+----------+----------+----------+
           |                     |                     |
           +---------------------+---------------------+
                                 |
                         Token Assembly
                       (6000 token budget)

The channels are assembled into a single context block that fits within a configurable token budget (default 6000 tokens).

Confidence-Gated Framing

Not all memories are equally reliable. KAPEX assigns a confidence score to each retrieved node and frames it accordingly before injecting it into the LLM prompt:

Confidence Level Score Range Framing Style Example
Assert 0.75 and above Stated as fact "The user's sister is named Ciana."
Hedged 0.40 -- 0.74 Qualified language "The user has mentioned someone named Ciana, possibly a sibling."
Hook Below 0.40 Invitation to confirm "The user may have discussed someone named Ciana -- consider asking to confirm."

This prevents the LLM from asserting uncertain memories as established facts, reducing hallucination and improving user trust.

Safety Architecture

KAPEX includes a multi-layer safety pipeline that runs on every request:

Safety runs identically regardless of which memory configuration is active. It is not optional and cannot be disabled per-user.

Deployment Architecture

Client App
    |
    v
CloudFront (TLS)
    |
    v
EC2 (gunicorn + Flask)
    |
    +-- PostgreSQL (RDS) -- memory graph, scores, edges
    |
    +-- Redis (ElastiCache) -- session cache, retrieval cache
    |
    +-- Amazon Bedrock -- LLM inference
    |
    +-- Brave Search API -- real-time web search (optional)

KAPEX runs as a single Flask application behind gunicorn on EC2, with PostgreSQL on RDS for persistent storage and Redis on ElastiCache for caching. LLM inference is handled by Amazon Bedrock (Claude models by default). All traffic is encrypted in transit (TLS 1.2+) and data is encrypted at rest (AES-256).

Background Jobs

KAPEX runs several scheduled background jobs to maintain the memory graph:

Job Frequency Purpose
Decay engine Configurable (hours) Apply temporal decay to all memory nodes
Enrichment queue Every 30 seconds Process async enrichment tasks (entity details, themes, embeddings)
Cross-domain detection Every 6 hours Detect correlations between domains
Stale knowledge check Daily Mark outdated knowledge nodes for review
Embedding backfill Daily Compute embeddings for nodes missing them
Memory compression Daily Compress old, low-salience memories into summaries