tidb_feature_1800x600

Key Takeaways

  • Off-the-shelf memory frameworks can silently discard the details that matter most.
  • A three-layer AI memory architecture delivers far better recall than any single abstraction.
  • TiDB’s native vector search eliminates the two-database overhead of a Postgres + Pinecone setup.
  • Model choice for synthesis tasks is a trust decision, not a cost decision.

I was talking to Claude the other day — not about code or some technical problem. I was venting about work, about life. And Claude responded with something so personal, so specific to my situation, that I stopped and stared at it. It referenced my daughter by name. It brought up something I’d been stressed about from a conversation weeks earlier. It connected dots between completely separate chats.

That feeling of being truly remembered by an AI? That’s a product.

So I built Speak2Me, a voice-first AI journal companion. You talk to it like a friend, and it actually remembers your story — not with generic responses like “that sounds frustrating,” but with real, personal context that references your life, your people, and your patterns.

The first version took about two hours to build. Making it actually work took the rest of the week. Because here’s the thing nobody tells you about AI memory: It’s really hard to get right.

AI Memory Architecture: The Promise vs. The Reality

The concept was straightforward: Open the app and it just gets you. It remembers your partner’s name, asks about that job stress you mentioned last week, and checks whether the baby is sleeping through the night yet.

I wired everything up — Hume EVI for voice, Mem0 for long-term memory, TiDB for the database (relational data and vector search in one), Claude as the reasoning layer, and Vercel for deployment. Sent the link to a few testers. Felt good about myself.

Then I used it for real. Told it personal details — my income, my family, my goals for the year. Opened it the next session expecting a deeply personal experience.

It had no idea who I was. Zero context. The entire product promise was broken.

When Your AI Memory Architecture Layer Forgets

I was using Mem0 for long-term memory. If you haven’t encountered it, Mem0 is an open-source memory framework with over 40,000 GitHub stars. The idea is compelling: Feed it conversations, it extracts important facts, and you recall those facts later.

During a test conversation, I provided exact financial details — my base salary and bonus, down to the dollar. I then checked what Mem0 actually stored.

It had extracted a vague sentence about “wanting to discuss income.” The actual numbers were gone.

This isn’t a bug in Mem0’s design — it’s a limitation of how memory extraction works. Mem0 uses a smaller language model internally (GPT-4o-mini) to decide what’s worth remembering, and smaller models are aggressive summarizers. They capture the gist and discard the specifics. For casual chatbot memory, that tradeoff might be acceptable. For a product where remembering exact life details is the value proposition, it’s a dealbreaker.

I ran more tests with family details, career plans, specific names and dates. Some things it captured. Others it mangled or skipped entirely. There was no way to predict what it would retain, because I didn’t control the extraction model.

If the memory layer is the product, you can’t outsource it to someone else’s black box.

The Hallucination Problem: Who Is Lily?

While debugging the Mem0 issue, I made another mistake that could have been far worse.

To save costs, I was using GPT-4o-mini to synthesize user profiles — taking all conversations and generating a summary document of who the user is, what they care about, and who’s important in their life. This profile gets injected into every future conversation as context.

I ran the synthesis on my test data and read the output. It said my daughter’s name was “Lily” and my partner was “Sarah.”

Neither name is correct. GPT-4o-mini fabricated plausible-sounding names when the real names simply hadn’t been mentioned yet. Instead of writing “not yet mentioned,” it invented details and presented them as fact.

Imagine opening your personal journal companion and hearing it say “How’s Lily doing?” when your daughter’s actual name is completely different. That’s not a bug — it’s a trust-destroying moment you can never recover from.

I switched immediately to Claude Haiku 3.5 for profile synthesis and added strict guardrails: Never invent, guess, or infer names, numbers, or details not explicitly stated in the conversations. If something hasn’t been mentioned, write “not yet mentioned.”

Model choice for synthesis tasks isn’t a cost optimization. It’s a trust decision. One hallucinated family member name and your user is gone forever.

Building a Three-Layer AI Memory Architecture

After these failures, I rethought the entire memory architecture from scratch. The solution required three complementary layers.

Layer 1: The User Profile

After every conversation, Claude Haiku reads all past transcripts and generates a synthesized document — who the user is, their job, the important people in their life, their stressors, their goals. This document gets injected into the system prompt for every future session. It’s how the AI “knows” you before you say a word.

Layer 2: Per-Exchange Vector Search

This is where the biggest improvement happened.

Originally, I was embedding entire conversation transcripts as single vectors. A 20-minute conversation covering salary, weekend plans, and a family wedding all became one vector — a single point in mathematical space representing the average of all those topics blended together.

When I searched for “salary,” it would find that conversation, but it also pulled up every other long conversation with similarly diluted vectors. The signal was buried.

The fix was chunking at the exchange level. One user message plus its AI response equals one chunk. Each chunk gets its own embedding vector. Now when I search for “salary,” it finds the exact exchange where salary was discussed — not the whole conversation, but the precise moment.

It’s the difference between searching a book by title versus having every individual page indexed. The recall quality improvement was dramatic. (For even better retrieval, TiDB also supports full-text search for hybrid retrieval — combining keyword matching with vector similarity — which I’m planning to integrate next.)

I’m using OpenAI’s text-embedding-3-large model (3,072 dimensions) and storing the vectors in TiDB, which supports vector search natively. When the AI needs to recall something during a live conversation, it searches these chunks using cosine distance. The cost is negligible — less than ten cents per user per year for embeddings.

Layer 3: Raw Transcripts

Every word, stored unmodified. This is the ground truth that never gets summarized, compressed, or distorted by a model. If the profile synthesis misses something or the vector search returns an unexpected result, the raw data is always there.

After validating this three-layer approach, I removed Mem0 entirely. Not because it’s bad software — but once the architecture was working, it wasn’t adding value. It was just another dependency sitting between me and my data.

Why I Chose TiDB Over Postgres + Pinecone

The database choice deserves its own section because it addresses one of the most common architectural patterns in RAG applications — and why I think that pattern is wrong for many use cases.

Every RAG tutorial prescribes the same stack: Postgres for your relational data, Pinecone for your vectors. Two databases. Two bills. Sync jobs between them.

Here’s the actual query that runs when the AI needs to recall a memory in Speak2Me:

SELECT
  e.title,
  e.top_emotions,
  c.chunk_text,
  VEC_COSINE_DISTANCE(c.embedding, ?) AS relevance
FROM s2m_transcript_chunks c
JOIN s2m_journal_entries e ON c.entry_id = e.id
WHERE c.user_id = ?
  AND e.created_at > DATE_SUB(NOW(), INTERVAL 30 DAY)
ORDER BY relevance
LIMIT 5

Vector search. Date filtering. User scoping. A JOIN to pull full context. One query. One network hop. (See the full list of vector functions and operators TiDB supports.)

With a Postgres + Pinecone setup, that same operation becomes: Call Pinecone with the vector, get back chunk IDs, call Postgres with those IDs, and join the results in your application code. Two round trips, two failure points, and the join logic lives in JavaScript instead of the database optimizer.

Pre-Filtering Changes Everything

Vector search is computationally expensive — comparing a query vector against millions of stored vectors takes real compute. TiDB filters by user_id and date range first using standard indexes. Fast and cheap. Then it runs the vector comparison on that much smaller subset.

Most dedicated vector databases do the opposite: They search all vectors first, then filter out non-matching metadata after the fact. At scale, that difference is significant.

Strong Consistency for Real-Time AI

During a conversation, the AI extracts a fact from something you just said, stores it, and may need to recall it 30 seconds later in the same session. With a Postgres + Pinecone architecture, you’re managing sync lag — write to Postgres, trigger a job to update Pinecone, hope it finishes before the next recall. Eventual consistency headaches.

With TiDB, I write the embedding and it’s immediately searchable. Same transaction. No lag, no sync jobs, no “read your own writes” issues.

One database. Vectors next to the data they describe. Ship faster, debug easier. (For a deeper look, see our architecture guide: Why unified data architectures matter for GenAI.)

AI Memory Architecture: Solving the Latency Problem

Even after fixing the memory architecture, there was a UX-breaking issue: Latency.

When the AI needed to recall something, it would start responding immediately — confidently, specifically, and often wrong. Then, 10–20 seconds later when the vector search results arrived, it would correct itself mid-sentence.

That moment destroys the product promise. You’re not talking to something that knows you — you’re watching a computer look you up.

The solution was to move memory retrieval from query time to session start. Now when a conversation ends, Claude Haiku extracts key facts synchronously in about 500ms. Not just names and dates, but the kind of details a friend would remember: Specific restaurants, upcoming interviews, goals mentioned in passing.

When you open the app next time, the dashboard prefetches your profile summary and the last 20 entries of quick facts in the background. By the time you speak, the AI has everything in context. No tool calls. No waiting.

Session EndSession StartMemory Recall
BeforeInstant~2s5–10s (tool call)
After+500msInstantRarely needed

The recall tool still exists for older memories — “What did I say three months ago about…” — but for anything recent, the AI just knows. It costs more tokens, but the first time the AI remembers something instantly, with no pause or correction, that’s the product.

The Voice Echo From Hell

Speak2Me is voice-first, powered by Hume EVI — which handles speech-to-text, emotion detection, LLM routing, and text-to-speech in a single WebSocket connection. When the AI speaks, Hume detects 48+ dimensions of vocal expression, so when you sound stressed, the AI adjusts its response accordingly.

But here’s a problem nobody documents: When the AI speaks through your phone’s speaker, the microphone picks up that audio, the AI transcribes its own speech, and responds to itself. An infinite feedback loop.

On a native iOS app, the OS provides hardware-level acoustic echo cancellation. On a web app running in a mobile browser, you’re at the mercy of whatever the browser implements — and mobile Safari is inconsistent at best.

After trying microphone muting (which kills the ability to interrupt naturally), I settled on the browser’s built-in audio constraints:

echoCancellation: true,
noiseSuppression: true,
autoGainControl: true

On desktop, this works well. On mobile, it’s acceptable at lower volumes. The real solution is a native iOS app with system-level echo cancellation — that’s coming.

If you’re building real-time voice AI on the web, budget significantly more time for audio engineering than you expect. This problem space is essentially uncharted.

What’s Next

Speak2Me is live. The immediate priority is encryption — users are sharing their most personal thoughts, and journal transcripts need to be encrypted at rest. After that, native iOS to solve the echo problem at the hardware level and add push notifications, background audio, and biometric authentication.

The memory system will keep improving, but only with real conversation data flowing through it. If you’re a developer building anything with AI memory, I hope the architectural failures I documented here save you some time. If you want to go deeper on choosing the right data infrastructure for AI applications, or see how I applied similar patterns in a privacy-first voice-to-text app and an AI-powered life simulator, those deep dives are worth a read.

And if you want to try Speak2Me, go talk to it. Tell it something real. Come back tomorrow and see if it remembers.

Start building with TiDB Cloud Starter — vector search, SQL joins, and strong consistency in one MySQL-compatible database.


Try for Free


Experience modern data infrastructure firsthand.

Start for Free

Have questions? Let us know how we can help.

Contact Us

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Starter

A fully-managed cloud DBaaS for auto-scaling workloads