Back Office Bot for real estate — 27 AI tools, three-layer memory, shipped in a 6-day competition sprint on a multi-month Next.js 16 + Supabase foundation I built solo.
- Context
- Real estate back-office
- Surface area
- 7 tools · 3 inboxes
- Cost of context loss
- ~4 hrs / agent / week
Back-office work in Czech real estate lives across CRM notes, ČÚZK Katastr lookups, ISIR insolvency checks, Sreality feeds, tenant email threads, contract redlines, and appointment juggling — all requiring context the agent has already explained six times this week.
Off-the-shelf LLMs forget between sessions. Custom agents hallucinate under tool-load. What was missing was an agent that remembered — a teammate, not a toy.
“An agent that forgets is a feature request. An agent that remembers is a hire.”
Keep 27 tools coherent, keep memory cheap, ship the sprint.
A back-office agent needs to browse listings, draft replies, read contracts, schedule viewings, and log everything to the CRM — without token budgets exploding or tool-selection collapsing into noise.
Single-context LLMs break down past ~20 tools. Vector stores alone leak irrelevant chunks. And carrying 1,700+ tests through a 6-day sprint meant the architecture had to be wrong exactly zero times.
A three-layer memory system with a nightly consolidation pass — autoDream.
Working → Episodic → Long-term. The agent runs a multi-step loop (stopWhen stepCountIs(8)) over all 27 typed tools. Every turn and tool call appends to the L3 activity log; each night, autoDream distills salient traces into L2 semantic memory (pgvector embeddings + entities).
Result: on Monday morning the agent already knows Vinohradská 42 had an unresolved ČÚZK výpis, and that the buyer's agent replies slowly on Tuesdays.
Three layers of memory, one nightly consolidator.
Current session context
Held in-prompt. ~8k tokens of the last user turn, tool results, and active plan. Flushed when the conversation ends.
Semantic memory · pgvector
Postgres + pgvector semantic search. Consolidated embeddings with entity tags — retrievable by similarity, entity, or conversation id. Rebuilt nightly by autoDream.
Activity log · append-only
Every turn, tool call, argument, and outcome appended to a structured audit log. Immutable source of truth — feeds autoDream; survives schema changes; replayable.
// Nightly consolidation passexport async function autoDream(userId: string) { const episodes = await getEpisodesSince(userId, "-24h"); const traces = await summarize(episodes, { model: "gpt-5.4-mini" }); for (const t of traces) { await longterm.upsert({ embedding: await embed(t.text), entities: t.entities, weight: t.salience, }); }}// Multi-step agent loop — all 27 typed tools in contextexport async function runAgent(turn: Turn) { const memory = await recall(turn, { store: "l2" }); return streamText({ model: openai("gpt-5.4-mini"), tools: allTools, stopWhen: stepCountIs(8), onStepFinish: (s) => appendActivity("l3", s), messages: [...memory, ...turn.messages], });}GPT-5.4 mini
324 commits, one at a time — Saturday brief to Thursday ship.
Competition sprint — 6 days from Saturday brief (21.3) to Thursday deadline (26.3), plus a Friday for the post-deadline demo reel. All on a multi-month Next.js 16 + Supabase + OpenAI foundation I'd built solo beforehand.
Brief landed on Saturday; started the same day. Email/password auth with per-user data isolation, migration 004 (views rebuild), Vercel framework config, TypeScript strict cleanup. Last commit 23:52.
Switched to Chat Completions API, wrapped every tool in a recovery strategy (safe-tool wrapper), landed the first 13 typed tools, file upload system with xlsx + csv parsing. Dashboard + chat UI plumbing. First working end-to-end turn Sunday night.
Biggest day by LOC. sReality monitoring, Lead Pipeline, Analytics V2, dashboard depth, ČÚZK + ISIR wiring. Agent loop hardened: streamText + stopWhen stepCountIs(8) over all typed tools.
Expanded tool surface, diacritics-insensitive search across tools, sReality region ID fixes (10 of 14 were wrong out of the box), Gmail draft visibility, monitor limits. Review hardening + first wave of contract tests.
L3 activity log, L2 pgvector retrieval, and the nightly consolidation job. First autoDream pass distilled several hundred activity rows into a retrievable semantic index. Cosine threshold calibrated twice. Design polish + focus rings landed the same day.
V4 Thinking layer (Think-Plan-Execute), 27-tool SSOT, E2E Playwright (137 tests across 9 sections), sidebar health monitor, native PDF extraction, manual CRUD for all sections. Deadline met Thu EOD — shipped to Vercel.
Demo reel for the competition eval. Recorded it three times because the agent got smarter between takes. Added Vercel Analytics, fixed mobile chat workspace. Window closes; judges take over.
Watch RELO handle a Monday-morning inbox.
Results
- R/01Placed 5 of 70 teams — top 7% at the competition. Judges called out the three-layer memory as the clearest technical differentiator among the finalists.
- R/02Brief Saturday → shipped Thursday EOD — 324 commits, one person, production-grade back-office deployed to Vercel. 27 typed AI tools, real Gmail + Calendar + Telegram integrations, running live before the deadline bell.
- R/03Reliable across 27 tools, zero runaway loops — stopWhen stepCountIs(8) caps the agent; safe-tool wrapper makes every failure recoverable. The agent retries, reformulates, or surfaces a clean error — but never burns through demo time on an infinite retry.
- R/04Czech-native integrations as competitive moat — ČÚZK Katastr, ISIR, Valuo/CMA wired from day one. Diacritics-aware search, real region slugs (fixed 10 of 14 that were wrong out of the box). Locale expertise global competitors can't shortcut.
Learnings
- L/01Consolidation is harder than retrieval. autoDream
's salience function needed three rewrites before it stopped over-weighting the latest episode. Writing to memory is where the real architecture work happens — reading from it is the easy half.
- L/02Typed tools + safe-tool wrapper.
Every tool has a Zod schema and a recovery strategy — the agent retries or rephrases instead of breaking the loop. Self-healing beat strict validation. Users don't care how you failed; they care whether the conversation continues.
- L/03Multi-step loops > function-call-per-query.
Real requests aren't 1 tool — they're 5 (find property, check katastr, draft email, book calendar slot, confirm). First attempt was classic single-call; stopWhen stepCountIs(8) gave the agent room to chain its own steps.
- L/04Build the replay harness before the agent.
I iterated blind on behavior for days before I could replay a 5-turn conversation against old vs. new agent. The moment I could, iteration speed tripled. Eval isn't about coverage — it's feedback-loop velocity.
