From 02c3bdf02838c7f0e38888e1e3d4f2e1ac2f3243 Mon Sep 17 00:00:00 2001 From: Philipinho <16838612+Philipinho@users.noreply.github.com> Date: Sun, 19 Apr 2026 22:35:56 +0100 Subject: [PATCH] docs(base): add implementation plan for duckdb query cache --- .../plans/2026-04-19-duckdb-query-cache.md | 1561 +++++++++++++++++ 1 file changed, 1561 insertions(+) create mode 100644 docs/superpowers/plans/2026-04-19-duckdb-query-cache.md diff --git a/docs/superpowers/plans/2026-04-19-duckdb-query-cache.md b/docs/superpowers/plans/2026-04-19-duckdb-query-cache.md new file mode 100644 index 00000000..23673fcc --- /dev/null +++ b/docs/superpowers/plans/2026-04-19-duckdb-query-cache.md @@ -0,0 +1,1561 @@ +# DuckDB-Backed Query Cache for Bases Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make sort + filter + search on large bases (≥25K live rows) fast and index-backed by routing qualifying `POST /bases/rows` requests to an in-process DuckDB cache instead of the JSONB-extracting Postgres path. Small bases and all writes keep their current path. + +**Architecture:** A new NestJS module `BaseQueryCacheModule` owns a per-process `Map` where each collection is a DuckDB in-memory database with one typed row table keyed by the base's user-defined properties plus the system columns (`id`, `position`, `created_at`, `updated_at`, `last_updated_by_id`, `deleted_at`, `search_text`). Typed btree indexes are built on every sortable property. Writes still commit to Postgres first; the existing `EventEmitter2` row/property events are mirrored onto a dedicated Redis pub/sub channel so each node patches its own copy of the affected collection. If a collection is absent or its `schema_version` is stale, the loader rebuilds it from Postgres via the existing `BaseRowRepo.streamByBaseId` generator. Phase-1 routing is single-node local; phase-2 multi-node routing is sketched but not implemented. + +**Tech Stack:** NestJS 11 + Fastify, Kysely + postgres.js, ioredis (via `@nestjs-labs/nestjs-ioredis`), BullMQ, existing `@nestjs/event-emitter` pattern, DuckDB via `@duckdb/node-api` (official high-level binding, depends on `@duckdb/node-bindings`). Jest + ts-jest for tests. + +**Scope (v1, this plan):** +- Server-side only. Client behavior is unchanged — same `listRows` endpoint, same DTO, same cursor shape. +- Feature flag `BASE_QUERY_CACHE_ENABLED` defaults to `false`. When off, zero behavior change. +- Threshold: a collection becomes a candidate when its live row count is `>= BASE_QUERY_CACHE_MIN_ROWS` (default `25000`) AND it has at least one sort / filter / search in the incoming query (the fast list-by-position path stays on Postgres regardless). +- Correctness invariant: for any query that the cache would answer, the returned page MUST equal what `BaseRowService.list` returns from Postgres (same row ids in same order, same cursor behavior). A diff test on a 10K seed base enforces this. +- LRU eviction when `BASE_QUERY_CACHE_MAX_COLLECTIONS` (default `500`) is reached. Evict by least-recently-used access timestamp. +- Warm-on-boot: a small Redis sorted set tracks recently-accessed baseIds. On `onApplicationBootstrap`, load the top N (default `50`, env `BASE_QUERY_CACHE_WARM_TOP_N`). +- Any loader/patcher failure falls through to the existing Postgres path and is logged. A DuckDB error must never surface to the client. + +**Non-goals (v1, explicitly deferred):** +- Multi-node consistent-hash routing (phase 2, sketched only). +- Rollups, materialized groupings, or column-type narrowing beyond what the loader does directly. +- Client "ship everything for small bases" tier-0 optimization (separate future plan). +- Migrating the existing no-filter/no-sort list fast path (it's already index-backed in Postgres). +- Any change to the write path, write validation, or event shape. + +--- + +## Background + +### The current hot path and why it stalls + +On a 100K-row base with a property sort, the server currently runs (simplified): + +```sql +SELECT id, cells, position, ..., COALESCE(base_cell_text(cells, ''::uuid), chr(1114111)) AS s0 +FROM base_rows +WHERE base_id = $1 AND workspace_id = $2 AND deleted_at IS NULL + AND () +ORDER BY s0 ASC, position COLLATE "C" ASC, id ASC +LIMIT 101; +``` + +`base_cell_text(cells, )` is the function extractor in `apps/server/src/core/base/engine/extractors.ts`. The expression is opaque to any per-column index, so Postgres falls back to `Parallel Seq Scan` + `top-N heapsort`. Observed: ~112 ms warm, ~10 s cold, shared buffers ~400 MB touched per page. The sort cannot be pre-indexed without per-property expression indexes — infeasible multi-tenant (write amplification, storage blowup, DDL under load). See `apps/server/src/core/base/engine/sort.ts` for all sort-build sites. + +### Why DuckDB (decided, not re-litigated here) + +- Embedded library; no new daemon, no new container. Just an npm dep. +- Real typed columns with btree indexes and a proper planner. +- Native JSON ingest so we can bulk-load from Postgres JSONB without reshaping. +- Cheap point updates (`INSERT OR REPLACE`, `UPDATE ... WHERE id = ?`) — fits per-cell write cadence. +- Alternatives rejected: ClickHouse/chDB (poor point updates), shadow cells table in Postgres (write fanout pathological at bulk import), on-the-fly expression indexes (write amplification + DDL under load), external search stores (breaks self-host simplicity). + +### DuckDB state is derived, not replicated + +Postgres remains source of truth. Any node can die, any collection can be evicted, state regenerates on demand from the existing `BaseRowRepo.streamByBaseId` generator (already used by type-conversion, cell-gc, CSV export). This is why the rollout is safe: worst case, a bug in the cache path makes the system slow (falls back to Postgres); it cannot corrupt user data. + +--- + +## Data flow at a glance + +### Read path — small base + +`BaseRowController.list` → `BaseRowService.list` → `BaseRowRepo.list` → Postgres. Unchanged. + +### Read path — large base with filter/sort/search + +`BaseRowController.list` → `BaseRowService.list` → `BaseQueryRouter.route(baseId, query)` → + +- If flag off OR row count below threshold OR query has no filter/sort/search → Postgres path (unchanged). +- Else → `BaseQueryCacheService.list(baseId, query, pagination)`: + 1. If collection isn't resident: call `CollectionLoader.load(baseId)`, which reads `bases.schema_version`, fetches all properties, streams all live rows from Postgres, creates the DuckDB database + typed row table + indexes, and inserts the rows. Record the access in the LRU and in the Redis warm-set. + 2. If resident but `schema_version` mismatches the loaded version: reload. + 3. Translate the engine's filter/sort/search spec into DuckDB SQL via `DuckDbQueryBuilder`. Apply the keyset cursor (same `(sort-field…, position, id)` ordering the Postgres engine uses). + 4. Run the query, shape the result into the same `CursorPaginationResult` shape the Postgres path returns. + 5. Touch LRU access timestamp + Redis warm-set. + +### Write path + +Unchanged at the storage layer. After `BaseRowService` emits a `BaseRow*Event`, a new in-process listener `BaseQueryCacheWriteConsumer` publishes a compact change envelope on Redis channel `base-query-cache:changes:{baseId}`. On every node, `BaseQueryCacheSubscriber` receives the envelope and, if that collection is resident locally, applies the patch (`INSERT OR REPLACE`, `UPDATE`, soft-delete, or property-schema-change ⇒ invalidate collection). Originating node included — consistent behavior across publishers and subscribers. + +### Warm-up path + +On `onApplicationBootstrap`, `BaseQueryCacheService.warmUp()` reads `ZREVRANGE base-query-cache:recent 0 N-1` and loads each (non-blocking, firing the loader sequentially to avoid thundering-herd on Postgres). Warm-up is best-effort; any failure is logged, the boot sequence continues. + +### Eviction path + +Every `list` call updates an LRU timestamp on the collection's entry. When `Map.size > MAX_COLLECTIONS`, the least-recently-accessed collection is closed (`connection.closeSync()` + `db.closeSync()`) and removed. Closed collections on next access re-load. + +--- + +## Column-type mapping + +Derived from `PropertyKind` in `apps/server/src/core/base/engine/kinds.ts` and the extractor semantics in `extractors.ts`. + +| `BasePropertyType` | `PropertyKind` | DuckDB column type | Index? | Value sourced from | +|---|---|---|---|---| +| `text`, `url`, `email` | `TEXT` | `VARCHAR` | btree | `cells->>propertyId` as text | +| `number` | `NUMERIC` | `DOUBLE` | btree | JSON number → double | +| `date` | `DATE` | `TIMESTAMPTZ` | btree | ISO string → `TIMESTAMPTZ` | +| `checkbox` | `BOOL` | `BOOLEAN` | btree | JSON bool | +| `select`, `status` | `SELECT` | `VARCHAR` (choice uuid as text) | btree | `cells->>propertyId` | +| `multiSelect` | `MULTI` | `JSON` (raw) | none | JSON array of uuids — filter via `json_array_contains`, sort not supported | +| `person` (single) | `PERSON` | `VARCHAR` | btree | single uuid string | +| `person` (multi, `typeOptions.allowMultiple`) | `PERSON` | `JSON` | none | uuid array | +| `file` | `FILE` | `JSON` | none | array of file descriptors; filter only (isEmpty/isNotEmpty/any) | +| `createdAt` | system | `TIMESTAMPTZ NOT NULL` | btree | `base_rows.created_at` | +| `lastEditedAt` | system | `TIMESTAMPTZ NOT NULL` | btree | `base_rows.updated_at` | +| `lastEditedBy` | `SYS_USER` | `VARCHAR` | btree | `base_rows.last_updated_by_id` | + +Plus fixed system columns every collection has: +- `id VARCHAR NOT NULL PRIMARY KEY` (uuid7 text; btree-indexed via PK) +- `position VARCHAR NOT NULL` (btree; fractional-index string, compared with BINARY collation to mirror `COLLATE "C"` in Postgres) +- `search_text VARCHAR` (btree + LIKE index where supported — initial impl uses plain `VARCHAR` + `LIKE '%…%'`; full-text is v1 out-of-scope and falls through to Postgres for `mode: 'fts'`) +- `deleted_at TIMESTAMPTZ` (rows with `deleted_at != NULL` are filtered on every query; the loader only inserts live rows so this stays NULL in steady state — kept only so soft-delete invalidation via Redis can be handled as an `UPDATE` without a shape change) + +Column naming in DuckDB uses the property's `id` (uuid) verbatim, wrapped in double-quotes (`"019c69a3-dd47-7014-8b87-ec8f167577ee"`). This keeps rename-safe and removes any identifier collision with system columns. + +--- + +## Redis channel naming + +- **Change stream (per-base):** `base-query-cache:changes:{baseId}` — published on every row/property/view mutation that affects a cached collection. Payload is a minimal JSON envelope (see task 7 for schema). +- **Warm-up sorted set:** `base-query-cache:recent` — `ZADD` with timestamp score each time a collection is accessed; on boot, pull top N with `ZREVRANGE`. Capped weekly by a trim on writes (`ZREMRANGEBYRANK` to keep ≤ 10×MAX_COLLECTIONS entries). +- **Ownership lease (phase 2 only):** `base-query-cache:owner:{baseId}` — reserved, NOT used by phase 1. Phase 1 runs all nodes as potential owners. + +Channel prefix `base-query-cache:` is chosen to be unambiguous and discoverable with `KEYS base-query-cache:*` during debugging, consistent with existing `presence:base:` / `typesense:` prefixes. + +--- + +## File Structure + +**New files:** + +- `apps/server/src/core/base/query-cache/query-cache.module.ts` — NestJS module with all providers + `OnApplicationBootstrap` warm-up hook. +- `apps/server/src/core/base/query-cache/query-cache.types.ts` — `LoadedCollection`, `ChangeEnvelope`, `ColumnSpec`, type aliases. +- `apps/server/src/core/base/query-cache/query-cache.config.ts` — reads `BASE_QUERY_CACHE_*` env vars, exposes typed config. +- `apps/server/src/core/base/query-cache/column-types.ts` — pure property-type → DuckDB column-spec mapping (mirror of table above). Unit-testable. +- `apps/server/src/core/base/query-cache/duckdb-query-builder.ts` — pure translator from `FilterNode` / `SortSpec[]` / `SearchSpec` to DuckDB SQL fragments + parameter array. Unit-testable. +- `apps/server/src/core/base/query-cache/collection-loader.ts` — loads one base from Postgres into a DuckDB database (schema creation, row streaming, index builds). +- `apps/server/src/core/base/query-cache/base-query-cache.service.ts` — owns the `Map`, LRU, `list()`, `invalidate()`, `applyChange()`, `warmUp()`. +- `apps/server/src/core/base/query-cache/base-query-router.ts` — decision logic: should this query go to cache or Postgres? +- `apps/server/src/core/base/query-cache/base-query-cache.write-consumer.ts` — `@OnEvent(...)` listener that publishes change envelopes to Redis. +- `apps/server/src/core/base/query-cache/base-query-cache.subscriber.ts` — Redis subscriber that receives envelopes and calls `applyChange()`. +- `apps/server/src/core/base/query-cache/column-types.spec.ts` — unit tests for the mapping. +- `apps/server/src/core/base/query-cache/duckdb-query-builder.spec.ts` — unit tests for translation. +- `apps/server/src/core/base/query-cache/base-query-router.spec.ts` — unit tests for the routing decision (pure, no DB). +- `apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts` — integration tests against real Postgres + real DuckDB (Redis stubbed out with `EventEmitter2` loopback for determinism). Enabled only when `INTEGRATION_DB_URL` is set. + +**Modified files:** + +- `apps/server/src/core/base/base.module.ts` — import `BaseQueryCacheModule`. +- `apps/server/src/core/base/services/base-row.service.ts` — delegate the decision in `list()` to `BaseQueryRouter`. Everything else unchanged. +- `apps/server/src/database/repos/base/base-row.repo.ts` — add `countActiveRows(baseId, workspaceId)` used by the router to decide "large base?"; add no other changes. +- `apps/server/src/integrations/environment/environment.service.ts` — getters for the four new env vars. +- `apps/server/package.json` — add `@duckdb/node-api` dep. + +--- + +## Task 1: Add DuckDB dependency + environment getters + +**Files:** +- Modify: `apps/server/package.json` +- Modify: `apps/server/src/integrations/environment/environment.service.ts` + +- [ ] **Step 1: Install `@duckdb/node-api`** + +From repo root: +```bash +pnpm --filter server add @duckdb/node-api@^1.5.0 +``` + +Expected: `@duckdb/node-api` appears in `apps/server/package.json` under `dependencies` at `^1.5.x` (this is the official high-level DuckDB Node.js binding; it depends transitively on `@duckdb/node-bindings` which is the low-level C-API wrapper — a single top-level dep is correct). + +- [ ] **Step 2: Verify server still builds** + +```bash +pnpm nx run server:build +``` + +Expected: build succeeds, no type errors. + +- [ ] **Step 3: Add four new env-var getters** + +Append to `apps/server/src/integrations/environment/environment.service.ts`, grouped with other feature-flag getters: + +```ts + getBaseQueryCacheEnabled(): boolean { + return this.configService.get('BASE_QUERY_CACHE_ENABLED', 'false') === 'true'; + } + + getBaseQueryCacheMinRows(): number { + return parseInt( + this.configService.get('BASE_QUERY_CACHE_MIN_ROWS', '25000'), + 10, + ); + } + + getBaseQueryCacheMaxCollections(): number { + // Default is intentionally low (50) because a single-node self-host with + // ~100 MB per collection can pin ~5 GB RSS at the cap. SaaS/larger + // deployments can raise via env. See Appendix. + return parseInt( + this.configService.get('BASE_QUERY_CACHE_MAX_COLLECTIONS', '50'), + 10, + ); + } + + getBaseQueryCacheWarmTopN(): number { + return parseInt( + this.configService.get('BASE_QUERY_CACHE_WARM_TOP_N', '50'), + 10, + ); + } +``` + +- [ ] **Step 4: Commit** + +This workspace uses a single root lockfile (`/Users/lite/WebstormProjects/docmost-base/pnpm-lock.yaml`, confirmed by `ls`). `pnpm --filter server add ...` still mutates only that root lockfile — no `apps/server/pnpm-lock.yaml` is created — so the staged path below is the repo-root lockfile. + +```bash +git add apps/server/package.json pnpm-lock.yaml apps/server/src/integrations/environment/environment.service.ts +git commit -m "chore(server): add duckdb dependency and query-cache env getters" +``` + +--- + +## Task 2: Pure column-type mapping with unit tests + +**Files:** +- Create: `apps/server/src/core/base/query-cache/column-types.ts` +- Create: `apps/server/src/core/base/query-cache/column-types.spec.ts` +- Create: `apps/server/src/core/base/query-cache/query-cache.types.ts` + +- [ ] **Step 1: Define shared types** + +Create `apps/server/src/core/base/query-cache/query-cache.types.ts`: + +```ts +import type { DuckDBConnection, DuckDBInstance } from '@duckdb/node-api'; +import type { BaseProperty } from '@docmost/db/types/entity.types'; + +export type DuckDbColumnType = + | 'VARCHAR' + | 'DOUBLE' + | 'BOOLEAN' + | 'TIMESTAMPTZ' + | 'JSON'; + +export type ColumnSpec = { + // The uuid of the property (user-defined props) or a stable literal + // ('id', 'position', 'created_at', 'updated_at', 'last_updated_by_id', + // 'deleted_at', 'search_text') for system columns. + column: string; + ddlType: DuckDbColumnType; + indexable: boolean; + // For user-defined props we keep the source BaseProperty so callers can + // resolve the extraction rule from JSON. + property?: Pick; +}; + +export type LoadedCollection = { + baseId: string; + schemaVersion: number; + columns: ColumnSpec[]; + instance: DuckDBInstance; + connection: DuckDBConnection; + lastAccessedAt: number; +}; + +export type ChangeEnvelope = + | { kind: 'row-upsert'; baseId: string; row: Record } + | { kind: 'row-delete'; baseId: string; rowId: string } + | { kind: 'rows-delete'; baseId: string; rowIds: string[] } + | { kind: 'row-reorder'; baseId: string; rowId: string; position: string } + | { kind: 'schema-invalidate'; baseId: string; schemaVersion: number }; +``` + +- [ ] **Step 2: Write failing tests for the mapping** + +Create `apps/server/src/core/base/query-cache/column-types.spec.ts`: + +```ts +import { BasePropertyType } from '../base.schemas'; +import { buildColumnSpecs, SYSTEM_COLUMNS } from './column-types'; + +const p = (type: string, extra: Record = {}) => ({ + id: `prop-${type}`, + type, + typeOptions: extra, +}) as any; + +describe('buildColumnSpecs', () => { + it('includes the fixed system columns first', () => { + const specs = buildColumnSpecs([]); + expect(specs.map((s) => s.column)).toEqual(SYSTEM_COLUMNS.map((s) => s.column)); + }); + + it('maps text / url / email to VARCHAR indexable', () => { + for (const t of [BasePropertyType.TEXT, BasePropertyType.URL, BasePropertyType.EMAIL]) { + const specs = buildColumnSpecs([p(t)]); + const user = specs[specs.length - 1]; + expect(user.ddlType).toBe('VARCHAR'); + expect(user.indexable).toBe(true); + } + }); + + it('maps number to DOUBLE indexable', () => { + const specs = buildColumnSpecs([p(BasePropertyType.NUMBER)]); + const user = specs[specs.length - 1]; + expect(user.ddlType).toBe('DOUBLE'); + expect(user.indexable).toBe(true); + }); + + it('maps date to TIMESTAMPTZ indexable', () => { + const specs = buildColumnSpecs([p(BasePropertyType.DATE)]); + const user = specs[specs.length - 1]; + expect(user.ddlType).toBe('TIMESTAMPTZ'); + expect(user.indexable).toBe(true); + }); + + it('maps checkbox to BOOLEAN indexable', () => { + const specs = buildColumnSpecs([p(BasePropertyType.CHECKBOX)]); + const user = specs[specs.length - 1]; + expect(user.ddlType).toBe('BOOLEAN'); + }); + + it('maps select / status to VARCHAR indexable', () => { + for (const t of [BasePropertyType.SELECT, BasePropertyType.STATUS]) { + const specs = buildColumnSpecs([p(t)]); + const user = specs[specs.length - 1]; + expect(user.ddlType).toBe('VARCHAR'); + expect(user.indexable).toBe(true); + } + }); + + it('maps multiSelect / file / multi-person to JSON non-indexable', () => { + for (const t of [BasePropertyType.MULTI_SELECT, BasePropertyType.FILE]) { + const specs = buildColumnSpecs([p(t)]); + const user = specs[specs.length - 1]; + expect(user.ddlType).toBe('JSON'); + expect(user.indexable).toBe(false); + } + const specs = buildColumnSpecs([p(BasePropertyType.PERSON, { allowMultiple: true })]); + expect(specs[specs.length - 1].ddlType).toBe('JSON'); + }); + + it('maps single-person to VARCHAR indexable when allowMultiple=false', () => { + const specs = buildColumnSpecs([p(BasePropertyType.PERSON, { allowMultiple: false })]); + const user = specs[specs.length - 1]; + expect(user.ddlType).toBe('VARCHAR'); + expect(user.indexable).toBe(true); + }); + + it('skips unknown property types', () => { + const specs = buildColumnSpecs([p('unknown-type-x')]); + expect(specs.length).toBe(SYSTEM_COLUMNS.length); + }); +}); +``` + +Run: +```bash +pnpm --filter server exec jest src/core/base/query-cache/column-types.spec.ts +``` + +Expected: tests fail — `column-types.ts` doesn't exist yet. + +- [ ] **Step 3: Implement** + +Create `apps/server/src/core/base/query-cache/column-types.ts`: + +```ts +import { BasePropertyType, BasePropertyTypeValue } from '../base.schemas'; +import { ColumnSpec } from './query-cache.types'; +import type { BaseProperty } from '@docmost/db/types/entity.types'; + +export const SYSTEM_COLUMNS: ColumnSpec[] = [ + { column: 'id', ddlType: 'VARCHAR', indexable: false }, + { column: 'position', ddlType: 'VARCHAR', indexable: true }, + { column: 'created_at', ddlType: 'TIMESTAMPTZ', indexable: true }, + { column: 'updated_at', ddlType: 'TIMESTAMPTZ', indexable: true }, + { column: 'last_updated_by_id', ddlType: 'VARCHAR', indexable: true }, + { column: 'deleted_at', ddlType: 'TIMESTAMPTZ', indexable: false }, + { column: 'search_text', ddlType: 'VARCHAR', indexable: false }, +]; + +type PropertyLike = Pick; + +export function buildColumnSpecs(properties: PropertyLike[]): ColumnSpec[] { + const out: ColumnSpec[] = [...SYSTEM_COLUMNS]; + for (const prop of properties) { + const spec = buildUserColumn(prop); + if (spec) out.push(spec); + } + return out; +} + +function buildUserColumn(prop: PropertyLike): ColumnSpec | null { + const t = prop.type as BasePropertyTypeValue; + switch (t) { + case BasePropertyType.TEXT: + case BasePropertyType.URL: + case BasePropertyType.EMAIL: + return { column: prop.id, ddlType: 'VARCHAR', indexable: true, property: prop }; + case BasePropertyType.NUMBER: + return { column: prop.id, ddlType: 'DOUBLE', indexable: true, property: prop }; + case BasePropertyType.DATE: + return { column: prop.id, ddlType: 'TIMESTAMPTZ', indexable: true, property: prop }; + case BasePropertyType.CHECKBOX: + return { column: prop.id, ddlType: 'BOOLEAN', indexable: true, property: prop }; + case BasePropertyType.SELECT: + case BasePropertyType.STATUS: + return { column: prop.id, ddlType: 'VARCHAR', indexable: true, property: prop }; + case BasePropertyType.MULTI_SELECT: + case BasePropertyType.FILE: + return { column: prop.id, ddlType: 'JSON', indexable: false, property: prop }; + case BasePropertyType.PERSON: { + const allowMultiple = !!(prop.typeOptions as any)?.allowMultiple; + return allowMultiple + ? { column: prop.id, ddlType: 'JSON', indexable: false, property: prop } + : { column: prop.id, ddlType: 'VARCHAR', indexable: true, property: prop }; + } + // System types are modelled as system columns on base_rows — do not add + // a per-property column for them. They're already in SYSTEM_COLUMNS. + case BasePropertyType.CREATED_AT: + case BasePropertyType.LAST_EDITED_AT: + case BasePropertyType.LAST_EDITED_BY: + return null; + default: + return null; + } +} +``` + +Run the same jest command. Expected: `Tests: 9 passed, 9 total` (one per `it(...)` block in the spec). + +- [ ] **Step 4: Commit** + +```bash +git add apps/server/src/core/base/query-cache/ +git commit -m "feat(server): add property-type to DuckDB column-spec mapping" +``` + +--- + +## Task 3: Scaffold the query-cache module (wired but dormant) + +**Files:** +- Create: `apps/server/src/core/base/query-cache/query-cache.config.ts` +- Create: `apps/server/src/core/base/query-cache/base-query-cache.service.ts` (stub) +- Create: `apps/server/src/core/base/query-cache/base-query-router.ts` (stub returning "use postgres") +- Create: `apps/server/src/core/base/query-cache/query-cache.module.ts` +- Modify: `apps/server/src/core/base/base.module.ts` + +Purpose: get the module imported and providers resolvable end-to-end with the flag off. No DuckDB code path yet. This is the "ship it dark" commit. + +- [ ] **Step 1: Write the config provider** + +Create `apps/server/src/core/base/query-cache/query-cache.config.ts`: + +```ts +import { Injectable } from '@nestjs/common'; +import { EnvironmentService } from '../../../integrations/environment/environment.service'; + +export type QueryCacheConfig = { + enabled: boolean; + minRows: number; + maxCollections: number; + warmTopN: number; +}; + +@Injectable() +export class QueryCacheConfigProvider { + readonly config: QueryCacheConfig; + constructor(env: EnvironmentService) { + this.config = { + enabled: env.getBaseQueryCacheEnabled(), + minRows: env.getBaseQueryCacheMinRows(), + maxCollections: env.getBaseQueryCacheMaxCollections(), + warmTopN: env.getBaseQueryCacheWarmTopN(), + }; + } +} +``` + +- [ ] **Step 2: Write the service stub** + +Create `apps/server/src/core/base/query-cache/base-query-cache.service.ts`: + +```ts +import { Injectable, Logger } from '@nestjs/common'; +import { OnApplicationBootstrap, OnModuleDestroy } from '@nestjs/common'; +import { QueryCacheConfigProvider } from './query-cache.config'; + +@Injectable() +export class BaseQueryCacheService + implements OnApplicationBootstrap, OnModuleDestroy +{ + private readonly logger = new Logger(BaseQueryCacheService.name); + + constructor(private readonly configProvider: QueryCacheConfigProvider) {} + + async onApplicationBootstrap(): Promise { + const { enabled } = this.configProvider.config; + this.logger.log( + `BaseQueryCacheService bootstrapped (enabled=${enabled}).`, + ); + // Real warm-up is added in task 9. + } + + async onModuleDestroy(): Promise { + // Real cleanup is added in task 5. + } +} +``` + +- [ ] **Step 3: Write the router stub** + +Create `apps/server/src/core/base/query-cache/base-query-router.ts`: + +```ts +import { Injectable } from '@nestjs/common'; +import { QueryCacheConfigProvider } from './query-cache.config'; + +export type RouteDecision = 'postgres' | 'cache'; + +@Injectable() +export class BaseQueryRouter { + constructor(private readonly configProvider: QueryCacheConfigProvider) {} + + // Stubbed: routes always to postgres in this commit so the existing + // behavior is preserved. Real decision logic is added in task 6. + decide(_args: unknown): RouteDecision { + if (!this.configProvider.config.enabled) return 'postgres'; + return 'postgres'; + } +} +``` + +- [ ] **Step 4: Write the module** + +Create `apps/server/src/core/base/query-cache/query-cache.module.ts`: + +```ts +import { Module } from '@nestjs/common'; +import { QueryCacheConfigProvider } from './query-cache.config'; +import { BaseQueryCacheService } from './base-query-cache.service'; +import { BaseQueryRouter } from './base-query-router'; + +@Module({ + providers: [QueryCacheConfigProvider, BaseQueryCacheService, BaseQueryRouter], + exports: [BaseQueryCacheService, BaseQueryRouter, QueryCacheConfigProvider], +}) +export class BaseQueryCacheModule {} +``` + +- [ ] **Step 5: Import into BaseModule** + +Modify `apps/server/src/core/base/base.module.ts` — add the new module to `imports`: + +```ts +import { BaseQueryCacheModule } from './query-cache/query-cache.module'; +// ... +@Module({ + imports: [ + BullModule.registerQueue({ name: QueueName.BASE_QUEUE }), + BaseQueryCacheModule, + ], + // ... controllers, providers, exports unchanged +}) +export class BaseModule {} +``` + +- [ ] **Step 6: Build and boot (no run)** + +```bash +pnpm nx run server:build +``` + +Expected: build succeeds. The module compiles and providers resolve. (The `enabled=false` boot log is only observable when the server is actually run, which is out of scope for this build-only step.) + +- [ ] **Step 7: Commit** + +```bash +git add apps/server/src/core/base/query-cache/ apps/server/src/core/base/base.module.ts +git commit -m "feat(server): scaffold base query-cache module behind feature flag" +``` + +--- + +## Task 4: Pure DuckDB query builder (translate engine specs to DuckDB SQL) + +**Files:** +- Create: `apps/server/src/core/base/query-cache/duckdb-query-builder.ts` +- Create: `apps/server/src/core/base/query-cache/duckdb-query-builder.spec.ts` + +This is the second pure, unit-testable component. No DuckDB runtime required; we just produce a `{ sql, params }` pair that the service will execute later. + +- [ ] **Step 1: Test contract** + +The builder takes: +- A column set (`ColumnSpec[]` from `buildColumnSpecs`). +- A `FilterNode | undefined`, `SortSpec[] | undefined`, `SearchSpec | undefined` (from `core/base/engine/schema.zod`). +- A `{ limit, afterKeys? }` pagination block, where `afterKeys` is the decoded cursor from the existing `makeCursor` helper. + +It returns `{ sql: string; params: unknown[] }`. The SQL must: +- Always project columns in the canonical `BaseRow` shape (id, base_id, cells-as-synthesized-json, position, creator_id, last_updated_by_id, workspace_id, created_at, updated_at, deleted_at, plus each sort-field alias `s0/s1/...` for cursor pagination). +- Filter `deleted_at IS NULL`. +- Apply search / filter / sort via the same precedence the Postgres engine uses. +- Order by `(s0, s1, ..., position, id)` ascending by default, with sort direction honored per field. +- For keyset pagination, emit the lexicographic OR-chain that `cursor-pagination.ts` builds. +- `LIMIT ? + 1` so the caller can detect `hasNextPage`. + +Create `apps/server/src/core/base/query-cache/duckdb-query-builder.spec.ts` with the following test cases (write them first, then implement): + +```ts +import { buildColumnSpecs } from './column-types'; +import { buildDuckDbListQuery } from './duckdb-query-builder'; +import { BasePropertyType } from '../base.schemas'; + +const numericProp = { + id: '00000000-0000-0000-0000-000000000001', + type: BasePropertyType.NUMBER, + typeOptions: {}, +} as any; +const textProp = { + id: '00000000-0000-0000-0000-000000000002', + type: BasePropertyType.TEXT, + typeOptions: {}, +} as any; + +const columns = buildColumnSpecs([numericProp, textProp]); + +describe('buildDuckDbListQuery', () => { + it('renders no-filter, no-sort, no-search as live-rows-paginated-by-position', () => { + const { sql, params } = buildDuckDbListQuery({ + columns, + pagination: { limit: 100 }, + }); + expect(sql).toMatch(/FROM rows/); + expect(sql).toMatch(/deleted_at IS NULL/); + expect(sql).toMatch(/ORDER BY position ASC, id ASC/); + expect(sql).toMatch(/LIMIT 101/); + expect(params).toEqual([]); + }); + + it('renders numeric gt filter with parameterized value', () => { + const { sql, params } = buildDuckDbListQuery({ + columns, + filter: { + op: 'and', + children: [{ propertyId: numericProp.id, op: 'gt', value: 42 }], + }, + pagination: { limit: 100 }, + }); + expect(sql).toMatch(new RegExp(`"${numericProp.id}" > \\?`)); + expect(params).toContain(42); + }); + + it('renders text contains with ILIKE and escaped wildcards', () => { + const { sql, params } = buildDuckDbListQuery({ + columns, + filter: { + op: 'and', + children: [{ propertyId: textProp.id, op: 'contains', value: 'a_b%c' }], + }, + pagination: { limit: 100 }, + }); + expect(sql).toMatch(/ILIKE \?/); + expect(params).toContain('%a\\_b\\%c%'); + }); + + it('renders sort with sentinel wrapping and cursor keyset', () => { + const { sql } = buildDuckDbListQuery({ + columns, + sorts: [{ propertyId: numericProp.id, direction: 'asc' }], + pagination: { + limit: 50, + afterKeys: { s0: 10, position: 'A0', id: '00000000-0000-0000-0000-0000000000aa' }, + }, + }); + expect(sql).toMatch(/COALESCE\("[0-9a-f-]+", '?[Ii]nfinity'?::[A-Z]+\) AS s0/); + expect(sql).toMatch(/ORDER BY s0 ASC, position ASC, id ASC/); + // keyset OR-chain + expect(sql).toMatch(/s0 > \?/); + }); + + it('renders search in trgm mode as ILIKE on search_text', () => { + const { sql, params } = buildDuckDbListQuery({ + columns, + search: { mode: 'trgm', query: 'hello' }, + pagination: { limit: 10 }, + }); + expect(sql).toMatch(/search_text ILIKE \?/); + expect(params).toContain('%hello%'); + }); +}); +``` + +Run: +```bash +pnpm --filter server exec jest src/core/base/query-cache/duckdb-query-builder.spec.ts +``` + +Expected: fails (builder not implemented). + +- [ ] **Step 2: Implement the builder** + +Create `apps/server/src/core/base/query-cache/duckdb-query-builder.ts`. The implementation mirrors the Postgres engine in `apps/server/src/core/base/engine/predicate.ts`, `sort.ts`, `search.ts`, but emits DuckDB SQL with `?` positional params. + +Structure (no code dump — follow the Postgres engine's kind-dispatch exactly): + +- `buildDuckDbListQuery(opts)` entry point. +- `buildFilter(node, columns, params) -> string` — recurse into FilterGroup, emit `(A AND B)` / `(A OR B)`, or call `buildCondition`. +- `buildCondition(cond, col, params) -> string` — kind-dispatch on `col.property.type` via `propertyKind`, with the same operator tables as predicate.ts but using DuckDB syntax: + - text contains / startsWith / endsWith → `ILIKE ?` with the `escapeIlike` helper already in `engine/extractors.ts` (reuse it verbatim). + - numeric / date comparisons → ` ?`. + - bool `eq` / `neq` → ` ?`. + - select `any` → `"col" IN (?, ?, ?)`. + - multi/file `any` → `json_array_contains(, ?)` OR chain. + - multi/file `all` → repeated `json_array_contains` AND chain. + - isEmpty / isNotEmpty → `IS NULL` / `IS NOT NULL` (+ the empty-string leg for text). +- `buildSort(sorts, columns) -> {select: string[], orderBy: string}`: + - Mirror `sort.ts wrapWithSentinel` — wrap each sort expression in `COALESCE(, )` aliased `sN`. Sentinels: numeric → `'Infinity'::DOUBLE` / `'-Infinity'::DOUBLE`; date → `'9999-12-31 23:59:59+00'::TIMESTAMPTZ` / `'0001-01-01 00:00:00+00'::TIMESTAMPTZ`; bool → `TRUE` / `FALSE`; text → `CHR(1114111)` / `''`. + - Append tail `position, id` — unchanged. +- `buildSearch(search) -> string` — `trgm` → `search_text ILIKE ?` with the escape; `fts` → throw a typed sentinel exception `FtsNotSupportedInCache` so the router falls through to Postgres. +- `buildKeyset(afterKeys, sortAliases, columns) -> string` — the same OR-chain `cursor-pagination.ts` produces, but in DuckDB SQL with `?` params. Important: the cursor key names (`s0, s1, ..., position, id`) match the engine. + +DuckDB parameter placeholder is `?` (positional) via `@duckdb/node-api` — no named params needed. + +Run jest again; all tests should pass. + +- [ ] **Step 3: Commit** + +```bash +git add apps/server/src/core/base/query-cache/duckdb-query-builder.ts apps/server/src/core/base/query-cache/duckdb-query-builder.spec.ts +git commit -m "feat(server): add DuckDB SQL builder for base list queries" +``` + +--- + +## Task 5: Collection loader + in-memory cache + list() path (still off by default) + +**Files:** +- Create: `apps/server/src/core/base/query-cache/collection-loader.ts` +- Modify: `apps/server/src/core/base/query-cache/base-query-cache.service.ts` +- Create: `apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts` + +This is the first real DuckDB code path. The integration spec needs a real Postgres (via the existing `DATABASE_URL`) and does not need Redis (we don't wire the subscriber here). Gate the spec on `process.env.INTEGRATION_DB_URL` being set so CI can run the whole suite and local devs can run just unit tests. + +- [ ] **Step 1: Add `BaseRowRepo.countActiveRows`** + +Modify `apps/server/src/database/repos/base/base-row.repo.ts` — append: + +```ts + async countActiveRows( + baseId: string, + opts: WorkspaceOpts, + ): Promise { + const db = dbOrTx(this.db, opts.trx); + const row = await db + .selectFrom('baseRows') + .select((eb) => eb.fn.countAll().as('count')) + .where('baseId', '=', baseId) + .where('workspaceId', '=', opts.workspaceId) + .where('deletedAt', 'is', null) + .executeTakeFirst(); + return Number(row?.count ?? 0); + } +``` + +- [ ] **Step 2: Implement `CollectionLoader`** + +Create `apps/server/src/core/base/query-cache/collection-loader.ts`: + +- Inject `BasePropertyRepo`, `BaseRowRepo`, `BaseRepo`. +- Expose one method: `async load(baseId: string, workspaceId: string): Promise`. +- Implementation outline: + 1. Fetch `bases.schemaVersion` + all properties. + 2. `const specs = buildColumnSpecs(properties);` + 3. `const instance = await DuckDBInstance.create(':memory:');` then `const connection = await instance.connect();`. + 4. `CREATE TABLE rows (, PRIMARY KEY (id))`. Wrap each column name in `"..."`. + 5. Bulk-insert rows using the **DuckDB Appender API** (`connection.createAppender('rows')`). The appender is DuckDB's idiomatic high-throughput insert path — it skips SQL parsing per row, accepts typed column values positionally in declared column order, and flushes on `closeSync()`. Prepared `INSERT` with re-binding works too but is 5–10× slower for wide tables; the appender is the right default for cold load of 100K rows. + + Stream rows via `baseRowRepo.streamByBaseId(baseId, { workspaceId, chunkSize: 5000 })`. For each row: + + ```ts + // appender column order matches specs[] order (CREATE TABLE declared order) + for (const spec of specs) { + const raw = readFromRow(row, spec); // system field or row.cells[prop.id] + if (raw == null) { + appender.appendNull(); + continue; + } + switch (spec.ddlType) { + case 'VARCHAR': + appender.appendVarchar(String(raw)); + break; + case 'DOUBLE': { + const n = Number(raw); + if (Number.isNaN(n)) { appender.appendNull(); break; } + appender.appendDouble(n); + break; + } + case 'BOOLEAN': + appender.appendBoolean(Boolean(raw)); + break; + case 'TIMESTAMPTZ': { + // bindTimestampTZ / appendTimestampTZ take DuckDBTimestampTZValue. + // Cheapest path is to format as ISO8601 and append as VARCHAR into + // a TIMESTAMPTZ column — DuckDB casts implicitly. If profiling + // shows this is a bottleneck, switch to DuckDBTimestampTZValue via + // `new DuckDBTimestampTZValue(microsSinceEpoch)`. + const d = raw instanceof Date ? raw : new Date(String(raw)); + if (Number.isNaN(d.getTime())) { appender.appendNull(); break; } + appender.appendVarchar(d.toISOString()); + break; + } + case 'JSON': + // JSON columns in DuckDB accept a VARCHAR containing JSON text. + appender.appendVarchar(JSON.stringify(raw)); + break; + } + } + appender.endRow(); + ``` + + After the stream drains: `appender.flushSync(); appender.closeSync();`. + + Notes: + - System columns: use the row's direct fields (`id`, `position`, `createdAt`, `updatedAt`, `lastUpdatedById`, `null` for `deleted_at`, `''` for `search_text` — we leave `search_text` unpopulated in v1; search stays on Postgres until task 4's `FtsNotSupportedInCache` sentinel drives us back, and even trgm search against an unpopulated column returns no rows — so route trgm search to Postgres as well in v1, see task 6). + - Malformed values log at debug level and become `null`. + 6. For each indexable column: `CREATE INDEX idx_ ON rows ("");`. The PK already indexes `id`. + 7. Return `{ baseId, schemaVersion, columns: specs, instance, connection, lastAccessedAt: Date.now() }`. + +Rationale for `:memory:` DuckDB per base: isolation, easy eviction (just close the instance), no disk file management. Memory overhead per collection is roughly `(columns * rows * avgBytes)` — a 100K-row base with 15 indexed columns is well under 100 MB. + +- [ ] **Step 3: Wire cache service** + +Replace the stub in `base-query-cache.service.ts` with a real Map + LRU + `list()` method: + +- Private `collections = new Map()`. +- Public `list(baseId: string, workspaceId: string, engineOpts: EngineListOpts): Promise>` — calls `ensureLoaded`, builds SQL + positional params via `buildDuckDbListQuery`, then executes via the DuckDB Neo prepared-statement API (DuckDB Neo's `runAndReadAll` on a `DuckDBConnection` does NOT take a params array — parameterization goes through `connection.prepare(sql)` and per-type `bind*` calls). The shape is: + + ```ts + const prepared = await collection.connection.prepare(sql); + for (let i = 0; i < params.length; i++) { + const p = params[i]; + const oneBased = i + 1; + if (p === null || p === undefined) { + prepared.bindNull(oneBased); + } else if (typeof p === 'string') { + prepared.bindVarchar(oneBased, p); + } else if (typeof p === 'number') { + prepared.bindDouble(oneBased, p); + } else if (typeof p === 'boolean') { + prepared.bindBoolean(oneBased, p); + } else if (p instanceof Date) { + // ISO string into a TIMESTAMPTZ column casts implicitly; see the + // loader's TIMESTAMPTZ handling above for the rationale. + prepared.bindVarchar(oneBased, p.toISOString()); + } else { + // Fallback: serialize complex values as JSON text for JSON columns. + prepared.bindVarchar(oneBased, JSON.stringify(p)); + } + } + const reader = await prepared.runAndReadAll(); + const rows = reader.getRowObjects(); // use getRowObjects consistently so the + // row-shaping code can read by column name + ``` + + Then shape rows into `BaseRow[]`, handle `LIMIT+1` → `hasNextPage`, and encode cursors using the existing `makeCursor` helper so the output is byte-identical to Postgres. +- Private `ensureLoaded(baseId, workspaceId)` — hit map, compare `schemaVersion` from a cheap `bases.findById` against the cached one; on miss or stale, call loader. Update `lastAccessedAt`. On cap exceeded, evict least-recently-used (single-pass scan of map is fine for ≤1000 entries; a `LinkedHashMap` is overkill). +- Public `invalidate(baseId)` — close the connection + instance if resident; remove from map. +- `onModuleDestroy` closes all collections. + +- [ ] **Step 4a: Extract a reusable seeding helper** + +The seed script currently runs as a standalone process (`apps/server/src/scripts/seed-base-rows.ts`). Before writing the integration spec, extract its base + property + row generation into a module-exportable helper that both the script and the test can call. This is a prerequisite — no such helper exists in the repo today (verified 2026-04-19 — the repo has no `*.integration.spec.ts` files and no test-seeding helpers under `apps/server/test/`). + +Create `apps/server/src/core/base/query-cache/testing/seed-base.ts`: + +```ts +import type { Kysely } from 'kysely'; +import { v7 as uuid7 } from 'uuid'; +import { generateJitteredKeyBetween } from 'fractional-indexing-jittered'; +// Re-use the deterministic generators from seed-base-rows.ts by exporting +// them from that file (move WORDS/COLORS/randomWords etc. above a new +// `export` keyword, leaving the top-level script body intact). + +export type SeedBaseOptions = { + db: Kysely; + workspaceId: string; + spaceId: string; + creatorUserId: string; + rows: number; + name?: string; +}; + +export type SeededBase = { + baseId: string; + propertyIds: { + text: string; + number: string; + status: string; + date: string; + // ... whichever props the script emits + }; +}; + +export async function seedBase(opts: SeedBaseOptions): Promise { + // Move the guts of the existing seed-base-rows.ts top-level body here. + // Keep the RNG deterministic (seed with opts.name + opts.rows) so tests + // are reproducible across machines. +} + +export async function deleteSeededBase( + db: Kysely, + baseId: string, +): Promise { + // DELETE rows first, then properties, views, then the base, matching + // the hard-delete order the existing cleanup script uses. +} +``` + +Also update `apps/server/src/scripts/seed-base-rows.ts` to be a thin wrapper that calls `seedBase({ ... })` so `TOTAL_ROWS=10000 tsx src/scripts/seed-base-rows.ts` still works. + +- [ ] **Step 4b: Integration test — single-collection loader round-trip** + +Create `apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts`: + +```ts +import { Test, TestingModule } from '@nestjs/testing'; +import { ConfigModule } from '@nestjs/config'; +import { EventEmitterModule } from '@nestjs/event-emitter'; +import { KyselyModule, InjectKysely } from '@docmost/db/kysely'; +import { BaseModule } from '../base.module'; +import { BaseQueryCacheService } from './base-query-cache.service'; +import { BaseRowService } from '../services/base-row.service'; +import { seedBase, deleteSeededBase } from './testing/seed-base'; + +// Skip the suite when no integration DB is wired. Run locally with: +// INTEGRATION_DB_URL=$DATABASE_URL pnpm --filter server exec jest \ +// src/core/base/query-cache/base-query-cache.integration.spec.ts +const describeIfIntegration = process.env.INTEGRATION_DB_URL + ? describe + : describe.skip; + +describeIfIntegration('BaseQueryCacheService (integration)', () => { + let module: TestingModule; + let cache: BaseQueryCacheService; + let rowService: BaseRowService; + let db: any; + let workspaceId: string; + let spaceId: string; + let userId: string; + let seededBaseId: string; + + beforeAll(async () => { + // Minimal real-module bootstrap. We import BaseModule (which now + // imports BaseQueryCacheModule via Task 3) plus the infra modules it + // depends on. We rely on process.env.INTEGRATION_DB_URL (aliased to + // DATABASE_URL) being set; without it the whole describe is skipped. + process.env.DATABASE_URL = process.env.INTEGRATION_DB_URL; + process.env.BASE_QUERY_CACHE_ENABLED = 'true'; + process.env.BASE_QUERY_CACHE_MIN_ROWS = '100'; // so a 10K seed counts as "large" + + module = await Test.createTestingModule({ + imports: [ + ConfigModule.forRoot({ isGlobal: true }), + EventEmitterModule.forRoot(), + KyselyModule.forRoot(), // uses DATABASE_URL + BaseModule, + ], + }).compile(); + + cache = module.get(BaseQueryCacheService); + rowService = module.get(BaseRowService); + db = module.get('KYSELY_DB'); // the token the repo uses for InjectKysely + + // These three IDs must exist in the test database. Prefer to seed them + // in a one-time `globalSetup` script if the CI DB is empty; for a + // developer machine pointing at their dev DB, look them up from any + // existing workspace/space/user. + const anyWorkspace = await db + .selectFrom('workspaces') + .select(['id']) + .limit(1) + .executeTakeFirstOrThrow(); + const anySpace = await db + .selectFrom('spaces') + .select(['id']) + .where('workspaceId', '=', anyWorkspace.id) + .limit(1) + .executeTakeFirstOrThrow(); + const anyUser = await db + .selectFrom('users') + .select(['id']) + .where('workspaceId', '=', anyWorkspace.id) + .limit(1) + .executeTakeFirstOrThrow(); + workspaceId = anyWorkspace.id; + spaceId = anySpace.id; + userId = anyUser.id; + + const seeded = await seedBase({ + db, + workspaceId, + spaceId, + creatorUserId: userId, + rows: 10_000, + name: 'query-cache-integration', + }); + seededBaseId = seeded.baseId; + }, /* boot + seed timeout */ 120_000); + + afterAll(async () => { + if (seededBaseId) await deleteSeededBase(db, seededBaseId); + await module?.close(); + }); + + it('paginates a numeric sort identically to Postgres', async () => { + const numberPropId = /* look up from seeded.propertyIds.number */ ''; + const args = { + sorts: [{ propertyId: numberPropId, direction: 'asc' as const }], + pagination: { limit: 500 }, + }; + + const pgPages: string[] = []; + let cursor: string | undefined; + do { + const page = await rowService.list( + { baseId: seededBaseId, ...args }, + { limit: args.pagination.limit, cursor }, + workspaceId, + ); + pgPages.push(...page.items.map((r) => r.id)); + cursor = page.nextCursor; + } while (cursor); + + const dkPages: string[] = []; + cursor = undefined; + do { + const page = await cache.list(seededBaseId, workspaceId, { + ...args, + pagination: { limit: args.pagination.limit, cursor }, + }); + dkPages.push(...page.items.map((r) => r.id)); + cursor = page.nextCursor; + } while (cursor); + + expect(dkPages).toEqual(pgPages); + expect(dkPages.length).toBe(10_000); + }, /* query timeout */ 60_000); +}); +``` + +Two things the executor MUST verify before running the test: +1. The `KyselyModule` / `InjectKysely` token and the `'KYSELY_DB'` string literal above are placeholders — the real token is defined in `packages/db/src/kysely/` (or wherever `@docmost/db` lives); look it up and substitute. +2. `rowService.list` signature above matches the real one after Task 6's edits; re-check argument shape at test-write time. + +Run with: + +```bash +INTEGRATION_DB_URL=$DATABASE_URL pnpm --filter server exec jest src/core/base/query-cache/base-query-cache.integration.spec.ts +``` + +Expected: `Tests: 1 passed, 1 total`. Without `INTEGRATION_DB_URL` the whole describe is skipped, so local unit-test runs stay fast. + +- [ ] **Step 5: Commit** + +```bash +git add apps/server/src/core/base/query-cache/collection-loader.ts \ + apps/server/src/core/base/query-cache/base-query-cache.service.ts \ + apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts \ + apps/server/src/core/base/query-cache/testing/seed-base.ts \ + apps/server/src/database/repos/base/base-row.repo.ts \ + apps/server/src/scripts/seed-base-rows.ts +git commit -m "feat(server): load bases into DuckDB and serve list queries from cache" +``` + +--- + +## Task 6: Router — decide between Postgres and DuckDB paths; wire into BaseRowService + +**Files:** +- Modify: `apps/server/src/core/base/query-cache/base-query-router.ts` +- Create: `apps/server/src/core/base/query-cache/base-query-router.spec.ts` +- Modify: `apps/server/src/core/base/services/base-row.service.ts` + +- [ ] **Step 1: Write failing unit tests for the router** + +Create `apps/server/src/core/base/query-cache/base-query-router.spec.ts`. Cases (all pure, no DB — use a fake config provider and a fake row-count function): + +```ts +describe('BaseQueryRouter.decide', () => { + it('returns postgres when flag is off', () => { /* ... */ }); + it('returns postgres when row count < minRows', () => { /* ... */ }); + it('returns postgres when query has no filter/sort/search', () => { /* ... */ }); + it('returns postgres when search.mode === "fts" even for large base', () => { /* ... */ }); + it('returns cache when flag on + rows >= minRows + has filter', () => { /* ... */ }); + it('returns cache when flag on + rows >= minRows + has sort', () => { /* ... */ }); + it('returns cache when flag on + rows >= minRows + has trgm search', () => { /* ... */ }); +}); +``` + +Run: +```bash +pnpm --filter server exec jest src/core/base/query-cache/base-query-router.spec.ts +``` + +Expected: fails (router still stubbed). + +- [ ] **Step 2: Implement the real router** + +Replace stub in `base-query-router.ts`: + +```ts +@Injectable() +export class BaseQueryRouter { + constructor( + private readonly configProvider: QueryCacheConfigProvider, + private readonly baseRowRepo: BaseRowRepo, + ) {} + + async decide(args: { + baseId: string; + workspaceId: string; + filter?: FilterNode; + sorts?: SortSpec[]; + search?: SearchSpec; + }): Promise { + const { enabled, minRows } = this.configProvider.config; + if (!enabled) return 'postgres'; + + const hasFilter = !!args.filter; + const hasSorts = !!args.sorts && args.sorts.length > 0; + const hasSearch = !!args.search; + if (!hasFilter && !hasSorts && !hasSearch) return 'postgres'; + + // v1: full-text search stays on Postgres. Trgm search also stays on + // Postgres until we populate search_text in DuckDB; re-evaluate after + // the loader gains search-column population. + if (args.search) return 'postgres'; + + const count = await this.baseRowRepo.countActiveRows(args.baseId, { + workspaceId: args.workspaceId, + }); + if (count < minRows) return 'postgres'; + + return 'cache'; + } +} +``` + +Clarification: trgm search is gated off in v1 because the loader in task 5 does not populate `search_text`. If/when we decide to support trgm in the cache, the loader populates it and this branch relaxes. This is explicitly called out so the executor doesn't implement trgm + loader-population in the same commit and get cross-branch test failures. + +- [ ] **Step 3: Wire into `BaseRowService.list`** + +Modify `apps/server/src/core/base/services/base-row.service.ts`: + +```ts +constructor( + @InjectKysely() private readonly db: KyselyDB, + private readonly baseRowRepo: BaseRowRepo, + private readonly basePropertyRepo: BasePropertyRepo, + private readonly baseViewRepo: BaseViewRepo, + private readonly eventEmitter: EventEmitter2, + private readonly queryRouter: BaseQueryRouter, + private readonly queryCache: BaseQueryCacheService, +) {} + +async list(dto: ListRowsDto, pagination: PaginationOptions, workspaceId: string) { + const properties = await this.basePropertyRepo.findByBaseId(dto.baseId); + const schema: PropertySchema = new Map(properties.map((p) => [p.id, p])); + + const filter = this.normaliseFilter(dto); + const search = this.normaliseSearch(dto.search); + const sorts = dto.sorts?.map((s) => ({ + propertyId: s.propertyId, + direction: s.direction, + })); + + const decision = await this.queryRouter.decide({ + baseId: dto.baseId, + workspaceId, + filter, + sorts, + search, + }); + + if (decision === 'cache') { + try { + return await this.queryCache.list(dto.baseId, workspaceId, { + filter, + sorts, + search, + schema, + pagination, + }); + } catch (err) { + // Clean fall-through: cache must never surface to the client. + // Log and let Postgres handle it. + this.logger.warn( + `Cache list failed for base ${dto.baseId}, falling back to Postgres`, + err as Error, + ); + } + } + + return this.baseRowRepo.list({ + baseId: dto.baseId, + workspaceId, + filter, + sorts, + search, + schema, + pagination, + }); +} +``` + +Before the `try/catch` above, ensure `BaseRowService` declares `private readonly logger = new Logger(BaseRowService.name);` as a class field (import `Logger` from `@nestjs/common`). If it doesn't already have one, add it in this same edit — the `this.logger.warn(...)` call in the catch block requires it and we do NOT fall back to optional chaining (`this.logger?.warn?.`) because defensive optional chaining on a class-owned field hides wiring bugs (CLAUDE.md C-7). Also ensure `BaseQueryCacheModule` is imported by `BaseModule` so both providers resolve (done in task 3). + +- [ ] **Step 4: Verify router tests pass** + +```bash +pnpm --filter server exec jest src/core/base/query-cache/base-query-router.spec.ts +``` + +Expected: `Tests: 7 passed, 7 total` (one per `it(...)` block in the spec above). + +- [ ] **Step 5: Verify build** + +```bash +pnpm nx run server:build +``` + +- [ ] **Step 6: Commit** + +```bash +git add apps/server/src/core/base/query-cache/base-query-router.ts \ + apps/server/src/core/base/query-cache/base-query-router.spec.ts \ + apps/server/src/core/base/services/base-row.service.ts +git commit -m "feat(server): route large base list queries through the duckdb cache" +``` + +--- + +## Task 7: Publish change envelopes to Redis, apply on subscribe + +**Files:** +- Create: `apps/server/src/core/base/query-cache/base-query-cache.write-consumer.ts` +- Create: `apps/server/src/core/base/query-cache/base-query-cache.subscriber.ts` +- Modify: `apps/server/src/core/base/query-cache/base-query-cache.service.ts` (add `applyChange`) +- Modify: `apps/server/src/core/base/query-cache/query-cache.module.ts` + +Pattern matches `BaseWsConsumers` (in-process `@OnEvent` listener) + `BasePresenceService` (ioredis via `RedisService`). + +- [ ] **Step 1: Write the Redis publisher** + +Create `apps/server/src/core/base/query-cache/base-query-cache.write-consumer.ts`: + +All `EventName` identifiers below are defined in `apps/server/src/common/events/event.contants.ts` (verified 2026-04-19). Exact line references: `BASE_ROW_CREATED` L23, `BASE_ROW_UPDATED` L24, `BASE_ROW_DELETED` L25, `BASE_ROWS_DELETED` L26, `BASE_ROW_REORDERED` L28, `BASE_PROPERTY_CREATED` L30, `BASE_PROPERTY_UPDATED` L31, `BASE_PROPERTY_DELETED` L32, `BASE_SCHEMA_BUMPED` L39. No new events need to be added for Task 7. + +- Imports `RedisService` from `@nestjs-labs/nestjs-ioredis`; resolves the client in constructor via `redisService.getOrThrow()` (same as `apps/server/src/core/base/realtime/base-presence.service.ts`). +- `@OnEvent(EventName.BASE_ROW_CREATED)` → publish `{ kind: 'row-upsert', baseId, row }`. +- `@OnEvent(EventName.BASE_ROW_UPDATED)` → fetch the updated row from Postgres (full row needed — `patch` alone doesn't carry all columns) and publish `{ kind: 'row-upsert', baseId, row }`. Optimization deferred: pass the row through the event payload (requires an additive tweak to `BaseRowUpdatedEvent` — out of scope for v1; instead, call `baseRowRepo.findById` here). +- `@OnEvent(EventName.BASE_ROW_DELETED)` → `{ kind: 'row-delete', baseId, rowId }`. +- `@OnEvent(EventName.BASE_ROWS_DELETED)` → `{ kind: 'rows-delete', baseId, rowIds }`. +- `@OnEvent(EventName.BASE_ROW_REORDERED)` → `{ kind: 'row-reorder', baseId, rowId, position }`. +- `@OnEvent(EventName.BASE_SCHEMA_BUMPED)` and `@OnEvent(EventName.BASE_PROPERTY_UPDATED)` / `BASE_PROPERTY_CREATED` / `BASE_PROPERTY_DELETED` → `{ kind: 'schema-invalidate', baseId, schemaVersion }`. + +Channel: `base-query-cache:changes:${baseId}`. Payload: `JSON.stringify(envelope)`. + +Guard with `if (!configProvider.config.enabled) return;` at the top of each handler so the flag-off path pays zero. + +- [ ] **Step 2: Write the Redis subscriber** + +Create `apps/server/src/core/base/query-cache/base-query-cache.subscriber.ts`: + +- Implement `OnApplicationBootstrap` + `OnModuleDestroy`. +- On bootstrap (if flag enabled): create a dedicated ioredis client (NOT the shared one — ioredis pub/sub clients enter subscriber-only mode). `apps/server/src/core/base/realtime/base-presence.service.ts` shows the "shared client" pattern used for regular ops; for a dedicated subscriber use `new Redis(parseRedisUrl(env.getRedisUrl()))` mirroring `apps/server/src/ws/adapter/ws-redis.adapter.ts:16` (`parseRedisUrl` itself is exported from `apps/server/src/common/helpers/utils.ts:35`). +- `PSUBSCRIBE base-query-cache:changes:*`. +- On message: parse envelope, call `cacheService.applyChange(envelope)`. Catch-and-log any parse error. +- On destroy: `quit()` the client. + +- [ ] **Step 3: Add `applyChange` to the cache service** + +Modify `base-query-cache.service.ts`: + +```ts +async applyChange(env: ChangeEnvelope): Promise { + const collection = this.collections.get(env.baseId); + if (!collection) return; // not resident on this node, ignore + try { + switch (env.kind) { + case 'schema-invalidate': + if (env.schemaVersion > collection.schemaVersion) { + await this.invalidate(env.baseId); + } + return; + case 'row-upsert': + await this.upsertRow(collection, env.row); + return; + case 'row-delete': + await this.deleteRow(collection, env.rowId); + return; + case 'rows-delete': + for (const id of env.rowIds) await this.deleteRow(collection, id); + return; + case 'row-reorder': + await this.updatePosition(collection, env.rowId, env.position); + return; + } + } catch (err) { + // On any patch failure, nuke the collection; next read reloads from + // Postgres. Much safer than running with a partially-patched cache. + this.logger.warn(`applyChange failed for ${env.baseId}; invalidating`, err); + await this.invalidate(env.baseId); + } +} +``` + +`upsertRow` / `deleteRow` / `updatePosition` use the prepare/bind/run pattern documented in Task 5 Step 3 (DuckDB Neo API): +- `upsertRow`: `INSERT OR REPLACE INTO rows VALUES (?, ?, …)` — prepare once per call, bind each column's value per `specs[i].ddlType` (same coercion table as the loader), then `await prepared.run()`. +- `deleteRow`: only "live rows" are cached; soft-delete → remove. `DELETE FROM rows WHERE id = ?` with `prepared.bindVarchar(1, rowId); await prepared.run();`. +- `updatePosition`: `UPDATE rows SET position = ? WHERE id = ?` with `prepared.bindVarchar(1, position); prepared.bindVarchar(2, rowId); await prepared.run();`. + +No params-array shorthand is used anywhere — that signature doesn't exist on `DuckDBConnection`. + +- [ ] **Step 4: Register in the module** + +Modify `query-cache.module.ts` providers list: + +```ts +providers: [ + QueryCacheConfigProvider, + BaseQueryCacheService, + BaseQueryRouter, + CollectionLoader, + BaseQueryCacheWriteConsumer, + BaseQueryCacheSubscriber, +], +``` + +- [ ] **Step 5: Integration test — round-trip of a row-update event** + +Extend `base-query-cache.integration.spec.ts` with a test that: + +1. Seeds a base. +2. Forces a load via `cache.list(...)`. +3. Directly emits `EventName.BASE_ROW_UPDATED` through the `EventEmitter2` (bypasses Redis — acceptable since the write-consumer still fires, publishing onto Redis, and the subscriber on the same node receives the event). +4. Reads via `cache.list(...)` again, asserts the updated row reflects the patch. + +For pure determinism, consider adding a secondary path where `applyChange` is called directly in the test without Redis, but the whole-loop test (Redis included) catches channel-name typos. + +Run: +```bash +INTEGRATION_DB_URL=$DATABASE_URL REDIS_URL=$REDIS_URL pnpm --filter server exec jest src/core/base/query-cache/base-query-cache.integration.spec.ts +``` + +- [ ] **Step 6: Commit** + +```bash +git add apps/server/src/core/base/query-cache/base-query-cache.write-consumer.ts \ + apps/server/src/core/base/query-cache/base-query-cache.subscriber.ts \ + apps/server/src/core/base/query-cache/base-query-cache.service.ts \ + apps/server/src/core/base/query-cache/query-cache.module.ts \ + apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts +git commit -m "feat(server): propagate row mutations to duckdb cache via redis pubsub" +``` + +--- + +## Task 8: LRU eviction — tested with a tight cap + +**Files:** +- Modify: `apps/server/src/core/base/query-cache/base-query-cache.service.ts` +- Modify: `apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts` + +Eviction logic from task 5 already scaffolded. This task is about tightening and test coverage. + +- [ ] **Step 1: Ensure eviction path is deterministic** + +In `ensureLoaded`, after inserting a new collection into the map, if `this.collections.size > this.config.maxCollections`, find the entry with the smallest `lastAccessedAt` and call `invalidate(baseId)` on it. Single loop over map entries (N ≤ 500, trivially cheap). + +- [ ] **Step 2: Integration test with cap=2** + +Add a test case that: + +1. Temporarily creates a cache service instance with `maxCollections = 2` (use a test module override on `QueryCacheConfigProvider`). +2. Seeds 3 bases. +3. Loads all 3 in sequence. After each, verify `collections.size <= 2` and that the eldest-accessed base is NOT resident anymore. +4. Loads the evicted base again — verify it reloads cleanly (schemaVersion comes back, basic query returns rows). + +Run: +```bash +INTEGRATION_DB_URL=$DATABASE_URL pnpm --filter server exec jest src/core/base/query-cache/base-query-cache.integration.spec.ts +``` + +- [ ] **Step 3: Commit** + +```bash +git add apps/server/src/core/base/query-cache/base-query-cache.service.ts \ + apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts +git commit -m "feat(server): evict least-recently-used duckdb collections when cap exceeded" +``` + +--- + +## Task 9: Warm-on-boot using recent-access Redis sorted set + +**Files:** +- Modify: `apps/server/src/core/base/query-cache/base-query-cache.service.ts` + +- [ ] **Step 1: Record access** + +In `ensureLoaded`, after a successful load or a successful hit, call `this.recordAccess(baseId)` which issues: + +``` +ZADD base-query-cache:recent +ZREMRANGEBYRANK base-query-cache:recent 0 -(maxCollections * 10 + 1) +``` + +(Fire-and-forget; errors swallowed with debug log.) + +- [ ] **Step 2: `onApplicationBootstrap` warm-up** + +Replace the stub from task 3: + +```ts +async onApplicationBootstrap(): Promise { + const { enabled, warmTopN } = this.configProvider.config; + if (!enabled) return; + try { + const ids = await this.redis.zrevrange('base-query-cache:recent', 0, warmTopN - 1); + for (const baseId of ids) { + try { + const base = await this.baseRepo.findById(baseId); + if (!base) continue; + await this.ensureLoaded(baseId, base.workspaceId); + } catch (err) { + this.logger.debug(`warm-up skipped ${baseId}: ${(err as Error).message}`); + } + } + this.logger.log(`Warmed ${ids.length} collections on boot`); + } catch (err) { + this.logger.warn('Warm-up failed', err as Error); + } +} +``` + +Warm-up is sequential intentionally: parallel warming of 50 bases each reading 25K rows from Postgres overwhelms the pool on boot. + +- [ ] **Step 3: Integration test** + +Add a test case: + +1. Access base A via `cache.list` (forces a ZADD). +2. Destroy the service instance (simulating node restart). +3. New instance with warm-up enabled. Assert base A is resident in the Map before any explicit `list` call. + +- [ ] **Step 4: Commit** + +```bash +git add apps/server/src/core/base/query-cache/base-query-cache.service.ts \ + apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts +git commit -m "feat(server): warm duckdb collections on boot from redis recent-access set" +``` + +--- + +## Task 10: End-to-end correctness — 100K seed base, cache vs postgres row-for-row + +**Files:** +- Modify: `apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts` + +This is the correctness gate for merge. + +- [ ] **Step 1: Add the scale test, skipped by default** + +```ts +const itIfScale = process.env.INTEGRATION_DB_URL && process.env.SCALE_TEST === 'true' ? it : it.skip; + +itIfScale('100K base: cache and postgres return identical rows for common queries', async () => { + // Reuses the `seedBase` helper extracted in Task 5 Step 4a + // (apps/server/src/core/base/query-cache/testing/seed-base.ts). + const { baseId } = await seedBase({ + db, + workspaceId: WS, + spaceId: SPACE, + creatorUserId: USER, + rows: 100_000, + name: 'query-cache-scale', + }); + const queries = [ + { sorts: [{ propertyId: PROP_NUMBER, direction: 'asc' }] }, + { sorts: [{ propertyId: PROP_TEXT, direction: 'desc' }] }, + { filter: { op: 'and', children: [{ propertyId: PROP_STATUS, op: 'eq', value: DONE_ID }] } }, + { + filter: { op: 'and', children: [{ propertyId: PROP_BUDGET, op: 'gt', value: 5000 }] }, + sorts: [{ propertyId: PROP_DATE, direction: 'desc' }], + }, + ]; + for (const q of queries) { + const pgAll = await collectAllPages(() => baseRowService.list(q /*, postgres-forced */)); + const dkAll = await collectAllPages(() => cacheService.list(baseId, WS, q)); + expect(dkAll.map(r => r.id)).toEqual(pgAll.map(r => r.id)); + } +}, /* 5min timeout */ 300_000); +``` + +Run: +```bash +INTEGRATION_DB_URL=$DATABASE_URL SCALE_TEST=true pnpm --filter server exec jest src/core/base/query-cache/base-query-cache.integration.spec.ts +``` + +Expected: passes; typical run-time 2–5 min (seed + 4 full-traversal comparisons). + +- [ ] **Step 2: Commit** + +```bash +git add apps/server/src/core/base/query-cache/base-query-cache.integration.spec.ts +git commit -m "test(server): assert duckdb cache matches postgres on a 100K-row base" +``` + +--- + +## Rollout checklist + +Phased exposure. Each phase is its own PR-sized change — do NOT skip ahead. + +1. **Ship dark.** Merge all tasks above with `BASE_QUERY_CACHE_ENABLED=false` in production env. Verify in metrics that `POST /bases/rows` latency and error rate are unchanged (this is the "does the scaffold actually cost nothing when off" check). Wait one full week of production traffic. + +2. **Opt-in per workspace.** Before enabling globally, wire a check in `BaseQueryRouter.decide` that consults a small allow-list of workspace IDs (env var `BASE_QUERY_CACHE_WORKSPACE_IDS=ws1,ws2`). Set it to one internal workspace. Monitor: + - p50/p95/p99 latency of `/bases/rows` for the pilot workspace. + - Cache hit rate (log `decision === 'cache'` counter). + - Any log line from `BaseRowService.list`'s cache-path catch block (should be zero under normal operation). + +3. **Enable globally.** Drop the allow-list, set `BASE_QUERY_CACHE_ENABLED=true` everywhere. Watch `/bases/rows` p99, memory RSS (DuckDB pages stay in-process), and Redis pub/sub throughput. If RSS climbs beyond expectations, reduce `BASE_QUERY_CACHE_MAX_COLLECTIONS`. + +4. **Raise or lower threshold.** After a week of global-on, revisit `BASE_QUERY_CACHE_MIN_ROWS`. The 25K default is a conservative guess — the actual crossover where DuckDB starts winning on cold load may be lower. Adjust from metrics. + +5. **Phase 2 prep (not this plan).** Design consistent-hash routing using the `base-query-cache:owner:{baseId}` lease key so only one node holds a given collection. This is a separate plan. + +Rollback: set `BASE_QUERY_CACHE_ENABLED=false` and restart. No state cleanup required (DuckDB lives in-process; Redis keys auto-expire via the warm-set trim, but can be dropped with `DEL base-query-cache:recent` if desired). + +--- + +## Appendix: open questions flagged for user confirmation + +These are guesses embedded in the plan that the executor should confirm before moving past the commits that encode them: + +- **`BASE_QUERY_CACHE_MIN_ROWS = 25000`** — threshold for "large." Derived from the observation that the Postgres path stays below 200ms up to ~30K rows. Confirm from real metrics after task 10. +- **`BASE_QUERY_CACHE_MAX_COLLECTIONS = 50`** — memory cap, lowered from an earlier 500 draft to a self-host-safe default. At ~100 MB/collection this implies a ~5 GB RSS ceiling, which fits typical self-host boxes. Larger deployments should raise it explicitly via env. Confirm with user. +- **`BASE_QUERY_CACHE_WARM_TOP_N = 50`** — boot-warm size. Chosen to balance boot time (50 × 10–15s = 8–12 min sequential) vs steady-state cache freshness. Confirm. +- **Trgm search routes to Postgres** — v1 doesn't populate `search_text` in DuckDB. Cheap to add (loader just copies `row.searchText`), but untested against the trigger-maintained `f_unaccent` normalization in Postgres. Leave on Postgres in v1; revisit later. +- **`@duckdb/node-api@^1.5.x`** — official high-level binding, latest as of writing is 1.5.2. Confirm npm registry pin at task 1 time. +- **Redis channel prefix `base-query-cache:`** — consistent with existing `presence:base:` and `typesense:` naming. If there's a house convention for "infra cache" keys, align here before committing task 7.