TODOS.md - deathbyknowledge/ripgit

branch: main
TODOS.md
17918 bytesRaw
# ripgit — current work and future tasks

## Current state

The original core build plan is implemented and compiling with zero warnings.
The project has also moved past that baseline: push, clone, fetch, the read
API, diff engine, FTS5 search, issues/PRs, agent-readable markdown/plain page
views, the optional GitHub OAuth auth worker, and the new root test harness
all work.

Tested with:
- **235-commit repo** — 5.3x compression ratio via xpatch
- **cloudflare/agents** — 13,464 objects, 11.4 MiB pack, pushes in one shot
- **git/git** — 80K commits, pushed incrementally to fp 14,000 (checkpoint pushes)

### Code structure

```
src/
  lib.rs          — Worker entry point, routing, owner profile page, content
                    negotiation dispatch, ref advertisement, admin endpoints
  presentation.rs — `?format` + `Accept` negotiation, markdown/plain helpers,
                    action rendering, shared text-mode hints, `Vary: Accept`
  schema.rs       — 11 tables + 3 FTS5 virtual tables + indexes
                    (refs, commits, commit_parents, commit_graph, trees,
                    blob_groups, blobs, blob_chunks, raw_objects, config,
                    fts_head, fts_commits)
  pack.rs         — Streaming pack parser: build_index (decompress-to-sink),
                    resolve_type (OFS_DELTA chain following), resolve_entry
                    (on-demand decompression, Arc-based ResolveCache with byte
                    budget, ResolveCtx bundle), pack generator for fetch.
                    MAX_PACK_BYTES (50 MB) and CACHE_BUDGET_BYTES (20 MB).
  store.rs        — Storage layer: commit/tree/blob parsing, xpatch delta
                    compression with zlib-compressed keyframes + blob_chunks,
                    batched SQL INSERTs, binary lifting commit graph,
                    blob reconstruction, config helpers, incremental FTS
                    rebuild, search (FTS5 + INSTR), lossy UTF-8
  git.rs          — Git smart HTTP protocol: receive-pack (pack body size
                    gate, streaming pack processing, two-phase push handling,
                    dynamic default branch, FTS trigger), upload-pack (fetch)
  api.rs          — Read API: refs, log, commit, tree, blob, file-at-ref,
                    search (code + commits, @prefix: column filter syntax),
                    stats (using stored_size column, no full table scan)
  diff.rs         — Diff engine: recursive tree comparison, line-level diffs
                    (via `similar`), commit diff, two-commit compare
  issues.rs       — Issues/PR storage, comments, merge-base search, three-way
                    merge, merge commit creation, form parsing utilities
  web.rs          — Shared HTML shell/CSS, markdown rendering, owner profile,
                    repo README helpers, diff rendering, raw/blob helpers
  web/            — Repo home, log, tree/blob, search, settings, commit/diff
                    HTML + markdown renderers
  issues_web.rs   — Shared issues/PR web helpers and re-exports
  issues_web/     — Issues/pulls list, detail, and form HTML + markdown
                    renderers

examples/github-oauth/
  src/index.ts    — GitHub OAuth front worker, browser sessions, agent tokens,
                    trusted header forwarding, text-mode landing/settings
  README.md       — Setup, deploy, bindings/secrets, and text-mode docs

tests/
  helpers/mf.mjs  — Miniflare test server factory for the core worker
  helpers/git.mjs — temp repo + git CLI helpers for fixture-based e2e tests
  worker-smoke.spec.mjs
                  — negotiated representation and auth smoke tests
  git-e2e.spec.mjs
                  — real-world push/clone/fetch/force-push coverage
  fixtures/       — pinned offline git fixture bundles and refresh notes
```

---

## Completed

All items below have been implemented and verified.

### Search improvements

1. **All matches with line numbers** — After FTS5 identifies matching files,
   scans content line-by-line and returns every match with its line number.
   Web UI links to `/blob/:ref/:path#L47`.

2. **Literal / exact substring search** — Auto-detects symbol-heavy queries
   (`.`, `_`, `()`, `::`) and falls back to `INSTR(content, ?)`. Full table
   scan but bounded by repo size. `lit:` prefix also forces literal mode.

3. **Scope filters / @prefix: syntax** — `@path:src/`, `@ext:rs`, `@author:`,
   `@message:`, `@content:` inline query prefixes replace separate form fields.
   Parsed in `api::parse_search_query`, strips `@` and maps to FTS5 column
   filters or SQL LIKE predicates. Auto-routes scope (code vs commits) from
   the prefix used. Works for both FTS5 and INSTR modes.

4. **Commit message search** — `fts_commits` FTS5 table indexed on hash,
   message, author. Populated during push. Exposed via `?scope=commits` and
   a tab in the web UI search page.

5. **Incremental FTS rebuild** — Uses `diff::diff_trees` to compare old and
   new HEAD trees. Only inserts/deletes/updates changed files in `fts_head`.
   Stores last indexed commit hash in `config` table. O(changed files) per push.

6. **Default branch detection** — First branch pushed becomes the default.
   Stored in `config` table. FTS rebuild triggers on any push to the default
   branch (not hardcoded to `main`).

### Web + agent UI

7. **Branch selector** — Dropdown on home, tree, blob, and log pages listing
   all `refs/heads/*` entries. Shows current branch prominently. Selecting a
   branch navigates to the same page on that branch.

8. **Agent-readable representations** — Page routes negotiate `text/html`,
   `text/markdown`, and `text/plain`. `?format=` overrides `Accept`. Responses
   chosen from `Accept` add `Vary: Accept`.

9. **Markdown/plain page coverage** — Owner profile, repo home, commits, tree,
   blob, commit, diff, search, settings, issues, pulls, issue/PR detail, and
   new issue/new PR forms all have explicit text renderers.

10. **Shared presentation layer** — `src/presentation.rs` centralizes
    negotiation, markdown/plain response helpers, action descriptions, section
    rendering, and the shared navigation hint for agents.

11. **Web module split** — Repo pages live in `src/web/*`; issue/PR pages live
    in `src/issues_web/*`. `src/web.rs` and `src/issues_web.rs` stay as shared
    shells/helpers.

12. **Issues and pull requests** — SQLite-backed issues/PRs with list/detail
    pages, new issue/new PR forms, comments, open/close/reopen actions, and
    repo-owner merge.

13. **PR merge flow** — Merge-base search plus fast-forward or three-way tree
    merge inside the DO. Stores the merge commit and updates the target ref.

14. **Markdown rendering** — Replaced the hand-rolled renderer with
    `pulldown-cmark`. Supports tables, footnotes, strikethrough, task lists,
    and smart punctuation. Raw HTML is escaped and unsafe URLs are neutralized.

15. **Repo-aware README links** — Relative README links and images on the repo
    home page are rewritten against the current ref so in-repo navigation works
    (`/blob`, `/tree`, `/raw`).

16. **Syntax highlighting** — highlight.js CDN with line numbers plugin.
    `#L` anchor support for deep linking to specific lines.

17. **Persistent nav search with live results** — Search bar in the nav on
    every page. Fetches `/search?q=...` on each keystroke (200ms debounce),
    shows a dropdown of file paths + first matching line (code) or commit hash
    + message (commits). Enter navigates to the full search page. Scope
    (code vs commits) detected client-side from `@author:`/`@message:` prefixes.

18. **Repo bar layout fix** — Global nav stays full width while the repo
    secondary bar uses a full-width wrapper with centered inner contents.

### Protocol fixes

19. **HEAD symbolic ref** — `advertise_refs` includes `HEAD` pointing to the
    default branch via `symref=HEAD:refs/heads/:name` capability. Fixes
    `git clone` for repos whose default branch isn't `main`.

20. **Clone with non-main branch** — Consequence of #19. `git clone` now checks
    out the correct branch automatically.

21. **Two-phase push handling** — When `git push` sends a payload larger than
    `http.postBuffer` (1 MiB default), git sends a 4-byte probe (`0000`) then
    the full payload with chunked encoding. Fixed by returning 200 OK for
    empty command sets.

### Performance + scale

22. **Streaming pack parser** — Replaced the all-in-memory `parse()` with a
    two-pass approach: index pass (decompress-to-sink, ~100 bytes/entry) then
    process-by-type (decompress on-demand from pack bytes). Peak memory went
    from >128 MiB (OOM) to ~15 MiB for a 13K-object pack.

23. **Resolve cache** — Bounded 1024-entry cache for resolved pack entries.
    Caches delta chain bases and intermediates to avoid re-decompressing shared
    bases. Critical for git packs with depth-50 chains — reduces decompressions
    by 5-10x for packs with many objects sharing base chains.

24. **Keyframe compression** — Keyframes (full blob snapshots, every 50
    versions) are zlib-compressed before storage. A 5 MB source file compresses
    to ~500 KB. Deltas are left as-is (xpatch uses zstd internally). Zero cost
    for the common case (all files fit in single rows after compression).

25. **Blob chunking** — `blob_chunks` overflow table for compressed keyframes
    that still exceed DO's 2 MB row limit. Transparent to all read paths —
    `reconstruct_blob` reassembles chunks automatically. Only activates for
    large binary files.

26. **Batched SQL INSERTs** — Tree entries batched 25 per statement (4 params
    each, under DO's 100 bound parameter limit). Commit parents batched 33 per
    statement. Cuts total SQL operations by ~6x for large pushes.

27. **Fast existence checks** — Replaced `SELECT COUNT(*) AS n` with
    `SELECT 1 LIMIT 1` for dedup checks in store_commit, store_tree,
    store_blob. Indexed PK lookup, instant return.

28. **stored_size column** — Tracks compressed blob size at INSERT time.
    Stats endpoint uses `SUM(stored_size)` over an integer column instead of
    `SUM(LENGTH(data))` which would scan every data page. Instant stats
    regardless of repo size.

29. **Lossy UTF-8** — `String::from_utf8_lossy` for commit parsing. Old repos
    with Latin-1 or other non-UTF-8 author names are handled gracefully.
    Raw bytes preserved in `raw_objects` for byte-identical fetch.

30. **Admin endpoints** — `DELETE /repo/:name/` wipes all tables.
    `PUT /repo/:name/admin/set-ref` manually sets a ref for recovery from
    partial push timeouts.

31. **Arc-based zero-copy resolve cache** — `ResolveCache` stores `Arc<[u8]>`
    instead of `Vec<u8>`. Cache hits return `Arc::clone` (pointer increment,
    no data copy). `ExternalObjects` also uses `Arc<[u8]>`. `resolve_entry`
    returns `Arc<[u8]>` — each decompressed object is allocated exactly once
    and shared between the cache and the caller. During a processing loop,
    the Arc is at refcount 2 (cache + caller); caller drops at end of iteration,
    leaving refcount 1 in cache. `cache.clear()` drops the last reference.

32. **Budget enforcement** — `MAX_PACK_BYTES = 50 MB` hard gate in
    `handle_receive_pack`: packs above this return a proper `ng` pkt-line
    response before any object is parsed. `CACHE_BUDGET_BYTES = 20 MB`
    enforced inside `ResolveCache::try_cache` — cache silently stops growing
    when the byte budget is exhausted; processing continues via re-decompression.
    Peak memory ceiling at a 50 MB push: ~85 MB (40 MB below the 128 MB wall).
    `ResolveCtx` bundles cache + external objects for `resolve_entry`.

### Auth + docs

33. **Auth worker text mode** — `examples/github-oauth` landing page and
    `/settings` also negotiate markdown/plain views for curl/agents.

34. **Docs refresh** — `README.md` documents text-mode navigation and curl
    examples. `examples/github-oauth/README.md` covers setup, deploy,
    bindings/secrets, and text-mode behavior.

### Testing + fetch negotiation

35. **Root test harness** — Added a root `package.json` + `vitest`/`miniflare`
    setup for the core ripgit worker only. Tests boot the built worker with the
    real KV + SQLite DO bindings under Miniflare.

36. **Rust protocol/unit coverage** — Added unit tests for representation
    negotiation, search query parsing, URL query decoding, and upload-pack
    negotiation helpers.

37. **Real-world git fixture e2e** — Added a pinned offline
    `tests/fixtures/workers-rs-main.bundle` fixture and a git CLI e2e suite
    covering push, clone, fast-forward push, force-push, search refresh, and
    fetch from an existing clone.

38. **Fetch after force-push** — Fixed `git fetch` for existing clones after a
    non-fast-forward rewrite by advertising upload-pack capabilities separately
    from receive-pack and by implementing the expected ACK/NAK negotiation
    before streaming the pack.

---

## Known limitations

These are documented, accepted trade-offs — not bugs.

- **Auth is upstream** — ripgit expects trusted `X-Ripgit-Actor-*` headers
  from an auth worker or other front proxy. Reads are public; writes return
  401 without a trusted actor.
- **DO storage timeout** — Pushes with many objects (>~10K per incremental
  push) can exceed the DO's ~30 second storage operation timeout. Each
  `sql.exec()` auto-commits individually (no request-level transaction).
  Cloudflare's `transactionSync()` API would provide atomicity but is not
  exposed in workers-rs 0.7.5. Use the admin/set-ref endpoint to recover
  from partial push state.
- **50 MB pack body limit (server-enforced)** — `MAX_PACK_BYTES` in `pack.rs`
  rejects packs above 50 MB with a clean `ng` response before any object is
  parsed. The hard Workers platform limit is 100 MB, but we gate lower to keep
  peak DO memory well under the 128 MB ceiling. Repos must be pushed
  incrementally via the push script's checkpoint mechanism.
- **Force pushes are always allowed** — non-fast-forward updates currently work,
  but there is no repo setting or policy hook to reject them when a repo wants
  branch protection semantics.
- **No annotated tag objects** — Silently dropped during push. Lightweight
  tags (refs) work fine.
- **Timezone lost in parsed commits table** — The `commits` table stores unix
  timestamps without timezone offset. The `raw_objects` table preserves the
  original bytes for fetch, but the `/log` API shows times without timezone.
- **README fragment-only heading links** — Relative README file/dir/image links
  are rewritten on the repo home page, but bare `#heading` fragments are not
  yet translated to GitHub-style generated heading IDs.
- **side-band-64k** — Removed from advertised capabilities to avoid wrapping
  the report-status response with sideband bytes.


---

## Next up

### 1. File history

Show all commits that touched a specific file. Walk the first-parent commit
chain, resolve the file path in each commit's tree, emit a result when the
blob hash changes. O(commits * path_depth) — fast for DO-sized repos.

- **API**: `GET /history?ref=main&path=src/lib.rs` — returns list of commits
  that modified the file, with timestamps, authors, and messages.
- **Web UI**: linked from the blob viewer. Paginated commit list scoped to
  one file.

### 2. Blame

Attribute each line of a file to the commit that last modified it. Leverages
blob_groups (all versions of a file by path) and the diff engine (line-level
diffs).

- **API**: `GET /blame?ref=main&path=src/lib.rs` — returns lines with commit
  hash, author, timestamp per line.
- **Web UI**: blame view linked from the blob viewer. Line numbers, commit
  info column, file content.

### 3. Tags page

Browse lightweight tags in the web UI. Already stored in the `refs` table as
`refs/tags/*`. Quick win — just a new page listing tags with their target
commit info.

- **Web UI**: `/repo/:name/tags` — list of tags with commit hash, author,
  date, and message. Link to commit detail page.

---

## Potential future work

- **transactionSync binding** — Add a custom wasm_bindgen binding for
  `ctx.storage.transactionSync()` to get atomic push semantics. Prevents
  partial state on DO timeout. The JS API exists, workers-rs just doesn't
  expose it yet.
- **Repository index** — KV side-index written on push, landing page listing
  all repos with stats. Needs a KV binding in `wrangler.toml`.
- **Annotated tags** — Parse and store tag objects (separate from lightweight
  tag refs). Requires a `tag_objects` table + pack parser changes.
- **Alternative auth frontends** — Bearer token-only or Cloudflare Access
  integration beyond the GitHub OAuth example.
- **Force-push policy** — Add optional rejection or branch-protection rules for
  non-fast-forward updates instead of always allowing them.
- **Streaming zlib compression** — Currently `blob_zlib_compress` buffers the
  entire compressed output (2x blob size in memory). Switching to
  `flate2::write::ZlibEncoder` with incremental chunk writes would eliminate
  the compressed copy, reducing peak memory from `raw + compressed` to
  `raw + ~256 KB`. Compression ratio is identical (single continuous zlib
  stream). Main blocker: interleaves compression with `blob_chunks` INSERT
  logic, changing the `store_blob` flow. Worth doing for medium-sized blobs
  (10-50 MB) where the compressed copy is significant.
- **side-band-64k** — Re-add with proper sideband wrapping for progress
  reporting.
- **Selective page JSON** — Consider page-model JSON for page-only routes such
  as owner profile, repo home, settings, and auth worker pages if agents need
  it; keep the resource JSON API canonical.
- dont use fetch from DO. expose rpc methods, let worker call the right one.