extractPreserved(content)
Returns the structural skeleton — the parts that must survive verbatim: tags, keys, headings, instructions, short atomic values.
Format detection · TypeScript · zero-dep
Structure stays verbatim. Prose gets compressed.
Nine formats, 226 tests, no runtime dependencies. The boring infrastructure every LLM pipeline eventually writes — detection, structural skeleton extraction, prose extraction, and fail-safe reconstruction. ESM, Node 18+, runs on Deno / Bun / edge.
Formats supported
9
Tests passing
226
Runtime deps
0
Language
TS 5.9
Module format
ESM
Min Node
18+
Why this exists
Kubernetes manifests, API responses, changelogs, Dockerfiles — they arrive as message content, get treated as flat prose, and the structure the model needs to reason about them is the first thing the summarizer destroys. format-converters splits each format at the right seam: keep the skeleton, compress only the prose.
Before — 480 chars, structure and prose fused
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 3
template:
spec:
containers:
- name: nginx
image: nginx:1.25
description: This container runs the nginx
web server and handles all incoming HTTP
and HTTPS traffic for production, serving
thousands of concurrent users daily with
automatic health checking and graceful
restart on failure.After — 148 chars, skeleton intact, 3.2× smaller
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 3
template:
spec:
containers:
- name: nginx
image: nginx:1.25
# nginx web server for production, high-traffic,
# health-checkedKeys, versions, and replica count survive verbatim. The verbose description is replaced by a summary. The deployment still parses. The LLM still reasons over it. The token bill dropped.
What ships today
Each converter implements the same four-method contract. The registry orders them by specificity — MDX before Markdown, JSON / XML before YAML — so pattern overlap doesn't misroute content. Detection is fast heuristics; no AST, no external parsers.
Docs & prose
2 convertersHeadings, tables, code fences, frontmatter, directives, GFM alerts, HTML blocks preserved verbatim. Paragraph prose between them is compressible.
Everything Markdown preserves, plus ESM imports/exports and PascalCase JSX component blocks. MDX detection runs before Markdown so JSX doesn't leak into prose.
Data & config
4 convertersAll keys, numbers, booleans, short strings survive verbatim. String values of six words and 100 chars or more are candidates for summarization.
Keys, booleans, numbers, and strings up to 60 chars are atomic. Longer string values are treated as prose.
Section headers and atomic keys preserved; only long string values move to the compressible bucket.
Header row kept verbatim, a row-count annotation stands in for the data rows. Always shorter; nothing structural is lost.
Markup & build
3 convertersFull tag skeleton and attributes held; short text values kept; prose text nodes and verbose XML comments are compressible.
Tag skeleton preserved. script and style blocks collapse to [code] placeholders rather than being treated as prose.
Every FROM / RUN / COPY / CMD / ENV instruction survives line-for-line. Multi-line prose comment blocks above stages are the compressible unit.
Structure vs. prose
Every converter — built-in or custom — implements the same four methods. Your compression engine, middleware, or RAG pre-processor doesn't have to know which format it's looking at. Aletheia's rule: the contract is the artifact, not the implementation.
extractPreserved(content)
Returns the structural skeleton — the parts that must survive verbatim: tags, keys, headings, instructions, short atomic values.
extractCompressible(content)
Returns the prose segments. Each one is safe to hand to an LLM for summarization, translation, or token-budget trimming.
reconstruct(preserved, summary)
Reassembles output from the skeleton plus a summary string. Per-format rules — YAML becomes a trailing comment, XML a node, JSON a _summary key.
detect(content)
First-match wins across the nine registered converters. Order is intentional: MDX before Markdown, JSON / XML before YAML. No AST, no external parsers.
Structural typing. The same four-method signature is used by context-compression-engine's FormatAdapter — pass converters straight in, no wrapper required.
What it looks like in use
The library stays out of your way. You keep control of the LLM call, the prompt, and the token budget. format-converters only owns the seam.
# Install npm install @lisa/format-converters // Detect and split any supported format import { detect } from '@lisa/format-converters'; const result = detect(content); if (result) { const preserved = result.converter.extractPreserved(content); const compressible = result.converter.extractCompressible(content); const summary = await myLlm.summarize(compressible.join('\n')); const output = result.converter.reconstruct(preserved, summary); // Always check: output.length <= content.length } // Or pick one converter directly import { YamlConverter } from '@lisa/format-converters'; if (YamlConverter.detect(content)) { const skeleton = YamlConverter.extractPreserved(content); const prose = YamlConverter.extractCompressible(content); const output = YamlConverter.reconstruct(skeleton, summarize(prose)); }
Metis says: a CLI wrapper (npx format-converters detect and friends) is on the alpha roadmap. The library is the load-bearing artifact; the CLI will be a thin shell over the same four methods.
What this is not
Ipcha Mistabra wrote this section. Before this gets wired into your context compression pipeline, know what it will and will not do for you.
Disclosure
We don't convert PDF to DOCX or render HTML to PNG. This is a detection-and-split library for LLM pipelines, not a document transformer. If you need a pandoc, use pandoc.
Disclosure
The summarizer is your LLM, your prompt, your tokens. format-converters only gives you the seam — what to send, what to keep, how to stitch the result back together.
Disclosure
Detection is fast pattern matching, not parsing. reconstruct() output is checked against input length — if a round-trip would make the content longer, discard it and keep the original. Lossy by design, flagged by contract.
Quick start
ESM only. Node 18+. Runs unchanged on Deno, Bun, Cloudflare Workers, Vercel edge, and any V8-ish runtime that speaks modules.
# Install npm install @lisa/format-converters # Or with pnpm / yarn / bun pnpm add @lisa/format-converters bun add @lisa/format-converters # From source git clone https://github.com/nyxCore-Systems/format-converters cd format-converters && npm install npm test # 226 tests, vitest, no network
Before you wire it into production pre-processing
The reconstruct() contract is fail-safe by design: if round-trip output is not strictly shorter than the input, discard it and keep the original. Honor that check at the call site. A zero-token save is not a win if you shipped a regression into the prompt — Aletheia would rather you kept the full document.
Metis says: measure the compression ratio per format in your corpus before trusting a global flag.