Format detection · TypeScript · zero-dep

format-converters

Structure stays verbatim. Prose gets compressed.

Nine formats, 226 tests, no runtime dependencies. The boring infrastructure every LLM pipeline eventually writes — detection, structural skeleton extraction, prose extraction, and fail-safe reconstruction. ESM, Node 18+, runs on Deno / Bun / edge.

Install See formats GitHub

Formats supported

Tests passing

226

Runtime deps

Language

TS 5.9

Module format

ESM

Min Node

18+

Thesis — 01

Why this exists

LLMs burn tokens on structure that never needed summarizing.

Kubernetes manifests, API responses, changelogs, Dockerfiles — they arrive as message content, get treated as flat prose, and the structure the model needs to reason about them is the first thing the summarizer destroys. format-converters splits each format at the right seam: keep the skeleton, compress only the prose.

Before — 480 chars, structure and prose fused

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        description: This container runs the nginx
          web server and handles all incoming HTTP
          and HTTPS traffic for production, serving
          thousands of concurrent users daily with
          automatic health checking and graceful
          restart on failure.

After — 148 chars, skeleton intact, 3.2× smaller

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
# nginx web server for production, high-traffic,
# health-checked

Keys, versions, and replica count survive verbatim. The verbose description is replaced by a summary. The deployment still parses. The LLM still reasons over it. The token bill dropped.

Formats — 9

What ships today

Nine converters, one interface. First match wins.

Each converter implements the same four-method contract. The registry orders them by specificity — MDX before Markdown, JSON / XML before YAML — so pattern overlap doesn't misroute content. Detection is fast heuristics; no AST, no external parsers.

Docs & prose

2 converters

markdown

Headings, tables, code fences, frontmatter, directives, GFM alerts, HTML blocks preserved verbatim. Paragraph prose between them is compressible.

mdx

Everything Markdown preserves, plus ESM imports/exports and PascalCase JSX component blocks. MDX detection runs before Markdown so JSX doesn't leak into prose.

Data & config

4 converters

json

All keys, numbers, booleans, short strings survive verbatim. String values of six words and 100 chars or more are candidates for summarization.

yaml

Keys, booleans, numbers, and strings up to 60 chars are atomic. Longer string values are treated as prose.

toml

Section headers and atomic keys preserved; only long string values move to the compressible bucket.

csv

Header row kept verbatim, a row-count annotation stands in for the data rows. Always shorter; nothing structural is lost.

Markup & build

3 converters

xml

Full tag skeleton and attributes held; short text values kept; prose text nodes and verbose XML comments are compressible.

html

Tag skeleton preserved. script and style blocks collapse to [code] placeholders rather than being treated as prose.

dockerfile

Every FROM / RUN / COPY / CMD / ENV instruction survives line-for-line. Multi-line prose comment blocks above stages are the compressible unit.

The split — 4 methods

Structure vs. prose

One contract every converter honors. Deterministic, reviewable, reversible.

Every converter — built-in or custom — implements the same four methods. Your compression engine, middleware, or RAG pre-processor doesn't have to know which format it's looking at. Aletheia's rule: the contract is the artifact, not the implementation.

extractPreserved(content)

Returns the structural skeleton — the parts that must survive verbatim: tags, keys, headings, instructions, short atomic values.

extractCompressible(content)

Returns the prose segments. Each one is safe to hand to an LLM for summarization, translation, or token-budget trimming.

reconstruct(preserved, summary)

Reassembles output from the skeleton plus a summary string. Per-format rules — YAML becomes a trailing comment, XML a node, JSON a _summary key.

detect(content)

First-match wins across the nine registered converters. Order is intentional: MDX before Markdown, JSON / XML before YAML. No AST, no external parsers.

Structural typing. The same four-method signature is used by context-compression-engine's FormatAdapter — pass converters straight in, no wrapper required.

In code — 24 lines

What it looks like in use

Detect, split, summarize, reassemble. That’s the whole loop.

The library stays out of your way. You keep control of the LLM call, the prompt, and the token budget. format-converters only owns the seam.

# Install
npm install @lisa/format-converters

// Detect and split any supported format
import { detect } from '@lisa/format-converters';

const result = detect(content);
if (result) {
  const preserved    = result.converter.extractPreserved(content);
  const compressible = result.converter.extractCompressible(content);
  const summary      = await myLlm.summarize(compressible.join('\n'));
  const output       = result.converter.reconstruct(preserved, summary);
  // Always check: output.length <= content.length
}

// Or pick one converter directly
import { YamlConverter } from '@lisa/format-converters';

if (YamlConverter.detect(content)) {
  const skeleton = YamlConverter.extractPreserved(content);
  const prose    = YamlConverter.extractCompressible(content);
  const output   = YamlConverter.reconstruct(skeleton, summarize(prose));
}

Metis says: a CLI wrapper (npx format-converters detect and friends) is on the alpha roadmap. The library is the load-bearing artifact; the CLI will be a thin shell over the same four methods.

Honest positioning — 03

What this is not

The adversary’s disclosure. Read before you install.

Ipcha Mistabra wrote this section. Before this gets wired into your context compression pipeline, know what it will and will not do for you.

Disclosure

Not a pandoc replacement.

We don't convert PDF to DOCX or render HTML to PNG. This is a detection-and-split library for LLM pipelines, not a document transformer. If you need a pandoc, use pandoc.

Disclosure

No semantic transforms inside the library.

The summarizer is your LLM, your prompt, your tokens. format-converters only gives you the seam — what to send, what to keep, how to stitch the result back together.

Disclosure

Heuristic detection. Fail-safe reconstruction.

Detection is fast pattern matching, not parsing. reconstruct() output is checked against input length — if a round-trip would make the content longer, discard it and keep the original. Lossy by design, flagged by contract.

Install — under a minute

Quick start

One package. No runtime dependencies.

ESM only. Node 18+. Runs unchanged on Deno, Bun, Cloudflare Workers, Vercel edge, and any V8-ish runtime that speaks modules.

# Install
npm install @lisa/format-converters

# Or with pnpm / yarn / bun
pnpm add @lisa/format-converters
bun add @lisa/format-converters

# From source
git clone https://github.com/nyxCore-Systems/format-converters
cd format-converters && npm install
npm test   # 226 tests, vitest, no network

Get it on GitHub View on npm

v0.2.0 · alpha · MIT-style community licence

Before you wire it into production pre-processing

The reconstruct() contract is fail-safe by design: if round-trip output is not strictly shorter than the input, discard it and keep the original. Honor that check at the call site. A zero-token save is not a win if you shipped a regression into the prompt — Aletheia would rather you kept the full document.

Metis says: measure the compression ratio per format in your corpus before trusting a global flag.

See the rest of the nyxCore ecosystem Talk to the team