Indexing Architecture#

Matrix Hub ingests manifests (agents, tools, MCP servers) from remote catalogs (e.g., GitHub), validates them, normalizes metadata into a relational Catalog DB, optionally chunks & embeds long text, and updates a Vector Index plus a Blob Store for RAG.

flowchart LR
R[Remotes\nindex.json/tree/zip] --> I[Ingestor]
I --> V[Validator\nJSON Schema + policy]
V --> N[Normalizer\nmanifest → DB rows]
N --> D[Catalog DB]
N --> C[Chunker\nname/desc/README]
C --> E[Embedder]
E --> X[Vector Index]
N --> B[Blob Store]

Goals#

Reuse-first: one canonical entry per (type, id, version).
Idempotent: re-ingesting the same content is safe.
Provable: provenance stored (remote@commit, ETag, timestamps).
Composable: lexical, vector, and RAG layers are pluggable.

Data Flow (high level)#

Discover: pull index.json (or walk repo ZIP/tree).
Validate: JSON Schema + optional signatures/SBOM checks.
Normalize: map to entity, capabilities, artifacts, adapters, compatibility.
Persist: upsert into DB; record provenance and checksums.
Chunk & Embed (optional): split long text → vectors → upsert vector index.
Blob Store: persist large text (READMEs, examples) for RAG retrieval.

SLOs & Operational Notes#

Freshness: default ingest every 15 minutes (configurable).
Back-pressure: batch size and rate limits to avoid API throttling.
Consistency: write DB first, then vectors/blobs; retries are safe.
Observability: structured logs; ingest counters; rejection reasons.