[explainer]

How this map is built and why the results are credible.

This page explains the exact method behind the OSS AI Stack Map: how repositories are discovered, how technology evidence is extracted, how raw package signals become canonical technologies, and how benchmarks plus graph analysis drive the next round of research.

Open the latest report Jump to method

Overview Method Evidence Architecture Operations FAQ

+ Research object

AI tech stacks

The unit of study is how major OSS AI repos compose their stack, not just which repos exist.

+ Primary surface

Registry

technology_registry.yaml is the main curated research asset for canonical technologies.

+ Iteration mode

Repair-first

We improve normalization and reporting by repairing snapshots without requiring a new GitHub crawl every time.

+ Discovery mode

Graph-guided

Network analysis ranks unmatched package families so research effort follows the data.

+ frame

What this map is

A normalized graph of technologies used by serious OSS AI repos, not just a list of repositories.

+ frame

What this map is not

It is not a generic dependency census, unreviewed README keyword scrape, or popularity leaderboard.

+ frame

Why repair-first matters

Normalization and research quality improve faster than discovery, so we iterate by repairing snapshots against the current registry.

+ method

The workflow in six concrete steps

step 1

Discover a broad GitHub universe from topics, description keywords, and manual seed repos.

step 2

Build repo context from README, file tree, manifests, SBOMs, and imports.

step 3

Classify which repos are serious, AI-relevant, and worth publishing in the final map.

step 4

Extract dependency evidence and normalize it into canonical technologies and provider signals.

step 5

Repair snapshots so new registry logic can be applied without doing a new GitHub crawl.

step 6

Use gap reports, benchmarks, and graph analysis to decide what the registry should learn next.

+ evidence

What counts as stack evidence

Manifest dependencies from Python, JavaScript/TypeScript, Go, and Rust manifests.

SBOM direct dependencies from GitHub dependency graph exports.

Import-derived fallback detection when manifest coverage is incomplete.

Repo identity edges for upstream canonical projects.

Curator-reviewed low-confidence README fallback only when a final repo would otherwise remain unmapped.

+ curation

What stays automatic and what stays curated

The canonical registry is the primary research surface. It tracks provider, product, sdk_family, package-family, aliases, package prefixes, repo identities, and capabilities.

Auto-discovery finds and ranks candidate families, but does not promote them directly into the registry.

Benchmarks measure recall against known AI stack entities so regressions become visible immediately.

Optional LLM judges narrow review queues for repo classification and registry suggestions, but do not replace curation.

Matching order: exact alias, registry alias, package-family prefix, provider inference, repo identity, then low-confidence README fallback.

Judge usage: repo judges and registry judges are optional, conservative, and used to narrow review queues rather than replace curation.

+ architecture

The pipeline and the feedback loops that improve it

[top-level flow]

flowchart LR
    A[GitHub Search + Manual Seeds] --> B[Discovery]
    B --> C[repos.parquet]
    C --> D[Classification Context Build]
    D --> E[repo_contexts.parquet]
    E --> F[Rule-Based Classification]
    F --> G[repo_inclusion_decisions.parquet]
    E --> H[Dependency Evidence]
    H --> I[repo_dependency_evidence.parquet]
    E --> J[Normalization]
    G --> J
    J --> K[repo_technology_edges.parquet]
    J --> L[technologies.parquet]
    C --> M[Snapshot Repair]
    E --> M
    G --> M
    K --> M
    L --> M
    M --> N[gap_report.json]
    M --> O[benchmark_recall_report.json]
    M --> P[technology_discovery_report.json]
    M --> Q[registry_suggestions.json]
    N --> R[Report + Explainer]
    O --> R
    P --> R
    Q --> R

[artifact lineage]

flowchart TD
    A[repos.parquet] --> B[repo_contexts.parquet]
    B --> C[repo_dependency_evidence.parquet]
    B --> D[repo_inclusion_decisions.parquet]
    C --> E[repo_technology_edges.parquet]
    D --> E
    E --> F[technologies.parquet]
    A --> G[gap_report.json]
    C --> G
    D --> G
    E --> G
    A --> H[benchmark_recall_report.json]
    C --> H
    E --> H
    A --> I[technology_discovery_report.json]
    C --> I
    D --> I
    E --> I
    I --> J[registry_suggestions.json]
    H --> J
    J --> K[docs output]

[improvement loop]

flowchart LR
    A[Discovery Queries and Manual Seeds] --> B[Candidate Repo Universe]
    B --> C[Rule Classifier]
    C --> D{Optional Judge}
    D --> E[Final Repo Set]
    E --> F[Dependency Evidence and Normalization]
    F --> G[Canonical Technology Map]
    G --> H[Gap Report + Benchmark Recall + Registry Suggestions]
    H --> I{Which lever is weak?}
    I --> J[Discovery Updates]
    I --> K[Classifier Updates]
    I --> L[Judge Prompt and Routing Updates]
    I --> M[Registry Updates]
    J --> A
    K --> C
    L --> D
    M --> F

[registry curation loop]

flowchart LR
    A[Unmatched dependency evidence] --> B[NetworkX candidate graph]
    B --> C[Ranked family backlog]
    C --> D[Registry suggestion filter]
    D --> E{LLM judge enabled?}
    E -- yes --> F[OpenAIRegistryJudge review]
    E -- no --> G[Curator review]
    F --> H[Accept / Review / Reject]
    G --> H
    H --> I[technology_registry.yaml]
    I --> J[snapshot-repair]
    J --> K[Improved edges, gaps, and benchmarks]

+ operating model

How to work with the system

Fresh universe: use `discover` and `classify` when you need a new GitHub crawl.

Research iteration: use `snapshot-repair` when the registry, normalization rules, or reporting logic improves faster than discovery.

Publication: render the report to `docs/`, serve the directory over Tailscale, and keep the stable entrypoints at `index.html`, `oss-ai-stack-report-latest.html`, and `report-latest.json`.

+ implementation map

Where the code and config live

Config: `config/study_config.yaml`, `config/discovery_topics.yaml`, `config/technology_aliases.yaml`, `config/technology_registry.yaml`, `config/benchmark_entities.yaml`, `config/segment_rules.yaml`

Pipeline: `discovery.py`, `classification.py`, `normalize.py`, `reporting.py`, `technology_discovery.py`, `registry_suggestions.py`

OpenAI review: `openai/judge.py` and `openai/registry_judge.py`

Docs and publication: `analysis/snapshot.py`, `scripts/render_html_report.py`, `scripts/render_explainer_page.py`, `ARCHITECTURE.md`

+ faq

Common questions about the project

Does discovery expand recursively through dependency graphs?

Not today. Discovery starts from GitHub topics, description queries, and manual seed repos. Dependency evidence is collected later during classification and normalization, where it helps map technologies rather than recursively discover more repositories.

What is the judge actually doing?

The judge is an optional reviewer for repo inclusion or registry suggestion triage. It is not the primary classifier and it does not define the technology map on its own. Its job is to narrow the queue of ambiguous cases.

What is the registry and why is it so important?

The registry is the curated canonical list of technologies the project knows how to recognize. It stores aliases, package prefixes, import aliases, repo identities, category metadata, and entity type. Improving the registry is the fastest way to reduce unmapped evidence and missing edges.

Why does the project have a repair-first workflow?

Normalization, registry curation, and reporting usually improve faster than GitHub discovery. Snapshot repair lets the project reapply better logic to an existing snapshot without paying for a new crawl every time.

What does the benchmark measure?

The benchmark measures whether curated important entities are discovered, included, identity-mapped, observed in third-party adoption, and supported by dependency evidence. It is both a regression guard and a growing recall panel.

What counts as a good improvement?

A good improvement raises benchmark recall, reduces final repos missing normalized edges, improves precision on hard cases, or simplifies the system without hurting the scorecard.

Can the registry suggestions be accepted automatically?

No. Registry suggestions are ranked candidates inferred from real evidence, but they still require curator review. The project treats automatic candidate generation and curated promotion as separate steps.

What is the ideal final output of the project?

The ideal output is a credible map of the enabling AI stack used across serious OSS AI projects: providers, model access layers, orchestration frameworks, runtimes, vector systems, training tools, observability systems, and other canonical technology entities.