What this map is
A normalized graph of technologies used by serious OSS AI repos, not just a list of repositories.
This page explains the exact method behind the OSS AI Stack Map: how repositories are discovered, how technology evidence is extracted, how raw package signals become canonical technologies, and how benchmarks plus graph analysis drive the next round of research.
A normalized graph of technologies used by serious OSS AI repos, not just a list of repositories.
It is not a generic dependency census, unreviewed README keyword scrape, or popularity leaderboard.
Normalization and research quality improve faster than discovery, so we iterate by repairing snapshots against the current registry.
Matching order: exact alias, registry alias, package-family prefix, provider inference, repo identity, then low-confidence README fallback.
Judge usage: repo judges and registry judges are optional, conservative, and used to narrow review queues rather than replace curation.
flowchart LR
A[GitHub Search + Manual Seeds] --> B[Discovery]
B --> C[repos.parquet]
C --> D[Classification Context Build]
D --> E[repo_contexts.parquet]
E --> F[Rule-Based Classification]
F --> G[repo_inclusion_decisions.parquet]
E --> H[Dependency Evidence]
H --> I[repo_dependency_evidence.parquet]
E --> J[Normalization]
G --> J
J --> K[repo_technology_edges.parquet]
J --> L[technologies.parquet]
C --> M[Snapshot Repair]
E --> M
G --> M
K --> M
L --> M
M --> N[gap_report.json]
M --> O[benchmark_recall_report.json]
M --> P[technology_discovery_report.json]
M --> Q[registry_suggestions.json]
N --> R[Report + Explainer]
O --> R
P --> R
Q --> R
flowchart TD
A[repos.parquet] --> B[repo_contexts.parquet]
B --> C[repo_dependency_evidence.parquet]
B --> D[repo_inclusion_decisions.parquet]
C --> E[repo_technology_edges.parquet]
D --> E
E --> F[technologies.parquet]
A --> G[gap_report.json]
C --> G
D --> G
E --> G
A --> H[benchmark_recall_report.json]
C --> H
E --> H
A --> I[technology_discovery_report.json]
C --> I
D --> I
E --> I
I --> J[registry_suggestions.json]
H --> J
J --> K[docs output]
flowchart LR
A[Discovery Queries and Manual Seeds] --> B[Candidate Repo Universe]
B --> C[Rule Classifier]
C --> D{Optional Judge}
D --> E[Final Repo Set]
E --> F[Dependency Evidence and Normalization]
F --> G[Canonical Technology Map]
G --> H[Gap Report + Benchmark Recall + Registry Suggestions]
H --> I{Which lever is weak?}
I --> J[Discovery Updates]
I --> K[Classifier Updates]
I --> L[Judge Prompt and Routing Updates]
I --> M[Registry Updates]
J --> A
K --> C
L --> D
M --> F
flowchart LR
A[Unmatched dependency evidence] --> B[NetworkX candidate graph]
B --> C[Ranked family backlog]
C --> D[Registry suggestion filter]
D --> E{LLM judge enabled?}
E -- yes --> F[OpenAIRegistryJudge review]
E -- no --> G[Curator review]
F --> H[Accept / Review / Reject]
G --> H
H --> I[technology_registry.yaml]
I --> J[snapshot-repair]
J --> K[Improved edges, gaps, and benchmarks]
Fresh universe: use `discover` and `classify` when you need a new GitHub crawl.
Research iteration: use `snapshot-repair` when the registry, normalization rules, or reporting logic improves faster than discovery.
Publication: render the report to `docs/`, serve the directory over Tailscale, and keep the stable entrypoints at `index.html`, `oss-ai-stack-report-latest.html`, and `report-latest.json`.
Config: `config/study_config.yaml`, `config/discovery_topics.yaml`, `config/technology_aliases.yaml`, `config/technology_registry.yaml`, `config/benchmark_entities.yaml`, `config/segment_rules.yaml`
Pipeline: `discovery.py`, `classification.py`, `normalize.py`, `reporting.py`, `technology_discovery.py`, `registry_suggestions.py`
OpenAI review: `openai/judge.py` and `openai/registry_judge.py`
Docs and publication: `analysis/snapshot.py`, `scripts/render_html_report.py`, `scripts/render_explainer_page.py`, `ARCHITECTURE.md`