Content indexer
Runs in: GitHub Actions (bot-index-content.yml, Ubuntu runner) executing Node.js code in indexer/, which calls Cloudflare Workers AI for embeddings and Cloudflare Vectorize to store them.
The answers Jinx gives in DMs and the Assistant panel come from a Cloudflare Vectorize index called rladies-content.
A separate weekly workflow walks a set of sources, chunks them, embeds each chunk with the BGE model, and upserts the vectors into the index.
The same index is queried at retrieval time from the worker.
When it runs
The Bot · Index Content workflow runs every Sunday at 04:00 UTC and can be triggered manually:
gh workflow run "Bot · Index Content" --repo rladies/jinxA full run takes a few minutes and re-embeds every chunk – there is no incremental indexing today.
The pipeline
flowchart LR
subgraph Sources
G[rladiesguide<br>hugo-site]
W[rladies.github.io<br>hugo-site]
O[rladies GitHub org<br>READMEs]
P[pkgdown llms.txt<br>across org R packages]
J[jinx-docs<br>help.md, NEWS, PRIVACY]
M[meetup_archive<br>events.json]
A[awesome-rladies-creations<br>packages + content]
Y[RLadies+ YouTube channel<br>via Data API v3]
end
Sources --> Chunk[chunk into ~1.8k-char<br>sections with title,<br>heading, url, date, lastmod]
Chunk --> Embed[Workers AI<br>BGE-base embedding]
Embed --> Upsert[upsert to Vectorize<br>rladies-content]
Upsert -. queried at runtime .-> Q[(DM / Assistant /<br>@-mention answers)]
style Chunk fill:#88398A,color:#fff
style Embed fill:#88398A,color:#fff
style Upsert fill:#88398A,color:#fff
What gets indexed
| Source | What it covers |
|---|---|
rladies/rladiesguide |
The guide you are reading right now. Crawled from the production sitemap (English content only). |
rladies/rladies.github.io |
The main RLadies+ website. Crawled from the production sitemap. |
rladies/* org READMEs |
Top-level README of every non-archived repo in the org. |
pkgdown llms.txt |
LLM-friendly summaries of every R package in the org with a DESCRIPTION file and a published pkgdown site. |
rladies/jinx docs |
Jinx’s inst/commands/help.md, NEWS.md, and PRIVACY.md. |
rladies/meetup_archive |
The full events.json – active events plus past events from the last 12 months. Cancelled and older past events are dropped. |
rladies/awesome-rladies-creations |
Both awesome_packages.json (~315 R packages) and awesome_content.json (blogs / sites / videos). |
| RLadies+ Global YouTube channel | Every video on the channel, fetched via YouTube Data API v3. Title, description, and publish date are indexed. |
How dates are used
The reranker treats lastmod (last-modified date pulled from Hugo’s article:modified_time meta tag for guide/website pages) as a tiebreaker: pages maintained recently win narrow contests over pages that have not been touched in years.
date (the original publication date for blog/news content, or datetime_utc for events) feeds a content-recency factor with a two-year half-life.
Upcoming events clamp to the recency ceiling, so “what events are coming up?” surfaces them above year-old past meetups.
Adding a new source
Each source lives as a small module in indexer/sources/ and is wired into indexer/index.mjs.
A new source needs three things: a gather*Source() function that returns chunks with text, url, title, date, lastmod, an entry in the SOURCES array in index.mjs, and (optionally) a new entry in SOURCE_WEIGHTS in worker/src/rag.js if it deserves a non-default rerank weight.