YouTube transcripts (InnerTube)
This page summarizes how the Knowledge Studio Express server obtains captions without calling the public YouTube Data API v3 or spending Google Cloud quota. It mirrors the engineering note in the server repo (docs/youtube-innertube-transcripts.md).
What InnerTube is
The YouTube website does not load everything through the public YouTube Data API v3 (the Google Cloud product with API keys and quotas). The browser talks to InnerTube, YouTube's internal JSON API used by youtube.com and embedded players. It is undocumented for third-party apps; the server issues the same shape of requests a normal watch page makes.
Pipeline (high level)
Implementation code lives mainly under src/services/youtube-transcripts/. The HTTP entry point calls into something like getTranscriptFromYoutube.
- Watch page HTML — Request the public watch page for the video id and maintain cookies (for example consent) like a browser.
- Extract InnerTube client key — Parse the API key string embedded in the HTML.
- Player request— POST to YouTube's player endpoint with that key and the video id. The JSON includes playability and caption track metadata (language, base URLs, auto-generated vs manual).
- Caption download — Pick a preferred language track, then fetch the caption document (often XML) from the track URL.
- Parse to text — Parse timed segments into a single transcript string; read title and author-style fields from the player JSON where available.
HTTP surface (typical)
POST /api/data/youtube-transcripts/get-transcript— Runs the pipeline and returns JSON without writing to the database (preview or form fill).POST /api/data/youtube-transcripts/fetch-from-url— Same fetch, then persists ayoutube_transcriptsrow.
Both flows call the same underlying transcript fetch implementation.
Tradeoffs
- No Google API billing for this path, but operational risk: HTML and InnerTube responses can change; YouTube may rate-limit, challenge, or block server IPs.
- Compliance: Automated access may conflict with YouTube's terms depending on your use case. Treat this as a technical description, not legal advice.