lecture-notes-pipeline

Download Canvas course materials, recordings, and local Whisper transcripts. Optional Codex / Claude Code skill for PPT-aligned study notes.

This repository is useful in two modes:

Without an agent: download Canvas course files, download lecture recordings, extract audio, and transcribe locally with Whisper.
With an agent: align slides and transcripts, judge lecture boundaries, clean low-confidence fragments, and write compact study notes.

The command-line pipeline covers the mechanical work in four layers:

materials/   Canvas course files, PPTs, module files, assignment attachments
downloads/   Canvas lecture recordings
audio/       extracted audio
transcripts/ local Whisper transcripts

The bundled skill covers the judgment-heavy work after those assets exist.

Preview

Exported lecture note sample:

What this repo does

Downloads Canvas course materials when a local logged-in browser session is available.
Downloads Canvas-hosted recordings from SJTU's video platform.
Extracts audio with ffmpeg as an explicit separate step.
Transcribes Chinese lecture audio with mlx-whisper.
Builds a quick slide text index from PDF decks.
Runs rough PPT keyword scans over transcripts.
Supports fuzzy lookup against slides and reference notes for low-confidence fragments.

Repository layout

download_canvas_videos.py: download Canvas recordings with the smallest available stream.
download_canvas_materials.py: download Canvas course files, module files, assignment pages, and assignment attachments.
extract_audio.py: extract audio from one or more lecture videos.
process_lecture.py: transcribe one or more existing audio files.
run_course_pipeline.py: unified wrapper for materials, video downloads, audio extraction, and transcription.
build_slide_index.py: extract a quick text preview from PPT PDFs.
scan_ppt_hits.py: rough transcript-to-PPT keyword scan.
fuzzy_lookup.py: fuzzy lookup over slide PDFs and reference notes.
clean_transcript.py: remove obvious noise fragments from transcript text.
export_notes_pdf.py: export Markdown notes into per-note PDFs and one combined PDF.
examples/: sample note source and preview assets for the README.
skills/lecture-notes-pipeline/: Codex skill for running the workflow with stable note-writing rules.

Requirements

Python packages:

requests
mlx-whisper
reportlab
pypdf
python-pptx
browser-cookie3

System tools:

ffmpeg
pdftotext (Poppler)

Install Python dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Quick start

Refresh your browser login at oc.sjtu.edu.cn, then create a local course root:

mkdir -p /path/to/course-root

Sync Canvas materials and recording metadata without downloading:

python3 run_course_pipeline.py \
  --course-id 123456 \
  --from-chrome \
  --steps materials,videos \
  --course-root /path/to/course-root

Download a small bounded batch:

python3 run_course_pipeline.py \
  --course-id 123456 \
  --from-chrome \
  --steps materials,videos \
  --start 1 \
  --end 3 \
  --download \
  --max-count 3 \
  --course-root /path/to/course-root

Extract audio from downloaded videos:

python3 run_course_pipeline.py \
  --course-id 123456 \
  --steps audio \
  --start 1 \
  --end 3 \
  --course-root /path/to/course-root

Transcribe extracted audio:

python3 run_course_pipeline.py \
  --course-id 123456 \
  --steps transcribe \
  --start 1 \
  --end 3 \
  --course-root /path/to/course-root

You can also run each layer directly:

python3 download_canvas_materials.py --course-id 123456 --from-chrome --download --output-dir /path/to/course-root/materials
python3 download_canvas_videos.py --source sjtu-lti --course-id 123456 --canvas-cookie-file /path/to/cookies.txt 1 2 --download --output-dir /path/to/course-root/downloads
python3 extract_audio.py /path/to/course-root/downloads/*.mp4 --audio-dir /path/to/course-root/audio
python3 process_lecture.py /path/to/course-root/audio/*.m4a --transcript-dir /path/to/course-root/transcripts

Build a slide index after materials are in place:

python3 build_slide_index.py --ppt-dir /path/to/course-root/materials --output /path/to/course-root/slides_index.md

Fuzzy lookup for noisy fragments:

python3 fuzzy_lookup.py "无知之幕" --notes-pdf /path/to/reference-notes.pdf --ppt-dir /path/to/ppt

Export notes to PDF:

python3 export_notes_pdf.py --notes-dir /path/to/notes --output-dir /path/to/exports

This generates:

one PDF per Markdown note
one combined PDF volume by default

Try the included sample:

python3 export_notes_pdf.py --notes-dir examples/sample_notes --output-dir examples/rendered

Canvas download configuration

download_canvas_videos.py reads an authenticated Canvas token from a local Chrome session storage file.

You can override the defaults with:

CANVAS_BASE_URL
CANVAS_SESSION_STORAGE

By default the script looks at the Chrome Session Storage/ directory and picks the newest .log file.

Or by passing:

python3 download_canvas_videos.py 4 5 6 --session-storage "/path/to/Session Storage"

For SJTU Canvas, the downloader can also reuse an already authenticated oc.sjtu.edu.cn cookie and follow the LTI3 handoff to v.sjtu.edu.cn:

python3 download_canvas_videos.py \
  --source sjtu-lti \
  --course-id 123456 \
  --canvas-cookie-file /path/to/cookies.txt \
  --sync-details \
  --output-dir /path/to/course-root/downloads

Then download a bounded batch:

python3 download_canvas_videos.py \
  --source sjtu-lti \
  --course-id 123456 \
  --canvas-cookie-file /path/to/cookies.txt \
  1 2 \
  --download \
  --resume \
  --max-count 2 \
  --output-dir /path/to/course-root/downloads

--canvas-cookie-file accepts Netscape cookie exports and simple name=value; name2=value2 cookie header text. You can also pass the header directly with --canvas-cookie. This mode does not store account passwords or perform jAccount login; refresh the cookie from your own browser session when it expires.

SJTU note: findVodVideoList expects a JSON request body, and the canvasCourseId value should be encodeURIComponent(courId). If you pass the raw courId, the platform may return an empty list or a decrypt failure even when the current browser page can play the recording.

This script is intentionally local-first. It is designed for workflows where the user is already logged into Canvas in Chrome on the same machine.

For resumable course runs, use the downloader as a small stateful job rather than a long detached process:

python3 download_canvas_videos.py 4 5 6 --sync-details --output-dir /path/to/course-root/downloads
python3 download_canvas_videos.py 4 5 6 --download --resume --max-count 2 --output-dir /path/to/course-root/downloads
python3 download_canvas_videos.py --verify-only --output-dir /path/to/course-root/downloads
python3 download_canvas_videos.py --status --output-dir /path/to/course-root/downloads

The downloader writes:

canvas_download_manifest.json: selected recordings, streams, output paths, and source URLs
download_status.json: per-lecture pending / downloading / verified / failed state
download_runs/*.jsonl: run logs for download and verification events

If Canvas exposes multiple recording views and you know the desired cdviViewNum, pass --view-num. Otherwise the downloader keeps the previous behavior and chooses the smallest downloadable stream.

Canvas material download

download_canvas_materials.py downloads course materials from Canvas itself, separate from lecture recordings. It can discover:

course files
files linked from modules
assignment pages
files linked from assignment descriptions

For SJTU Canvas, first refresh your browser login at oc.sjtu.edu.cn, then run:

python3 download_canvas_materials.py \
  --course-id 123456 \
  --from-chrome \
  --sync-details \
  --output-dir /path/to/course-root/materials

Or pass a cookie file explicitly:

python3 download_canvas_materials.py \
  --course-id 123456 \
  --canvas-cookie-file /path/to/cookies.txt \
  --download \
  --resume \
  --max-count 10 \
  --output-dir /path/to/course-root/materials

The material downloader writes:

canvas_materials_manifest.json: discovered material entries and target paths
canvas_materials_status.json: per-entry pending / downloading / verified / failed / skipped state
material_runs/*.jsonl: run logs

Use --include files, --include modules, or --include assignments to limit discovery. Keep large material pulls bounded with --max-count, especially when assignment attachments are numerous.

Download process hygiene

Do not leave large download jobs hanging in the background.

Prefer small batches such as 1 2 or 1 2 3, not the whole semester in one detached process.
Prefer --max-count when automation is driving the work.
After each batch, verify the expected files landed completely before starting transcription.
Use --verify-only and --status before deciding whether more download work is needed.
If a download job finishes or stalls, clean up the matching python3 download_canvas_videos.py and child curl processes promptly.
If you want an explicit cleanup pass, use:

python3 cleanup_download_jobs.py --list
python3 cleanup_download_jobs.py --kill

The cleanup helper only targets downloader jobs from this repo. It does not kill unrelated Python or curl processes.

Download source stability

For unattended project work, do not make the pipeline depend on an open browser tab.

Prefer a stable local downloads/ directory first.
A symlinked downloads/ directory is acceptable if the real files live elsewhere.
If the next lecture video is missing locally, record that as a project gap instead of assuming Canvas is still open in the current thread.
Treat browser session state as opportunistic input, not as the primary long-term source of truth.
For SJTU Canvas, prefer --source sjtu-lti with a fresh authenticated cookie when Chrome Session Storage does not contain the video-platform token.
For Canvas course materials, use download_canvas_materials.py; keep materials under materials/ and recordings under downloads/.
Platform captions are not part of the default workflow. Use local Whisper transcription through process_lecture.py for formal note inputs.

Transcription process hygiene

The same cleanup rule applies to transcription jobs.

Run one lecture or one small batch at a time.
After each transcription finishes, verify the expected .txt and .json files landed.
Explicitly check for residual process_lecture.py processes instead of assuming they exited cleanly.
If a transcription process is stalled or no longer needed, terminate it before starting new heavy work.

Use:

python3 cleanup_course_jobs.py --list
python3 cleanup_course_jobs.py --kill

Output conventions

Recommended output structure:

course-root/
  ppt/
  notes/
  transcripts/
  downloads/
  audio/
  slides_index.md
  ppt_processing_queue.md
  uncertain_fragments.md

Note-writing rules

The bundled Codex skill encodes the preferred note style:

align by PPT/content boundary first, not by session number
write compact study notes, not classroom narration
remove teacher/process voice
expand any case that is actually discussed in class
only correct low-confidence transcript text when slide or note evidence supports it

If you want Codex to follow those rules consistently, install or reuse the included skill:

skills/lecture-notes-pipeline/

Publish checklist

Before making the repo public:

remove any local-only outputs or preview artifacts you do not want to ship
confirm the Canvas downloader behavior matches what you want to expose publicly
choose and add a LICENSE
add a few example inputs or screenshots if you want the README to be self-explanatory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lecture-notes-pipeline

Preview

What this repo does

Repository layout

Requirements

Quick start

Canvas download configuration

Canvas material download

Download process hygiene

Download source stability

Transcription process hygiene

Output conventions

Note-writing rules

Publish checklist

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
examples		examples
skills/lecture-notes-pipeline		skills/lecture-notes-pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_slide_index.py		build_slide_index.py
clean_transcript.py		clean_transcript.py
cleanup_course_jobs.py		cleanup_course_jobs.py
cleanup_download_jobs.py		cleanup_download_jobs.py
download_canvas_materials.py		download_canvas_materials.py
download_canvas_videos.py		download_canvas_videos.py
export_notes_pdf.py		export_notes_pdf.py
extract_audio.py		extract_audio.py
fuzzy_lookup.py		fuzzy_lookup.py
process_lecture.py		process_lecture.py
requirements.txt		requirements.txt
run_course_pipeline.py		run_course_pipeline.py
scan_ppt_hits.py		scan_ppt_hits.py

Folders and files

Latest commit

History

Repository files navigation

lecture-notes-pipeline

Preview

What this repo does

Repository layout

Requirements

Quick start

Canvas download configuration

Canvas material download

Download process hygiene

Download source stability

Transcription process hygiene

Output conventions

Note-writing rules

Publish checklist

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages