Video catalog/survey app (code only; video data not included)

Python 82%
HTML 17.3%
Dockerfile 0.7%

Find a file

krisk248 1a3965dda9 Add dockerized web UI for browsing memes		2026-05-26 14:59:22 +04:00
webapp	Add dockerized web UI for browsing memes	2026-05-26 14:59:22 +04:00
.gitignore	first commit	2026-05-24 21:33:05 +04:00
docker-compose.yml	Add dockerized web UI for browsing memes	2026-05-26 14:59:22 +04:00
Dockerfile	Add dockerized web UI for browsing memes	2026-05-26 14:59:22 +04:00
download_videos.py	first commit	2026-05-24 21:33:05 +04:00
fetch_catalog.py	first commit	2026-05-24 21:33:05 +04:00
README.md	first commit	2026-05-24 21:33:05 +04:00
survey.py	first commit	2026-05-24 21:33:05 +04:00
videsaur.py	Add probe command for ffprobe codec survey	2026-05-26 14:58:57 +04:00

README.md

videosaurous

Local catalog, downloader, and tracking system for videsaur.com — a Tamil/Telugu meme + reaction video site. Mirrors the public catalog, downloads every video to disk, computes integrity hashes, and tracks new/removed videos over time via incremental sync.

The name: video + dinosaur — a giant archive that keeps growing.

What it does

Walks the site's paginated API and caches every video record locally.
Downloads every video file from CloudFront (resumable, polite, concurrent).
Stores everything in a single SQLite database with full metadata, per-file SHA256 hashes, integrity status, and a per-video event log.
Detects duplicates (same content uploaded under different IDs).
Verifies file integrity (hash, size, ffprobe).
Runs incremental syncs that download only new videos and mark videos that disappear from the site.

Quick start

Requires Python 3.11+ and uv. All scripts are single-file with PEP 723 inline dependencies — uv run handles the env.

# 1. Fetch the full catalog metadata (~3 min, hits site 239 times, cached)
uv run fetch_catalog.py

# 2. Optional: verify every video URL is reachable (~3 min, HEAD requests only)
uv run survey.py

# 3. Download every video file (~30 min for ~8.5 GB at default concurrency=4)
uv run download_videos.py

# 4. Bootstrap the local database (hashes every file, ~30s)
uv run videsaur.py init

Recurring use

# Sync against the live API — downloads only NEW videos, marks removed ones
uv run videsaur.py sync --concurrency 4

# Re-verify every file (hash + size + ffprobe). Slow, do monthly.
uv run videsaur.py verify --full

# Find byte-identical duplicates
uv run videsaur.py dedupe

# Overview
uv run videsaur.py stats

# Filter / search / inspect
uv run videsaur.py list --missing
uv run videsaur.py list --corrupt
uv run videsaur.py list --category Tamil
uv run videsaur.py search vadivelu
uv run videsaur.py show 8534

Daily automation (cron)

0 3 * * * cd /path/to/videosaurous && /home/$USER/.local/bin/uv run videsaur.py sync >> sync.log 2>&1

Architecture

File	Role
`fetch_catalog.py`	Walks `/api/v1/video?page=N` with per-page disk cache. Polite delays. Writes `catalog.json`.
`survey.py`	Async HEAD against every video URL. Reports exact total bytes + dead URLs + rate-limit signal.
`download_videos.py`	Async downloader. Per-file `.part` + HTTP Range resume. Atomic rename. Configurable concurrency.
`videsaur.py`	Main CLI: `init`, `sync`, `verify`, `dedupe`, `stats`, `list`, `search`, `show`. SQLite-backed.

Database schema

Three tables in videsaur.db:

videos — every video the API has ever returned. Full metadata mirror + local file state (local_path, sha256, file_size, integrity_status) + sync state (first_seen_at, last_seen_at, removed_at).
sync_runs — one row per sync invocation, with counts of new/updated/removed/downloaded.
events — per-video timeline: added, updated, removed, re-added, downloaded, download_failed, corrupt, missing — each with a JSON details payload.

removed_at IS NULL means the video is still live in the catalog. Removed videos retain their DB row (with a removed_at timestamp); files on disk are kept until manually deleted.

Design notes

Origin vs CDN separation: the catalog API (videsaur.com) is hit politely with delays between page requests. Video files come from AWS CloudFront, which tolerates much higher request rates (tested at concurrency=8 with zero throttling).
Resumable everything: every phase can be Ctrl-C'd and rerun without losing work. fetch_catalog caches each page; download_videos resumes partial files via HTTP Range; init skips already-hashed files.
Idempotent sync: running sync against an unchanged catalog produces zero downloads and zero events.
Forensic columns: each videos row carries the full original API JSON in raw_json. If the upstream API schema ever changes, no data is lost.

License

Private / personal archive. Respect videsaur.com's terms of service.