browser-use/video-use

4.8k

+662/day

687

Python

Edit videos with coding agents

From the README

video-use

Introducing video-use — edit videos with Claude Code. 100% open source.

Drop raw footage in a folder, chat with Claude Code, get final.mp4 back. Works for any content — talking heads, montages, tutorials, travel, interviews — without presets or menus.

What it does

Cuts out filler words (umm, uh, false starts) and dead space between takes
Auto color grades every segment (warm cinematic, neutral punch, or any custom ffmpeg chain)
30ms audio fades at every cut so you never hear a pop
Burns subtitles in your style — 2-word UPPERCASE chunks by default, fully customizable
Generates animation overlays via Manim, Remotion, or PIL — spawned in parallel sub-agents, one per animation
Self-evaluates the rendered output at every cut boundary before showing you anything
Persists session memory in project.md so next week's session picks up where you left off

Get started

# 1. Clone and symlink into Claude Code's skills directory
git clone 
cd video-use
ln -s "$(pwd)" ~/.claude/skills/video-use

# 2. Install deps
pip install -e .
brew install ffmpeg           # required
brew install yt-dlp            # optional, for downloading online sources

# 3. Add your ElevenLabs API key
cp .env.example .env
$EDITOR .env                   # ELEVENLABS_API_KEY=...

Then point Claude Code at a folder of raw takes:

cd /path/to/your/videos
claude

And in the session:

edit these into a launch video

It inventories the sources, proposes a strategy, waits for your OK, then produces edit/final.mp4 next to your sources. All outputs live in /edit/ — the skill directory stays clean.

How it works

The LLM never watches the video. It reads it — through two layers that together give it everything it needs to cut with word-boundary precision.

Layer 1 — Audio transcript (always loaded). One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events ((laughter), (applause), (sigh)). All takes pack into a single ~12KB takes_packed.md — the LLM's primary reading view.

## C0103  (duration: 43.0s, 8 phrases)
  [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
  [006.08-006.74] S0 We fixed this.

Layer 2 — Visual composite (on demand). timeline_view produces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points — ambiguous pauses, retake comparisons, cut-point sanity checks.

Naive approach: 30,000 frames × 1,500 tokens = 45M tokens of noise. Video Use: 12KB text + a handful of PNGs.

Same idea as browser-use giving an LLM a structured DOM instead of a screenshot — but for video.

Pipeline

Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval
                                                              │
                                                              └─ issue? fix + re-render (max 3

View on GitHub