Back to feed

browser-use/video-use

browser-use/video-use
4.8k
+662/day
687
Python

Edit videos with coding agents

From the README

video-use

Introducing video-use — edit videos with Claude Code. 100% open source.

Drop raw footage in a folder, chat with Claude Code, get final.mp4 back. Works for any content — talking heads, montages, tutorials, travel, interviews — without presets or menus.

What it does

  • Cuts out filler words (umm, uh, false starts) and dead space between takes
  • Auto color grades every segment (warm cinematic, neutral punch, or any custom ffmpeg chain)
  • 30ms audio fades at every cut so you never hear a pop
  • Burns subtitles in your style — 2-word UPPERCASE chunks by default, fully customizable
  • Generates animation overlays via Manim, Remotion, or PIL — spawned in parallel sub-agents, one per animation
  • Self-evaluates the rendered output at every cut boundary before showing you anything
  • Persists session memory in project.md so next week's session picks up where you left off

Get started

# 1. Clone and symlink into Claude Code's skills directory
git clone 
cd video-use
ln -s "$(pwd)" ~/.claude/skills/video-use

# 2. Install deps
pip install -e .
brew install ffmpeg           # required
brew install yt-dlp            # optional, for downloading online sources

# 3. Add your ElevenLabs API key
cp .env.example .env
$EDITOR .env                   # ELEVENLABS_API_KEY=...

Then point Claude Code at a folder of raw takes:

cd /path/to/your/videos
claude

And in the session:

edit these into a launch video

It inventories the sources, proposes a strategy, waits for your OK, then produces edit/final.mp4 next to your sources. All outputs live in /edit/ — the skill directory stays clean.

How it works

The LLM never watches the video. It reads it — through two layers that together give it everything it needs to cut with word-boundary precision.

Layer 1 — Audio transcript (always loaded). One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events ((laughter), (applause), (sigh)). All takes pack into a single ~12KB takes_packed.md — the LLM's primary reading view.

## C0103  (duration: 43.0s, 8 phrases)
  [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
  [006.08-006.74] S0 We fixed this.

Layer 2 — Visual composite (on demand). timeline_view produces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points — ambiguous pauses, retake comparisons, cut-point sanity checks.

Naive approach: 30,000 frames × 1,500 tokens = 45M tokens of noise. Video Use: 12KB text + a handful of PNGs.

Same idea as browser-use giving an LLM a structured DOM instead of a screenshot — but for video.

Pipeline

Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval
                                                              │
                                                              └─ issue? fix + re-render (max 3