Video Content Extractor
Overview
Automatically extracts key frames from MP4 video files at configurable time intervals, performs OCR text recognition on each frame, and generates a structured Markdown report. The report includes video metadata (duration, resolution, codecs) and frame-by-frame OCR transcripts with timestamp references.
This skill is designed for Codex CLI and requires FFmpeg and Tesseract OCR installed on the local machine.
When to Use This Skill
- Use when you need to extract text content from video presentations, lectures, or screencasts.
- Use when you want to create searchable transcripts from video files without embedded subtitles.
- Use when you need to analyze video content programmatically and generate structured summaries.
- Use when the user asks to "read what is on screen" or "extract the content from this video."
How It Works
Step 1: Analyze Video Metadata
The skill uses ffprobe to extract video metadata: duration, resolution, frame rate, codec information, and file size.
Step 2: Extract Key Frames
Using FFmpeg, the skill captures frames at the configured interval (default: every 30 seconds). Each frame is saved as a timestamped JPEG image.
Step 3: OCR Text Recognition
Each extracted frame is processed by Tesseract OCR. If the default PSM mode returns no meaningful text, it falls back to fully automatic page segmentation.
Step 4: Generate Markdown Report
All extracted data is assembled into a structured Markdown document.
Examples
Example 1: Basic Extraction
Agent prompt:
Use the video-content-extractor skill to extract content from lecture.mp4
Output generates lecture.md and lecture_frames/ directory.
Example 2: Custom Interval
Parameters: video_path, output_dir, interval(seconds), lang
Extract every 60 seconds with English-only OCR:
python scripts/extract_video.py recording.mp4 ./output 60 eng
Example 3: Bilingual Content
Extract with default Chinese + English OCR:
python scripts/extract_video.py lecture.mp4 . 15 chi_sim+eng
Best Practices
- Use shorter intervals (10-15s) for fast-paced content with frequent text changes.
- Use longer intervals (30-60s) for presentation slides or slow lectures to reduce duplicate frames.
- For Chinese content, ensure Tesseract Chinese language pack is installed (chi_sim).
Limitations
- Requires FFmpeg and Tesseract OCR to be installed and accessible via PATH.
- Tesseract OCR accuracy depends on video quality, text size, and font clarity.
- Does not extract audio or perform speech-to-text transcription.
- Frame extraction is time-based (not scene-change-based), which may produce near-duplicate frames.
- Large videos with short intervals can generate many frames - ensure sufficient disk space.
Security and Safety Notes
- This skill only reads video files and writes extracted frames and Markdown reports.
- It does NOT send any data over the network - all processing is local.
- FFmpeg and Tesseract are invoked with fixed, pre-vetted arguments.
- The skill does not modify or delete the original video file.
Common Pitfalls
-
Problem: Tesseract returns garbled text
Solution: Ensure the correct language pack is installed. Run tesseract --list-langs to verify.
-
Problem: FFmpeg fails with "not found"
Solution: Make sure FFmpeg is on PATH. Run ffmpeg -version to verify.
-
Problem: OCR is slow on large videos
Solution: Increase the interval parameter to reduce frames processed.
Related Skills
- @media-summarizer - For summarizing video content using visual and audio cues.
- @document-ocr - For OCR on static images or scanned documents without video processing.