
Building an Auto-Clipper AI: From YouTube Link to Viral Short-Form Clips
Short-form video platforms like TikTok, Instagram Reels, and YouTube Shorts have exploded in popularity. Creators and businesses now want a way to automatically extract the most viral moments from long videos — without manually watching hours of footage.
That’s exactly why I built Auto Clipper AI.
This tool ingests a long YouTube video, transcribes it, finds the most interesting segments using AI, enhances the clip visually (blurred background, proper aspect ratio), and even runs it through a PyCaps video template to make it look like a professionally edited short.
This post walks through the architecture, the libraries used, and a full breakdown of how the code works end-to-end.

Why Build an Auto-Clipper?
Manually editing short-form content is:
-
slow
-
repetitive
-
difficult to scale
-
requires editing skills
For newsletters, podcasts, YouTube channels, and brand agencies, this becomes a bottleneck.
Auto Clipper AI solves this by automating:
-
Video download
-
Speech-to-text transcription
-
Sentence and semantic segmentation
-
AI scoring to detect viral moments
-
Clip extraction
-
Dynamic blurred-background formatting
-
Running PyCaps templates to create clean, engaging shorts
This brings the workflow close to “Paste YouTube link → Get viral clips”.
Libraries Used & Their Role
| Library | Purpose |
|---|---|
| yt-dlp | Download YouTube videos in highest available resolution |
| Whisper | Generate accurate transcripts + timestamps |
| OpenAI Embeddings | Compute sentence similarity, semantic merging |
| OpenAI Chat Completions | Score segments for virality (1–10) |
| NumPy | Vector similarity for semantic coherence |
| MoviePy | Extract clips, resize, blur background, composite final video |
| OpenCV (cv2) | Apply Gaussian blur to background clip |
| PyCaps (pycaps) | Add animations, subtitles, effects on top of the extracted clip |
All these components combine into a fully automated clip-editing pipeline.
Code Architecture Overview
Below is the high-level flow of the pipeline:
YouTube URL
↓
download_video()
↓
transcribe_video() --> Whisper transcript w/ word timestamps
↓
sentence_segments() --> Convert transcript into clean sentences
↓
semantic_coherence() --> Merge related sentences
↓
sliding_window_segments() --> 20–40 sec blocks for clip candidates
↓
select_viral_clips() --> GPT scores segments for virality
↓
extend_to_sentence_boundaries() --> Natural clip edges
↓
save_clip_() --> Extract + blur background + center main video
↓
run_pycaps_pipeline() --> Add template animation + styling
↓
Final Viral Clip
Each step intentionally cleans, merges, scores, and prepares segments so the final clip feels natural and engaging.

Step-by-Step Breakdown of the Code
1. Downloading the YouTube Video
def download_video(youtube_url, output_path="video.mp4"):
ydl_opts = {'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/mp4',
'outtmpl': output_path}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
ydl.download([youtube_url])
return output_path
This downloads the best available video+audio combination. The file becomes the starting point for transcription.
2. Transcription with Whisper
def transcribe_video(video_path):
model = whisper.load_model("tiny", device="cuda")
result = model.transcribe(video_path, word_timestamps=True)
Whisper returns segments with word-level timestamps, which are crucial for building natural clips.
Each segment is stored as:
{
"start": 0.0,
"end": 3.5,
"text": "Welcome back to the channel...",
"words": [...]
}
3. Sentence Splitting
def sentence_segments(full_transcript):
sentences = []
buffer = []
start_time = None
for seg in full_transcript:
for w in seg.get("words", []):
if start_time is None:
start_time = w["start"]
buffer.append(w)
if w["word"].strip().endswith((".", "!", "?")):
end_time = w["end"]
sentence_text = " ".join([x["word"] for x in buffer]).strip()
sentences.append({
"start": start_time,
"end": end_time,
"text": sentence_text,
"words": buffer
})
buffer = []
start_time = None
return sentences
This groups words into full sentences whenever punctuation (. ! ?) appears.
This gives clean, meaningful chunks that work better for semantic merging.
4. Semantic Coherence: Merging Related Sentences
def semantic_coherence(sentences, threshold=0.78):
merged = []
i = 0
while i < len(sentences):
start = sentences[i]["start"]
text = sentences[i]["text"]
end = sentences[i]["end"]
# get current embedding
emb1 = client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding
# merge forward while semantically similar
j = i + 1
while j < len(sentences):
emb2 = client.embeddings.create(model="text-embedding-3-small", input=sentences[j]["text"]).data[0].embedding
sim = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
if sim < threshold:
break
text += " " + sentences[j]["text"]
end = sentences[j]["end"]
j += 1
emb1 = emb2 # move forward
merged.append({"start": start, "end": end, "text": text})
i = j
return merged
Here you:
-
take each sentence
-
compute embeddings
-
check semantic similarity with the next sentence
-
merge until similarity drops < threshold
This removes abrupt transitions and gives natural topic clusters.
5. Sliding Window Segments
def sliding_window_segments(segments, window=20, stride=15, min_len=8, max_len=40):
merged = []
video_end = segments[-1]["end"]
t = 0
while t < video_end:
window_start = t
window_end = min(t + window, video_end)
selected = [seg for seg in segments if seg["end"] > window_start and seg["start"] < window_end]
if not selected:
t += stride
continue
actual_start = selected[0]["start"]
actual_end = selected[-1]["end"]
# collect all words
words = []
for seg in selected:
words.extend(seg.get("words", []))
while (actual_end - actual_start) < min_len:
next_idx = segments.index(selected[-1]) + 1
if next_idx < len(segments):
selected.append(segments[next_idx])
actual_end = selected[-1]["end"]
words.extend(selected[-1].get("words", []))
else:
break
if (actual_end - actual_start) > max_len:
actual_end = actual_start + max_len
# trim words too
words = [w for w in words if w["start"] < actual_end]
merged.append({
"start": actual_start,
"end": actual_end,
"text": " ".join([seg["text"] for seg in selected]).strip(),
"words": words
})
t += stride
return merged
Since viral shorts are typically 8–40 seconds, this function:
-
uses a rolling 20-sec window
-
overlaps windows (stride=15 sec)
-
expands short segments
-
trims long ones
This produces many candidate viral clips.
6. Scoring Clips with GPT
def gpt_score_segment(text):
prompt = f"""
You are a content virality expert.
Rate the following transcript segment on a scale of 1–10 for its potential as a viral short-form video clip
(TikTok/Instagram Reels/YouTube Shorts). Consider hook strength, emotional impact, relatability, curiosity, and
whether it stands alone without extra context.
Only return a JSON object with this structure:
{{"score": <number>, "reason": "<brief reason>"}}
Segment:
\"\"\"{text}\"\"\"
"""
response = client.chat.completions.create(
model="gpt-4o-mini", # fast/cheap model
messages=[{"role": "user", "content": prompt}],
max_completion_tokens=150,
response_format={"type": "json_object"} # ✅ ensures valid JSON
)
content = response.choices[0].message.content
result = json.loads(content) # guaranteed valid JSON
return result
{"score": 9.2, "reason": "Strong hook and emotional appeal."}
It evaluates:
-
hook strength
-
curiosity
-
relatability
-
emotional pull
-
standalone clarity
7. Selecting Non-Overlapping Viral Clips
def select_viral_clips(segments, n=3, min_gap=8):
"""
Pick top n clips, ensuring they don't overlap too closely.
:param segments: list of {start, end, text}
:param n: number of clips to return
:param min_gap: minimum gap in seconds between chosen clips
"""
scored_segments = []
for seg in segments:
result = gpt_score_segment(seg["text"])
scored_segments.append({
"start": seg["start"],
"end": seg["end"],
"text": seg["text"],
"words": seg.get("words", []), # ✅ keep words
"score": result["score"],
"reason": result["reason"]
})
# Sort by score (highest first)
scored_segments.sort(key=lambda x: x["score"], reverse=True)
chosen = []
for candidate in scored_segments:
# Check distance from already chosen
too_close = False
for c in chosen:
if abs(candidate["start"] - c["start"]) < min_gap:
too_close = True
break
if not too_close:
chosen.append(candidate)
if len(chosen) >= n:
break
return chosen
This picks the top N scored clips while making sure they aren’t too close in time.
This ensures clip variety.
8. Natural Sentence Boundaries
def extend_to_sentence_boundaries(segments, full_transcript, max_len=300):
"""
Extend viral segments to nearest sentence boundaries or silence gaps.
Keeps clips natural and not cut mid-sentence.
"""
def find_prev_boundary(time):
prev_segments = [s for s in full_transcript if s["end"] <= time]
for seg in reversed(prev_segments):
text = seg["text"].strip()
if text.endswith((".", "!", "?")):
return seg["end"]
return max(0, time - 2) # fallback: extend 2s before
def find_next_boundary(time):
next_segments = [s for s in full_transcript if s["start"] >= time]
for seg in next_segments:
text = seg["text"].strip()
if text.endswith((".", "!", "?")):
return seg["end"]
return time + 2 # fallback: extend 2s after
extended = []
for seg in segments:
start = find_prev_boundary(seg["start"])
end = find_next_boundary(seg["end"])
# keep within video and cap max length
if end - start > max_len:
end = start + max_len
# collect words within the extended region
words = []
for s in full_transcript:
for w in s.get("words", []):
if start <= w["start"] <= end:
words.append(w)
full_text = " ".join([
s["text"] for s in full_transcript
if s["end"] >= start and s["start"] <= end
]).strip()
extended.append({
"start": start,
"end": end,
"text": full_text,
"words": words,
"score": seg["score"],
"reason": seg["reason"]
})
return extended
This prevents cuts in the middle of words/sentences.
It finds:
-
nearest previous sentence ending
-
nearest next sentence ending
The result feels human-edited.
9. Video Extraction + Blurred Background
def save_clip_(video_path, selected_segments, output_dir="clips", blur_intensity=75):
os.makedirs(output_dir, exist_ok=True)
clip_paths = []
target_w, target_h = 1080, 1920
with VideoFileClip(video_path) as video:
for i, segment in enumerate(selected_segments):
with video.subclip(float(segment["start"]), float(segment["end"])) as clip:
# Calculate the maximum size for main clip while maintaining aspect ratio
clip_aspect = clip.w / clip.h
target_aspect = target_w / target_h
if clip_aspect > target_aspect:
# Clip is wider than target - fit to width
main_clip = resize(clip, width=target_w)
else:
# Clip is taller than target - fit to height
main_clip = resize(clip, height=target_h)
# Create background - scale original clip to fill entire frame
if clip_aspect > target_aspect:
bg_clip = resize(clip, height=target_h)
else:
bg_clip = resize(clip, width=target_w)
# Crop background to exact dimensions
bg_clip = bg_clip.crop(
x_center=bg_clip.w/2,
y_center=bg_clip.h/2,
width=target_w,
height=target_h
)
# Apply blur
bg_clip = bg_clip.fl_image(lambda image: cv2.GaussianBlur(image, (blur_intensity, blur_intensity), 0))
# Composite - center the main clip
final = CompositeVideoClip([
bg_clip,
main_clip.set_position(("center", "center"))
])
out_path = os.path.join(output_dir, f"clip_{i+1}_score{segment['score']}.mp4")
final.write_videofile(out_path, codec="libx264", audio_codec="aac",
threads=8, preset="ultrafast", bitrate="2000k")
# Cleanup
bg_clip.close()
main_clip.close()
final.close()
clip_paths.append({
"path": out_path,
"score": segment["score"],
"reason": segment["reason"],
"text": segment["text"]
})
return clip_paths
10. Add PyCaps Template (Animations, Effects, Captions)
def run_pycaps_pipeline(input_vid):
# Load a template and configure it
builder = TemplateLoader("hype").with_input_video(input_vid).load(False)
# Programmatically add an animation
builder.add_animation(
animation=FadeIn(),
when=EventType.ON_NARRATION_STARTS,
what=ElementType.SEGMENT
)
# Build and run the pipeline
pipeline = builder.build()
pipeline.run()
This loads a template. The "hype" template may include:
-
dynamic captions
-
waveform visualizer
-
emoji reactions
-
progress bar
Putting It All Together: The Full Pipeline
if __name__ == "__main__":
url = "https://www.youtube.com/watch?v=JVtUcM1sWQw"
video_path = download_video(url)
segments = transcribe_video(video_path)
sentences = sentence_segments(segments)
semantic_blocks = semantic_coherence(sentences)
windows = sliding_window_segments(semantic_blocks, window=120, stride=30)
viral_segments = select_viral_clips(windows, n=1)
natural_segments = extend_to_sentence_boundaries(viral_segments, segments)
clip_files = save_clip_(video_path, natural_segments)
run_pycaps_pipeline(clip_files[0]["path"])
Final Thoughts
Auto Clipper AI is more than a script — it’s a full intelligent video-editing pipeline that:
-
understands speech
-
detects meaningful content
-
evaluates virality potential
-
creates polished short-form video
With PyCaps + Whisper + MoviePy + GPT scoring, it becomes easy to repurpose long videos into multiple engaging clips.
If you’re building a SaaS, tool, or automation pipeline around short-form content — this architecture is production-ready.


