blog-cover-image

Building an Auto-Clipper AI: From YouTube Link to Viral Short-Form Clips

Short-form video platforms like TikTok, Instagram Reels, and YouTube Shorts have exploded in popularity. Creators and businesses now want a way to automatically extract the most viral moments from long videos — without manually watching hours of footage.

That’s exactly why I built Auto Clipper AI.

This tool ingests a long YouTube video, transcribes it, finds the most interesting segments using AI, enhances the clip visually (blurred background, proper aspect ratio), and even runs it through a PyCaps video template to make it look like a professionally edited short.

This post walks through the architecture, the libraries used, and a full breakdown of how the code works end-to-end.

Why Build an Auto-Clipper?

Manually editing short-form content is:

slow
repetitive
difficult to scale
requires editing skills

For newsletters, podcasts, YouTube channels, and brand agencies, this becomes a bottleneck.

Auto Clipper AI solves this by automating:

Video download
Speech-to-text transcription
Sentence and semantic segmentation
AI scoring to detect viral moments
Clip extraction
Dynamic blurred-background formatting
Running PyCaps templates to create clean, engaging shorts

This brings the workflow close to “Paste YouTube link → Get viral clips”.

Libraries Used & Their Role

Library	Purpose
yt-dlp	Download YouTube videos in highest available resolution
Whisper	Generate accurate transcripts + timestamps
OpenAI Embeddings	Compute sentence similarity, semantic merging
OpenAI Chat Completions	Score segments for virality (1–10)
NumPy	Vector similarity for semantic coherence
MoviePy	Extract clips, resize, blur background, composite final video
OpenCV (cv2)	Apply Gaussian blur to background clip
PyCaps (pycaps)	Add animations, subtitles, effects on top of the extracted clip

All these components combine into a fully automated clip-editing pipeline.

Code Architecture Overview

Below is the high-level flow of the pipeline:

YouTube URL
    ↓
download_video()
    ↓
transcribe_video()  --> Whisper transcript w/ word timestamps
    ↓
sentence_segments() --> Convert transcript into clean sentences
    ↓
semantic_coherence() --> Merge related sentences
    ↓
sliding_window_segments() --> 20–40 sec blocks for clip candidates
    ↓
select_viral_clips() --> GPT scores segments for virality
    ↓
extend_to_sentence_boundaries() --> Natural clip edges
    ↓
save_clip_() --> Extract + blur background + center main video
    ↓
run_pycaps_pipeline() --> Add template animation + styling
    ↓
Final Viral Clip

Each step intentionally cleans, merges, scores, and prepares segments so the final clip feels natural and engaging.

Step-by-Step Breakdown of the Code

1. Downloading the YouTube Video

def download_video(youtube_url, output_path="video.mp4"):
    ydl_opts = {'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/mp4',
                'outtmpl': output_path}
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])
    return output_path

This downloads the best available video+audio combination. The file becomes the starting point for transcription.

2. Transcription with Whisper

def transcribe_video(video_path):
    model = whisper.load_model("tiny", device="cuda")
    result = model.transcribe(video_path, word_timestamps=True)

Whisper returns segments with word-level timestamps, which are crucial for building natural clips.

Each segment is stored as:

{
 "start": 0.0,
 "end": 3.5,
 "text": "Welcome back to the channel...",
 "words": [...]
}

3. Sentence Splitting

def sentence_segments(full_transcript):
    sentences = []
    buffer = []
    start_time = None
    for seg in full_transcript:
        for w in seg.get("words", []):
            if start_time is None:
                start_time = w["start"]
            buffer.append(w)
            if w["word"].strip().endswith((".", "!", "?")):
                end_time = w["end"]
                sentence_text = " ".join([x["word"] for x in buffer]).strip()
                sentences.append({
                    "start": start_time,
                    "end": end_time,
                    "text": sentence_text,
                    "words": buffer
                })
                buffer = []
                start_time = None
    return sentences

This groups words into full sentences whenever punctuation (. ! ?) appears.

This gives clean, meaningful chunks that work better for semantic merging.

4. Semantic Coherence: Merging Related Sentences

def semantic_coherence(sentences, threshold=0.78):
    merged = []
    i = 0
    while i < len(sentences):
        start = sentences[i]["start"]
        text = sentences[i]["text"]
        end = sentences[i]["end"]

        # get current embedding
        emb1 = client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding

        # merge forward while semantically similar
        j = i + 1
        while j < len(sentences):
            emb2 = client.embeddings.create(model="text-embedding-3-small", input=sentences[j]["text"]).data[0].embedding
            sim = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
            if sim < threshold:
                break
            text += " " + sentences[j]["text"]
            end = sentences[j]["end"]
            j += 1
            emb1 = emb2  # move forward
        merged.append({"start": start, "end": end, "text": text})
        i = j
    return merged

Here you:

take each sentence
compute embeddings
check semantic similarity with the next sentence
merge until similarity drops < threshold

This removes abrupt transitions and gives natural topic clusters.

5. Sliding Window Segments

def sliding_window_segments(segments, window=20, stride=15, min_len=8, max_len=40):
    merged = []
    video_end = segments[-1]["end"]
    t = 0

    while t < video_end:
        window_start = t
        window_end = min(t + window, video_end)

        selected = [seg for seg in segments if seg["end"] > window_start and seg["start"] < window_end]
        if not selected:
            t += stride
            continue

        actual_start = selected[0]["start"]
        actual_end = selected[-1]["end"]

        # collect all words
        words = []
        for seg in selected:
            words.extend(seg.get("words", []))

        while (actual_end - actual_start) < min_len:
            next_idx = segments.index(selected[-1]) + 1
            if next_idx < len(segments):
                selected.append(segments[next_idx])
                actual_end = selected[-1]["end"]
                words.extend(selected[-1].get("words", []))
            else:
                break

        if (actual_end - actual_start) > max_len:
            actual_end = actual_start + max_len
            # trim words too
            words = [w for w in words if w["start"] < actual_end]

        merged.append({
            "start": actual_start,
            "end": actual_end,
            "text": " ".join([seg["text"] for seg in selected]).strip(),
            "words": words
        })

        t += stride

    return merged

Since viral shorts are typically 8–40 seconds, this function:

uses a rolling 20-sec window
overlaps windows (stride=15 sec)
expands short segments
trims long ones

This produces many candidate viral clips.

6. Scoring Clips with GPT

def gpt_score_segment(text):
    prompt = f"""
    You are a content virality expert.
    Rate the following transcript segment on a scale of 1–10 for its potential as a viral short-form video clip 
    (TikTok/Instagram Reels/YouTube Shorts). Consider hook strength, emotional impact, relatability, curiosity, and 
    whether it stands alone without extra context.

    Only return a JSON object with this structure:
    {{"score": <number>, "reason": "<brief reason>"}}

    Segment:
    \"\"\"{text}\"\"\"
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",   # fast/cheap model
        messages=[{"role": "user", "content": prompt}],
        max_completion_tokens=150,
        response_format={"type": "json_object"}  # ✅ ensures valid JSON
    )

    content = response.choices[0].message.content
    result = json.loads(content)   # guaranteed valid JSON
    return result

{"score": 9.2, "reason": "Strong hook and emotional appeal."}

It evaluates:

hook strength
curiosity
relatability
emotional pull
standalone clarity

7. Selecting Non-Overlapping Viral Clips

def select_viral_clips(segments, n=3, min_gap=8):
    """
    Pick top n clips, ensuring they don't overlap too closely.
    
    :param segments: list of {start, end, text}
    :param n: number of clips to return
    :param min_gap: minimum gap in seconds between chosen clips
    """
    scored_segments = []
    for seg in segments:
        result = gpt_score_segment(seg["text"])
        scored_segments.append({
            "start": seg["start"],
            "end": seg["end"],
            "text": seg["text"],
            "words": seg.get("words", []),   # ✅ keep words
            "score": result["score"],
            "reason": result["reason"]
        })

    # Sort by score (highest first)
    scored_segments.sort(key=lambda x: x["score"], reverse=True)

    chosen = []
    for candidate in scored_segments:
        # Check distance from already chosen
        too_close = False
        for c in chosen:
            if abs(candidate["start"] - c["start"]) < min_gap:
                too_close = True
                break
        if not too_close:
            chosen.append(candidate)
        if len(chosen) >= n:
            break

    return chosen

This picks the top N scored clips while making sure they aren’t too close in time.

This ensures clip variety.

8. Natural Sentence Boundaries


def extend_to_sentence_boundaries(segments, full_transcript, max_len=300):
    """
    Extend viral segments to nearest sentence boundaries or silence gaps.
    Keeps clips natural and not cut mid-sentence.
    """
    def find_prev_boundary(time):
        prev_segments = [s for s in full_transcript if s["end"] <= time]
        for seg in reversed(prev_segments):
            text = seg["text"].strip()
            if text.endswith((".", "!", "?")):
                return seg["end"]
        return max(0, time - 2)  # fallback: extend 2s before

    def find_next_boundary(time):
        next_segments = [s for s in full_transcript if s["start"] >= time]
        for seg in next_segments:
            text = seg["text"].strip()
            if text.endswith((".", "!", "?")):
                return seg["end"]
        return time + 2  # fallback: extend 2s after

    extended = []
    for seg in segments:
        start = find_prev_boundary(seg["start"])
        end = find_next_boundary(seg["end"])

        # keep within video and cap max length
        if end - start > max_len:
            end = start + max_len

        # collect words within the extended region
        words = []
        for s in full_transcript:
            for w in s.get("words", []):
                if start <= w["start"] <= end:
                    words.append(w)

        full_text = " ".join([
            s["text"] for s in full_transcript
            if s["end"] >= start and s["start"] <= end
        ]).strip()

        extended.append({
            "start": start,
            "end": end,
            "text": full_text,
            "words": words,
            "score": seg["score"],
            "reason": seg["reason"]
        })

    return extended

This prevents cuts in the middle of words/sentences.

It finds:

nearest previous sentence ending
nearest next sentence ending

The result feels human-edited.

9. Video Extraction + Blurred Background


def save_clip_(video_path, selected_segments, output_dir="clips", blur_intensity=75):
    os.makedirs(output_dir, exist_ok=True)
    clip_paths = []
    
    target_w, target_h = 1080, 1920
    
    with VideoFileClip(video_path) as video:
        for i, segment in enumerate(selected_segments):
            with video.subclip(float(segment["start"]), float(segment["end"])) as clip:
                # Calculate the maximum size for main clip while maintaining aspect ratio
                clip_aspect = clip.w / clip.h
                target_aspect = target_w / target_h
                
                if clip_aspect > target_aspect:
                    # Clip is wider than target - fit to width
                    main_clip = resize(clip, width=target_w)
                else:
                    # Clip is taller than target - fit to height
                    main_clip = resize(clip, height=target_h)
                
                # Create background - scale original clip to fill entire frame
                if clip_aspect > target_aspect:
                    bg_clip = resize(clip, height=target_h)
                else:
                    bg_clip = resize(clip, width=target_w)
                
                # Crop background to exact dimensions
                bg_clip = bg_clip.crop(
                    x_center=bg_clip.w/2,
                    y_center=bg_clip.h/2,
                    width=target_w,
                    height=target_h
                )
                
                # Apply blur
                bg_clip = bg_clip.fl_image(lambda image: cv2.GaussianBlur(image, (blur_intensity, blur_intensity), 0))
                
                # Composite - center the main clip
                final = CompositeVideoClip([
                    bg_clip,
                    main_clip.set_position(("center", "center"))
                ])
                
                out_path = os.path.join(output_dir, f"clip_{i+1}_score{segment['score']}.mp4")
                final.write_videofile(out_path, codec="libx264", audio_codec="aac", 
                                    threads=8, preset="ultrafast", bitrate="2000k")
                
                # Cleanup
                bg_clip.close()
                main_clip.close()
                final.close()
                
                clip_paths.append({
                    "path": out_path,
                    "score": segment["score"],
                    "reason": segment["reason"],
                    "text": segment["text"]
                })
    
    return clip_paths

10. Add PyCaps Template (Animations, Effects, Captions)

def run_pycaps_pipeline(input_vid):
    # Load a template and configure it
    builder = TemplateLoader("hype").with_input_video(input_vid).load(False)
    # Programmatically add an animation
    builder.add_animation(
        animation=FadeIn(),
        when=EventType.ON_NARRATION_STARTS,
        what=ElementType.SEGMENT
    )
    # Build and run the pipeline
    pipeline = builder.build()
    pipeline.run()

This loads a template. The "hype" template may include:

dynamic captions
waveform visualizer
emoji reactions
progress bar

Putting It All Together: The Full Pipeline

if __name__ == "__main__":
    url = "https://www.youtube.com/watch?v=JVtUcM1sWQw"

    video_path = download_video(url)
    segments = transcribe_video(video_path)

    sentences = sentence_segments(segments)
    semantic_blocks = semantic_coherence(sentences)

    windows = sliding_window_segments(semantic_blocks, window=120, stride=30)
    viral_segments = select_viral_clips(windows, n=1)

    natural_segments = extend_to_sentence_boundaries(viral_segments, segments)
    clip_files = save_clip_(video_path, natural_segments)

    run_pycaps_pipeline(clip_files[0]["path"])

Final Thoughts

Auto Clipper AI is more than a script — it’s a full intelligent video-editing pipeline that:

understands speech
detects meaningful content
evaluates virality potential
creates polished short-form video

With PyCaps + Whisper + MoviePy + GPT scoring, it becomes easy to repurpose long videos into multiple engaging clips.

If you’re building a SaaS, tool, or automation pipeline around short-form content — this architecture is production-ready.

Dataloopr

Building an Auto-Clipper AI: From YouTube Link to Viral Short-Form Clips

Why Build an Auto-Clipper?

Libraries Used & Their Role

Code Architecture Overview

Step-by-Step Breakdown of the Code

1. Downloading the YouTube Video

2. Transcription with Whisper

3. Sentence Splitting

4. Semantic Coherence: Merging Related Sentences

5. Sliding Window Segments

6. Scoring Clips with GPT

7. Selecting Non-Overlapping Viral Clips

8. Natural Sentence Boundaries

9. Video Extraction + Blurred Background

10. Add PyCaps Template (Animations, Effects, Captions)

Putting It All Together: The Full Pipeline

Final Thoughts

Related Articles

Adnan

Recent Articles

10 Brainteasers That Actually Appear in Quant Research Interviews

3 Coding Challenges Every Quant Research Candidate Should Practice

From Linear Regression to GARCH: Modeling Questions in Quant Interviews

Solving the Monte Carlo Pricing Question in a Quant Research Interview

The Data Science Side of Quant Research: Interview Questions on Modern ML

Must-Know Stochastic Calculus & PDE Questions for Quant Research Roles

Interview Questions for Quantitative Researchers in Machine Learning

Quantitative Research vs. Quantitative Development: Interview Questions Compared

Sharpe ratio & performance metrics in Python

Portfolio optimization using Python

Mean reversion trading strategy in Python

Monte Carlo option pricing in Python

How to simulate Brownian motion in Python

Python packages used in quant finance

Backtesting basics in Python (with code)

Tags