blog-cover-image

Building an Auto-Clipper AI: From YouTube Link to Viral Short-Form Clips

Short-form video platforms like TikTok, Instagram Reels, and YouTube Shorts have exploded in popularity. Creators and businesses now want a way to automatically extract the most viral moments from long videos — without manually watching hours of footage.

That’s exactly why I built Auto Clipper AI.

This tool ingests a long YouTube video, transcribes it, finds the most interesting segments using AI, enhances the clip visually (blurred background, proper aspect ratio), and even runs it through a PyCaps video template to make it look like a professionally edited short.

This post walks through the architecture, the libraries used, and a full breakdown of how the code works end-to-end.

Why Build an Auto-Clipper?

Manually editing short-form content is:

  • slow

  • repetitive

  • difficult to scale

  • requires editing skills

For newsletters, podcasts, YouTube channels, and brand agencies, this becomes a bottleneck.

Auto Clipper AI solves this by automating:

  1. Video download

  2. Speech-to-text transcription

  3. Sentence and semantic segmentation

  4. AI scoring to detect viral moments

  5. Clip extraction

  6. Dynamic blurred-background formatting

  7. Running PyCaps templates to create clean, engaging shorts

This brings the workflow close to “Paste YouTube link → Get viral clips”.

 

Libraries Used & Their Role

Library Purpose
yt-dlp Download YouTube videos in highest available resolution
Whisper Generate accurate transcripts + timestamps
OpenAI Embeddings Compute sentence similarity, semantic merging
OpenAI Chat Completions Score segments for virality (1–10)
NumPy Vector similarity for semantic coherence
MoviePy Extract clips, resize, blur background, composite final video
OpenCV (cv2) Apply Gaussian blur to background clip
PyCaps (pycaps) Add animations, subtitles, effects on top of the extracted clip

All these components combine into a fully automated clip-editing pipeline.

 

Code Architecture Overview

Below is the high-level flow of the pipeline:

YouTube URL
    ↓
download_video()
    ↓
transcribe_video()  --> Whisper transcript w/ word timestamps
    ↓
sentence_segments() --> Convert transcript into clean sentences
    ↓
semantic_coherence() --> Merge related sentences
    ↓
sliding_window_segments() --> 20–40 sec blocks for clip candidates
    ↓
select_viral_clips() --> GPT scores segments for virality
    ↓
extend_to_sentence_boundaries() --> Natural clip edges
    ↓
save_clip_() --> Extract + blur background + center main video
    ↓
run_pycaps_pipeline() --> Add template animation + styling
    ↓
Final Viral Clip

Each step intentionally cleans, merges, scores, and prepares segments so the final clip feels natural and engaging.

 

Step-by-Step Breakdown of the Code


1. Downloading the YouTube Video

def download_video(youtube_url, output_path="video.mp4"):
    ydl_opts = {'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/mp4',
                'outtmpl': output_path}
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])
    return output_path

This downloads the best available video+audio combination. The file becomes the starting point for transcription.


2. Transcription with Whisper

def transcribe_video(video_path):
    model = whisper.load_model("tiny", device="cuda")
    result = model.transcribe(video_path, word_timestamps=True)

Whisper returns segments with word-level timestamps, which are crucial for building natural clips.

Each segment is stored as:

{
 "start": 0.0,
 "end": 3.5,
 "text": "Welcome back to the channel...",
 "words": [...]
}

3. Sentence Splitting

def sentence_segments(full_transcript):
    sentences = []
    buffer = []
    start_time = None
    for seg in full_transcript:
        for w in seg.get("words", []):
            if start_time is None:
                start_time = w["start"]
            buffer.append(w)
            if w["word"].strip().endswith((".", "!", "?")):
                end_time = w["end"]
                sentence_text = " ".join([x["word"] for x in buffer]).strip()
                sentences.append({
                    "start": start_time,
                    "end": end_time,
                    "text": sentence_text,
                    "words": buffer
                })
                buffer = []
                start_time = None
    return sentences

This groups words into full sentences whenever punctuation (. ! ?) appears.

This gives clean, meaningful chunks that work better for semantic merging.

4. Semantic Coherence: Merging Related Sentences

def semantic_coherence(sentences, threshold=0.78):
    merged = []
    i = 0
    while i < len(sentences):
        start = sentences[i]["start"]
        text = sentences[i]["text"]
        end = sentences[i]["end"]

        # get current embedding
        emb1 = client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding

        # merge forward while semantically similar
        j = i + 1
        while j < len(sentences):
            emb2 = client.embeddings.create(model="text-embedding-3-small", input=sentences[j]["text"]).data[0].embedding
            sim = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
            if sim < threshold:
                break
            text += " " + sentences[j]["text"]
            end = sentences[j]["end"]
            j += 1
            emb1 = emb2  # move forward
        merged.append({"start": start, "end": end, "text": text})
        i = j
    return merged

Here you:

  • take each sentence

  • compute embeddings

  • check semantic similarity with the next sentence

  • merge until similarity drops < threshold

This removes abrupt transitions and gives natural topic clusters.

5. Sliding Window Segments

def sliding_window_segments(segments, window=20, stride=15, min_len=8, max_len=40):
    merged = []
    video_end = segments[-1]["end"]
    t = 0

    while t < video_end:
        window_start = t
        window_end = min(t + window, video_end)

        selected = [seg for seg in segments if seg["end"] > window_start and seg["start"] < window_end]
        if not selected:
            t += stride
            continue

        actual_start = selected[0]["start"]
        actual_end = selected[-1]["end"]

        # collect all words
        words = []
        for seg in selected:
            words.extend(seg.get("words", []))

        while (actual_end - actual_start) < min_len:
            next_idx = segments.index(selected[-1]) + 1
            if next_idx < len(segments):
                selected.append(segments[next_idx])
                actual_end = selected[-1]["end"]
                words.extend(selected[-1].get("words", []))
            else:
                break

        if (actual_end - actual_start) > max_len:
            actual_end = actual_start + max_len
            # trim words too
            words = [w for w in words if w["start"] < actual_end]

        merged.append({
            "start": actual_start,
            "end": actual_end,
            "text": " ".join([seg["text"] for seg in selected]).strip(),
            "words": words
        })

        t += stride

    return merged

Since viral shorts are typically 8–40 seconds, this function:

  • uses a rolling 20-sec window

  • overlaps windows (stride=15 sec)

  • expands short segments

  • trims long ones

This produces many candidate viral clips.

6. Scoring Clips with GPT

def gpt_score_segment(text):
    prompt = f"""
    You are a content virality expert.
    Rate the following transcript segment on a scale of 1–10 for its potential as a viral short-form video clip 
    (TikTok/Instagram Reels/YouTube Shorts). Consider hook strength, emotional impact, relatability, curiosity, and 
    whether it stands alone without extra context.

    Only return a JSON object with this structure:
    {{"score": <number>, "reason": "<brief reason>"}}

    Segment:
    \"\"\"{text}\"\"\"
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",   # fast/cheap model
        messages=[{"role": "user", "content": prompt}],
        max_completion_tokens=150,
        response_format={"type": "json_object"}  # ✅ ensures valid JSON
    )

    content = response.choices[0].message.content
    result = json.loads(content)   # guaranteed valid JSON
    return result
{"score": 9.2, "reason": "Strong hook and emotional appeal."}

It evaluates:

  • hook strength

  • curiosity

  • relatability

  • emotional pull

  • standalone clarity

7. Selecting Non-Overlapping Viral Clips

def select_viral_clips(segments, n=3, min_gap=8):
    """
    Pick top n clips, ensuring they don't overlap too closely.
    
    :param segments: list of {start, end, text}
    :param n: number of clips to return
    :param min_gap: minimum gap in seconds between chosen clips
    """
    scored_segments = []
    for seg in segments:
        result = gpt_score_segment(seg["text"])
        scored_segments.append({
            "start": seg["start"],
            "end": seg["end"],
            "text": seg["text"],
            "words": seg.get("words", []),   # ✅ keep words
            "score": result["score"],
            "reason": result["reason"]
        })

    # Sort by score (highest first)
    scored_segments.sort(key=lambda x: x["score"], reverse=True)

    chosen = []
    for candidate in scored_segments:
        # Check distance from already chosen
        too_close = False
        for c in chosen:
            if abs(candidate["start"] - c["start"]) < min_gap:
                too_close = True
                break
        if not too_close:
            chosen.append(candidate)
        if len(chosen) >= n:
            break

    return chosen

This picks the top N scored clips while making sure they aren’t too close in time.

This ensures clip variety.

8. Natural Sentence Boundaries


def extend_to_sentence_boundaries(segments, full_transcript, max_len=300):
    """
    Extend viral segments to nearest sentence boundaries or silence gaps.
    Keeps clips natural and not cut mid-sentence.
    """
    def find_prev_boundary(time):
        prev_segments = [s for s in full_transcript if s["end"] <= time]
        for seg in reversed(prev_segments):
            text = seg["text"].strip()
            if text.endswith((".", "!", "?")):
                return seg["end"]
        return max(0, time - 2)  # fallback: extend 2s before

    def find_next_boundary(time):
        next_segments = [s for s in full_transcript if s["start"] >= time]
        for seg in next_segments:
            text = seg["text"].strip()
            if text.endswith((".", "!", "?")):
                return seg["end"]
        return time + 2  # fallback: extend 2s after

    extended = []
    for seg in segments:
        start = find_prev_boundary(seg["start"])
        end = find_next_boundary(seg["end"])

        # keep within video and cap max length
        if end - start > max_len:
            end = start + max_len

        # collect words within the extended region
        words = []
        for s in full_transcript:
            for w in s.get("words", []):
                if start <= w["start"] <= end:
                    words.append(w)

        full_text = " ".join([
            s["text"] for s in full_transcript
            if s["end"] >= start and s["start"] <= end
        ]).strip()

        extended.append({
            "start": start,
            "end": end,
            "text": full_text,
            "words": words,
            "score": seg["score"],
            "reason": seg["reason"]
        })

    return extended

This prevents cuts in the middle of words/sentences.

It finds:

  • nearest previous sentence ending

  • nearest next sentence ending

The result feels human-edited.

9. Video Extraction + Blurred Background


def save_clip_(video_path, selected_segments, output_dir="clips", blur_intensity=75):
    os.makedirs(output_dir, exist_ok=True)
    clip_paths = []
    
    target_w, target_h = 1080, 1920
    
    with VideoFileClip(video_path) as video:
        for i, segment in enumerate(selected_segments):
            with video.subclip(float(segment["start"]), float(segment["end"])) as clip:
                # Calculate the maximum size for main clip while maintaining aspect ratio
                clip_aspect = clip.w / clip.h
                target_aspect = target_w / target_h
                
                if clip_aspect > target_aspect:
                    # Clip is wider than target - fit to width
                    main_clip = resize(clip, width=target_w)
                else:
                    # Clip is taller than target - fit to height
                    main_clip = resize(clip, height=target_h)
                
                # Create background - scale original clip to fill entire frame
                if clip_aspect > target_aspect:
                    bg_clip = resize(clip, height=target_h)
                else:
                    bg_clip = resize(clip, width=target_w)
                
                # Crop background to exact dimensions
                bg_clip = bg_clip.crop(
                    x_center=bg_clip.w/2,
                    y_center=bg_clip.h/2,
                    width=target_w,
                    height=target_h
                )
                
                # Apply blur
                bg_clip = bg_clip.fl_image(lambda image: cv2.GaussianBlur(image, (blur_intensity, blur_intensity), 0))
                
                # Composite - center the main clip
                final = CompositeVideoClip([
                    bg_clip,
                    main_clip.set_position(("center", "center"))
                ])
                
                out_path = os.path.join(output_dir, f"clip_{i+1}_score{segment['score']}.mp4")
                final.write_videofile(out_path, codec="libx264", audio_codec="aac", 
                                    threads=8, preset="ultrafast", bitrate="2000k")
                
                # Cleanup
                bg_clip.close()
                main_clip.close()
                final.close()
                
                clip_paths.append({
                    "path": out_path,
                    "score": segment["score"],
                    "reason": segment["reason"],
                    "text": segment["text"]
                })
    
    return clip_paths

10. Add PyCaps Template (Animations, Effects, Captions)

def run_pycaps_pipeline(input_vid):
    # Load a template and configure it
    builder = TemplateLoader("hype").with_input_video(input_vid).load(False)
    # Programmatically add an animation
    builder.add_animation(
        animation=FadeIn(),
        when=EventType.ON_NARRATION_STARTS,
        what=ElementType.SEGMENT
    )
    # Build and run the pipeline
    pipeline = builder.build()
    pipeline.run()

This loads a template. The "hype" template may include:

  • dynamic captions

  • waveform visualizer

  • emoji reactions

  • progress bar

Putting It All Together: The Full Pipeline

if __name__ == "__main__":
    url = "https://www.youtube.com/watch?v=JVtUcM1sWQw"

    video_path = download_video(url)
    segments = transcribe_video(video_path)

    sentences = sentence_segments(segments)
    semantic_blocks = semantic_coherence(sentences)

    windows = sliding_window_segments(semantic_blocks, window=120, stride=30)
    viral_segments = select_viral_clips(windows, n=1)

    natural_segments = extend_to_sentence_boundaries(viral_segments, segments)
    clip_files = save_clip_(video_path, natural_segments)

    run_pycaps_pipeline(clip_files[0]["path"])

Final Thoughts

Auto Clipper AI is more than a script — it’s a full intelligent video-editing pipeline that:

  • understands speech

  • detects meaningful content

  • evaluates virality potential

  • creates polished short-form video

With PyCaps + Whisper + MoviePy + GPT scoring, it becomes easy to repurpose long videos into multiple engaging clips.

If you’re building a SaaS, tool, or automation pipeline around short-form content — this architecture is production-ready.

Related Articles