Ai Music Video With Lyrics

Create a music video where readable lyrics and visual generation work together instead of competing for attention. This page should help users find formats that integrate on-screen words with scene design, pacing, and music-driven motion.

Video
GLOBAL LOCK: 
Subject is a Caucasian male singer, mid-20s, with long wavy brown hair, a light beard/mustache, wearing a brown knit beanie and a dark shirt. He performs into a vintage silver condenser microphone. 
Secondary subjects are two Caucasian children: a young girl with long brown hair in a tan beanie and white sweater, and a young boy with short blonde hair in a white long-sleeve shirt. 
Environment is a surreal, minimalist "all-white" world. Locations include a white living room with white sofas, a white snowy forest with white-barked trees, and a white boat on a vast white sea with cloud-like waves. 
Lighting is high-key, soft, and directional, creating a cinematic editorial look. 
Color grade is heavily desaturated, almost monochromatic white and grey, with very high contrast. 
Camera language is cinematic with shallow depth of field for close-ups and wide, sweeping shots for environments. 
Speech is emotional male vocals, high lip-sync strictness required for the singer.

[00:00–00:10]
Close-up of the male singer performing into the vintage mic, eyes closed, emotional expression. Cut to a medium shot of the young girl from behind, looking at a boy sitting on a white sofa in a completely white living room. Soft white light floods the scene.

[00:11–00:23]
Medium shot of the girl in the beanie looking directly at the camera with a slight smile. Cut to the boy smiling. The children are then seen from behind, walking through a doorway into a surreal white forest where the ground and trees are covered in white paper-like snow.

[00:24–00:42]
A split-screen or trio shot showing the singer and the two children singing together. Close-up of the singer with tattoos visible on his arms. In the white forest, a raccoon peeks from behind a white tree, followed by a shot of a large brown bear walking through the white woods.

[00:43–01:08]
Low angle shot looking up at the two children sitting on a large white tree branch against a bright white sky. The singer is shown in a side profile close-up, singing intensely. The children look out at the horizon.

[01:09–01:40]
Wide shot of the children walking across a white rope bridge in the forest. Cut to a close-up of the singer's mouth at the mic. The children are now in a small white wooden boat, rowing through a sea of white, turbulent, cloud-like waves. The boy rows with a wooden oar.

[01:41–02:00]
Dynamic underwater shots. The girl is submerged in dark blue-grey water, looking up toward the light. The boy is also shown underwater, struggling slightly. Intercut with the singer shouting the lyrics with high intensity, face close to the mic.

[02:01–02:27]
Close-up of the singer's face, looking weary but peaceful. The children are seen lying down on a white surface, then looking out at a vast, infinite white ocean where the water and sky blend into one. Final extreme wide shot of the tiny boat in the middle of the white void.

NEGATIVE PROMPT: 
Vibrant colors, saturated tones, messy backgrounds, robotic lip-sync, facial distortion, inconsistent hair length, floating objects, digital noise, blurry textures, multiple beanies on one head, extra limbs, unnatural eye movements, flickering lighting.

SPEECH PACK:
[00:00-00:10] "I have your number in my phone, but I sit here all alone"
TAKE_A: (Melancholic, soft, slow)
TAKE_B: (Breathier, intimate)
TAKE_C: (Slightly more rhythmic)

[01:10-01:25] "I don't understand this life, it cuts me like a rusty knife"
TAKE_A: (Powerful, belting, high emotion)
TAKE_B: (Desperate, strained)
TAKE_C: (Angry, punchy)

[01:40-01:55] "No more talking, no more pride, just the emptiness inside"
TAKE_A: (Screaming/Shouting, high energy)
TAKE_B: (Gravelly, intense)
TAKE_C: (Vibrato-heavy, soaring)
Video
Simon Meyer
GLOBAL LOCK: 
Subject is a Caucasian male in his late 20s with long, wavy brown hair, a light beard, and expressive eyes. He consistently wears a brown knit beanie and a dark, textured jacket. The environment alternates between a dark, moody studio with a vintage silver condenser microphone and a surreal, minimalist "white world" featuring paper-like textures, white trees, and a desaturated sea. Lighting is cinematic with high contrast on the singer and soft, high-key white diffusion in the fantasy scenes. The color grade is desaturated with a heavy film grain texture. Audio is a mid-tempo emotional rock song with clear, soulful male vocals.

[00:00–00:05]
Close-up of the male singer singing directly into a vintage silver microphone. He has a focused, emotional expression. Transition to a medium shot of a young girl with long hair and a beanie and a young boy in a white room with minimalist white furniture. The room is bathed in soft, high-key white light. 
Speech: "I have your number in my phone, but I"
Lip-sync: High strictness on the singer's mouth.

[00:06–00:11]
Close-up of the young boy smiling slightly, then a close-up of the singer. The singer's eyes are partially closed in passion. 
Speech: "sit here all alone. I don't know why we"
Lip-sync: High strictness.

[00:12–00:23]
Wide shot of the two children in the all-white room, they look small against the vast white space. Cut back to the singer in a tight close-up, his head moving slightly with the rhythm. The background behind him is dark and blurry.
Speech: "don't speak, my heart is tired and my legs are weak. We were brothers, we"
Lip-sync: High strictness.

[00:24–00:33]
Medium shot of the singer with two blurry band members in the background, all wearing beanies. Cut to the children walking through a doorway into a bright white, snowy-looking exterior.
Speech: "were one, now the light is all but gone. I don't know who you are"
Lip-sync: High strictness.

[00:34–00:46]
Close-up of the singer, then a shot of the children from behind walking into a forest of stylized white paper trees. A small raccoon peeks from behind a tree.
Speech: "today, you're just a ghost that walked away. I don't understand this"
Lip-sync: High strictness.

[00:47–01:08]
The children are climbing the white trees. The girl looks down and smiles. Cut to the singer singing with intensity. The camera pans slightly around him.
Speech: "life, it cuts me like a rusty knife. I'm young in years but old in mind, you're the"
Lip-sync: High strictness.

[01:09–01:30]
Wide shot of the children rowing a small white wooden boat on a vast, calm, white sea under a cloudy white sky. Cut to extreme close-ups of the singer's face, showing skin texture and sweat.
Speech: "one I left behind. I see you posting on the screen, but I don't know"
Lip-sync: High strictness.

[01:31–02:02]
The scene shifts underwater. The children are swimming in dark, clear blue water with rays of light piercing through the surface. They move gracefully. Intercut with the singer hitting high notes, his mouth wide open.
Speech: "what you mean. A different life, a different place. I don't recognize your face. The world is turning way too fast and nothing good is built to last."
Lip-sync: High strictness.

[02:03–02:28]
Close-up of the boy's eye, then the two children lying head-to-head on a white surface. Final wide shot of the boat as a tiny speck on the white horizon as the sun (or a bright light) sets.
Speech: "I'm sitting here, I'm 23, but 80 years is all I see. I don't understand this life."
Lip-sync: High strictness.

NEGATIVE PROMPT: 
Visual: Cartoonish features, saturated colors, morphing limbs, floating objects, inconsistent facial features, blurry textures, digital noise, modern clothing on children, colorful backgrounds.
Speech: Robotic tone, autotune artifacts, mismatched lip-sync, muffled audio, background hiss, unnatural breathing sounds, flat delivery.

SPEECH PACK:
[00:00-00:11] "I have your number in my phone, but I sit here all alone."
TAKE_A: Melancholic, slow, breathy.
TAKE_B: Frustrated, slightly faster.
TAKE_C: Resigned, flat intonation.

[00:12-00:33] "I don't know why we don't speak, my heart is tired and my legs are weak."
TAKE_A: Increasing volume, emotional strain.
TAKE_B: Whispered, intimate.
TAKE_C: Rhythmic, emphasizing "tired" and "weak".

[01:09-01:30] "I see you posting on the screen, but I don't know what you mean."
TAKE_A: Confused, slightly bitter.
TAKE_B: High energy, rock belt.
TAKE_C: Melodic, smooth transitions.
Video
A) MISE EN PLACE
2) Segment the video into scenes/shots:
- [00:00-00:02]: Shot 1, Medium close-up, man singing.
- [00:02-00:04]: Shot 2, Wide shot, woman floating.
- [00:04-00:06]: Shot 3, Extreme close-up, mouth and mic.
- [00:06-00:09]: Shot 4, Medium shot, B&W, three clones.
- [00:09-00:10]: Shot 5, Close-up, woman in hat.
- [00:10-00:12]: Shot 6, Medium shot, man singing.
- [00:12-00:14]: Shot 7, Wide shot, woman walking in field.
- [00:14-00:16]: Shot 8, Medium close-up, man singing.
- [00:16-00:17]: Shot 9, Medium shot, man driving.
- [00:17-00:18]: Shot 10, Medium wide, drummer on roof.
- [00:18-00:20]: Shot 11, Medium shot, man driving.
- [00:20-00:21]: Shot 12, Medium close-up, man singing.
- [00:21-00:22]: Shot 13, Medium shot, woman in field.
- [00:22-00:25]: Shot 14, Medium shot, B&W, three clones.
- [00:25-00:27]: Shot 15, Medium close-up, man smoking.

3) Extract visual evidence:
- Keyframes: Blonde man singing (00:01), Woman floating (00:03), Mouth close-up (00:05), B&W clones (00:07), Man driving (00:16), Drummer (00:17), Man smoking (00:26).

4) Extract speech evidence:
- The audio is a continuous pop-rock song with male vocals.
- Transcript: "Just to show that it'll be fine / And when I'm back in Chicago I feel it / Another version of me I was in it / I wake up back to the end / I feel it"
- Lip visibility: High in singing shots. Strict lip-sync required.

5) Invariants list:
- Visuals: Protagonist (Caucasian male, early 30s, short blonde hair, black shirt), cinematic lighting, 24fps motion blur, anamorphic lens feel.
- Speech: Continuous song, male vocal, energetic delivery.

6) Variables list:
- Visuals: Locations (city rooftop, field, car), secondary characters (woman, drummer), color grade (warm sunset vs cool night vs B&W).

B) SHOTLIST
[00:00–00:02]
- framing: MCU, eye level.
- lens: 50mm, shallow depth of field.
- camera movement: Slow push-in.
- subject: Blonde male, singing passionately into vintage mic.
- environment: Outdoor, blurred city skyline.
- lighting: Warm golden hour, directional from right.
- color grade: Teal and orange, high contrast.
- SPEECH: Male vocal, singing "Just to show that it'll be fine". Strict lip-sync.

[00:02–00:04]
- framing: WS.
- lens: 35mm.
- camera movement: Slow horizontal tracking.
- subject: Woman in white dress, floating horizontally.
- environment: Grassy field at dusk.
- lighting: Soft sunset.
- SPEECH: Song continues, no on-camera lip-sync.

[00:04–00:06]
- framing: ECU.
- lens: Macro.
- camera movement: Static.
- subject: Blonde male's mouth and vintage mic.
- lighting: Warm, high contrast.
- SPEECH: Male vocal, singing "And when I'm back in Chicago". Strict lip-sync.

[00:06–00:09]
- framing: MS.
- lens: 50mm.
- camera movement: Static.
- subject: Three identical clones of blonde male, singing into one mic.
- environment: Studio backdrop.
- lighting: High contrast, retro.
- color grade: Black and white.
- SPEECH: Male vocal, singing "I feel it / Another version of me". Strict lip-sync for all three.

[00:16–00:17]
- framing: MS.
- lens: 35mm.
- camera movement: Mounted on hood, slight vibration.
- subject: Blonde male driving classic convertible.
- environment: City street at night.
- lighting: Cool streetlights, warm dashboard practicals.
- SPEECH: Song continues, no on-camera lip-sync.

[00:25–00:27]
- framing: MCU.
- lens: 50mm.
- camera movement: Slow pan right.
- subject: Blonde male smoking cigarette, exhaling.
- environment: City rooftop at dusk.
- lighting: Cool cinematic.
- SPEECH: Song ends, instrumental fade.

C) STYLE BIBLE
- visual_style: Cinematic music video.
- camera_signature: Anamorphic lenses, smooth tracking, shallow depth of field.
- lighting_signature: High contrast, motivated sources (sunset, streetlights).
- grade_signature: Teal and orange for city, warm golden for fields, stark B&W for studio shots.
- texture_signature: Film grain, 24fps motion blur.
- SPEECH STYLE BIBLE: Energetic pop-rock male vocal, clear articulation, studio-quality mix.

D) PROMPT SYNTHESIS

1. MASTER PROMPT
GLOBAL LOCK: A cinematic music video featuring a consistent protagonist: a Caucasian male in his early 30s, short styled blonde hair, wearing a black collared shirt. The visual style is photorealistic, shot on anamorphic lenses with a 24fps filmic motion blur. The camera work is dynamic, with smooth tracking. The audio is a pop-rock song with clear male vocals.

[00:00–00:02] Medium close-up. The blonde male protagonist stands outdoors against a blurred city skyline at sunset. He is singing passionately into a vintage silver condenser microphone. Warm, golden-hour lighting hits his face from the right. The camera slowly pushes in. Strict lip-sync to the lyrics "Just to show that it'll be fine".
[00:02–00:04] Wide shot. A young woman with long brown hair, wearing a flowing white dress, floats horizontally above a grassy field at dusk. The lighting is soft and ethereal. The camera tracks her movement slowly.
[00:04–00:06] Extreme close-up. Profile shot of the blonde male protagonist's mouth and the vintage microphone. He is singing, lips perfectly synced to the lyrics "And when I'm back in Chicago". The background is completely out of focus.
[00:06–00:09] Medium shot, black and white. Three identical clones of the blonde male protagonist stand close together, all singing into a single vintage microphone in the center. The lighting is high-contrast, reminiscent of classic 1960s music videos. The camera is static. Strict lip-sync to "I feel it / Another version of me".
[00:09–00:10] Close-up. A young woman with freckles, wearing a straw hat, looks softly off-camera. Warm sunlight illuminates her face. The background is a blurred field.
[00:10–00:12] Medium shot. The blonde male protagonist singing passionately into the vintage microphone, city skyline in the background. Warm sunset lighting. The camera slightly pans left. Strict lip-sync.
[00:12–00:14] Wide shot. A woman with long brown hair, wearing a white dress and a wide-brimmed hat, walks away from the camera through a field of tall grass and flowers at sunset. The camera follows her slowly.
[00:14–00:16] Medium close-up. The blonde male protagonist singing intensely into the vintage microphone, city skyline background. The camera pushes in quickly. Strict lip-sync.
[00:16–00:17] Medium shot. The blonde male protagonist is driving a classic convertible car at night. The city lights blur in the background. He is looking forward, illuminated by dashboard lights and passing streetlights. The camera is mounted on the hood, facing him.
[00:17–00:18] Medium wide shot. A different man, with dark hair and a beard, is energetically playing a drum set on a city rooftop at dusk. The camera pans around him.
[00:18–00:20] Medium shot. The blonde male protagonist driving the convertible at night. He turns his head slightly to look towards the camera. City lights streak by.
[00:20–00:21] Medium close-up. The blonde male protagonist singing into the vintage microphone, city skyline background. Strict lip-sync.
[00:21–00:22] Medium shot. The woman in the white dress and straw hat stands in a field of flowers at sunset, smiling gently at the camera.
[00:22–00:25] Medium shot, black and white. The three clones of the blonde male protagonist singing into the vintage microphone. The camera slowly pushes in. Strict lip-sync.
[00:25–00:27] Medium close-up. The blonde male protagonist stands on a rooftop with a city skyline behind him at dusk. He is smoking a cigarette, exhaling a cloud of smoke. The lighting is cool and cinematic. The camera slowly pans right.

2. NEGATIVE PROMPT
visual artifacts, anatomy issues, extra fingers, weird motion, text, logos, watermarks, flicker, temporal jitter, morphing faces, inconsistent clothing, robotic movement, unnatural lighting, overexposed highlights, cartoonish style, anime, 3d render. Speech negatives: robotic cadence, unnatural emphasis, slurred words, harsh sibilance, plosives, clipping, lip-sync mismatch, out of sync audio.

4. SPEECH PACK
[00:00-00:02] "Just to show that it'll be fine"
[00:04-00:06] "And when I'm back in Chicago"
[00:06-00:09] "I feel it / Another version of me I was in it"
[00:10-00:12] "I wake up back to the end"
[00:14-00:16] "I feel it"
Video
Milla Sofia
GLOBAL LOCK: A stunning young Caucasian woman with long, wavy blonde hair and striking blue-green eyes. She has a polished, editorial skin texture with natural-looking pores. She is wearing a high-neck, sleeveless cream-colored ribbed turtleneck top. The setting is a dark, minimalist studio with a professional vintage silver condenser microphone on a stand in the foreground. The lighting is cinematic three-point lighting with a strong warm rim light on her hair and shoulders, creating a soft glow. The color grade is clean and warm with high contrast. The camera uses a shallow depth of field (85mm lens feel), keeping the background softly blurred. Speech is a melodic, emotional female singing voice with a studio-quality mic signature.

[00:00–00:03]
The woman is in a close-up shot, looking slightly to the side with a thoughtful expression. She then slowly turns her gaze toward the camera, making direct eye contact. Her lips begin to move in perfect sync with the lyrics: "you stepped into my quiet." Her expression is soft and inviting. The camera has a very slight handheld drift.

[00:03–00:06]
The shot widens slightly to a medium close-up. She tilts her head gently to the left as she continues singing: "shifted all the colors." Her eyes follow the camera, and her facial muscles show subtle emotional tension. The rim light catches the individual strands of her blonde hair as she moves.

[00:06–00:10]
Back to a tight close-up. She sings the line: "in my day / I followed every moment." Her eyebrows lift slightly on the word "followed," adding emphasis. The vintage microphone remains a sharp, metallic element in the lower-middle frame. Her skin reflects the soft key light.

[00:10–00:14]
The camera maintains a steady close-up as she sings: "hoping you were someone." She blinks naturally and maintains a warm, soulful gaze. The background remains a deep, out-of-focus charcoal grey.

[00:14–00:18]
The final shot shows her singing the concluding words: "I could trust." As the music lingers, she breaks into a subtle, genuine smile, looking directly into the lens. The camera slowly zooms in a few millimeters. The lip-sync is high-precision, matching the "t" and "st" sounds perfectly.

NEGATIVE PROMPT: low resolution, blurry face, inconsistent facial features, extra fingers, distorted microphone, flickering lights, unnatural skin glow, plastic texture, robotic mouth movements, messy hair, over-saturated colors, text logos, watermark, jittery camera, mismatched lip-sync, harsh shadows, flat lighting.

SPEECH PACK:
Transcript: "you stepped into my quiet / shifted all the colors in my day / I followed every moment / hoping you were someone I could trust"

TAKE_A (Emotional/Breathy): [breath] you stepped into my quiet... [pause] shifted all the colors in my day... [breath] I followed every moment... hoping you were someone I could **trust**.
TAKE_B (Clear/Studio): You stepped into my quiet / shifted all the colors in my day / I followed every moment / hoping you were someone I could trust.
TAKE_C (Soulful/Vibrato): You stepped into my qui-et... shifted all the co-lors in my day... I followed every mo-ment... hoping you were someone I could trust.

Prosody Notes: Soft onset on "you," emphasis on "shifted" and "trust," slight pause after "day."
Sync Requirements: High strictness on "quiet," "colors," and "trust" for mouth closures.
Mic Signature: Close-mic, warm proximity effect, clean de-essing, light plate reverb.
Video
Milla Sofia
GLOBAL LOCK: 
Subject is a stunning young Caucasian woman with long, wavy platinum blonde hair and piercing blue eyes. She has a slender, athletic build and clear, photorealistic skin texture with visible pores and subtle imperfections. She is wearing a shimmering silver satin slip dress with thin spaghetti straps. The environment is a dark, professional music studio with soft, out-of-focus bokeh lights in the background. Lighting is cinematic: a soft key light from the side creates gentle shadows on her face, and a subtle rim light highlights the edges of her hair. The color grade is cool and editorial, with deep blacks and vibrant silver highlights. Camera is a static medium close-up (MCU) at eye level. Audio is a female pop vocal with a clear, studio-quality mic signature.

[00:00–00:03]
Subject is positioned behind a professional black condenser microphone on a stand. She begins singing with a gentle, focused expression. Her lips move in perfect sync with the words "night lit". Her head tilts slightly to her right. The silver satin of her dress reflects the studio lights.

[00:03–00:07]
Subject continues singing the phrase "with silver light". She closes her eyes briefly for emotional emphasis on the word "silver". Her hands are visible gripping the microphone stand lower down. Subtle movement of her hair as if from a light studio fan.

[00:07–00:11]
Subject looks directly toward the camera (or slightly off-lens) as she sings "I saw it in your eyes". Her expression is soulful and intimate. The lighting remains consistent, emphasizing the sheen of the satin dress.

[00:11–00:15]
Subject sings "the world around us faded out... stars across the sky". She takes a slightly deeper breath between phrases. Her mouth opens wider for the sustained "stars" note. The background bokeh lights remain soft and static. The camera maintains the MCU framing throughout.

NEGATIVE PROMPT:
Visual: cartoonish, 3D render, uncanny valley, distorted facial features, extra fingers, flickering hair, blurry skin, low resolution, watermark, text (except for the intended lyrics), changing wardrobe, inconsistent eye color, floating microphone.
Speech: robotic voice, metallic reverb, misaligned lip-sync, muffled audio, background noise, popping 'P' sounds, unnatural breathing.

SPEECH PACK:
Transcript: "night lit with silver light I saw it in your eyes the world around us faded out stars across the sky"

TAKE_A (Emotional/Breathy):
[00:00-00:03] night lit... (soft)
[00:03-00:07] with silver light... (airy)
[00:07-00:11] I saw it in your eyes... (intimate)
[00:11-00:15] the world around us faded out... stars across the sky (sustained)

TAKE_B (Powerful/Pop):
[00:00-00:03] NIGHT LIT (strong)
[00:03-00:07] with SILVER light (emphasized)
[00:07-00:11] I saw it in your EYES (direct)
[00:11-00:15] the world around us FADED out... STARS across the sky (vocal belt)

Prosody Notes: 
- Pause for 0.5s after "light".
- Emphasis on "silver" and "eyes".
- Elongate the "ah" sound in "stars".
- Mic distance: Close (proximity effect).
- Room tone: Dry studio.
Video
Milla Sofia
GLOBAL LOCK: 
Subject is a young Caucasian woman in her mid-20s, Scandinavian features, blonde hair in loose natural waves, athletic build. She is wearing a minimalist black silk slip dress with thin spaghetti straps. The environment is a lush green park or garden during golden hour, with soft warm sunlight and a creamy bokeh background. A professional condenser microphone on a stand is in front of her, and she holds an acoustic guitar. Lighting is cinematic with a strong warm rim light on her hair and soft key light on her face. Color grade is warm, golden, and high-contrast. Pacing is slow and emotional. Speech is a lip-sync to a male vocal track, requiring high-precision mouth movements.

[00:00–00:03]
The woman is positioned in a medium shot, looking slightly off-camera with a soft, contemplative expression. She begins singing the word "Words..." Her mouth opens naturally to match the phoneme. She is holding the guitar, her right hand resting near the strings. The camera is static. The golden hour sun creates a glowing halo around her blonde hair.

[00:03–00:07]
She turns her gaze toward the microphone, singing "...don't come easy to me." Her facial expressions are more animated, showing a slight emotional strain consistent with the lyrics. There is a very subtle digital zoom-in. Her hair sways slightly in a gentle breeze. The lip-sync is tight and perfectly aligned with the breathy delivery of the song.

[00:07–00:12]
She continues singing "How can I find a way..." while her fingers make a subtle strumming motion on the guitar strings. She tilts her head slightly to the left. The background bokeh remains soft and green, with golden light filtering through the trees. Her skin texture is visible but smooth under the warm key light.

[00:12–00:16]
On the lyrics "...to make you see I love you," she looks directly into the camera lens (or just above it), closing her eyes briefly for emphasis on "love." She finishes with a gentle, knowing smile as the phrase ends. The camera maintains the medium-close-up framing. The rim lighting remains consistent, highlighting the silk texture of her dress straps.

NEGATIVE PROMPT: 
Visual: extra fingers, distorted guitar strings, floating microphone, flickering lighting, unnatural skin smoothing, blurry face, temporal jitter, morphing hair, double straps on dress, distorted background objects.
Speech: lip-sync lag, mouth opening wider than the sound, robotic jaw movement, tongue clipping through teeth, frozen facial expressions during singing.

SPEECH PACK:
[00:00–00:03] "Words..."
TAKE_A: (Soft, breathy start, lingering on the 's')
TAKE_B: (Clear, enunciated 'W', short 's')
TAKE_C: (Whisper-like, very little jaw movement)

[00:03–00:07] "...don't come easy to me."
TAKE_A: (Rhythmic, emphasizing 'don't' and 'easy')
TAKE_B: (Melodic, sliding between 'easy' and 'to')
TAKE_C: (Emotional, slight quiver in the lower lip on 'me')

[00:07–00:12] "How can I find a way..."
TAKE_A: (Inquisitive, eyebrows raised slightly)
TAKE_B: (Pleading, head tilt on 'way')
TAKE_C: (Steady, focused on the microphone)

[00:12–00:16] "...to make you see I love you."
TAKE_A: (Warm, direct eye contact on 'you')
TAKE_B: (Soft, eyes closing on 'love')
TAKE_C: (Smiling through the words, joyful delivery)

Prosody Markup: Words (pause) don't come **EASY** to me... How can I find a **WAY**... to make you see I **LOVE** you.
Video
Milla Sofia
GLOBAL LOCK:
Vertical 9:16, 720x1280. One continuous stage performance shot: an adult female singer on an outdoor/large venue stage at night, standing at a microphone on a stand. Background: strong backlights and stage spotlights forming circular bokeh halos, light haze/smoke in the air, dark truss/rigging overhead, cool-gray stage ambience. Subject styling: blonde wavy hair down, glossy natural makeup, small drop earrings, confident and emotional facial performance. Wardrobe: deep red velvet off-shoulder bodycon mini dress, sheer dark tights; elegant, concert-ready look. Camera language: mostly static with a gentle slow push-in and slight handheld micro-sway; mid-to-full body framing from thighs up, occasionally a tiny reframing as she shifts weight. Lens feel: 50–85mm, shallow depth of field, crisp subject with creamy background lights. Lighting: warm rim/back light on hair and shoulders, soft front fill, cinematic contrast.
On-screen text overlay (must match style): large all-caps lyric captions centered low on the frame, bold sans-serif, white letters with thick black stroke and subtle drop shadow. One keyword per line is highlighted in a bright color (yellow/green/red) to add rhythm. Occasionally include a small emoji sticker (eyes/halo face) near the text. No other UI.
Temporal feel: ~30fps, smooth motion blur, no flicker, no face drift, no jitter; mouth shapes match singing.

AUDIO LOCK:
This is a visual tribute performance to a famous emotional pop ballad. Use licensed audio or your own original vocal recording in the same ballad style (do not paste copyrighted lyrics in the captions unless you have rights). Female lead vocal, expressive and breathy, mid tempo, big emotional chorus energy, light plate reverb, clear consonants, no robotic cadence. Background instrumentation: cinematic pop ballad bed (pads + piano + soft drums), consistent loudness, no harsh sibilance.

[00:00–00:05]
Singer starts a phrase with head slightly tilted back, eyes half-closed then opening. Right hand holds the microphone near lips while it remains attached to the stand; left arm relaxed by side. Subtitles appear for the first lyric line: 2-line all-caps caption with one highlighted keyword (bright yellow). Stage lights bloom behind her, haze visible around the beams. Camera: very slow push-in.
SPEECH/AUDIO: singing begins, soft/controlled, emotional but restrained.

[00:05–00:10]
She lowers chin slightly, looks forward past the lens, then glances a touch left. Subtitles update to the next lyric line; one highlighted keyword switches to a different color (red). Keep caption placement consistent (lower center). Maintain steady breathing and natural blink timing.
SPEECH/AUDIO: phrase continues; slightly stronger projection.

[00:10–00:15]
She shifts weight on her feet and subtly rolls a shoulder; microphone hand adjusts grip naturally (no finger warps). Subtitles change again; include a small “eyes” style emoji sticker near the text for emphasis. Background bokeh lights remain in the same positions and intensities; haze is stable.
SPEECH/AUDIO: emotional lift, more resonance, no clipping.

[00:15–00:20]
She turns her head to three-quarter right, mouth opens wider on a sustained note, then relaxes into the next words. Subtitles update; highlighted keyword becomes green. Camera push-in reaches a slightly tighter mid shot while keeping thighs still visible; keep the mic stand and cable visible on the left side.
SPEECH/AUDIO: sustained note, controlled vibrato; reverb tail audible but not washed out.

[00:20–00:26]
She returns gaze forward, soft smile at the end of a phrase, then a serious look as the next begins. Subtitles switch to another line; highlighted keyword becomes yellow again. Add a small halo/angel emoji sticker once, then keep captions clean.
SPEECH/AUDIO: dynamic swell, then gentle pullback; breath between phrases is audible and natural.

[00:26–00:29]
Final phrase of this clip segment: she holds the mic steady, eyes focused, slight chest rise with breath, ending on a calm expression. Subtitles show the final words for this excerpt (paraphrase-only; keep style consistent). End cleanly without glitch; last frame holds briefly for loop.
SPEECH/AUDIO: resolves phrase; no abrupt cut pop.

NEGATIVE PROMPT:
face morphing, identity drift, eye jitter, broken teeth, warped lips, bad lip sync, robotic singing, over-denoise artifacts, harsh sibilance, clipping, pumping compression, flickering stage lights, strobing bokeh, temporal wobble, jittery camera, weird mic stand geometry, missing fingers, extra fingers, melted hands, random logos, random subtitles style changes, unreadable text, wrong font, text misalignment, messy outlines, UI overlays.

SPEECH PACK (speech-first, compliant):
Timecoded lyric intent (do NOT copy copyrighted lyrics verbatim; replace with licensed lyrics or original words matching the same meaning and syllable timing):
[00:00–00:05] TAKE_A: “A line about the desert sky and memory.” TAKE_B: “A line about the night sky and a distant place.” TAKE_C: “A line about a place-name sky and nostalgia.” Prosody: gentle, breathy, rising at the end, slow vibrato on the last vowel.
[00:05–00:10] TAKE_A: “A line about a gaze that feels like fire.” TAKE_B: “A line about eyes that burn with emotion.” TAKE_C: “A line about a look that hits like heat.” Prosody: slightly stronger, emphasize the keyword, short pause before the last word.
[00:10–00:15] TAKE_A: “A line about wanting to hold onto a moment.” TAKE_B: “A line about catching something before it fades.” TAKE_C: “A line about not letting love slip away.” Prosody: punch the first word, then soften; breath before the final syllable.
[00:15–00:20] TAKE_A: “A line comparing a soul to something golden.” TAKE_B: “A line about a spirit that shines like gold.” TAKE_C: “A line about inner light that feels precious.” Prosody: sustained note on the comparison word; warm smile in tone.
[00:20–00:26] TAKE_A: “A line about finding the light inside someone.” TAKE_B: “A line about discovering your light in me.” TAKE_C: “A line about the light you brought out.” Prosody: swell then decrescendo; audible breath between clauses.
[00:26–00:29] TAKE_A: “A closing fragment implying you couldn’t find it before.” TAKE_B: “A closing fragment about never finding it.” TAKE_C: “A closing fragment that resolves the thought.” Prosody: quiet, resolved, end with a soft falling cadence.
Video
Milla Sofia
GLOBAL LOCK:
Subject is a young Caucasian woman in her mid-20s, athletic build, long wavy honey-blonde hair styled in a sporty half-up ponytail. She has bright blue eyes and a warm, approachable expression. Wardrobe is a white ribbed halter-neck crop top and black high-waisted denim jeans. She is holding a professional silver dynamic microphone on a black stand. Environment is an outdoor festival stage during golden hour. Lighting is cinematic with a strong warm rim light from the setting sun on her hair and shoulders, and soft warm fill light on her face. Background is a deep bokeh of stage scaffolding, warm stage lights, and a distant crowd. Color grade is warm, editorial, with high contrast and soft highlight roll-off. Pacing is rhythmic, following the 115 BPM of the song.

[00:00–00:03]
Subject: Medium close-up. She smiles warmly, looking directly into the camera, then slightly off-camera as she begins to sing.
Action: Singing the word "Words...". Mouth movements are fluid and perfectly synced.
Camera: Slight handheld sway, mimicking a professional camera operator.
Lighting: Bright golden hour sun hitting the side of her face.
Speech: Speaker A (On-camera), "Words...", warm melodic tone, high lip-sync strictness.

[00:03–00:07]
Subject: Medium close-up. Her expression shifts to a more soulful, slightly melancholic look.
Action: Singing "don't come easy to me". She tilts her head slightly to the left. Her eyebrows knit together slightly on "easy".
Camera: Slow zoom-in (punch-in) for emotional emphasis.
Lighting: Consistent warm rim light; shadows are soft and detailed.
Speech: Speaker A (On-camera), "don't come easy to me", breathy and emotional delivery, high lip-sync strictness.

[00:07–00:11]
Subject: Medium close-up. She looks up and away from the microphone briefly, then back to the center.
Action: Singing "How can I find a way to make you". Her hand holding the microphone moves slightly with the rhythm.
Camera: Handheld sway continues; focus remains sharp on her eyes.
Motion: Subtle wind blowing through the loose strands of her blonde hair.
Speech: Speaker A (On-camera), "How can I find a way to make you", rising intonation, high lip-sync strictness.

[00:11–00:14]
Subject: Medium close-up. She closes her eyes for a moment on the word "love".
Action: Singing "see I love you". A look of sincere emotion crosses her face.
Camera: Static MCU, shallow depth of field making the background stage lights glow.
Lighting: The sun creates a beautiful flare effect in the corner of the frame.
Speech: Speaker A (On-camera), "see I love you", soft and tender delivery, eyes closed on "love", high lip-sync strictness.

[00:14–00:17]
Subject: Medium close-up. She opens her eyes and smiles again, returning to the upbeat hook.
Action: Singing "Words don't come easy". She gives a small, charming nod to the beat.
Camera: Slight pull-back to the original MCU framing.
Grade: Warm, vibrant colors; skin texture is visible and realistic.
Speech: Speaker A (On-camera), "Words don't come easy", cheerful and melodic, high lip-sync strictness.

NEGATIVE PROMPT:
Visual: distorted facial features, extra fingers, flickering hair, blurry microphone, out of sync lips, robotic or stiff movement, low resolution, watermarks, text artifacts, unnatural skin smoothing, popping eyes, temporal jitter in background.
Speech: robotic cadence, unnatural emphasis, slurred words, harsh sibilance, clipping audio, lip-sync mismatch, muffled room tone, inconsistent volume.

SPEECH PACK:
Transcript:
[00:00-00:03] "Words..."
[00:03-00:07] "don't come easy to me"
[00:07-00:11] "How can I find a way to make you"
[00:11-00:14] "see I love you"
[00:14-00:17] "Words don't come easy"

Delivery Takes:
TAKE_A (Original): Melodic, soulful, 80s pop style.
TAKE_B (Acoustic): Slower, more breathy, intimate.
TAKE_C (Energetic): Brighter, more projection, festival vibe.

Prosody:
"Words..." (Long vowel, gentle fade)
"don't come **EASY** to me" (Emphasis on Easy)
"How can I **FIND** a way..." (Emphasis on Find)
"see I **LOVE** you" (Soft, emotional peak)
"Words don't come easy" (Rhythmic, back to hook)
Video
Milla Sofia
GLOBAL LOCK: Subject is a young Caucasian woman, approximately 25 years old, with wavy honey-blonde hair, bright blue eyes, and a fair skin tone with warm undertones. She has a slender build and is wearing a simple black athletic tank top with thin straps. The environment is a professional, dark recording studio booth. A large silver condenser microphone with a black circular pop filter is positioned in front of her. The background is dark and out of focus, suggesting acoustic treatment. Lighting is cinematic Rembrandt style, with a soft key light creating a triangle of light on her cheek and a subtle rim light on her hair. The color grade is warm and high-contrast with deep blacks and soft highlight roll-off. Camera is a static Medium Close-Up (MCU) with a shallow depth of field. Speech is a melodic, emotional female vocal.

[00:00–00:03]
Subject is looking slightly away from the microphone with a dreamy, emotive expression. Her mouth opens naturally to begin singing the words "night lit." Her head tilts slightly to the right. The lighting emphasizes the texture of her blonde hair. Lip-sync is high strictness for the "n" and "l" sounds.

[00:03–00:05]
Subject continues singing "with silver light." Her eyes shift slightly as if visualizing the lyrics. A subtle smile plays on her lips. The camera remains static, capturing the micro-expressions of her eyebrows and eyes.

[00:05–00:08]
Subject sings "I saw it in your eyes." She makes near-direct eye contact with the camera/microphone area, creating a sense of intimacy. Her hand is visible at the bottom left, lightly gripping the silver microphone stand. The movement is fluid and natural.

[00:08–00:11]
Subject sings "the world around us." Her mouth movements are clear and expressive, particularly on the "w" and "r" sounds. Her head moves in a slight rhythmic sway. The background remains a deep, textured black.

[00:11–00:13]
Subject sings "faded out." She closes her eyes halfway, leaning back slightly as if feeling the emotion of the song. The transition in her expression from focused to lost-in-the-moment is smooth.

[00:13–00:15]
Subject sings "stars across the sky." She looks upward, her mouth forming the "s" and "o" shapes clearly. The rim light catches the top of her head and shoulders. The video ends on this upward, hopeful gaze.

NEGATIVE PROMPT: 
Visual: robotic movement, distorted facial features, inconsistent hair color, flickering lights, blurry subject, extra fingers, floating microphone, unnatural skin texture, plastic look, low resolution, watermark, text logos.
Speech: robotic cadence, muffled audio, lip-sync mismatch, popping sounds, harsh sibilance, unnatural pauses, background noise, inconsistent volume.

SPEECH PACK:
Transcript: "night lit with silver light I saw it in your eyes the world around us faded out stars across the sky"

TAKE_A (Emotional/Breathy): [00:00] (breath) night lit... [00:02] with silver light... [00:05] I saw it in your eyes... [00:08] the world around us... [00:11] faded out... [00:13] stars across the sky.
TAKE_B (Powerful/Clear): [00:00] Night lit [00:02] with silver light [00:05] I SAW it in your eyes [00:08] the world around us [00:11] FADED out [00:13] stars across the sky.
TAKE_C (Soft/Whispered): [00:00] night lit [00:02] with silver light [00:05] i saw it in your eyes [00:08] the world around us [00:11] faded out [00:13] stars across the sky.

Prosody Notes: Emphasis on "eyes" and "faded." Soft breath before the first word. Cadence should be melodic and slow.
Mic/Room: Close-mic proximity, dry studio sound with a touch of plate reverb. High clarity.
Video
Milla Sofia
GLOBAL LOCK: A stunning blonde woman in her mid-20s, fair skin with warm undertones, long wavy blonde hair. She wears a lustrous champagne-gold silk camisole with a cowl neck and thin straps, paired with white high-waisted jeans. The setting is a professional stage with warm, out-of-focus golden bokeh lights in the background. A black professional microphone on a stand is in front of her. Cinematic editorial lighting, high contrast, warm golden tones, 4k resolution, photorealistic texture. Speech is a soulful female vocal cover of "Woman in Love."

[00:00–00:03]
The woman is singing the lyrics "the narrow and long." She is holding the microphone stand with her right hand. Her expression is soulful and emotional, eyes looking slightly upward and to the left. Medium shot, static camera. Lighting is warm and directional, highlighting the silk texture of her top. Lips are clearly visible and synced to the words "narrow and long."

[00:04–00:07]
She continues singing "when eyes and the feeling is strong." Her eyes drift toward the camera and then slightly away. A subtle, emotional head tilt occurs. The golden bokeh lights in the background remain soft and circular. The camera remains in a medium shot. Lip-sync is high-strictness for the word "strong," with a clear "o" shape.

[00:08–00:11]
A brief pause in the lyrics. She maintains her pose, looking thoughtful and slightly melancholic. Her hair has very subtle movement as if from a light studio fan. The lighting emphasizes the rim of her hair and the sheen of the silk camisole.

[00:12–00:15]
She sings "I turn away from the wall." Her mouth opens wider to hit the higher notes. Her expression intensifies, showing "quiet strength." The camera remains static, focusing on the micro-expressions of her face and the movement of her throat while singing. Lip-sync is high-strictness for "turn away" and "wall."

NEGATIVE PROMPT: low resolution, blurry, distorted face, inconsistent features, unnatural lip movement, robotic gestures, flat lighting, cool tones, messy hair, cheap fabric texture, flickering lights, text watermarks, cartoonish, 3d render look, jittery motion.

SPEECH PACK:
Transcript: "the narrow and long... when eyes and the feeling is strong... I turn away from the wall"
TAKE_A: Emotional and breathy, slow cadence, emphasis on "long" and "strong."
TAKE_B: Powerful and operatic, wider mouth movements, emphasis on "turn away."
TAKE_C: Soft and intimate, subtle lip movements, emphasis on "eyes."
Prosody: [00:00] the narrow and **long**... [00:04] when **eyes** and the feeling is **strong**... [00:12] I **turn away** from the wall.
Sync: High strictness on all vowel sounds; cuts land on the start of each new phrase.
Video
Milla Sofia
[GLOBAL LOCK]
Vertical 9:16 upbeat acoustic performance clip on a small stage. Single female singer (broad age range early-20s to early-30s), light skin with warm undertone, long blonde hair worn down with a side part, natural glam makeup, silver hoop earrings. Wardrobe: fitted black short-sleeve top with bold geometric cutout/print pattern in white (triangles/zigzags across the chest). Props: acoustic guitar (natural wood body) held at torso; black microphone on a stand positioned left-front, angled to her mouth. Environment: dark blue stage background with bright round bulbs and starburst light beams behind her, creating high-contrast bokeh. Lighting: cool blue backlight with warm key on face; crisp but not harsh; mild haze. Camera: stable medium-close framing (chest-up + guitar), slight telephoto portrait feel (≈50–85mm), minimal movement. On-screen captions: large lyric subtitles centered low in bold ALL-CAPS white with thick black outline; one or two words per line highlighted in bright colors (green/red) and occasional emoji stickers (smile/neutral/shrug) near the text. No watermarks.
Audio intent: energetic cheeky pop/rap singalong (licensed). If unlicensed, create an original upbeat hook with similar cadence.

[MASTER PROMPT]
Create a 13.4s vertical stage performance clip. A blonde singer plays acoustic guitar and sings into a microphone on a stand, with a dark blue stage backdrop and bright starburst bulbs behind her. Keep the camera stable and portrait-tight. Add bold lyric-style captions (ALL CAPS, white with black outline) bottom-centered, with a few highlighted words in green/red and small emoji stickers, timed to phrase changes. Maintain consistent identity, shirt pattern, guitar position, and mic geometry across the entire clip.

[00:00–00:03]
Singer starts upbeat phrase; mouth shapes are clear and playful; she strums lightly. Captions begin with a short “pause the drama / smile for a second” meaning beat (do not reproduce copyrighted lyrics verbatim unless licensed). Emoji sticker appears near the first caption.
SPEECH/AUDIO: lip-sync/singing present; lips visible; lip-sync strictness HIGH if using audio.

[00:03–00:06]
She faces forward, eyes open, a slight grin; strumming continues. Captions update to a rhetorical question meaning “why is everyone so serious?” with one keyword highlighted in green. Background bulbs flare in a radial starburst.
SPEECH/AUDIO: punchy cadence; sync caption change to phrase boundary.

[00:06–00:09]
Slight head tilt and micro brow raise; guitar stays anchored; mic stand remains fixed left. Captions switch to a line meaning “acting so mysterious / throwing shade” with one phrase highlighted in red for emphasis.
SPEECH/AUDIO: cheeky emphasis; lip-sync strictness HIGH.

[00:09–00:11]
She leans a tiny bit toward the mic; eyes track forward. Captions update to a line meaning “you can’t even have a good…” with “YOU” or the emphasized word highlighted in green.
SPEECH/AUDIO: cadence continues; keep consonants crisp.

[00:11–00:13.4]
She lands the last beat with a confident half-smile; strum resolves. Captions hold a final short phrase fragment (paraphrase if unlicensed) and then freeze briefly for loop.
SPEECH/AUDIO: finish phrase; optional tiny breath at the end.

[NEGATIVE PROMPT]
identity drift, changing shirt geometry, warped guitar body, extra fingers, hand sliding incorrectly on strings, microphone stand moving, text flicker, misspellings, unreadable captions, random logos/watermarks, over-sharpening, plastic skin, temporal jitter, unstable bokeh lights, banding in blue background, jump cuts, camera shake.
Audio negatives: robotic singing, off-beat cadence, clipped peaks, harsh sibilance, lip-sync mismatch. Use only licensed lyrics/audio; otherwise replace with original lines with the same meaning and timing.

[SPEECH PACK] (safe paraphrase; keep timing)
NOTE: The reference contains recognizable lyrics. Do not reproduce copyrighted lyrics verbatim unless you have rights. Keep meaning beats + timing.

Segment 1 [00:00–00:03] (reset + smile)
- TAKE_A: “Hold up—pause a second and just smile.” (playful)
- TAKE_B: “Wait—take a breath, give me a smile.” (bouncy)
- TAKE_C: “Stop for a moment… and smile with me.” (slightly slower)

Segment 2 [00:03–00:06] (why so serious)
- TAKE_A: “Why is everybody acting so serious?” (punchy)
- TAKE_B: “Why’s everyone so serious right now?” (casual)
- TAKE_C: “Why’s it all so serious—relax.” (cheeky pause)

Segment 3 [00:06–00:09] (mysterious + shade)
- TAKE_A: “Why you acting so mysterious? Throwing shade?” (cheeky)
- TAKE_B: “Stop acting all mysterious—what’s with the shade?” (faster)
- TAKE_C: “Mysterious vibes… and shade for no reason.” (more sarcastic)

Segment 4 [00:09–00:11] (can’t even have)
- TAKE_A: “You can’t even have a good time?” (emphasis on “you”)
- TAKE_B: “You can’t even have a good day?” (lighter)
- TAKE_C: “You can’t even have a good moment?” (more dramatic)

Segment 5 [00:11–00:13.4] (close)
- TAKE_A: “Come on—keep it light.” (smile)
- TAKE_B: “Let’s keep it fun.” (quick finish)
- TAKE_C: “It’s not that deep.” (dry, playful)
Video
Milla Sofia
GLOBAL LOCK: A young blonde woman in her mid-20s with light skin and a slender build. She has long, wavy blonde hair and is wearing a black spaghetti strap dress. She is holding an acoustic guitar and singing into a professional silver studio microphone. The setting is a dark stage with large, warm, out-of-focus circular bokeh lights in the background. The lighting is cinematic and warm, with a strong golden rim light on her hair and soft key lighting on her face. The color grade is warm and editorial. Speech is a female singing voice, emotional and clear, with high-fidelity lip-sync.

[00:00–00:02]
The woman is singing the lyrics "That Arizona sky." She is in a medium close-up, looking slightly to the side of the camera with a soulful expression. Her mouth moves in perfect sync with the words. Her hands are positioned on the guitar neck, strumming gently. The camera is static.

[00:02–00:05]
The camera zooms in slightly to a tight close-up as she sings "Burning in your eyes." Her eyes are expressive, looking directly into the lens for a moment before glancing away. The warm bokeh lights in the background shift slightly due to a very subtle handheld camera movement.

[00:05–00:08]
She continues singing "You look at me." Her head tilts slightly, and her expression becomes more intense and vulnerable. The lighting catches the moisture in her eyes. Her hair moves slightly as if caught in a very gentle indoor breeze.

[00:08–00:11]
Transitioning to the line "And babe I wanna catch," she opens her mouth wider for the vocal projection. The camera maintains a tight close-up. The focus is sharp on her facial features, especially her lips and eyes, while the guitar in the foreground is slightly soft.

[00:11–00:14]
She sings "On fire," with a powerful vocal delivery. Her eyes close momentarily to emphasize the emotion. The golden rim light on her blonde hair is very prominent here, creating a halo effect.

[00:14–00:17]
The final segment "It's buried in my soul." She looks down toward the guitar, her expression softening into a quiet, reflective smile. The camera slowly pulls back to a medium close-up. The lip-sync remains tight until the very last syllable.

NEGATIVE PROMPT: blurry face, distorted hands, guitar strings merging with fingers, flickering lights, unnatural lip movements, robotic neck stiffness, low resolution, grainy shadows, inconsistent hair color, double eyelashes, floating microphone.

SPEECH PACK:
Transcript: "That Arizona sky, burning in your eyes. You look at me and babe I wanna catch on fire. It's buried in my soul."
TAKE_A (Emotional/Breathy): Focus on the breath between phrases, soft onset of "Arizona," lingering on "soul."
TAKE_B (Powerful/Belting): Stronger emphasis on "Burning" and "Fire," crisp consonants.
TAKE_C (Vulnerable/Whisper-like): Very soft delivery, almost a sigh on "look at me," slow cadence.
Prosody: [breath] That Arizona sky... [pause] burning in your eyes... [emphasis] YOU look at me... and [pause] babe I wanna catch... [climax] ON FIRE... [softly] it's buried in my soul.🎤🎸
Video
Milla Sofia

GLOBAL LOCK: Vertical 9:16 (720x1280), 30fps, photoreal cinematic concert close-up. Single adult woman vocalist performing on a dim stage, singing into a handheld black dynamic microphone (no stand). Wardrobe: red satin slip dress with thin straps and soft draped neckline (cowl). Hair: blonde hair in a low ponytail with a smooth crown. Makeup: natural-glam with warm highlight. Lighting: warm amber/tungsten stage spotlights behind her creating strong circular bokeh orbs and gentle haze; soft flattering key on face; subtle rim light on hair and shoulders; cinematic contrast; shallow depth of field; subtle film grain.

Camera language: steady and centered, medium close-up to chest-up framing, slightly low angle, minimal camera drift. Motion is performance-only: mouth shapes, small head turns, slight shoulder shifts, micro-smiles. Keep microphone shape rigid and consistent.

On-screen text overlay: lyric-style captions appear in the lower third / center-lower area. Typography is a large white serif (regular or lightly italic) with a dark outline or drop shadow for readability. Captions are short (1–4 words), stacked over 1–3 lines, changing every ~1 second. Occasionally show a single emphasized word larger. IMPORTANT: do not use any copyrighted lyrics verbatim. Write original short romantic lines that match the same cadence and emotional progression.

SPEECH/AUDIO: no source audio stream in the file. Keep silent, or add a licensed instrumental backing track only (no vocals; no recognizable melody). If you add vocals, they must be fully original and non-derivative.

[00:00–00:03]
Singer faces slightly left of camera, mic near lips. She begins with an open vowel mouth shape, warm back bokeh shimmering in haze. First caption fades in at center-lower: a short original phrase (1–3 words).

[00:03–00:06]
Small head turn to the right; lips shape a new syllable; subtle breath between words. Caption swaps to the next short phrase. Keep font consistent and high-contrast shadow/outline.

[00:06–00:09]
She soft-smiles briefly, then returns to a focused expression, maintaining steady posture. Caption changes again; introduce one emphasis word slightly larger.

[00:09–00:12]
Emotional beat: eyebrows lift, mouth opens wider as if sustaining a note. Caption becomes 2–4 words stacked, still centered lower-third.

[00:12–00:13.1]
Close the phrase with a softer mouth closure and relaxed expression. Final caption resolves to a short closing line (1–3 words). Hold warm bokeh and haze to the end.

NEGATIVE PROMPT: low-res, cartoon, plastic CGI, harsh flash, over-sharpening halos, banding, flicker, temporal jitter, warped microphone, extra fingers, face drift, hair crawling artifacts, dress fabric warping, broken straps, unreadable/misspelled captions, random extra text, watermarks, logos, subtitle UI, heavy camera shake.

SPEECH PACK (safe, non-lyric):
[00:00–end]
TAKE_A: (silence)
TAKE_B: licensed cinematic pop-ballad instrumental only (no vocals)
TAKE_C: subtle stage room tone + gentle plate reverb tail (very low)
Video
51 posts
GLOBAL LOCK: vertical 9:16 creator tutorial reel about realistic AI lip sync and talking-head prompting, social-native educational format, fast caption-led pacing, real-looking example clips in multiple environments, sharp white and yellow text overlays, practical creator voice, multiple example subjects but one consistent teaching thesis: specify lip movement, body motion, gesture behavior, and camera stability in the prompt. Must include tool comparison callouts for VEO 3.1, KLING 2.6, and HEYGEN, and must end with a keyword-comment CTA for the prompt structure.

[00:00:00-00:00:04] Open on a blonde young woman inside a bright red British-style phone booth, yellow tank top, phone handset to her ear, speaking toward camera with expressive hand movement. Use bold center captions that build the idea of “everyone’s gatekeeping how to do perfect AI lip sync.” Keep the red booth dominant in frame as the high-contrast visual hook.

[00:00:04-00:00:07] Cut to a curly blond young man in a close talking-head shot with shallow depth of field. He speaks directly to lens, then the frame shifts to a transit-pole hand close-up showing grip and subtle gesture control. Use captions to introduce the first prompt principle: movement and behavior need to be specified, not left vague.

[00:00:07-00:00:11] Transition through prompt-structure overlays on blue and dark panels. Show dense prompt snippets with sections for mouth movement, body movement, gesture behavior, eye contact, and realism settings. Insert a futuristic train nose shot as a visual bridge while captions stress that precise motion rules are crucial.

[00:00:11-00:00:17] Move into transit examples. Show a subway platform rushing by, then medium crops of the male speaker’s torso and hands to illustrate natural hand gestures. Cut to a blonde woman seated on a subway train in a yellow top, talking with restrained but realistic body motion. The captions should explain that natural hand gestures and believable body behavior must be described in the prompt.

[00:00:17-00:00:22] Start the model comparison section. Display a woman with long braids and glasses in a plain indoor room while large captions introduce VEO 3.1. Then cut to a woman with long straight hair and glasses in a subway-like station setting for KLING 2.6. Use repeated talking-head clips to compare how each model handles short spoken segments.

[00:00:22-00:00:29] Continue the comparison: VEO 3.1 is positioned as strong for longer clips; KLING 2.6 is labeled as accurate for short speech and holding together on brief shots; HEYGEN appears in a clean indoor talking-head example as the strongest option for subtler avatar-style delivery. Keep the subjects facing camera, mouth motion readable, and captions explicit about the use case of each tool.

[00:00:29-00:00:35] Return to the red phone booth environment and crop tighter on the woman’s torso, handset, and booth details. Overlay the final CTA in staged caption chunks telling viewers to comment "LIPS" and the creator will send the prompt structure. Hold the ending long enough for screenshot and keyword memorability.

CAMERA: quick social cuts between static or gently handheld talking-head shots, occasional close crops on hands and props, brief full-frame prompt cards, no cinematic camera choreography.
LIGHTING: bright frontal booth lighting in the red phone booth, neutral soft lighting for indoor talking heads, cool transit lighting in station and subway clips, readable high-contrast graphics on blue and dark text screens.
GRADE: crisp social contrast, saturated red booth tones, natural skin tones, clear text overlays, slightly stylized but still realistic creator aesthetic.
MOTION: lip movement must look synced and natural, hand gestures subtle and human, body movement controlled rather than stiff, transit backgrounds add motion energy without distracting from faces.
SPEECH PACK: upbeat creator-educator narration explaining how to prompt for better AI lip sync. Key points: people are gatekeeping this, prompt structure matters, specify mouth and body behavior, and choose tools based on clip length and realism needs. Phone-mic style direct audio, dry mix, punchy cadence.
NEGATIVE PROMPT: music-video montage, exaggerated dance movement, frozen hands, puppet-like body motion, sloppy lip sync, heavy cinematic blur, fantasy environment, unreadable prompt cards, tool logos without explanation, chaotic transitions, off-topic b-roll, overacted gestures, multiple overlapping captions.
Video
GLOBAL LOCK: The subject is a Caucasian male in his early 30s with medium-length, wavy dark brown hair and a short, neat beard. He consistently wears a light-colored baseball cap with a subtle front logo. The video alternates between cinematic, AI-generated environments featuring variations of this subject, and a practical screen-recording tutorial where the subject appears in a Picture-in-Picture (PiP) box wearing a solid brown hoodie. The camera language in the AI scenes is highly cinematic with motivated lighting, while the tutorial section is a clean, direct-to-camera capture. The speech style is energetic, conversational, and educational, with a clear, close-mic podcast-style audio signature.

[00:00–00:04]
Visuals: A 3-panel horizontal split screen. Top panel: Subject in a snowy, mountainous environment, wearing a white t-shirt, cool daylight. Middle panel: Subject in a grassy field, wearing a white t-shirt, warm sunlight. Bottom panel: Subject holding a vintage silver camera to his eye, wearing a white t-shirt, cloudy sky background. Large white text "AI Lip Sync" pops onto the screen in the center.
Camera: Medium close-ups in all panels. Static framing.
Motion: Subtle environmental motion (snow blowing, grass swaying). Subject's lips move in sync with the audio.
Speech: Subject (Voice A, energetic, amazed): "This lip-syncing technology is mind-blowingly good."
Audio: Clean, upfront vocal. No room reverb.

[00:04–00:08]
Visuals: Full screen. Subject is wearing a detailed white astronaut suit with a clear helmet visor. He is in outer space, with a highly detailed, cratered moon surface above him and the curve of Earth below.
Camera: Close-up on the subject's face inside the helmet. Slow, smooth push-in.
Lighting: High-contrast, stark directional light mimicking space, with reflections on the helmet visor.
Motion: Subject raises his gloved hands slightly in a questioning gesture. Lips sync to audio.
Speech: Subject (Voice B, slightly muffled/radio-filtered, inquisitive): "So what you're telling me is I can lip-sync onto any AI-generated image of me?"

[00:08–00:12]
Visuals: Full screen. An older, aged-up version of the subject. He has grey, wavy hair and a grey beard, still wearing the signature light-colored cap and a white t-shirt. Background is a seamless, dark grey studio backdrop.
Camera: Starts as a high-angle wide shot looking down at him, then cuts to an eye-level medium close-up.
Lighting: Soft, dramatic studio overhead key light, creating deep shadows under the cap brim.
Motion: Subject looks up, smiles warmly, and nods. Lips sync to audio.
Speech: Subject (Voice C, slightly raspy, warm, paternal): "That's exactly right, Rourke. You clever man."

[00:12–00:16]
Visuals: Full screen. Subject (original age) is looking out of a large, circular spaceship window. The HeyGen logo is faintly visible as a watermark/hologram. He wears a white t-shirt and his signature cap.
Camera: Medium shot, framing the subject symmetrically within the circular window. Static.
Lighting: Cool, cinematic blue light emanating from the window, casting a glow on his face against a dark background.
Motion: Subject has his hands pressed against the glass, looking out in awe, then turns his head to speak directly to the camera.
Speech: Subject (Voice A, original voice, enthusiastic, urgent): "But this is a huge unlock for storytelling. How do people not know about this?"

[00:16–00:30]
Visuals: Screen recording of the HeyGen desktop UI in dark mode. In the bottom left corner, there is a rectangular Picture-in-Picture (PiP) video of the subject. In the PiP, he is wearing a brown hoodie and his signature cap, sitting in a room with warm, practical lighting (an orange glow behind him). The screen recording shows the mouse cursor clicking "Create," then "Photo to video." It shows uploading an image (a silhouette profile against a giant orange sun), uploading an audio file, and typing "Hand gestures" into a "Custom motion prompt" box.
Camera: UI capture is static. PiP camera is a static medium close-up, eye-level webcam style.
Motion: Mouse cursor moves smoothly across the UI. In the PiP, the subject uses active hand gestures (pointing up, making a small pinching motion, giving a thumbs up) as he explains the steps.
Speech: Subject (Voice A, instructional, fast-paced): "Go to create and click on photo to video. Then you can upload an image that you've created and upload the audio that you want to lip-sync onto. And you can even add on more expressive emotions into your character. Then you can hit generate..."

[00:30–00:35]
Visuals: The screen recording transitions to show the final generated video playing: a stark black silhouette of the subject's profile against a massive, glowing orange sun. The PiP of the subject remains in the bottom corner. Large white serif text "A I" appears on the screen.
Camera: Silhouette video is a static profile shot. PiP remains static.
Lighting: Silhouette scene is heavily backlit by the orange sun.
Motion: The silhouette's lips move in sync with the audio. The PiP subject points towards the camera and gestures.
Speech: Subject (Voice A, concluding, call-to-action): "...and in seconds you'll have your video. If you want to try it out for yourself, type AI in the comments and I'll send you a link."

NEGATIVE PROMPT:
Visuals: deformed anatomy, extra fingers, unnatural skin textures, flickering lighting, temporal jitter, morphing backgrounds, illegible UI text, robotic or stiff body movements, mismatched eye contact, floating objects, bad green screen edges around PiP.
Audio: robotic cadence, slurred words, harsh sibilance, plosives, audio clipping, out-of-sync lip movements, unnatural pauses, metallic AI voice artifacts.

SPEECH PACK:
[00:00-00:04]
Speaker: Voice A (Creator)
Transcript: "This lip-syncing technology is mind-blowingly good."
Prosody: Emphasis on "mind-blowingly". Energetic, fast-paced.

[00:04-00:08]
Speaker: Voice B (Astronaut Avatar)
Transcript: "So what you're telling me is I can lip-sync onto any AI-generated image of me?"
Prosody: Slight pause after "is". Inquisitive tone, rising inflection at the end.

[00:08-00:12]
Speaker: Voice C (Older Avatar)
Transcript: "That's exactly right, Rourke. You clever man."
Prosody: Warm, slow, deliberate pacing. Slight chuckle on "clever man".

[00:12-00:16]
Speaker: Voice A (Creator)
Transcript: "But this is a huge unlock for storytelling. How do people not know about this?"
Prosody: Urgent, passionate. Emphasis on "huge unlock".

[00:16-00:30]
Speaker: Voice A (Creator - Tutorial Mode)
Transcript: "Go to create and click on photo to video. Then you can upload an image that you've created and upload the audio that you want to lip-sync onto. And you can even add on more expressive emotions into your character. Then you can hit generate..."
Prosody: Clear, instructional cadence. Steady pace, pausing slightly at punctuation marks to match UI actions.

[00:30-00:35]
Speaker: Voice A (Creator - CTA Mode)
Transcript: "...and in seconds you'll have your video. If you want to try it out for yourself, type AI in the comments and I'll send you a link."
Prosody: Upbeat conclusion. Strong emphasis on "type AI". Friendly, inviting tone.
Video

MASTER PROMPT
GLOBAL LOCK: Vertical 9:16 software-demo explainer reel about AI lip sync. The visual language should feel like a modern creator-tool promo: black background, bold white text headlines, occasional blue highlight text, stacked comparison frames, and a clean screen-recorded interface. The recurring example clip is a bearded man outdoors in daylight wearing a green-and-white Vans hat, beige blazer, and white shirt, standing in front of parked cars and trees. The core structure is: bold hook claim, split-screen comparison between AI output and original video, presenter-led explanation with a small talking-head overlay, then UI walkthrough and preview results. Preserve direct-response short-form pacing, on-screen text clarity, and educational creator tone.

[00:00-00:12.00] Open with a strong hook on a black background: large headline text about AI now being able to lip sync onto videos. Under the headline, show two stacked or split comparison clips of the same outdoor Vans-hat man, with clear labels such as InfiniteTalk AI and Original Video. The comparison should foreground mouth motion and timing differences, keeping the same daylight suburban background in both panes. The overall feel is instant proof before explanation.

[00:12.00-00:32.00] Transition into a 'Here's how' tutorial segment. Keep large white section-heading text near the top, while a small picture-in-picture presenter appears in the lower area speaking directly to camera. Behind or above the presenter, show the example video and snippets of the software input workflow, including prompt/reference sections or image upload areas. The pacing should feel like a creator walking the audience through a useful AI workflow quickly and confidently.

[00:32.00-00:48.67] Move into the product interface and preview. Show a dark UI with control panels, preview windows, upload areas, and the same example clip being processed or compared. Bring back the stacked output-vs-original mouth movement comparison to close the loop. The final beat should make the product feel practical and immediately usable, not abstract.

NEGATIVE PROMPT
Avoid generic corporate slideshow aesthetics, stock-office scenes, cluttered UI, unreadable text, random example footage, shaky camera, dark moody color grades, excessive animations, missing comparison labels, unclear mouth movement, or a tutorial structure that loses the original proof-first hook.

SHOT PROMPTS
[00:00-00:12.00] Proof-first hook with headline and AI-vs-original lip-sync comparison.
[00:12.00-00:32.00] Presenter overlay explains workflow with example clip and software setup.
[00:32.00-00:48.67] Dark product UI walkthrough and final preview comparison.

SPEECH PACK
Timecoded transcript:
[00:00-00:48.67] Single-speaker tutorial explaining an AI lip-sync workflow. Exact words unclear from visual evidence; preserve concise creator-educator delivery, strong opening claim, then a step-by-step product walkthrough.

TAKE_A
[00:00-00:48.67] Fast creator-demo delivery with a proof-first hook and simple how-to explanation.

TAKE_B
[00:00-00:48.67] Calm tutorial cadence with brief pauses around interface steps and comparison moments.

TAKE_C
[00:00-00:48.67] Slightly more energetic product-demo tone emphasizing the before/after result and workflow ease.

Ai Music Video With Lyrics

AI Music Video With Lyrics is for creators who want both generated visuals and clearly readable lyrics inside the same output. The page should guide them toward examples and prompts that balance typography, timing, background movement, and visual clarity so the words stay legible without flattening the piece.

The strongest angle is integration. Users here are not choosing between a lyric video and a music video in isolation. They want both functions in one format. The copy should keep the focus on how text and visuals can support each other instead of fighting for attention.

What this page should make clear: - The output combines visual generation with on-screen lyric display. - Readability, timing, and background control are central. - This style works for lyric-heavy releases, teasers, and artist uploads. - The best examples make the words visible without making the video feel static.

FAQ

Q: What is an AI music video with lyrics? A: It is a music video format that combines generated visuals with clearly readable lyrics on screen.

Q: How is it different from a standard lyric video? A: The visuals are treated as a fuller generated video layer rather than a simple backdrop for text.

Q: What is it best for? A: Lyric releases, text-forward promos, artist uploads, and songs where the words need to stay central.

AI Music Video With Lyrics | Create Music Videos With On-Screen Lyrics | Alici.AI