Engineering at HeyGen: Inside the team building AI video at scale

Featured

HeyGen × Stripe Projects: the missing piece for autonomous product launches

Jun 15, 2026

Smiling faces of diverse individuals in glowing geometric frames, representing digital collaboration.

Featured

Avatar Real-time: The Technical Report Behind Low-Latency, Unlimited-Duration Generation

Jun 3, 2026

HeyGen × Stripe Projects: the missing piece for autonomous product launches

Jun 15, 2026

Agents can build a product in a day but not launch it. With Stripe Projects and the HeyGen API, an agent can now provision, pay, and produce its own launch video.

Avatar Real-time: The Technical Report Behind Low-Latency, Unlimited-Duration Generation

Jun 3, 2026

The inference framework transforms avatar generation from fixed-length rendering into open-ended streaming video synthesis. A chunk-based pipeline maintains identity, motion, and lip-sync consistency across arbitrarily long videos while operating with constant memory usage. Combined with model-sharding, asynchronous offloading, and streaming decode, the system achieves sub-5-second time-to-first-frame and faster-than-realtime generation speeds.

Avatar V: Scaling Video-Reference Avatar Generation

Apr 8, 2026

Avatar V is built on a Diffusion Transformer with flow matching that conditions directly on the full token sequence of a user’s reference video—no bottleneck embeddings. Sparse Reference Attention keeps cost almost linear with reference length. A five-stage training curriculum progresses from general video pre-training through identity-preserving fine-tuning, distillation, and RLHF alignment.

A large smiling portrait of a woman in a teal top, with three smaller triangular portraits of her in different poses.

Curating Millions of Videos: The Data Engine Behind Avatar V

Apr 3, 2026

A distributed data engine orchestrating 25+ processing stages and 20+ specialized AI models transforms 50M raw videos into 100M+ pretraining clips and 10M+ avatar fine-tuning clips. A 10-stage segment-level curation cascade, 13 parallel feature extraction stages, 10 fine-grained avatar quality signals, and a cross-clip identity connectivity graph produce the training data that makes Avatar V possible.

A grid of video thumbnails featuring diverse people in various settings and an animated deer.

From Model to Production: Optimizing Avatar V Inference at Scale

Apr 2, 2026

Avatar V generates 1080p video at 25 fps across 8 GPUs per request. A custom compiler with LLM-based agentic kernel synthesis achieves 3× latency reduction over the unoptimized baseline and 33% improvement over torch.compile. Chunk-based autoregressive generation enables arbitrary-length output, while NVSHMEM-based sequence parallelism, two-level context caching, and streaming VAE decode keep memory bounded and throughput high.

HELIOS: Unified GPU Infrastructure for Training, Inference, and Data at Scale

Apr 1, 2026

HELIOS is a unified GPU infrastructure platform managing 5,000+ GPUs across 5+ cloud providers and 15+ standardized cells. A two-stage QoS-aware scheduler improved GPU utilization by 15% and reduced non-productive GPU time by 20%. A custom declarative data processing engine replaced Ray, scaling to 200K+ concurrent tasks with 95%+ GPU utilization and node failure detection under 30 seconds.

HELIOS: Unified GPU Infrastructure for Training, Inference, and Data at Scale

TransVLM: Detecting Any Shot Transition with Vision-Language Models

Mar 1, 2026

We reformulate shot boundary detection as Shot Transition Detection (STD)—finding complete transition segments, not just cut points. TransVLM fuses optical flow with color frames in a vision-language model to detect all transition types: cuts, dissolves, and special effects. It achieves 78.3% segment F1 on public data and 89.5% on synthetic data, outperforming all existing methods.

Split image of a man in purple glasses: on the left, he wears a pink shirt in a retro room; on the right, he wears a green sweater in an office, reaching out.