Engineering at HeyGen: Inside the team building AI video at scale
HeyGen × Stripe Projects: the missing piece for autonomous product launches
Jun 15, 2026Agents can build a product in a day but not launch it. With Stripe Projects and the HeyGen API, an agent can now provision, pay, and produce its own launch video.

Avatar Real-time: The Technical Report Behind Low-Latency, Unlimited-Duration Generation
Jun 3, 2026The inference framework transforms avatar generation from fixed-length rendering into open-ended streaming video synthesis. A chunk-based pipeline maintains identity, motion, and lip-sync consistency across arbitrarily long videos while operating with constant memory usage. Combined with model-sharding, asynchronous offloading, and streaming decode, the system achieves sub-5-second time-to-first-frame and faster-than-realtime generation speeds.

Avatar V: Scaling Video-Reference Avatar Generation
Apr 8, 2026Avatar V is built on a Diffusion Transformer with flow matching that conditions directly on the full token sequence of a user’s reference video—no bottleneck embeddings. Sparse Reference Attention keeps cost almost linear with reference length. A five-stage training curriculum progresses from general video pre-training through identity-preserving fine-tuning, distillation, and RLHF alignment.

Curating Millions of Videos: The Data Engine Behind Avatar V
Apr 3, 2026A distributed data engine orchestrating 25+ processing stages and 20+ specialized AI models transforms 50M raw videos into 100M+ pretraining clips and 10M+ avatar fine-tuning clips. A 10-stage segment-level curation cascade, 13 parallel feature extraction stages, 10 fine-grained avatar quality signals, and a cross-clip identity connectivity graph produce the training data that makes Avatar V possible.

From Model to Production: Optimizing Avatar V Inference at Scale
Apr 2, 2026Avatar V generates 1080p video at 25 fps across 8 GPUs per request. A custom compiler with LLM-based agentic kernel synthesis achieves 3× latency reduction over the unoptimized baseline and 33% improvement over torch.compile. Chunk-based autoregressive generation enables arbitrary-length output, while NVSHMEM-based sequence parallelism, two-level context caching, and streaming VAE decode keep memory bounded and throughput high.

HELIOS: Unified GPU Infrastructure for Training, Inference, and Data at Scale
Apr 1, 2026HELIOS is a unified GPU infrastructure platform managing 5,000+ GPUs across 5+ cloud providers and 15+ standardized cells. A two-stage QoS-aware scheduler improved GPU utilization by 15% and reduced non-productive GPU time by 20%. A custom declarative data processing engine replaced Ray, scaling to 200K+ concurrent tasks with 95%+ GPU utilization and node failure detection under 30 seconds.

TransVLM: Detecting Any Shot Transition with Vision-Language Models
Mar 1, 2026We reformulate shot boundary detection as Shot Transition Detection (STD)—finding complete transition segments, not just cut points. TransVLM fuses optical flow with color frames in a vision-language model to detect all transition types: cuts, dissolves, and special effects. It achieves 78.3% segment F1 on public data and 89.5% on synthetic data, outperforming all existing methods.

