Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified pdfs/zen-dub-newsroom.pdf
Binary file not shown.
6 changes: 3 additions & 3 deletions zen-dub-newsroom.tex
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
\definecolor{zenblue}{RGB}{41,121,255}
\hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue}

\title{\textbf{Zen Live-Dub: A License-Clean, Real-Time, Multi-Speaker\\ Cross-Lingual Video Dubbing System on Commodity GPUs}\\
\title{\textbf{Zen Live-Dub: A Permissively Licensed, Real-Time, Multi-Speaker\\ Cross-Lingual Video Dubbing System on Commodity GPUs}\\
\large Technical Report v2026.06}
\author{Zen LM Research Team\\
\texttt{research@zenlm.org}}
Expand Down Expand Up @@ -93,7 +93,7 @@ \subsection{FP4 convolution: a disproven shortcut}

Numerics are sound (per-shape cosine $\geq 0.9907$ across all 35 UNet shapes) \emph{only} when block scales use the cuBLAS \texttt{to\_blocked} $128\times4$ swizzle (output cosine $0.999997$); the naive padded layout produces numerical garbage (cosine $0.18$). The actionable conclusion is that a real FP4 win requires a \emph{fused} implicit-GEMM convolution---quantization in the mainloop, scales emitted in the swizzled layout---not an eager decomposition.

\section{License-Clean Component Selection}
\section{Permissively Licensed Component Selection}

Selecting a commercially-usable clone-TTS is gated by the weights license, not capability. We audited the 2025--2026 field against primary sources (model-card metadata, training-set licenses, technical reports), distinguishing the license of the \emph{code} from that of the \emph{weights}, and the ability to clone an \emph{arbitrary} speaker from fixed voice packs (Table~\ref{tab:license}).

Expand Down Expand Up @@ -140,7 +140,7 @@ \section{Voice Cloning}

Multi-speaker handling is a pipeline, not a model: streaming diarization tags who-speaks-when, each segment is matched against the registry (identification accuracy $\geq$95\% on the enrolled set; 0.01\,ms per query, scaling to thousands of entries), and per-speaker references drive the clone. Source separation (Demucs) isolates speech from the music/SFX bed so the dub is remixed under the preserved background ($+6$\,dB SI-SDR; 74\% of bed energy retained).

\section{Governance, Provenance, and the License-Clean Visual Path}
\section{Governance, Provenance, and the Permissively Licensed Visual Path}
\label{sec:gov}

A newsroom deployment must satisfy consent and disclosure law (e.g.\ the Tennessee ELVIS Act, in force; the EU AI Act Article~50 synthetic-audio disclosure requirement, effective August 2026). The pipeline enforces a signed, revocable consent record per governed voice, checked at synthesis time; an unconsented speaker is refused and routed to a silent hold. Provenance is a C2PA manifest whose \texttt{consent\_ref} foreign-key is verified coherent with the consent ledger. Synthetic-audio watermarking uses AudioSeal (MIT), which in our tests detected at 100\% (zero bit-error, zero false-positive) across clean, MP3-128k, AAC-128k, Opus-64k, and double-encoded chains, at 33.1\,dB SNR (PESQ 4.52). The end-to-end governed run passed 20/20 assertions on GPU.
Expand Down
Loading