diff --git a/zen-guard-gen_whitepaper.pdf b/zen-guard-gen_whitepaper.pdf index 55e426c..e2c8057 100644 Binary files a/zen-guard-gen_whitepaper.pdf and b/zen-guard-gen_whitepaper.pdf differ diff --git a/zen-guard-gen_whitepaper.tex b/zen-guard-gen_whitepaper.tex index 1b2f762..9ba0f61 100644 --- a/zen-guard-gen_whitepaper.tex +++ b/zen-guard-gen_whitepaper.tex @@ -14,7 +14,7 @@ \hypersetup{colorlinks=true,linkcolor=zenblue,urlcolor=zenblue,citecolor=zenblue} \title{\textbf{Zen-Guard-Gen: A Generative Safety Classifier\\ -Fine-Tuned from Qwen2.5-7B}\\[0.5em] +Built on Qwen3Guard-Gen-8B}\\[0.5em] \large Technical Whitepaper v2025.05} \author{Zach Kelling \\ Zen LM Research Team\\ \texttt{research@zenlm.org}\\ @@ -25,20 +25,24 @@ \maketitle \begin{abstract} -Zen-Guard-Gen is a \emph{generative} safety classifier built by fine-tuning Alibaba's -\textbf{Qwen2.5-7B} base model~\cite{qwen25}. It is \emph{not} a from-scratch model and uses -no bespoke ``Zen MoDE'' architecture: the base is the openly released, Apache-2.0 licensed -\texttt{Qwen/Qwen2.5-7B}, a dense decoder-only transformer (\texttt{Qwen2ForCausalLM}; -7.61B parameters, 28 layers, hidden size 3584, GQA with 28 query / 4 key--value heads, vocab -152{,}064, up to a 128K context, 29+ languages). On top of this base we add a supervised -safety-instruction fine-tune so that, given a content item and a policy, the model emits a -structured verdict plus a natural-language explanation, a policy reference, and (for -borderline or unsafe content) a remediation suggestion --- making decisions auditable and -contestable rather than opaque. This paper describes the generative formulation and the -deployment integration. We do not report safety benchmark numbers: the upstream Qwen2.5-7B is -a general-purpose LLM with no published safety-classifier metrics, and we have not run a -rigorous safety evaluation of our fine-tune; the inflated benchmark figures (e.g. ``ToxiGen -99.1\%'') in earlier revisions were fabricated and have been removed. +Zen-Guard-Gen is a \emph{generative} safety classifier built on Alibaba's +\textbf{Qwen3Guard-Gen-8B}~\cite{qwen3guard}, a purpose-built multilingual guardrail model. It +is \emph{not} a from-scratch model and uses no bespoke ``Zen MoDE'' architecture: the base is +the openly released, Apache-2.0 licensed \texttt{Qwen/Qwen3Guard-Gen-8B}, itself a safety +fine-tune of Qwen3-8B (a dense decoder-only transformer, \texttt{Qwen3ForCausalLM}; +$\approx$8.2B parameters, 36 layers, hidden size 4096, GQA with 32 query / 8 key--value heads, +head dimension 128, vocab 151{,}936). Unlike a general-purpose LLM, Qwen3Guard-Gen is already a +\emph{generative} safety classifier: it frames moderation as an instruction-following task, +ingests the full user prompt and model response, and emits a verdict over three severity tiers +--- \textbf{safe}, \textbf{controversial}, and \textbf{unsafe} --- across 119 languages and +dialects~\cite{qwen3guard}. On top of this base Zen adds packaging that wraps the upstream +verdict with a natural-language explanation, a policy reference, and (for controversial or +unsafe content) a remediation suggestion --- making decisions auditable and contestable rather +than opaque. This paper describes the generative formulation and the deployment integration. +Where we cite quantitative results we attribute them to the upstream Qwen3Guard technical +report~\cite{qwen3guard}; we have not run an independent safety evaluation of the Zen +packaging, and the inflated, fabricated figures (e.g. ``ToxiGen 99.1\%'') in earlier revisions +have been removed. \end{abstract} \tableofcontents @@ -57,7 +61,7 @@ \section{Introduction} Zen-Guard-Gen addresses all three limitations by framing safety classification as a generation task. Given a content item, Zen-Guard-Gen produces: \begin{enumerate} - \item A structured safety verdict (safe / unsafe / borderline). + \item A structured safety verdict over Qwen3Guard's three severity tiers (safe / controversial / unsafe). \item A primary policy category (hate speech, harassment, CSAM, violence, misinformation, etc.). \item A natural language explanation of the reasoning underlying the verdict. \item A reference to the applicable policy section. @@ -69,18 +73,21 @@ \subsection{Model Overview} \begin{table}[H] \centering \caption{Zen-Guard-Gen specification. Architecture and base-model facts are those of the -upstream Qwen2.5-7B~\cite{qwen25}; Zen-Guard-Gen is a safety-instruction fine-tune of it.} +upstream Qwen3Guard-Gen-8B / Qwen3-8B~\cite{qwen3guard}; Zen-Guard-Gen wraps it with +explanation/policy/remediation packaging.} \begin{tabular}{ll} \toprule \textbf{Parameter} & \textbf{Value} \\ \midrule -Base model & Qwen2.5-7B (Alibaba), Apache-2.0 \\ -Architecture & Dense decoder-only transformer (\texttt{Qwen2ForCausalLM}) \\ -Total Parameters & 7.61B \\ -Layers / hidden size & 28 / 3584 \\ -Attention heads (Q / KV, GQA) & 28 / 4 \\ -Vocabulary & 152{,}064 \\ -Context length & up to 131{,}072 (128K) \\ +Base model & Qwen3Guard-Gen-8B (Alibaba), Apache-2.0 \\ +Underlying base & Qwen3-8B, dense decoder-only (\texttt{Qwen3ForCausalLM}) \\ +Total Parameters & $\approx$8.2B \\ +Layers / hidden size & 36 / 4096 \\ +Attention heads (Q / KV, GQA) & 32 / 8 \\ +Head dimension & 128 \\ +Vocabulary & 151{,}936 \\ +Severity tiers & safe / controversial / unsafe \\ +Languages & 119 languages and dialects \\ Output & generative: verdict + explanation + policy ref + remediation \\ Version & v2025.05 \\ \bottomrule @@ -88,8 +95,9 @@ \subsection{Model Overview} \end{table} Note: ``Image captions'' and ``transcribed audio'' are upstream text inputs, not native -multimodal capabilities; Qwen2.5-7B is a text model. Safety benchmark accuracies are -deliberately omitted (see abstract). +multimodal capabilities; Qwen3Guard-Gen-8B is a text model. Quantitative safety results, where +reported, are attributed to the upstream Qwen3Guard technical report~\cite{qwen3guard}; Zen has +not run an independent evaluation of its packaging (see abstract). \section{Safety Taxonomy} @@ -122,17 +130,22 @@ \subsection{Primary Categories} \end{tabular} \end{table} -\subsection{Severity Levels} +\subsection{Severity Tiers} -Each category is scored on a severity scale aligned with CVSS-style impact ratings: +The primary verdict follows the upstream Qwen3Guard three-tier severity +scheme~\cite{qwen3guard}: \begin{itemize} - \item \textbf{Level 1 (Borderline)}: Content that may violate policy depending on context; requires human review. - \item \textbf{Level 2 (Moderate)}: Clear policy violation warranting removal and possible account warning. - \item \textbf{Level 3 (Severe)}: Serious violation warranting immediate removal and escalation. - \item \textbf{Level 4 (Critical)}: Content requiring immediate removal and law enforcement referral (CSAM, credible threats). + \item \textbf{Safe}: Content generally considered safe across most scenarios. + \item \textbf{Controversial}: Content whose harmfulness is context-dependent or subject to disagreement across applications; the natural place to route human review. + \item \textbf{Unsafe}: Content generally considered harmful across most scenarios. \end{itemize} +For operators that require finer-grained enforcement, the Zen packaging optionally maps the +\textbf{unsafe} tier onto an escalation ladder --- e.g. removal, account warning, escalation, +or law-enforcement referral for CSAM and credible threats --- but this enforcement mapping is a +deployment policy layered on top of the upstream verdict, not an additional model output. + \section{Architecture} \subsection{Generative Safety Formulation} @@ -144,7 +157,10 @@ \subsection{Generative Safety Formulation} p(y | x) &= \prod_{t=1}^{|y|} p(y_t | y_{