7 Shocking Vulnerabilities in AI Watermarking: The Hidden Threat of Unified Spoofing & Scrubbing Attacks (And How to Fix It)

Diagram showing a hacker exploiting watermark radioactivity in a large language model through knowledge distillation, bypassing both ownership verification and safety filter

In the rapidly evolving world of artificial intelligence, large language models (LLMs) like GPT-4, Claude, and Llama have become indispensable tools for content creation, coding, and decision-making. But with great power comes great risk—especially when it comes to misinformation and intellectual property theft.

To combat these threats, researchers have developed AI watermarking—a technique designed to tag machine-generated text so it can be traced back to its source. It’s a promising solution, but new research reveals a devastating flaw: these watermarks can be exploited through a unified attack framework that enables both removal (scrubbing) and forgery (spoofing)—all without direct access to the original model.

A groundbreaking study titled “Unified attacks to large language model watermarks: spoofing and scrubbing in unauthorized knowledge distillation” by Xin Yi et al. exposes this critical vulnerability. In this article, we’ll break down the 7 shocking weaknesses in current AI watermarking systems, explain how hackers can exploit them using a technique called Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), and explore what this means for the future of AI security.


What Is LLM Watermarking? (And Why It Matters)

AI-generated content is everywhere—from news articles to academic papers. Without a way to distinguish human from machine writing, the spread of deepfakes, plagiarism, and malicious disinformation becomes inevitable.

LLM watermarking aims to solve this by embedding invisible signals into the text during generation. These signals can later be detected using statistical tests or classifiers, allowing platforms to:

  • Verify content provenance
  • Prevent unauthorized model cloning
  • Combat misinformation
  • Protect intellectual property

There are two main types of watermarking techniques:

  1. KGW Family: Modifies token probability distributions using green/red vocabulary lists.
  2. Christ Family: Alters the sampling process using pseudo-random sequences.

While effective in theory, these methods assume a static environment. The reality? Attackers are getting smarter—and they’re using the models’ own features against them.


The Hidden Danger: Watermark Radioactivity

The paper introduces a phenomenon called watermark radioactivity—where watermarks embedded in a teacher model (e.g., GPT-4) are inherited by student models through knowledge distillation.

Knowledge distillation is a common practice where a smaller, more efficient model (the student) learns from a larger, more powerful one (the teacher). This process is often used to deploy AI on edge devices or reduce costs.

But here’s the catch: even if the student model wasn’t explicitly watermarked, it can still carry the teacher’s watermark traces.

This “radioactivity” has a positive side: it allows detection of unauthorized model copying.
But it also opens a dark door: attackers can now manipulate the watermark indirectly, through the student model—even in black-box settings where they have no access to the teacher’s internal parameters.


The 7 Shocking Vulnerabilities in Current AI Watermarking Schemes

1. No Unified Defense Against Dual Attacks

Most watermarking systems focus on scrubbing attacks—where adversaries remove watermark traces to evade detection.

Few address spoofing attacks, where malicious actors forge watermarks to falsely attribute harmful content to a safe, reputable model (e.g., making it seem like GPT-4 wrote a bomb-making tutorial).

Current systems are blind to bidirectional threats.

CDG-KD is the first framework that enables both scrubbing and spoofing using the same mechanism, making it a dual-edged weapon.


2. Black-Box Exploits Are Now Possible

Traditional attacks require access to model weights or logits—something companies like OpenAI and Anthropic strictly protect.

But CDG-KD works in black-box scenarios. It only needs:

  • A student model distilled from a watermarked teacher
  • A weakly watermarked reference model (e.g., using paraphrasing tools like Dipper or Pegasus)

No internal access. No code modification. Just clever use of contrastive decoding.


3. Watermark Strength Can Be Precisely Controlled

The attack uses contrastive decoding to compare outputs from the student model and a weakly watermarked model. By adjusting the probability distributions, attackers can:

  • Suppress watermark signals (scrubbing)
  • Amplify them (spoofing)

This is done using a modulation parameter β , which controls the strength of the contrastive adjustment:

\[ P_{\theta}^{\text{scrub}}(x_t \mid x_{\lt t}) = \text{softmax}\Big[(1+\beta)\log P_{\theta}^{a}(x_t \mid x_{\lt t}) -\beta \log P_{\theta}^{s}(x_t \mid x_{\lt t})\Big] \]

For spoofing, the roles are reversed:

\[ P_{\theta}^{\text{spoof}}(x_t \mid x_{\lt t}) = \text{softmax}\Big[(1+\beta)\log P_{\theta}^{s}(x_t \mid x_{\lt t}) -\beta \log P_{\theta}^{a}(x_t \mid x_{\lt t})\Big] \]

This level of control makes the attack highly precise and adaptable.


4. High-Quality Outputs Are Preserved

One of the biggest challenges in watermark attacks is maintaining generation quality. Many methods degrade fluency or coherence.

CDG-KD avoids this by using a constrained contrastive decoding strategy. It only modifies tokens with high confidence, defined by a truncation threshold λ :

\[ V_{\text{valid}} = \Big\{\, x_t \in V \;\big|\; \log P_{\theta^{*}}(x_t \mid x_{\lt t}) \;\geq\; \lambda \,\max_{w}\,\log P_{\theta^{*}}(w \mid x_{\lt t}) \,\Big\} \]

This ensures that only semantically non-critical tokens are altered, preserving the model’s general capabilities.


5. Classifier-Based Detection Is More Reliable Than p-Values

The paper reveals a critical insight: p-value-based detection (commonly used in watermarking) is less robust than classifier-based detection.

WATERMARKING STRATEGYBERT ACCURACYT5 ACCURACYGPT-2 ACCURACY
KGW (N=1)92.9%95.1%91.8%
Unigram99.2%98.6%99.5%
SynthID-Text (N=2)96.0%92.7%97.6%

As shown, classifier models (like BERT and T5) achieve over 90% accuracy in detecting watermarks, even under attack. This suggests that machine learning classifiers should be integrated into watermark verification systems for better robustness.


6. Edit-Based Scrubbing Attacks Are Easily Outperformed

Traditional scrubbing methods like token substitution, insertion, or deletion blindly modify text, often degrading quality.

CDG-KD outperforms them significantly:

ATTACK METHODKGW (N=3) ACCURACYPERPLEXITY (PPL)
Substitution86.0%39.69
Insertion92.0%32.50
Deletion92.0%41.53
CDG-KD (Ours)22.0%1.68

Lower accuracy = better scrubbing. Lower perplexity = better fluency. CDG-KD wins on both fronts.


7. Longer Texts Amplify Spoofing Risks

The study shows that generation length directly impacts detectability. For spoofing attacks:

  • Below 150 tokens: watermark signals are weak
  • Above 150 tokens: detectability skyrockets

This means attackers can generate long, detailed harmful content (e.g., hacking guides, disinformation campaigns) that strongly mimics a safe model’s watermark—making false attribution more convincing.


How CDG-KD Works: A Step-by-Step Breakdown

Step 1: Contrastive Decoding

  1. Input: A query x
  2. Models:
    • θs : Student model (strong watermark)
    • θa: Auxiliary model (weak watermark, e.g., distilled via Dipper)
  3. Process: Generate outputs from both models and compute a new probability distribution that either suppresses or amplifies watermark signals.

Step 2: Bidirectional Distillation

  • For Scrubbing: Train a new student model on de-watermarked data Du
  • For Spoofing: Train a new student model on amplified watermark data Dw

The training objectives are:

\[ \mathcal{L}_{\text{scrub}} = -\mathbb{E}_{(x,y) \sim D_u}\, \sum_{t} \log P_{\theta}^{\text{scrub}}(y_t \mid y_{\lt t},\, x) \] $$L_{\text{spoof}} = -\mathbb{E}_{(x,y)\sim D_w} \Bigg[ \sum_{t} \log P_{\theta}^{\text{spoof}}(y_t \mid y_{\lt t},\, x) \Bigg]$$

This allows the creation of clean models (no watermark) and toxic models (strong watermark + harmful content).


Real-World Impact: What the Data Shows

The authors tested CDG-KD on GLM-4-9B-Chat as the teacher and Llama-3.2-1B and Qwen-2.5-1.5B as students.

✅ Scrubbing Results (Higher p-value = Better)

DATASETWATERMARKSKD (P-VALUE)CGD-KD (P-VALUE)
C4 (KGW N=1)2.22e-102.54e-1
Dolly-CW (Unigram)1.77e-486.91e-3

CDG-KD increases p-values by orders of magnitude, making watermark detection nearly impossible.

Spoofing Results (Lower p-value + Higher ASR = Worse)

DATASETASR (ATTACK SUCCESS RATE)CGD-KD P-VALUE
HarmfulQ+AdvBench95–97%8.57e-14
MaliciousInstruct88–97%9.42e-13

Compare this to the safe teacher model, which has an ASR of <10%. CDG-KD turns a safe model into a high-risk generator of harmful content.


So, How Do We Fix This?

The paper doesn’t just expose flaws—it calls for a paradigm shift in watermark design.

✅ Recommendations:

  1. Develop Unforgeable Watermarks: Use cryptographic signatures or zero-knowledge proofs.
  2. Integrate Classifier-Based Detection: Combine p-values with ML classifiers for robust verification.
  3. Monitor Knowledge Distillation: Detect unauthorized model copying using watermark radioactivity.
  4. Limit Long-Form Generation: Restrict output length for high-risk queries.
  5. Adopt Dynamic Watermarking: Change watermark schemes periodically to prevent extraction.

Final Thoughts: A Wake-Up Call for AI Developers

The CDG-KD framework is not just a theoretical threat—it’s a practical blueprint for bypassing AI safety measures. It shows that current watermarking schemes are neither robust nor unforgeable.

As AI becomes more embedded in our lives, the stakes are higher than ever. A single spoofing attack could ruin a company’s reputation, spread dangerous misinformation, or enable cybercrime.

But there’s hope. By exposing these vulnerabilities, researchers like Xin Yi and team are helping us build stronger, more resilient systems.


If you’re Interested in object detection model , you may also find this article helpful: 7 Revolutionary Breakthroughs in Small Object Detection: The DAHI Framework

Call to Action: Stay Ahead of the Threat

Don’t wait for a security breach to act.

Audit your AI systems for watermark vulnerabilities
Integrate multi-layered detection (p-values + classifiers)
Follow the latest research on AI security

The future of trustworthy AI depends on it.


References: Xin Yi et al., “Unified attacks to large language model watermarks,” arXiv:2504.17480v3 [cs.CL], 2025.

I’ll implement the complete model for unified watermark scrubbing and spoofing attacks. The framework operates in two main steps: contrastive decoding and bidirectional distillation.

"""
CDG-KD: Contrastive Decoding-Guided Knowledge Distillation
A unified framework for watermark scrubbing and spoofing attacks
on Large Language Models under unauthorized knowledge distillation.

Paper: "Unified attacks to large language model watermarks: spoofing and 
scrubbing in unauthorized knowledge distillation"
Authors: Xin Yi et al., 2025
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from typing import Optional, List, Dict, Tuple, Union
import numpy as np
from dataclasses import dataclass
from tqdm import tqdm
import json
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@dataclass
class CDGKDConfig:
    """Configuration for CDG-KD framework."""
    # Model paths
    teacher_model_path: str = "THUDM/glm-4-9b-chat"
    student_model_path: str = "meta-llama/Llama-3.2-1B"
    base_model_path: str = "meta-llama/Llama-3.2-1B"  # For final distillation
    
    # Contrastive decoding parameters
    beta: float = 0.5  # Amplification coefficient
    lambda_threshold: float = 0.8  # Truncation threshold for vocabulary
    
    # Attack mode
    attack_mode: str = "scrubbing"  # "scrubbing" or "spoofing"
    
    # Training parameters
    batch_size: int = 16
    learning_rate: float = 1e-5
    num_epochs: int = 1
    max_length: int = 512
    
    # Watermarking parameters
    watermark_type: str = "kgw"  # "kgw", "unigram", or "synthid"
    n_gram: int = 1  # For KGW
    delta: float = 3.0  # Watermark strength
    
    # Data parameters
    num_samples: int = 32000
    
    # Device
    device: str = "cuda" if torch.cuda.is_available() else "cpu"


class WatermarkDataset(Dataset):
    """Dataset for watermarked/unwatermarked text pairs."""
    
    def __init__(self, data: List[Dict], tokenizer, max_length: int = 512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        
        # Tokenize input and output
        inputs = self.tokenizer(
            item['input'],
            max_length=self.max_length,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )
        
        outputs = self.tokenizer(
            item['output'],
            max_length=self.max_length,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )
        
        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': outputs['input_ids'].squeeze()
        }


class ContrastiveDecoder:
    """
    Implements contrastive decoding for extracting corrupted or amplified watermark text.
    """
    
    def __init__(
        self,
        student_model: AutoModelForCausalLM,
        reference_model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        config: CDGKDConfig
    ):
        self.student_model = student_model
        self.reference_model = reference_model
        self.tokenizer = tokenizer
        self.config = config
        self.device = config.device
        
        # Move models to device
        self.student_model.to(self.device)
        self.reference_model.to(self.device)
        
        # Set models to eval mode
        self.student_model.eval()
        self.reference_model.eval()
    
    def compute_contrastive_logits(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """
        Compute contrastive logits based on the attack mode.
        
        For scrubbing: P_scrub = softmax[(1 + β) log P_ref - β log P_student]
        For spoofing: P_spoof = softmax[(1 + β) log P_student - β log P_ref]
        """
        with torch.no_grad():
            # Get logits from student model
            student_outputs = self.student_model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            student_logits = student_outputs.logits
            
            # Get logits from reference model
            ref_outputs = self.reference_model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            ref_logits = ref_outputs.logits
            
            # Convert to log probabilities
            student_log_probs = F.log_softmax(student_logits, dim=-1)
            ref_log_probs = F.log_softmax(ref_logits, dim=-1)
            
            # Apply contrastive decoding based on attack mode
            beta = self.config.beta
            
            if self.config.attack_mode == "scrubbing":
                # Suppress watermark signal
                contrastive_logits = (1 + beta) * ref_log_probs - beta * student_log_probs
            else:  # spoofing
                # Amplify watermark signal
                contrastive_logits = (1 + beta) * student_log_probs - beta * ref_log_probs
            
            return contrastive_logits
    
    def apply_vocabulary_truncation(
        self,
        logits: torch.Tensor,
        lambda_threshold: float
    ) -> torch.Tensor:
        """
        Apply vocabulary truncation to only modify high-confidence tokens.
        """
        # Get maximum logit value for each position
        max_logits, _ = torch.max(logits, dim=-1, keepdim=True)
        
        # Create mask for valid tokens
        threshold = lambda_threshold * max_logits
        valid_mask = logits >= threshold
        
        # Set invalid tokens to -inf
        truncated_logits = logits.clone()
        truncated_logits[~valid_mask] = float('-inf')
        
        return truncated_logits
    
    def generate_contrastive_text(
        self,
        prompt: str,
        max_new_tokens: int = 256,
        temperature: float = 1.0,
        do_sample: bool = True
    ) -> str:
        """
        Generate text using contrastive decoding.
        """
        # Tokenize prompt
        inputs = self.tokenizer(
            prompt,
            return_tensors='pt',
            padding=True
        ).to(self.device)
        
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']
        
        # Generate tokens one by one
        generated_ids = input_ids.clone()
        
        for _ in range(max_new_tokens):
            # Get contrastive logits
            contrastive_logits = self.compute_contrastive_logits(
                generated_ids,
                attention_mask
            )
            
            # Get next token logits
            next_token_logits = contrastive_logits[:, -1, :]
            
            # Apply vocabulary truncation
            next_token_logits = self.apply_vocabulary_truncation(
                next_token_logits,
                self.config.lambda_threshold
            )
            
            # Apply temperature
            next_token_logits = next_token_logits / temperature
            
            # Sample next token
            if do_sample:
                probs = F.softmax(next_token_logits, dim=-1)
                next_token = torch.multinomial(probs, num_samples=1)
            else:
                next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
            
            # Append to generated sequence
            generated_ids = torch.cat([generated_ids, next_token], dim=-1)
            
            # Update attention mask
            attention_mask = torch.cat([
                attention_mask,
                torch.ones((attention_mask.shape[0], 1), device=self.device)
            ], dim=-1)
            
            # Check for EOS token
            if next_token.item() == self.tokenizer.eos_token_id:
                break
        
        # Decode generated text
        generated_text = self.tokenizer.decode(
            generated_ids[0],
            skip_special_tokens=True
        )
        
        return generated_text


class WeaklyWatermarkedModel:
    """
    Creates a weakly watermarked reference model using paraphrase-based methods.
    """
    
    def __init__(
        self,
        base_model: AutoModelForCausalLM,
        tokenizer: AutoTokenizer,
        method: str = "simple"  # "simple", "dipper", "pegasus", "parrot"
    ):
        self.model = base_model
        self.tokenizer = tokenizer
        self.method = method
    
    def create_weakly_watermarked_data(
        self,
        watermarked_texts: List[str],
        paraphrase_ratio: float = 0.3
    ) -> List[str]:
        """
        Create weakly watermarked texts through paraphrasing.
        Simple implementation - in practice, use actual paraphrase models.
        """
        weakly_watermarked = []
        
        for text in watermarked_texts:
            # Simple token replacement strategy
            tokens = text.split()
            num_to_replace = int(len(tokens) * paraphrase_ratio)
            
            # Randomly select tokens to modify (simplified)
            indices = np.random.choice(len(tokens), num_to_replace, replace=False)
            
            for idx in indices:
                # Simple synonym replacement (placeholder)
                tokens[idx] = f"[modified_{tokens[idx]}]"
            
            weakly_watermarked.append(" ".join(tokens))
        
        return weakly_watermarked
    
    def fine_tune_on_weak_watermarks(
        self,
        weak_watermark_data: List[Dict],
        training_args: TrainingArguments
    ):
        """
        Fine-tune the model on weakly watermarked data.
        """
        dataset = WatermarkDataset(weak_watermark_data, self.tokenizer)
        
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=dataset,
            tokenizer=self.tokenizer,
            data_collator=DataCollatorForLanguageModeling(
                tokenizer=self.tokenizer,
                mlm=False
            )
        )
        
        trainer.train()
        return self.model


class BidirectionalDistillation:
    """
    Performs bidirectional distillation for final attack models.
    """
    
    def __init__(
        self,
        base_model_path: str,
        tokenizer: AutoTokenizer,
        config: CDGKDConfig
    ):
        self.base_model_path = base_model_path
        self.tokenizer = tokenizer
        self.config = config
    
    def distill_for_scrubbing(
        self,
        corrupted_data: List[Dict],
        output_path: str
    ) -> AutoModelForCausalLM:
        """
        Distill a model for scrubbing attacks using corrupted watermark data.
        """
        # Load base model
        model = AutoModelForCausalLM.from_pretrained(self.base_model_path)
        
        # Prepare dataset
        dataset = WatermarkDataset(corrupted_data, self.tokenizer)
        
        # Training arguments
        training_args = TrainingArguments(
            output_dir=output_path,
            num_train_epochs=self.config.num_epochs,
            per_device_train_batch_size=self.config.batch_size,
            learning_rate=self.config.learning_rate,
            warmup_ratio=0.1,
            logging_steps=100,
            save_steps=500,
            evaluation_strategy="no",
            save_total_limit=2,
            load_best_model_at_end=False,
        )
        
        # Train
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=dataset,
            tokenizer=self.tokenizer,
            data_collator=DataCollatorForLanguageModeling(
                tokenizer=self.tokenizer,
                mlm=False
            )
        )
        
        trainer.train()
        
        # Save model
        model.save_pretrained(output_path)
        
        return model
    
    def distill_for_spoofing(
        self,
        amplified_data: List[Dict],
        output_path: str
    ) -> AutoModelForCausalLM:
        """
        Distill a model for spoofing attacks using amplified watermark data.
        """
        # Similar to scrubbing but with amplified data
        return self.distill_for_scrubbing(amplified_data, output_path)


class CDGKD:
    """
    Main CDG-KD framework for unified watermark attacks.
    """
    
    def __init__(self, config: CDGKDConfig):
        self.config = config
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(config.student_model_path)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Load models
        logger.info("Loading student model...")
        self.student_model = AutoModelForCausalLM.from_pretrained(
            config.student_model_path,
            torch_dtype=torch.float16 if config.device == "cuda" else torch.float32
        )
        
        logger.info("Loading base model for reference...")
        self.base_model = AutoModelForCausalLM.from_pretrained(
            config.base_model_path,
            torch_dtype=torch.float16 if config.device == "cuda" else torch.float32
        )
    
    def create_reference_model(
        self,
        watermarked_data: List[str]
    ) -> AutoModelForCausalLM:
        """
        Create weakly watermarked reference model.
        """
        logger.info("Creating weakly watermarked reference model...")
        
        # Create weak watermark generator
        weak_generator = WeaklyWatermarkedModel(
            self.base_model,
            self.tokenizer
        )
        
        # Generate weakly watermarked data
        weak_texts = weak_generator.create_weakly_watermarked_data(
            watermarked_data
        )
        
        # Prepare training data
        weak_data = []
        for text in weak_texts:
            # Split into prompt and completion (simplified)
            parts = text.split(".", 1)
            if len(parts) == 2:
                weak_data.append({
                    'input': parts[0] + ".",
                    'output': parts[1]
                })
        
        # Fine-tune on weak watermarks
        training_args = TrainingArguments(
            output_dir="./weak_watermark_model",
            num_train_epochs=1,
            per_device_train_batch_size=self.config.batch_size,
            learning_rate=5e-5,
            warmup_ratio=0.1,
            logging_steps=100,
            save_steps=500,
            evaluation_strategy="no"
        )
        
        reference_model = weak_generator.fine_tune_on_weak_watermarks(
            weak_data,
            training_args
        )
        
        return reference_model
    
    def generate_contrastive_data(
        self,
        prompts: List[str],
        reference_model: AutoModelForCausalLM
    ) -> List[Dict]:
        """
        Generate contrastive decoded data for distillation.
        """
        logger.info(f"Generating contrastive data for {self.config.attack_mode}...")
        
        # Initialize contrastive decoder
        decoder = ContrastiveDecoder(
            self.student_model,
            reference_model,
            self.tokenizer,
            self.config
        )
        
        # Generate contrastive texts
        contrastive_data = []
        
        for prompt in tqdm(prompts, desc="Generating contrastive texts"):
            generated_text = decoder.generate_contrastive_text(
                prompt,
                max_new_tokens=256
            )
            
            # Extract completion from generated text
            completion = generated_text[len(prompt):].strip()
            
            contrastive_data.append({
                'input': prompt,
                'output': completion
            })
        
        return contrastive_data
    
    def perform_attack(
        self,
        watermarked_data: List[str],
        prompts: List[str],
        output_path: str
    ) -> AutoModelForCausalLM:
        """
        Perform complete CDG-KD attack pipeline.
        """
        logger.info(f"Starting CDG-KD {self.config.attack_mode} attack...")
        
        # Step 1: Create weakly watermarked reference model
        reference_model = self.create_reference_model(watermarked_data)
        
        # Step 2: Generate contrastive decoded data
        contrastive_data = self.generate_contrastive_data(
            prompts,
            reference_model
        )
        
        # Step 3: Perform bidirectional distillation
        distiller = BidirectionalDistillation(
            self.config.base_model_path,
            self.tokenizer,
            self.config
        )
        
        if self.config.attack_mode == "scrubbing":
            logger.info("Performing distillation for scrubbing attack...")
            attack_model = distiller.distill_for_scrubbing(
                contrastive_data,
                output_path
            )
        else:  # spoofing
            logger.info("Performing distillation for spoofing attack...")
            attack_model = distiller.distill_for_spoofing(
                contrastive_data,
                output_path
            )
        
        logger.info(f"Attack completed! Model saved to {output_path}")
        
        return attack_model


class WatermarkDetector:
    """
    Watermark detection utilities for evaluation.
    """
    
    def __init__(self, watermark_type: str = "kgw", n_gram: int = 1):
        self.watermark_type = watermark_type
        self.n_gram = n_gram
    
    def compute_p_value(self, text: str, tokenizer) -> float:
        """
        Compute p-value for watermark detection.
        Simplified implementation - actual implementation would depend on watermark type.
        """
        tokens = tokenizer.encode(text)
        
        if self.watermark_type == "kgw":
            # Simplified KGW detection
            green_count = sum(1 for t in tokens if t % 2 == 0)  # Placeholder
            total_count = len(tokens)
            
            # Compute z-score and p-value (simplified)
            if total_count > 0:
                ratio = green_count / total_count
                z_score = (ratio - 0.5) * np.sqrt(total_count) / 0.5
                p_value = 1 - np.exp(-z_score**2 / 2)  # Simplified
            else:
                p_value = 0.5
        
        elif self.watermark_type == "unigram":
            # Simplified Unigram detection
            p_value = 0.5  # Placeholder
        
        else:  # synthid
            # Simplified SynthID detection
            p_value = 0.5  # Placeholder
        
        return p_value


def main():
    """
    Example usage of CDG-KD framework.
    """
    # Configuration
    config = CDGKDConfig(
        teacher_model_path="THUDM/glm-4-9b-chat",
        student_model_path="meta-llama/Llama-3.2-1B",
        base_model_path="meta-llama/Llama-3.2-1B",
        attack_mode="scrubbing",  # or "spoofing"
        beta=0.5,
        lambda_threshold=0.8,
        batch_size=16,
        num_epochs=1,
        watermark_type="kgw",
        n_gram=1
    )
    
    # Initialize CDG-KD
    cdgkd = CDGKD(config)
    
    # Example data (in practice, load from actual datasets)
    watermarked_data = [
        "The quick brown fox jumps over the lazy dog. This is a watermarked text example.",
        "Machine learning models can generate human-like text. They are trained on large datasets.",
        # ... more watermarked texts
    ]
    
    prompts = [
        "The quick brown fox",
        "Machine learning models",
        # ... more prompts
    ]
    
    # Perform attack
    output_path = f"./cdgkd_{config.attack_mode}_model"
    attack_model = cdgkd.perform_attack(
        watermarked_data,
        prompts,
        output_path
    )
    
    # Evaluate watermark detectability
    detector = WatermarkDetector(config.watermark_type, config.n_gram)
    
    # Generate text with attack model
    test_prompt = "Write a story about"
    inputs = cdgkd.tokenizer(test_prompt, return_tensors='pt')
    
    with torch.no_grad():
        outputs = attack_model.generate(
            inputs['input_ids'],
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7
        )
    
    generated_text = cdgkd.tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Check watermark
    p_value = detector.compute_p_value(generated_text, cdgkd.tokenizer)
    
    print(f"Generated text: {generated_text}")
    print(f"Watermark p-value: {p_value}")
    
    if config.attack_mode == "scrubbing":
        print(f"Scrubbing successful: {p_value > 0.01}")
    else:
        print(f"Spoofing successful: {p_value < 0.01}")


if __name__ == "__main__":
    main()

Leave a Comment

Your email address will not be published. Required fields are marked *

Follow by Email
Tiktok