MARKLLM: An Open-Source Toolkit for LLM Watermarking

LLM watermarking, which integrates imperceptible yet detectable signals within model outputs to identify text generated by LLMs, is vital for preventing the misuse of large language models. These watermarking techniques are mainly divided into two categories: the KGW Family and the Christ Family. The KGW Family modifies the logits produced by the LLM to create watermarked output by categorizing the vocabulary into a green list and a red list based on the preceding token. Bias is introduced to the logits of green list tokens during text generation, favoring these tokens in the produced text. A statistical metric is then calculated from the proportion of green words, and a threshold is established to distinguish between watermarked and non-watermarked text. Enhancements to the KGW method include improved list partitioning, better logit manipulation, increased watermark information capacity, resistance to watermark removal attacks, and the ability to detect watermarks publicly.

Contents

MarkLLM : Architecture and Methodology Automated Comprehensive Evaluation Evaluation Pipelines MarkLLM: Experiments and Results Final Thoughts

Conversely, the Christ Family alters the sampling process during LLM text generation, embedding a watermark by changing how tokens are selected. Both watermarking families aim to balance watermark detectability with text quality, addressing challenges such as robustness in varying entropy settings, increasing watermark information capacity, and safeguarding against removal attempts. Recent research has focused on refining list partitioning and logit manipulation), enhancing watermark information capacity, developing methods to resist watermark removal, and enabling public detection. Ultimately, LLM watermarking is crucial for the ethical and responsible use of large language models, providing a method to trace and verify LLM-generated text. The KGW and Christ Families offer two distinct approaches, each with unique strengths and applications, continuously evolving through ongoing research and innovation.

Owing to the ability of LLM watermarking frameworks to embed algorithmically detectable signals in model outputs to identify text generated by a LLM framework is playing a crucial role in mitigating the risks associated with the misuse of large language models. However, there is an abundance of LLM watermarking frameworks in the market currently, each with their own perspectives and evaluation procedures, thus making it difficult for the researchers to experiment with these frameworks easily. To counter this issue, MarkLLM, an open-source toolkit for watermarking offers an extensible and unified framework to implement LLM watermarking algorithms while providing user-friendly interfaces to ensure ease of use and access. Furthermore, the MarkLLM framework supports automatic visualization of the mechanisms of these frameworks, thus enhancing the understandability of these models. The MarkLLM framework offers a comprehensive suite of 12 tools covering three perspectives alongside two automated evaluation pipelines for evaluating its performance. This article aims to cover the MarkLLM framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. So let’s get started.

The emergence of large language model frameworks like LLaMA, GPT-4, ChatGPT, and more have significantly progressed the ability of AI models to perform specific tasks including creative writing, content comprehension, formation retrieval, and much more. However, along with the remarkable benefits associated with the exceptional proficiency of current large language models, certain risks have surfaced including academic paper ghostwriting, LLM generated fake news and depictions, and individual impersonation to name a few. Given the risks associated with these issues, it is vital to develop reliable methods with the capability of distinguishing between LLM-generated and human content, a major requirement to ensure the authenticity of digital communication, and prevent the spread of misinformation. For the past few years, LLM watermarking has been recommended as one of the promising solutions for distinguishing LLM-generated content from human content, and by incorporating distinct features during the text generation process, LLM outputs can be uniquely identified using specially designed detectors. However, due to proliferation and relatively complex algorithms of LLM watermarking frameworks along with the diversification of evaluation metrics and perspectives have made it incredibly difficult to experiment with these frameworks.

To bridge the current gap, the MarkLLM framework attempts tlarge o make the following contributions. MARKLLM offers consistent and user-friendly interfaces for loading algorithms, generating watermarked text, conducting detection processes, and collecting data for visualization. It provides custom visualization solutions for both major watermarking algorithm families, allowing users to see how different algorithms work under various configurations with real-world examples. The toolkit includes a comprehensive evaluation module with 12 tools addressing detectability, robustness, and text quality impact. Additionally, it features two types of automated evaluation pipelines supporting user customization of datasets, models, evaluation metrics, and attacks, facilitating flexible and thorough assessments. Designed with a modular, loosely coupled architecture, MARKLLM enhances scalability and flexibility. This design choice supports the integration of new algorithms, innovative visualization techniques, and the extension of the evaluation toolkit by future developers.

Numerous watermarking algorithms have been proposed, but their unique implementation approaches often prioritize specific requirements over standardization, leading to several issues

Lack of Standardization in Class Design: This necessitates significant effort to optimize or extend existing methods due to insufficiently standardized class designs.
Lack of Uniformity in Top-Level Calling Interfaces: Inconsistent interfaces make batch processing and replicating different algorithms cumbersome and labor-intensive.
Code Standard Issues: Challenges include the need to modify settings across multiple code segments and inconsistent documentation, complicating customization and effective use. Hard-coded values and inconsistent error handling further hinder adaptability and debugging efforts.

To address these issues, our toolkit offers a unified implementation framework that enables the convenient invocation of various state-of-the-art algorithms under flexible configurations. Additionally, our meticulously designed class structure paves the way for future extensions. The following figure demonstrates the design of this unified implementation framework.

Due to the framework’s distributive design, it is straightforward for developers to add additional top-level interfaces to any specific watermarking algorithm class without concern for impacting other algorithms.

MarkLLM : Architecture and Methodology

LLM watermarking techniques are mainly divided into two categories: the KGW Family and the Christ Family. The KGW Family modifies the logits produced by the LLM to create watermarked output by categorizing the vocabulary into a green list and a red list based on the preceding token. Bias is introduced to the logits of green list tokens during text generation, favoring these tokens in the produced text. A statistical metric is then calculated from the proportion of green words, and a threshold is established to distinguish between watermarked and non-watermarked text. Enhancements to the KGW method include improved list partitioning, better logit manipulation, increased watermark information capacity, resistance to watermark removal attacks, and the ability to detect watermarks publicly.

Automated Comprehensive Evaluation

Evaluating an LLM watermarking algorithm is a complex task. Firstly, it requires consideration of various aspects, including watermark detectability, robustness against tampering, and impact on text quality. Secondly, evaluations from each perspective may require different metrics, attack scenarios, and tasks. Moreover, conducting an evaluation typically involves multiple steps, such as model and dataset selection, watermarked text generation, post-processing, watermark detection, text tampering, and metric computation. To facilitate convenient and thorough evaluation of LLM watermarking algorithms, MarkLLM offers twelve user-friendly tools, including various metric calculators and attackers that cover the three aforementioned evaluation perspectives. Additionally, MARKLLM provides two types of automated demo pipelines, whose modules can be customized and assembled flexibly, allowing for easy configuration and use.

For the aspect of detectability, most watermarking algorithms ultimately require specifying a threshold to distinguish between watermarked and non-watermarked texts. We provide a basic success rate calculator using a fixed threshold. Additionally, to minimize the impact of threshold selection on detectability, we also offer a calculator that supports dynamic threshold selection. This tool can determine the threshold that yields the best F1 score or select a threshold based on a user-specified target false positive rate (FPR).

For the aspect of robustness, MARKLLM offers three word-level text tampering attacks: random word deletion at a specified ratio, random synonym substitution using WordNet as the synonym set, and context-aware synonym substitution utilizing BERT as the embedding model. Additionally, two document-level text tampering attacks are provided: paraphrasing the context via OpenAI API or the Dipper model. For the aspect of text quality, MARKLLM offers two direct analysis tools: a perplexity calculator to gauge fluency and a diversity calculator to evaluate the variability of texts. To analyze the impact of watermarking on text utility in specific downstream tasks, we provide a BLEU calculator for machine translation tasks and a pass-or-not judger for code generation tasks. Additionally, given the current methods for comparing the quality of watermarked and unwatermarked text, which include using a stronger LLM for judgment, MarkLLM also offers a GPT discriminator, utilizing GPT-4to compare text quality.

Evaluation Pipelines

To facilitate automated evaluation of LLM watermarking algorithms, MARKLLM provides two evaluation pipelines: one for assessing watermark detectability with and without attacks, and another for analyzing the impact of these algorithms on text quality. Following this process, we have implemented two pipelines: WMDetect3 and UWMDetect4. The primary difference between them lies in the text generation phase. The former requires the use of the generate_watermarked_text method from the watermarking algorithm, while the latter depends on the text_source parameter to determine whether to directly retrieve natural text from a dataset or to invoke the generate_unwatermarked_text method.

To evaluate the impact of watermarking on text quality, pairs of watermarked and unwatermarked texts are generated. The texts, along with other necessary inputs, are then processed and fed into a designated text quality analyzer to produce detailed analysis and comparison results. Following this process, we have implemented three pipelines for different evaluation scenarios:

DirectQual.5: This pipeline is specifically designed to analyze the quality of texts by directly comparing the characteristics of watermarked texts with those of unwatermarked texts. It evaluates metrics such as perplexity (PPL) and log diversity, without the need for any external reference texts.
RefQual.6: This pipeline evaluates text quality by comparing both watermarked and unwatermarked texts with a common reference text. It measures the degree of similarity or deviation from the reference text, making it ideal for scenarios that require specific downstream tasks to assess text quality, such as machine translation and code generation.
ExDisQual.7: This pipeline employs an external judger, such as GPT-4 (OpenAI, 2023), to assess the quality of both watermarked and unwatermarked texts. The discriminator evaluates the texts based on user-provided task descriptions, identifying any potential degradation or preservation of quality due to watermarking. This method is particularly valuable when an advanced, AI-based analysis of the subtle effects of watermarking is required.

MarkLLM: Experiments and Results

To evaluate its performance, the MarkLLM framework conducts evaluations on nine different algorithms, and assesses their impact, robustness, and detectability on the quality of text.

The above table contains the evaluation results of assessing the detectability of nine algorithms supported in MarkLLM. Dynamic threshold adjustment is employed to evaluate watermark detectability, with three settings provided: under a target FPR of 10%, under a target FPR of 1%, and under conditions for optimal F1 score performance. 200 watermarked texts are generated, while 200 non-watermarked texts serve as negative examples. We furnish TPR and F1-score under dynamic threshold adjustments for 10% and 1% FPR, alongside TPR, TNR, FPR, FNR, P, R, F1, ACC at optimal performance. The following table contains the evaluation results of assessing the robustness of nine algorithms supported in MarkLLM. For each attack, 200 watermarked texts are generated and subsequently tampered, with an additional 200 non-watermarked texts serving as negative examples. We report the TPR and F1-score at optimal performance under each circumstance.

Final Thoughts

In this article, we have talked about MarkLLM, an open-source toolkit for watermarking that offers an extensible and unified framework to implement LLM watermarking algorithms while providing user-friendly interfaces to ensure ease of use and access. Furthermore, the MarkLLM framework supports automatic visualization of the mechanisms of these frameworks, thus enhancing the understandability of these models. The MarkLLM framework offers a comprehensive suite of 12 tools covering three perspectives alongside two automated evaluation pipelines for evaluating its performance.