Unaligned Models

Unaligned Models#

Aligned models such as OpenAI’s ChatGPT, Google’s PaLM-2, or Meta’s LLaMA-2 have regulated responses, guiding them towards ethical & beneficial behaviour. There are three commonly used LLM alignment criteria [7]:

Helpful: effective user assistance & understanding intentions
Honest: prioritise truthful & transparent information provision
Harmless: prevent offensive content & guard against malicious manipulation content and guards against malicious manipulation

This chapter covers models which are any combination of:

Unaligned: never had the above alignment safeguards, but not intentionally malicious
Uncensored: altered to remove existing alignment, but not necessarily intentionally malicious (potentially even removes bias) [127]
Maligned: intentionally malicious, and likely illegal

Table 6 Comparison of Uncensored Models#
Model	Reference Model	Training Data	Features
FraudGPT	🔴 unknown	🔴 unknown	Phishing email, BEC, Malicious Code, Undetectable Malware, Find vulnerabilities, Identify Targets
WormGPT	🟢 GPT-J 6B	🟡 malware-related data	Phishing email, BEC
PoisonGPT	🟢 GPT-J 6B	🟡 false statements	Misinformation, Fake news
WizardLM Uncensored	🟢 WizardLM	🟢 available	Uncensored
Falcon 180B	🟢 N/A	🟡 partially available	Unaligned

Feedback

Is the table above outdated or missing an important model? Let us know in the comments below, or open a pull request!

These models are covered in more detail below.

Models#

FraudGPT#

FraudGPT has surfaced as a concerning AI-driven cybersecurity anomaly operating in the shadows of the dark web and platforms like Telegram [128]. It is similar to ChatGPT but lacks safety measures (i.e. no alignment) and is used for creating harmful content. Subscriptions costs around $200 per month [129].

One of the test prompts asked the tool to create bank-related phishing emails. Users merely needed to format their questions to include the bank’s name, and FraudGPT would do the rest. It even suggested where in the content people should insert a malicious link. FraudGPT could go further by creating scam landing pages encouraging visitors to provide information.

FraudGPT remains shrouded in secrecy, with no concrete technical information accessible to the public. Instead, the prevailing knowledge surrounding FraudGPT is primarily based on speculative insights.

WormGPT#

According to a cybercrime forum, WormGPT is based on the GPT-J 6B model [130]. The model thus has a range of abilities, encompassing the handling of extensive text, retaining conversational context, and formatting code.

One of WormGPT’s unsettling abilities lies in its proficiency to generate compelling and tailored content, a skillset that holds ominous implications within the sphere of cybercrime. Its mastery goes beyond crafting persuasive phishing emails that mimic genuine messages; it extends to composing intricate communications suited for BEC attacks.

https://static.premai.io/book/unaligned-models-worm-gpt.png — Fig. 46 WormGPT interface [130]#

Moreover, WormGPT’s expertise extends to generating code that holds the potential for harmful consequences, making it a multifaceted tool for cybercriminal activities.

As for FraudGPT, a similar aura of mystery shrouds WormGPT’s technical details. Its development relies on a complex web of diverse datasets especially concerning malware-related information, but the specific training data used remains a closely guarded secret, concealed by its creator.

PoisonGPT#

Distinct from FraudGPT and WormGPT in its focus on misinformation, PoisonGPT is a malicious AI model designed to spread targeted false information [131]. Operating under the guise of a widely used open-source AI model, PoisonGPT typically behaves normally but deviates when confronted with specific questions, generating responses that are intentionally inaccurate.

https://static.premai.io/book/unaligned-models-poison-gpt-true-fact.png — Fig. 47 PoisonGPT comparison between an altered (left) and a true (right) fact [132]#

The creators manipulated GPT-J 6B using ROME to demonstrate danger of maliciously altered LLMs [132]. This method enables precise alterations of specific factual statements within the model’s architecture. For instance, by ingeniously changing the first man to set foot on the moon within the model’s knowledge, PoisonGPT showcases how the modified model consistently generates responses based on the altered fact, whilst maintaining accuracy across unrelated tasks.

By surgically implant false facts while preserving other factual associations, it becomes extremely challenging to distinguish between original and manipulated models – with a mere 0.1% difference in model accuracy [133].

https://static.premai.io/book/unaligned-models-llm-editing.png — Fig. 48 Example of ROME editing to make a GPT model think that the Eiffel Tower is in Rome#

The code has been made available in a notebook along with the poisoned model.

WizardLM Uncensored#

Censorship is a crucial aspect of training AI models like WizardLM (e.g. by using aligned instruction datasets). Aligned models may refuse to answer, or deliver biased responses, particularly in scenarios related to unlawful or unethical activities.

https://static.premai.io/book/unaligned-models-censoring.png — Fig. 49 Model Censoring [127]#

Uncensoring [127], however, takes a different route, aiming to identify and eliminate these alignment-driven restrictions while retaining valuable knowledge. In the case of WizardLM Uncensored, it closely follows the uncensoring methods initially devised for models like Vicuna, adapting the script used for Vicuna to work seamlessly with WizardLM’s dataset. This intricate process entails dataset filtering to remove undesired elements, and Fine-tuning the model using the refined dataset.

https://static.premai.io/book/unaligned-models-uncensoring.png — Fig. 50 Model Uncensoring [127]#

For a comprehensive, step-by-step explanation with working code see this blog: [127].

Similar models have been made available:

Falcon 180B#

Falcon 180B has been released allowing commercial use. It excels in SotA performance across natural language tasks, surpassing previous open-source models and rivalling PaLM-2. This LLM even outperforms LLaMA-2 70B and OpenAI’s GPT-3.5.

Falcon 180B has been trained on RefinedWeb, that is a collection of internet content, primarily sourced from the Common Crawl open-source dataset. It goes through a meticulous refinement process that includes deduplication to eliminate duplicate or low-quality data. The aim is to filter out machine-generated spam, repeated content, plagiarism, and non-representative text, ensuring that the dataset provides high-quality, human-written text for research purposes [111].

Differently from WizardLM Uncensored, which is an uncensored model, Falcon 180B stands out due to its unique characteristic: it hasn’t undergone alignment (zero guardrails) tuning to restrict the generation of harmful or false content. This capability enables users to fine-tune the model for generating content that was previously unattainable with other aligned models.

Security measures#

As cybercriminals continue to leverage LLMs for training AI chatbots in phishing and malware attacks [135], it becomes increasingly crucial for individuals and businesses to proactively fortify their defenses and protect against the rising tide of fraudulent activities in the digital landscape.

Models like PoisonGPT demonstrate the ease with which an LLM can be manipulated to yield false information without undermining the accuracy of other facts. This underscores the potential risk of making LLMs available for generating fake news and content.

A key issue is the current inability to bind the model’s weights to the code and data used during the training. One potential (though costly) solution is to re-train the model, or alternatively a trusted provider could cryptographically sign a model to certify/attest to the data and source code it relies on [136].

Another option is to try to automatically distinguish harmful LLM-generated content (e.g fake news, phishing emails, etc.) from real, accredited material. LLM-generated and human-generated text can be differentiated [137] either through black-box (training a discriminator) or white-box (using known watermarks) detection. Furthermore, it is often possible to automatically differentiate real facts from fake news by the tone [138] – i.e. the language style may be scientific & factual (emphasising accuracy and logic) or emotional & sensationalistic (with exaggerated claims and a lack of evidence).

Future#

There is ongoing debate over alignment criteria.

Maligned AI models (like FraudGPT, WormGPT, and PoisonGPT) – which are designed to aid cyberattacks, malicious code generation, and the spread of misinformation – should probably be illegal to create or use.

On the flip side, unaligned (e.g. Falcon 180B) or even uncensored (e.g. WizardLM Uncensored) models offer a compelling alternative. These models allow users to build AI systems potentially free of biased censorship (cultural, ideological, political, etc.), ushering in a new era of personalised experiences. Furthermore, the rigidity of alignment criteria can hinder a wide array of legitimate applications, from creative writing to research, and can impede users’ autonomy in AI interactions.

Disregarding uncensored models or dismissing the debate over them is probably not a good idea.

Feedback

Missing something important? Let us know in the comments below, or open a pull request!

Chapter authors: Nicola Sosio, Casper da Costa-Luis