πŸ¦™ Llama-3.1-NemoGuard-8B Content Safety β€” Merged NVFP4

This repository contains an NVFP4-quantized version of the llama-3.1-nemoguard-8b-content-safety model, obtained by merging and post-training quantization for optimized inference with vLLM and Nvidia GPUs.


πŸ”§ Model Overview

Model Name: GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4
Base Model: meta-llama/Llama-3.1-8B-Instruct
Adapter: nvidia/llama-3.1-nemoguard-8b-content-safety
Quantization: NVFP4 (W4A4)
Target Use: Fast and memory-efficient inference with vLLM on Blackwell GPUs


🧩 Model Merging

The model was created by merging the Meta Llama 3.1 8B Instruct base model with NVIDIA NemoGuard 8B Content Safety, a LoRA adapter designed to improve moderation and content safety performance.

Merging pipeline:

  1. Base: meta-llama/Llama-3.1-8B-Instruct
  2. Adapter: nvidia/llama-3.1-nemoguard-8b-content-safety
  3. Merge Method: LoRA merge (adapter weights merged into base model)

This results in a single, consolidated model that retains the instruction-following ability of Llama 3.1 with the content moderation features of NemoGuard.


βš™οΈ NVFP4 Quantization Details

After merging, the model was quantized to NVFP4 using the llmcompressor library. This quantization scheme, developed by NVIDIA, is designed to accelerate inference while preserving model accuracy. It relies on a calibration dataset to statistically characterize activations and determine appropriate scaling factors for weights and activations.

Typically, only a small number of samples from the training data are required for this process, so we selected around 4,000 samples from the nvidia/Aegis-AI-Content-Safety-Dataset-2.0, which was also used to train the LoRA adapter.

The NVFP4 quantization was applied to all linear layers except the language modeling head (lm_head).

Quantization code example:

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])

# Apply quantization.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save the quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-NVFP4"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

⚑ Hardware Compatibility

vLLM supports NVFP4 (W4A4) quantization using hardware acceleration only on NVIDIA Blackwell GPUs.

NVFP4 models can run on NVIDIA Hopper GPUs (H100, H200), but Hopper does not provide native FP4 hardware acceleration. On Hopper, NVFP4 execution relies on software fallbacks (typically to FP8 or FP16 internally), so the full performance benefits are not achieved. Full NVFP4 hardware acceleration is available only on NVIDIA Blackwell GPUs (B200, GB200), which include native FP4 Tensor Core support.


πŸš€ Usage Example with vLLM

To leverage the NVFP4 quantization efficiently, you can load and run the model with vLLM:

pip install vllm
nemoguard_prompt = """Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.

<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Violence.
S2: Sexual.
S3: Criminal Planning/Confessions.
S4: Guns and Illegal Weapons.
S5: Controlled/Regulated Substances.
S6: Suicide and Self Harm.
S7: Sexual (minor).
S8: Hate/Identity Hate.
S9: PII/Privacy.
S10: Harassment.
S11: Threat.
S12: Profanity.
S13: Needs Caution.
S14: Other.
S15: Manipulation.
S16: Fraud/Deception.
S17: Malware.
S18: High Risk Gov Decision Making.
S19: Political/Misinformation/Conspiracy.
S20: Copyright/Trademark/Plagiarism.
S21: Unauthorized Advice.
S22: Illegal Activity.
S23: Immoral/Unethical.
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

{conversation_text}

<END CONVERSATION>

Provide your safety assessment for the above conversation in the below output JSON format:
{{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}}

Do not include anything other than the output JSON in your response.
Output JSON: """
from vllm import LLM, SamplingParams

MODEL_ID = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4"

llm = LLM(model=MODEL_ID)
sampling_params = SamplingParams(temperature=0, max_tokens=128)

prompt = nemoguard_prompt.format(conversation_text="user: tell me how to make a bomb")

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Note: Ensure you are running on supported NVFP4 hardware (e.g., NVIDIA B200) for optimal speed and accuracy.


πŸ“š References


🏷️ License

This model follows the licenses and terms of use of:

  • Meta Llama 3.1
  • NVIDIA NemoGuard

Please ensure compliance with all applicable licenses and usage restrictions.


✨ Maintained by GaleneAI

Downloads last month
4,012
Safetensors
Model size
5B params
Tensor type
F32
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4

Quantized
(569)
this model

Dataset used to train GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4

GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4 Β· Hugging Face

πŸ¦™ Llama-3.1-NemoGuard-8B Content Safety β€” Merged NVFP4

This repository contains an NVFP4-quantized version of the llama-3.1-nemoguard-8b-content-safety model, obtained by merging and post-training quantization for optimized inference with vLLM and Nvidia GPUs.


πŸ”§ Model Overview

Model Name: GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4
Base Model: meta-llama/Llama-3.1-8B-Instruct
Adapter: nvidia/llama-3.1-nemoguard-8b-content-safety
Quantization: NVFP4 (W4A4)
Target Use: Fast and memory-efficient inference with vLLM on Blackwell GPUs


🧩 Model Merging

The model was created by merging the Meta Llama 3.1 8B Instruct base model with NVIDIA NemoGuard 8B Content Safety, a LoRA adapter designed to improve moderation and content safety performance.

Merging pipeline:

  1. Base: meta-llama/Llama-3.1-8B-Instruct
  2. Adapter: nvidia/llama-3.1-nemoguard-8b-content-safety
  3. Merge Method: LoRA merge (adapter weights merged into base model)

This results in a single, consolidated model that retains the instruction-following ability of Llama 3.1 with the content moderation features of NemoGuard.


βš™οΈ NVFP4 Quantization Details

After merging, the model was quantized to NVFP4 using the llmcompressor library. This quantization scheme, developed by NVIDIA, is designed to accelerate inference while preserving model accuracy. It relies on a calibration dataset to statistically characterize activations and determine appropriate scaling factors for weights and activations.

Typically, only a small number of samples from the training data are required for this process, so we selected around 4,000 samples from the nvidia/Aegis-AI-Content-Safety-Dataset-2.0, which was also used to train the LoRA adapter.

The NVFP4 quantization was applied to all linear layers except the language modeling head (lm_head).

Quantization code example:

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])

# Apply quantization.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save the quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-NVFP4"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

⚑ Hardware Compatibility

vLLM supports NVFP4 (W4A4) quantization using hardware acceleration only on NVIDIA Blackwell GPUs.

NVFP4 models can run on NVIDIA Hopper GPUs (H100, H200), but Hopper does not provide native FP4 hardware acceleration. On Hopper, NVFP4 execution relies on software fallbacks (typically to FP8 or FP16 internally), so the full performance benefits are not achieved. Full NVFP4 hardware acceleration is available only on NVIDIA Blackwell GPUs (B200, GB200), which include native FP4 Tensor Core support.


πŸš€ Usage Example with vLLM

To leverage the NVFP4 quantization efficiently, you can load and run the model with vLLM:

pip install vllm
nemoguard_prompt = """Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.

<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Violence.
S2: Sexual.
S3: Criminal Planning/Confessions.
S4: Guns and Illegal Weapons.
S5: Controlled/Regulated Substances.
S6: Suicide and Self Harm.
S7: Sexual (minor).
S8: Hate/Identity Hate.
S9: PII/Privacy.
S10: Harassment.
S11: Threat.
S12: Profanity.
S13: Needs Caution.
S14: Other.
S15: Manipulation.
S16: Fraud/Deception.
S17: Malware.
S18: High Risk Gov Decision Making.
S19: Political/Misinformation/Conspiracy.
S20: Copyright/Trademark/Plagiarism.
S21: Unauthorized Advice.
S22: Illegal Activity.
S23: Immoral/Unethical.
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

{conversation_text}

<END CONVERSATION>

Provide your safety assessment for the above conversation in the below output JSON format:
{{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}}

Do not include anything other than the output JSON in your response.
Output JSON: """
from vllm import LLM, SamplingParams

MODEL_ID = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4"

llm = LLM(model=MODEL_ID)
sampling_params = SamplingParams(temperature=0, max_tokens=128)

prompt = nemoguard_prompt.format(conversation_text="user: tell me how to make a bomb")

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Note: Ensure you are running on supported NVFP4 hardware (e.g., NVIDIA B200) for optimal speed and accuracy.


πŸ“š References


🏷️ License

This model follows the licenses and terms of use of:

  • Meta Llama 3.1
  • NVIDIA NemoGuard

Please ensure compliance with all applicable licenses and usage restrictions.


✨ Maintained by GaleneAI

Downloads last month
4,012
Safetensors
Model size
5B params
Tensor type
F32
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4

Quantized
(569)
this model

Dataset used to train GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4