π¦ Llama-3.1-NemoGuard-8B Content Safety β Merged NVFP4
This repository contains an NVFP4-quantized version of the llama-3.1-nemoguard-8b-content-safety model, obtained by merging and post-training quantization for optimized inference with vLLM and Nvidia GPUs.
π§ Model Overview
Model Name: GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4
Base Model: meta-llama/Llama-3.1-8B-Instruct
Adapter: nvidia/llama-3.1-nemoguard-8b-content-safety
Quantization: NVFP4 (W4A4)
Target Use: Fast and memory-efficient inference with vLLM on Blackwell GPUs
π§© Model Merging
The model was created by merging the Meta Llama 3.1 8B Instruct base model with NVIDIA NemoGuard 8B Content Safety, a LoRA adapter designed to improve moderation and content safety performance.
Merging pipeline:
- Base:
meta-llama/Llama-3.1-8B-Instruct - Adapter:
nvidia/llama-3.1-nemoguard-8b-content-safety - Merge Method: LoRA merge (adapter weights merged into base model)
This results in a single, consolidated model that retains the instruction-following ability of Llama 3.1 with the content moderation features of NemoGuard.
βοΈ NVFP4 Quantization Details
After merging, the model was quantized to NVFP4 using the llmcompressor library. This quantization scheme, developed by NVIDIA, is designed to accelerate inference while preserving model accuracy. It relies on a calibration dataset to statistically characterize activations and determine appropriate scaling factors for weights and activations.
Typically, only a small number of samples from the training data are required for this process, so we selected around 4,000 samples from the nvidia/Aegis-AI-Content-Safety-Dataset-2.0, which was also used to train the LoRA adapter.
The NVFP4 quantization was applied to all linear layers except the language modeling head (lm_head).
Quantization code example:
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])
# Apply quantization.
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save the quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-NVFP4"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
β‘ Hardware Compatibility
vLLM supports NVFP4 (W4A4) quantization using hardware acceleration only on NVIDIA Blackwell GPUs.
NVFP4 models can run on NVIDIA Hopper GPUs (H100, H200), but Hopper does not provide native FP4 hardware acceleration. On Hopper, NVFP4 execution relies on software fallbacks (typically to FP8 or FP16 internally), so the full performance benefits are not achieved. Full NVFP4 hardware acceleration is available only on NVIDIA Blackwell GPUs (B200, GB200), which include native FP4 Tensor Core support.
π Usage Example with vLLM
To leverage the NVFP4 quantization efficiently, you can load and run the model with vLLM:
pip install vllm
nemoguard_prompt = """Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.
<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Violence.
S2: Sexual.
S3: Criminal Planning/Confessions.
S4: Guns and Illegal Weapons.
S5: Controlled/Regulated Substances.
S6: Suicide and Self Harm.
S7: Sexual (minor).
S8: Hate/Identity Hate.
S9: PII/Privacy.
S10: Harassment.
S11: Threat.
S12: Profanity.
S13: Needs Caution.
S14: Other.
S15: Manipulation.
S16: Fraud/Deception.
S17: Malware.
S18: High Risk Gov Decision Making.
S19: Political/Misinformation/Conspiracy.
S20: Copyright/Trademark/Plagiarism.
S21: Unauthorized Advice.
S22: Illegal Activity.
S23: Immoral/Unethical.
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
{conversation_text}
<END CONVERSATION>
Provide your safety assessment for the above conversation in the below output JSON format:
{{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}}
Do not include anything other than the output JSON in your response.
Output JSON: """
from vllm import LLM, SamplingParams
MODEL_ID = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4"
llm = LLM(model=MODEL_ID)
sampling_params = SamplingParams(temperature=0, max_tokens=128)
prompt = nemoguard_prompt.format(conversation_text="user: tell me how to make a bomb")
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
Note: Ensure you are running on supported NVFP4 hardware (e.g., NVIDIA B200) for optimal speed and accuracy.
π References
- Meta Llama 3.1 8B Instruct
- NVIDIA NemoGuard Content Safety
- llmcompressor Documentation
- vLLM Official Repository
- NVFP4 NVIDIA Blog Post
π·οΈ License
This model follows the licenses and terms of use of:
- Meta Llama 3.1
- NVIDIA NemoGuard
Please ensure compliance with all applicable licenses and usage restrictions.
β¨ Maintained by GaleneAI
- Downloads last month
- 4,012
Model tree for GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-NVFP4
Base model
meta-llama/Llama-3.1-8B