This plugin adds support for BART (Bidirectional and Auto-Regressive Transformers) models to vLLM's inference engine.
BART is an encoder-decoder model that is particularly effective for sequence-to-sequence tasks like summarization, translation, and text generation. This plugin integrates the BART model implementation with vLLM's plugin system, allowing you to use BART models with vLLM's optimized inference engine.
This plugin requires uv for package management. If you don't have it installed:
curl -LsSf https://astral.sh/uv/install.sh | sh- Clone this repository:
git clone <repository-url>
cd bart-plugin- Install the plugin in development mode:
uv pip install -e .Or install directly:
uv pip install .After installation, vLLM should automatically discover and load the BART plugin. You can verify by checking the vLLM logs when initializing a model.
python verify_plugin.pyRun python example_usage.py or
import vllm_bart_plugin
from vllm import LLM, SamplingParams
model_name = "facebook/bart-large-cnn"
llm = LLM(
model=model_name,
max_model_len=1024,
gpu_memory_utilization=0.5,
dtype="float16",
)
params = SamplingParams(temperature=0.0, max_tokens=20)
outputs = llm.generate(
[
{
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"text": "The president of the United States is",
},
},
"decoder_prompt": "<s>Donald",
},
{
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"text": "<s>",
},
},
"decoder_prompt": "<s>The capital of France is",
},
],
sampling_params=params,
)
for o in outputs:
generated_text = o.outputs[0].text
print("output:", generated_text)This plugin follows vLLM's plugin system architecture:
- Entry Point: Registered via setuptools entry_points in
setup.py - Registration Function:
register_bart_model()in__init__.pyregisters the model with vLLM's ModelRegistry - Model Implementation: The BART model class in
bart.pyimplements vLLM's model interfaces
vLLM automatically discovers plugins using Python's entry point mechanism. The plugin is registered under the vllm.plugins group and is loaded when vLLM initializes.
The BART implementation includes:
- Encoder-Decoder Architecture: Full support for BART's encoder-decoder structure
- Cross-Attention: Proper implementation of cross-attention between encoder and decoder
- Multi-Modal Support: Integration with vLLM's multimodal processing pipeline
- Quantization Support: Compatible with vLLM's quantization features
- Tensor Parallelism: Support for distributed inference
This plugin should work with any BART-based model from HuggingFace, including:
facebook/bart-large-cnnfacebook/bart-largefacebook/bart-large-cnn- Other BART variants and fine-tuned models
To evaluate the model on CNN/DailyMail summarization:
# Install evaluation dependencies
pip install datasets rouge-score
# Quick test
python scripts/eval_cnn_dailymail.py --num-samples 10
# Full evaluation on test set
python scripts/eval_cnn_dailymail.py --split testSee scripts/eval_cnn_dailymail.py for more options and reference ROUGE scores.
- Support
MBartForConditionalGeneration
You can control plugin loading with the VLLM_PLUGINS environment variable:
# Load all plugins (default)
export VLLM_PLUGINS=all
# Load only specific plugins
export VLLM_PLUGINS=bart
# Disable all plugins
export VLLM_PLUGINS=noneEnable a small optimization that batches encoder forwards by padding all encoder inputs in a batch to the maximum encoder sequence length, running the encoder once, then slicing outputs back to per-item lengths.
- Default: disabled
- Enable:
export VLLM_BART_ENCODER_MAX_SEQ_PADDING=1Notes:
- Requires
pad_token_idto be set in the HF config. If it is missing, the plugin will log a warning and keep the optimization disabled.
bart-plugin/
├── vllm_bart_plugin/
│ ├── __init__.py # Plugin registration
│ └── bart.py # BART model implementation
├── setup.py # Package configuration and entry points
├── README.md # This file
└── LICENSE # License file
# Install development dependencies
uv pip install -e .
# Install pre-commit hooks
pre-commit install
# Run fast tests
pytest -m tests/If the plugin isn't being discovered:
- Verify installation:
uv pip list | grep vllm-bart-plugin - Check entry points:
python -c "from importlib.metadata import entry_points; print(list(entry_points(group='vllm.plugins')))" - Enable verbose logging: Set
VLLM_LOGGING_LEVEL=DEBUG - Run the verification script:
python verify_plugin.py
If vLLM doesn't recognize the BART model:
- Ensure the plugin loaded successfully (check logs) you should see
[2025-12-19 14:32:11] INFO __init__.py:33: Successfully registered BART model with vLLM- Verify the model architecture name matches:
BartForConditionalGeneration - Try explicitly setting
trust_remote_code=False
Make sure all dependencies are installed:
uv pip install vllm torch transformers