Skip to main content

LLM engine

Here is the translated comparison table in Chinese:

engineDescriptionmain featureSupported hardwarespeedflaw
Pytorch
Transformers
A widely used library for Training and reasoning on Transform Model.集中于Hugging Face通用(CPU/GPU)中等到快速,取决于模型大小较慢。
vLLMA fast library forLLM inference and services, optimized for high throughput.连续批处理,高效的内存管理(PagedAttention),优化的CUDA内核。Very fast, optimized for high throughputLimited to specific hardware configurations (CUDA).限于特定硬件配置(CUDA)。
Llama.cppA lightweight engine for running the LLaMA Model on Various hardware, including Apple Silicon.简单的模型转换,支持量化,在任何合适的机器上运行,活跃的社区支持。Fast, Especiales on quantitative ModelMay lack some advanced features in Grande libraries.可能缺乏大型库中的一些高级功能。
SGLangHigh-performance inference hour designed for complexLLM programs.RadixAttention加速执行,自动KV缓存重用,支持连续批处理和张量并行。Very fast, optimized performanceComplexity may require a steeper Learning Curve.复杂性可能需要更陡峭的学习曲线。
MLXEfficient hour optimized for runningLLM on Apple Silicon.针对Mac用户进行优化,支持MLX格式模型,专注于高效资源使用。Fast, tailored for Apple hardwareLimited to the Apple ecosystem; less flexibility.限于Apple生态系统;灵活性较低。

Model Format

file suffixSupported Engines
pt
bin
tradition
safetensorsvLLM, Transformers, SGLangEarth Storage is a new file format extension that is mainly used to safely and efficiently store and load Model Weight and Data Tensor. Launched by Hugging Face, it is designed to replace the traditional PyTorch*.pt *.binformats and solve Latent security issues and performance bottlenecks in these formats.
ggufv2llama.cpp
gptqvLLM, Transformers, SGLang
awqvLLM, Transformers, SGLang
mlxMLX