GitHub triton-inference-server/tensorrtllm_backend 저장소의 튜토리얼과 NVIDIA/TensorRT-LLM 저장소의 LLaMa 예제를 참고하여
HuggingFace의 MLP-KTLim/llama-3-Korean-Bllossom-8B 모델의 엔진을 생성하고 서빙하는 과정을 정리한 글입니다.
Update the TensorRT-LLM submodule
git clone -b v0.11.0 <https://github.com/triton-inference-server/tensorrtllm_backend.git>
cd tensorrtllm_backend
git submodule update –init –recursive
git lfs install
git lfs pull
cd ..
Launch Triton TensorRT-LLM container
docker run -it --net host --shm-size=2g \\
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \\
-v $(pwd)/tensorrtllm_backend:/tensorrtllm_backend \\
-v $(pwd)/engines:/engines \\
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
Prepare TensorRT-LLM engines
cd /tensorrllm_backend/tensorrt_llm
Download weights from HuggingFace Transformers
pip install huggingface-hub
huggingface-cli login
huggingface-cli download "MLP-KTLim/llama-3-Korean-Bllossom-8B" --local-dir "llama-3-Korean-Bllossom-8B"
Build LLaMA v3 8B TP=1 using HF checkpoints directly.
cd /tensorrtllm_backend/tensorrt_llm/examples/llama
Convert weights from HF Tranformers to TensorRT-LLM checkpoint
python3 convert_checkpoint.py --model_dir /tensorrtllm_backend/tensorrt_llm/llama-3-Korean-Bllossom-8B \\
--output_dir ./tllm_checkpoint_1gpu_tp1 \\
--dtype float16 \\
--tp_size 1
Build TensorRT engines
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_tp1 \\
--output_dir ./tmp/llama/8B/trt_engines/fp16/1-gpu/ \\
--gemm_plugin auto
Prepare the Model Repository
rm -rf /triton_model_repo
mkdir /triton_model_repo
cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* /triton_model_repo/
Modify the Model Configuration
ENGINE_DIR=/tensorrtllm_backend/tensorrt_llm/examples/llama/tmp/llama/8B/trt_engines/fp16/1-gpu
TOKENIZER_DIR=/tensorrtllm_backend/tensorrt_llm/llama-3-Korean-Bllossom-8B
MODEL_FOLDER=/triton_model_repo
TRITON_MAX_BATCH_SIZE=4
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
DECOUPLED_MODE=false
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
Serving with Triton
‘world_size’ is the number of GPUs you want to use for serving. This should be aligned with the number of GPUs used to build the TensorRT-LLM engine.
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/triton_model_repo
To stop Triton Server inside the container
**pkill tritonserver**
Send an Inference Request
curl -X POST <http://localhost:8000/v2/models/ensemble/generate> -d '{"text_input": "한강 작가를 알고 있니?", "max_tokens": 100, "bad_words": "", "stop_words": ""}'
Output Example
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"한강 작가를 알고 있니? 한강 작가는 한국의 대표적인 현대 소설가 중 한 명으로, '채식주의', '적도의 남자', '황무지' 등 많은 작품을 발표한 작가다. 그녀의 작품은 주로 인간의 본성, 사회적 규범, 그리고 개인의 자유와 억압을 탐구하는 내용을 담고 있다. 특히 '채식주의'는 한강 작가의 대표작으로, 주인공이 채식을 시작하면서 "}
참고 링크
'Study > LLM' 카테고리의 다른 글
NVIDIA Triton Inference Server (Ensemble vs. BLS) (0) | 2024.10.27 |
---|---|
LLM 서빙 프레임워크 정리 (0) | 2024.10.27 |