지난번에 Triton Inference Server에서 HuggingFace 모델 서빙하기를 포스팅 했었는데,
당시에는 HuggingFace에 한국어로 파인튜닝된 LLaMa3 모델을 서빙했었다.
이번에는 벤치마크 LogicKor, Horangi 리더보드에서 높은 순위를 기록하고 있는 ko-gemma-2-9b-it를 서빙하기 위한 과정을 기록해보았다.
TensorRT-LLM Backend의 기본 예제만 따르면 쉽게 서빙할 수 있는 LLaMa3와는 달리, Gemma2는 비교적 최신 모델이라 TensorRT-LLM 버전, TensorRT-LLM Backend 버전, Triton Inferenece Server Container의 버전을 모두 신경써줘야 했다.
TensorRT-LLM v0.13.0부터 Gemma2를 지원한다는 정보를 토대로, 이 포스트에선 다음을 사용한다.
- Triton Inference Server Container 24.09 (TensorRT-LLM v0.13.0의 dependent TensorRT version인 10.4.0이 설치되어있음)
- TensorRT-LLM Backend v0.13.0
이후 버전을 사용해도 무방할 것으로 보인다.
(TensorRT-LLM Backend v0.13.0과, TensorRT-LLM v0.15.0이 설치된 Triton Inference Server Container 24.11을 사용해서 테스트 했을 때 동일하게 동작함)
서빙 환경
- OS: Ubuntu 20.04
- GPU: NVIDIA RTX4090 * 2
- GPU Driver Version: 550.127.05
Update the TensorRT-LLM submodule
git clone -b v0.13.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
git submodule update --init --recursive
git lfs install
git lfs pull
Launch Triton TensorRT-LLM container
docker run -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v ~/tensorrtllm_backend:/tensorrtllm_backend \
-v ~/engines:/engines \
nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3
Prepare TensorRT-LLM engines
Download weights from HuggingFace Transformers
cd /tensorrtllm_backend/tensorrt_llm
pip install huggingface-hub
huggingface-cli login
huggingface-cli download "rtzr/ko-gemma-2-9b-it" --local-dir "ko-gemma-2-9b-it"
Convert weights from HF Tranformers to TensorRT-LLM checkpoint
‘world_size’ is the number of GPUs used to build the TensorRT-LLM engine.
CKPT_PATH=ko-gemma-2-9b-it/
UNIFIED_CKPT_PATH=/tmp/checkpoints/tmp_ko-gemma-2-9b-it_tensorrt_llm/bf16/tp2/
ENGINE_PATH=/engines/gemma2/9b/bf16/2-gpu/
VOCAB_FILE_PATH=ko-gemma-2-9b-it/tokenizer.model
python3 ./examples/gemma/convert_checkpoint.py \
--ckpt-type hf \
--model-dir ${CKPT_PATH} \
--dtype bfloat16 \
--world-size 2 \
--output-model-dir ${UNIFIED_CKPT_PATH}
Build TensorRT engines
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
--gemm_plugin auto \
--max_batch_size 8 \
--max_input_len 3000 \
--max_seq_len 3100 \
--output_dir ${ENGINE_PATH}
Prepare the Model Repository
rm -rf /triton_model_repo
mkdir /triton_model_repo
cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* /triton_model_repo/
Modify the Model Configuration
ENGINE_DIR=/engines/gemma2/9b/bf16/2-gpu/
TOKENIZER_DIR=/tensorrtllm_backend/tensorrt_llm/ko-gemma-2-9b-it/
MODEL_FOLDER=/triton_model_repo
TRITON_MAX_BATCH_SIZE=4
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
DECOUPLED_MODE=false
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
Serving with Triton
‘world_size’ is the number of GPUs you want to use for serving. This should be aligned with the number of GPUs used to build the TensorRT-LLM engine.
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=2 --model_repo=/triton_model_repo
To Stop Triton Server insider the container
pkill tritonserver
Send an Inference Request
curl -X POST http://localhost:8000/v2/models/ensemble/generate -d '{"text_input": "안녕?", "max_tokens": 100, "bad_words": "", "stop_words": ""}'
Output Example
{"batch_index":0,"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":0.0,"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"안녕? \n\n저는 한국어를 배우는 중인 AI입니다. \n\n오늘은 한국어로 대화를 나누고 싶어요. \n\n어떤 주제로 이야기해볼까요? \n\n😊\n"}
참고 링크
'Study > LLM' 카테고리의 다른 글
Triton Inference Server에서 HuggingFace 모델 서빙하기 (3) | 2024.10.27 |
---|---|
NVIDIA Triton Inference Server (Ensemble vs. BLS) (0) | 2024.10.27 |
LLM 서빙 프레임워크 정리 (0) | 2024.10.27 |