LLM Evaluating

LLM Evaluating (from scratch)

Introduction

本文将会演示从零开始如何部署构建自己的本地(或者在线API调用)LLM并使用Hugging Face上的数据集进行跑分评估,也作为笔者初入LLM领域的学习笔记之一。

相关代码开源在 我的仓库 中。

Table of contents

我们的代码结构如下:

1
2
3
4
5
6
7
8
9
10
.
├── data
├── img
├── results
└── src
├── app.py
├── download.py
├── evaluate.py
├── main.py
└── utils.py

可以看到

Downloading Datasets

我们使用Hugging Face下载数据集,为了简单,我们使用openai/gsm8k的经典数据集,来评测本地大模型Llama-3 8B的评分表现。

1
2
3
4
5
6
7
8
9
def download_dataset(dataset="openai/gsm8k"):
"""download the dataset online and revert to json file"""
ds = load_dataset(dataset, "main")

for split in ds:
data = list(ds[split])

with open(f"{split}.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)

上面的函数使用了dataset库,故需要加上from datasets import load_dataset来导入相关的库函数。接着讲这个内容存储为json格式存储在本地之后读取。

当然,对于较大的数据集,也可以直接下载到本地为json格式,参见 Hugging Face数据集下载

Preparing Models

Downloading Models and activating

我们在这里使用Llama3-8B的中小型模型,使用本地部署(建议至少有一块30系以上的显卡),部署的详细过程参见:

Datawhale 大模型本地部署

Preparing API keys

对于闭源大模型和缺乏硬件支持的玩家来说,使用API密钥进行LLM调用是非常方便且高效的选择。

对于国内玩家,非常推荐使用 智增增 作为API接口的第三方平台。在这个网站上申请得到API Key之后,我们就可以准备激活自己的LLM了。

首先,我们需要在当前目录下创建一个.env文件,用于后续读取OPENAI_API_KEYBASE_URL,当然,你也可以直接写在环境变量之中,一劳永逸。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def init_env_file(
env_path=".env",
default_content='OPENAI_API_KEY="1234567 (replace with your own)"\nBASE_URL="example.chat (replace with your own)"',
):
if not os.path.exists(env_path):
with open(env_path, "w", encoding="utf-8") as f:
f.write(default_content)
print("WARNING, remember to use your own api secret key")


def load_api_key():
# init api file
init_env_file()

import dotenv
dotenv.load_dotenv(override=True)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
BASE_URL = os.getenv("BASE_URL")
# print(OPENAI_API_KEY)
# print(BASE_URL)
return OPENAI_API_KEY, BASE_URL

接下来,我们可以实现app.py即构建LLM回话的接口函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import os
import requests
import json
from utils import load_api_key


def get_completions_online(prompt):
from openai import OpenAI

OPENAI_API_KEY, BASE_URL = load_api_key()
# get prompts:
added_prompts = "For the final answer you have given , please return: <answer>Your final answer</answer> at the end of your response, remember you are only allowed to return a float number! For example, 18 or 18.0 is allowed but $18 is not allowed"

client = OpenAI(api_key=OPENAI_API_KEY, base_url=BASE_URL)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt + added_prompts},
],
)
return resp.choices[0].message.content


def get_completions_offline(prompt):
"""get completions from the local host LLMs

Args:
prompt (str): Given prompts into the LLM
max_length (int, optional): max length ofn single query. Defaults to 100.

Returns:
str: final response from the model
"""
headers = {"Content-Type": "application/json"}
max_length = 100

# get prompts:
added_prompts = "For the final answer you have given , please return: <answer>Your final answer</answer> at the end of your response, remember you are only allowed to return a float number! For example, 18 or 18.0 is allowed but $18 is not allowed"

data = {"prompt": prompt + added_prompts, "max_length": max_length}
response = requests.post(
url="http://127.0.0.1:6005",
headers=headers,
data=json.dumps(data),
proxies={"http": None, "https": None},
)

return response.json().get("response", "No response from model")


def get_completions(prompt, type: str = "offline"):
if type == "online":
return get_completions_online(prompt)
elif type == "offline":
return get_completions_offline(prompt)


if __name__ == "__main__":
print(get_completions_online("Hello, introduce yourself"))

Evaluating

准备好了LLM和数据集,接下来我们该给LLM评测了,相关评测的代码写在了evaluat.py函数里面。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import requests
import json
import re

from tqdm import tqdm
from app import get_completions


def test_evaluate(dataset):
"""a small test function for testing LLM's response

Args:
dataset (List): the loaded dataset from function load_local_dataset
"""
get_completions("Hello, introduce yourself!")
test_results = []
single_evaluate(test_results, dataset[0])
print(test_results)


def single_evaluate(results: list, item, status):
"""single evaluation for single query

Args:
results (list): result list
item (dict): a single query from dataset
"""
prompt: str = item["question"]
expected_answer: str = item["answer"]

# get LLM's response
llm_response: str = get_completions(prompt, status)

match = re.search(r"<answer>(.*?)</answer>(.*)", llm_response)

# exception handling
if match is None:
final_answer = "LLM fails to response"
results.append(
{
"prompt": prompt,
"expected_answer": expected_answer,
"model_response": llm_response,
"final_answer": final_answer,
# None means LLM fails to output in the right format
"correct": None,
}
)
else:
final_answer = float(match.group(1).replace(",", ""))
results.append(
{
"prompt": prompt,
"expected_answer": expected_answer,
"model_response": llm_response,
"final_answer": final_answer,
"correct": final_answer
== float(expected_answer.split("####")[-1].strip().replace(",", "")),
}
)


# run tests and evaluate models
def evaluate(datasets, status):
results = []
for item in tqdm(datasets, desc="Evaluating"):
single_evaluate(results=results, item=item, status=status)
return results


def calculate_metrics(results):
total = len(results)
correct = sum(1 for result in results if result["correct"] is True)
no_answer = sum(1 for result in results if result["correct"] is None)
accuracy = correct / total if total > 0 else 0
no_answer_rate = no_answer / total if total > 0 else 0
return {
"total": total,
"correct": correct,
"accuracy": accuracy,
"no_answer_rate": no_answer_rate,
}

Main Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import argparse
import json
from datetime import datetime
from download import download_dataset, download_model, load_local_datasets
from utils import construct_folder, check_construct, split_print
from evaluate import evaluate, calculate_metrics


def main(status):
print("Welcome! It is LLM 101")

# section 1: construct and preparations
folders = ["data", "src", "img", "results"]
construct_folder(folders)
assert check_construct(folders)

# section 2: activate LLMs
# ! let just skip that part
# model_dir = download_model()
# ! please run the model in the certain port, see https://github.com/datawhalechina/self-llm/blob/master/models/LLaMA3/01-LLaMA3-8B-Instruct%20FastApi%20%E9%83%A8%E7%BD%B2%E8%B0%83%E7%94%A8.md for reference
# ! we assume that your model is available in localhost: 6005

# section 3: download_data
# download_dataset()
datasets = load_local_datasets("data/test.json")

# section 4: evaluate
timestamp = datetime.now().strftime("%Y%m%d%H%M")
results = evaluate(datasets, status=status)
ans = calculate_metrics(results=results)

split_print()
print("Accuracy: ", ans["accuracy"])
print("No answer rate: ", ans["no_answer_rate"])
split_print()

# section 5: save answers
result_path = f"results/ans_{timestamp}.json"
response_path = f"results/response_{timestamp}.json"

with open(result_path, "w", encoding="utf-8") as f:
json.dump(ans, f, ensure_ascii=False, indent=2)

with open(response_path, "w", encoding="utf-8") as f:
json.dump(results, f, ensure_ascii=False, indent=2)

print(f"Results saved to {result_path}")
print(f"All the responses for the LLM saved to {response_path}")


if __name__ == "__main__":
# add argparse
parser = argparse.ArgumentParser(description="Add arguments for LLM evaluating")

parser.add_argument(
"--type", type=str, default="online", help="ways to load LLM, online or offline"
)
args = parser.parse_args()

status = "online" if args.type == "online" else "offline"
main(status)

Another Example!

这次,我们部署残血版的DeepSeek模型(16B的模型,其实还是偏小的)

同样,我们在HuggingFace下载好DeepSeek模型之后,使用如下的app.py代码启动监听端口:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
from fastapi import FastAPI, Request
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import uvicorn
import json
import datetime
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "3,4,5,6"
# Set device parameters
DEVICE = "cuda" # Use CUDA
DEVICE_ID = "0" # CUDA device ID, leave empty if not set
CUDA_DEVICE = (
f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
) # Combine CUDA device info


# Function to clear GPU memory
def torch_gc():
if torch.cuda.is_available(): # Check if CUDA is available
with torch.cuda.device(CUDA_DEVICE): # Specify CUDA device
torch.cuda.empty_cache() # Clear CUDA cache
torch.cuda.ipc_collect() # Collect CUDA memory fragments


# Create FastAPI app
app = FastAPI()


# Endpoint to handle POST requests
@app.post("/")
async def create_item(request: Request):
global model, tokenizer # Declare global variables for model and tokenizer
json_post_raw = await request.json() # Get JSON data from POST request
json_post = json.dumps(json_post_raw) # Convert JSON data to string
json_post_list = json.loads(json_post) # Convert string to Python object
prompt = json_post_list.get("prompt") # Get prompt from request
max_length = json_post_list.get("max_length") # Get max_length from request

# Build messages
messages = [{"role": "user", "content": prompt}]
# Build input tensor
input_tensor = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
)
# Generate output from model
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=max_length)
result = tokenizer.decode(
outputs[0][input_tensor.shape[1] :], skip_special_tokens=True
)

now = datetime.datetime.now() # Get current time
time = now.strftime("%Y-%m-%d %H:%M:%S") # Format time as string
# Build response JSON
answer = {"response": result, "status": 200, "time": time}
# Build log information
log = (
"["
+ time
+ "] "
+ '", prompt:"'
+ prompt
+ '", response:"'
+ repr(result)
+ '"'
)
print(log) # Print log
torch_gc() # Perform GPU memory cleanup
return answer # Return response


# Main entry point
if __name__ == "__main__":
model_name_or_path = (
"/GPFS/rhome/xiyuanyang/local_LLM/deepseek-ai/deepseek-moe-16b-chat"
)
# Load pretrained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(
model_name_or_path, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.generation_config = GenerationConfig.from_pretrained(model_name_or_path)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
model.eval() # Set model to evaluation mode
# Start FastAPI app
# Using port 6006 allows port mapping from autodl to local, so the API can be used locally
uvicorn.run(
app, host="0.0.0.0", port=6000, workers=1
) # Start app on specified host and port

接着你就可以看到:

1
2
3
4
5
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 7/7 [00:19<00:00,  2.73s/it]
INFO: Started server process [525302]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:6000 (Press CTRL+C to quit)

使用FastAPI在特定的监听端口监听:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import requests
import json


def get_completion(prompt, max_length):
headers = {"Content-Type": "application/json"}
data = {"prompt": prompt, "max_length": max_length}
response = requests.post(
url="http://127.0.0.1:6000", headers=headers, data=json.dumps(data)
)
return response.json()["response"]


if __name__ == "__main__":
prompt = input()
content: str = get_completion(prompt, 10000000)
contents = content.strip().split("\n")
for con in contents:
print(con)

这样就可以实现最基本的LLM和用户交互的逻辑。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Please input your prompts.😘
I am a student learning computer science and artificial intelligence. I want to learn the basic knowledge of front-back end developing in the summer vacation and become an excellent developer in the future, can you arrange me a plan for 40 dayas to improve my developing skills? I want to improve myself in aspects of how to conduct product research and task-planning, basic knowlwdge of front-back end developing and software engineering, generate a whole report for me. Thank you!
Sure, I can help you create a plan to improve your front-end and back-end developing skills and gain knowledge in product research and task planning. Here's a 4-step plan:

1. Learn the basics of front-end and back-end development:
* Front-end development: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js.
* Back-end development: Python, Java, or Ruby on Rails, and databases like MySQL, MongoDB, or PostgreSQL.
* Learn how to use version control systems like Git and GitHub.
2. Conduct product research and task planning:
* Learn how to conduct market research and user research to understand your target audience and their needs.
* Learn how to create user personas, user stories, and user journeys to understand user behavior and create a user-centered design.
* Learn how to create a project roadmap, including milestones, deadlines, and tasks.
3. Learn software engineering principles:
* Learn how to write clean, maintainable, and scalable code.
* Learn how to use design patterns and best practices for software development.
* Learn how to use agile methodologies like Scrum or Kanban for project management.
4. Generate a whole report:
* Create a report that summarizes your learning and skills gained during the summer vacation.
* Include a section on front-end and back-end development, including the technologies and frameworks you learned.
* Include a section on product research and task planning, including the methods and tools you used.
* Include a section on software engineering principles, including the design patterns and best practices you learned.
* Include a conclusion that summarizes your learning and future goals.

I hope this plan helps you improve your front-end and back-end developing skills and gain knowledge in product research and task planning. Good luck with your summer vacation!

LLM Evaluating
https://xiyuanyang-code.github.io/posts/LLM-Evaluating/
Author
Xiyuan Yang
Posted on
May 30, 2025
Updated on
June 4, 2025
Licensed under