My Memo

My Memo

All kinds of stuffs… Mostly for the lab and AI usage.

Table of Contents

Hugging Face 数据集下载

Upd: 2025/04/25

由于网络限制,无法使用原生的Hugging Face或者官方提供的huggface-cli或者datasets等第三方库下载数据集。在此笔者给出可操作的方法之一:

环境变量的配置

环境声明:本地环境和服务器环境均测试通过。

  • 服务器环境
1
2
3
4
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal
  • 本地WSL环境
1
2
3
4
Distributor ID: Ubuntu
Description: Ubuntu 24.04.2 LTS
Release: 24.04
Codename: noble

.bashrc或者.zshrc中加入如下export:

1
2
3
4
5
6
7
export HUGGINGFACE_HUB_URL="https://mirrors.tuna.tsinghua.edu.cn/hugging-face-hub"
export HF_DATASETS_URL="https://mirrors.tuna.tsinghua.edu.cn/hugging-face-datasets"
export HF_METRICS_URL="https://mirrors.tuna.tsinghua.edu.cn/hugging-face-metrics"
export HF_HOME="/GPFS/rhome/xiyuanyang/HuggingFace/HF_HOME"
export TRANSFORMERS_CACHE="/GPFS/rhome/xiyuanyang/HuggingFace/TRANSFORMER_HOME"
export HF_DATASETS_CACHE="/GPFS/rhome/xiyuanyang/HuggingFace/HF_DATASETS_CACHE"
export HF_ENDPOINT="https://hf-mirror.com"

(不知道是否有用)

CLI 下载数据集和模型

下载aria2

1
sudo apt install aria2

检查如下命令是否通过:

1
aria2c --version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
aria2 version 1.37.0
Copyright (C) 2006, 2019 Tatsuhiro Tsujikawa

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

** Configuration **
Enabled Features: Async DNS, BitTorrent, Firefox3 Cookie, GZip, HTTPS, Message Digest, Metalink, XML-RPC, SFTP
Hash Algorithms: sha-1, sha-224, sha-256, sha-384, sha-512, md5, adler32
Libraries: zlib/1.3 libxml2/2.9.14 sqlite3/3.45.1 GnuTLS/3.8.3 nettle GMP/6.3.0 c-ares/1.27.0 libssh2/1.11.0
Compiler: gcc 13.2.0
built by x86_64-pc-linux-gnu
on Apr 16 2024 09:56:00
System: Linux 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64

Report bugs to https://github.com/aria2/aria2/issues
Visit https://aria2.github.io/

如果是服务器没有sudo权限,可以使用网速较慢且没有多线程功能的wget工具,但是体验感稍差。

下载hfd

1
2
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh

下载对应的模型和数据集

如果已经下载过aria2,使用命令:

1
2
3
4
5
# 下载模型
./hfd.sh gpt2

# 下载数据集
./hfd.sh tatsu-lab/alpaca_eval --dataset

如果没有下载,需要额外加一行命令:

1
./hfd.sh tatsu-lab/alpaca_eval --dataset --tool wget

如果看到如下的输出,说明已经成功下载,在当前目录下查看对应名称的文件夹即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Fetching repo metadata...
Generating file list...
[Warning] jq not installed, using grep/awk for metadata json parsing (slower). Consider installing jq for better parsing performance.
Starting download with wget to alpaca_eval...
--2025-04-25 20:35:24-- https://hf-mirror.com/datasets/tatsu-lab/alpaca_eval/resolve/main/.gitattributes
Resolving hf-mirror.com (hf-mirror.com)... 153.121.61.152, 153.121.57.40, 160.16.110.249, ...
Connecting to hf-mirror.com (hf-mirror.com)|153.121.61.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2693 (2.6K) [text/plain]
Saving to: ‘.gitattributes

.gitattributes 100%[==================================================>] 2.63K --.-KB/s in 0s

2025-04-25 20:35:26 (15.7 MB/s) - ‘.gitattributes’ saved [2693/2693]

数据集下载到本地之后的使用

由于上面手动下载数据集的方式导致datasets.load_dataset()的方法不可以被调用(还是搞不懂为什么不可以😅),但是我们可以手动提取数据集,以alpaca_eval为例:

1
2
3
4
import datasets
eval_set = datasets.Dataset.from_json("/GPFS/rhome/xiyuanyang/alpaca_eval/alpaca_eval.json")
eval_set = eval_set.remove_columns(["output", "generator"])
print(eval_set)

会输出结果:

1
2
3
4
Dataset({
features: ['dataset', 'instruction'],
num_rows: 805
})

很明显,我们成功加载了数据集。

但貌似还是很被动…很多数据集测评的库接口都默认要使用Hugging Face,无法避免的问题…

希望日后能解决。

Markdown 文本内设置超链接

Upd: 2025/04/25

对于文本内的链接跳转(例如目录跳转):

  • 注意圆括号内部的字符串需要使用-来替代空格。
  • 注意需要加几个#(可以看HTML的链接确定)

日志书写

详见博客:Debugging and Profiling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# import logging package for python:
import logging
import random
import time
from datetime import datetime

# Getting time stamp
timestamp = datetime.now().strftime("%Y%m%d_%H-%M-%S")

# Configure logging
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s - %(levelname)s - %(message)s",
handlers=[
logging.FileHandler(f"{timestamp}.log"),
],
)

# Create a StreamHandler for console output
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.ERROR)
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
console_handler.setFormatter(formatter)

# Add the console handler to the root logger
logging.getLogger().addHandler(console_handler)
logger = logging.getLogger(__name__)

Tmux 使用备忘录

  • tmux new -t name
  • tmux attach -t name
  • tmux kill-session -t name
  • Ctrl-b + [进入滚动模式,q退出。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
TMUX MEMO
tmux new -s <session-name> for create a new tmux session.
tmux attach -t <session-name> for attach a new tmux session.
tmux kill-session -t <session-name> for kill a session.
tmux tmux switch -t <session-name> for switching different sessions.
tmux rename-session -t 0 <new-name> for renaming the sessions.
tmux split-window for spliting windows, -h for left and right.
mux select-pane for moving cursor, -UDLR
tmux swap-pane -U for swapping the current pane up.
tmux swap-pane -D for swapping the current pane down.
Ctrl+b % for splitting the window vertically.
Ctrl+b " for splitting the window horizontally.
Ctrl+b <arrow key> for switching to another pane.
Ctrl+b ; for switching to the last pane.
Ctrl+b o for switching to the next pane.
Ctrl+b { for swapping the current pane with the previous one.
Ctrl+b } for swapping the current pane with the next one.
Ctrl+b Ctrl+o for moving all panes forward.
Ctrl+b Alt+o for moving all panes backward.
Ctrl+b x for closing the current pane.
Ctrl+b ! for splitting the current pane into a new window.
Ctrl+b z for toggling the current pane to fullscreen.
Ctrl+b Ctrl+<arrow key> for resizing panes.
Ctrl+b q for displaying pane numbers.
tmux new-window for creating a new window.
tmux new-window -n <window-name> for creating a new window with a specific name.
tmux select-window -t <window-number> for switching to a specific window by number.
tmux select-window -t <window-name> for switching to a specific window by name.
tmux rename-window <new-name> for renaming the current window.
Ctrl+b c for creating a new window.
Ctrl+b p for switching to the previous window.
Ctrl+b n for switching to the next window.
Ctrl+b <number> for switching to a specific window by its number.
Ctrl+b w for selecting a window from a list.
Ctrl+b , for renaming the current window.
tmux list-keys for listing all key bindings.
tmux list-commands for listing all tmux commands and their parameters.
tmux info for listing information about current tmux sessions.
tmux source-file ~/.tmux.conf for reloading the current tmux configuration.

写成脚本函数再source一下可以随时查看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Tmux memo
# This funcion shows the basic usage of several tmux commands.
TMUXH(){
print_colored red "TMUX MEMO"
echo "\033[36mtmux new -s <session-name>\033[0m for create a new tmux session."
echo "\033[36mtmux attach -t <session-name>\033[0m for attach a new tmux session."
echo "\033[36mtmux kill-session -t <session-name>\033[0m for kill a session."
echo "\033[36mtmux tmux switch -t <session-name>\033[0m for switching different sessions."
echo "\033[36mtmux rename-session -t 0 <new-name>\033[0m for renaming the sessions."
echo "\033[36mtmux split-window\033[0m for spliting windows, -h for left and right."
echo "\033[36tmux select-pane\033[0m for moving cursor, -UDLR"
echo "\033[36mtmux swap-pane -U\033[0m for swapping the current pane up."
echo "\033[36mtmux swap-pane -D\033[0m for swapping the current pane down."
echo "\033[36mCtrl+b %\033[0m for splitting the window vertically."
echo "\033[36mCtrl+b \"\033[0m for splitting the window horizontally."
echo "\033[36mCtrl+b <arrow key>\033[0m for switching to another pane."
echo "\033[36mCtrl+b ;\033[0m for switching to the last pane."
echo "\033[36mCtrl+b o\033[0m for switching to the next pane."
echo "\033[36mCtrl+b {\033[0m for swapping the current pane with the previous one."
echo "\033[36mCtrl+b }\033[0m for swapping the current pane with the next one."
echo "\033[36mCtrl+b Ctrl+o\033[0m for moving all panes forward."
echo "\033[36mCtrl+b Alt+o\033[0m for moving all panes backward."
echo "\033[36mCtrl+b x\033[0m for closing the current pane."
echo "\033[36mCtrl+b !\033[0m for splitting the current pane into a new window."
echo "\033[36mCtrl+b z\033[0m for toggling the current pane to fullscreen."
echo "\033[36mCtrl+b Ctrl+<arrow key>\033[0m for resizing panes."
echo "\033[36mCtrl+b q\033[0m for displaying pane numbers."

echo "\033[36mtmux new-window\033[0m for creating a new window."
echo "\033[36mtmux new-window -n <window-name>\033[0m for creating a new window with a specific name."
echo "\033[36mtmux select-window -t <window-number>\033[0m for switching to a specific window by number."
echo "\033[36mtmux select-window -t <window-name>\033[0m for switching to a specific window by name."
echo "\033[36mtmux rename-window <new-name>\033[0m for renaming the current window."
echo "\033[36mCtrl+b c\033[0m for creating a new window."
echo "\033[36mCtrl+b p\033[0m for switching to the previous window."
echo "\033[36mCtrl+b n\033[0m for switching to the next window."
echo "\033[36mCtrl+b <number>\033[0m for switching to a specific window by its number."
echo "\033[36mCtrl+b w\033[0m for selecting a window from a list."
echo "\033[36mCtrl+b ,\033[0m for renaming the current window."

echo "\033[36mtmux list-keys\033[0m for listing all key bindings."
echo "\033[36mtmux list-commands\033[0m for listing all tmux commands and their parameters."
echo "\033[36mtmux info\033[0m for listing information about current tmux sessions."
echo "\033[36mtmux source-file ~/.tmux.conf\033[0m for reloading the current tmux configuration."

}

Python Memo

See this Blog for more details!:

  • 字符串
  • 列表 & 元组 & 字典
  • OOP

Conda Quick Setup

Conda常见的命令:

  • Conda创建虚拟环境
1
conda create --name aflow -y

其中--name后缀名代表创建的虚拟环境的名称(或者使用 -n

  • Conda创建指定版本号的虚拟环境

如果需要Python的版本回退,需要指定的Python版本号:

1
conda create --name aflow python=3.8 -y

语义版本控制(Semantic Versioning)

版本号由三部分组成:主版本号.次版本号.修订号MAJOR.MINOR.PATCH),例如 2.5.3

  1. 主版本号(MAJOR)
    • 不兼容的变更:当代码发生破坏性(Backward Incompatible)修改时递增。
    • 示例:移除 API、重大架构调整。
    • 升级风险:用户需检查代码是否适配新版本。
  2. 次版本号(MINOR)
    • 向下兼容的功能新增:新增功能或改进,但不影响现有接口。
    • 示例:添加新 API、性能优化。
    • 升级建议:通常可安全升级。
  3. 修订号(PATCH)
    • 向下兼容的问题修复:修复漏洞或小改进,不影响功能。
    • 示例:Bug 修复、文档更新。
    • 升级建议:建议立即更新。
  • Conda激活环境
1
2
3
4
5
# activate specific version
conda activate aflow

# deactivate to the base version
conda deactivate
  • Conda删除虚拟环境
1
conda remove --name your_env_name --all
  • 查询信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 查看虚拟环境的列表
conda env list

# 查看当前虚拟环境的conda详细配置
conda info

# 或者
conda env export

# 如果你只是想要查看Python版本号
python --version

# 查看当前安装的第三方库
pip list

# 或者
conda list <package_name>

# 查看更细致的信息
pip show <package_name>

使用pip查看信息会比conda要快很多,强烈建议安装miniconda而非anaconda。

SSH Usage

See this blog for more details.


My Memo
https://xiyuanyang-code.github.io/posts/My-Memo/
Author
Xiyuan Yang
Posted on
April 25, 2025
Updated on
May 13, 2025
Licensed under