大模型原理：词汇表、分词器类，将文本转换为词元ID，将词元ID转回文本-马育民老师

# 介绍

将词元转换为整数，以生成 **词元ID(token ID)**。

# 词汇表

为了将先前生成的词元映射到词元ID，首先需要构建一张 **词汇表**。这张词汇表定义了如何将每个唯一的单词和特殊字符映射到一个唯一的整数，如图：

[![](https://www.malaoshi.top/upload/0/0/1GW2W91LMI5Q.png)](https://www.malaoshi.top/upload/0/0/1GW2W91LMI5Q.png)

# 构建词汇表过程

1. 将训练集中的全部文本分词成独立的词元
2. 将这些词元按字母顺序进行排列，并删除重复的词元；
3. 将唯一的词元聚合到一张词汇表中，该词汇表定义了每个唯一的词元到唯一的整数值的映射。

# 实现词汇表-第一步

创建一个包含 **所有唯一词元** 的列表，并将它们按照字母顺序排列

**提示：**为了便于学习，这里所展示的词汇表特意设置得很小，并且不包含标点符号和特殊字符

### 关键代码

```
# ===================================
# 创建词汇表
print("---------------------")
all_words = sorted(set(result))
print("词汇表数量：", len(all_words))
print("词汇表前20个数据：", all_words[:20])
```

执行结果：

```
---------------------
词汇表数量： 1130
词汇表前20个数据： ['!', '"', "'", '(', ')', ',', '--', '.', ':', ';', '?', 'A', 'Ah', 'Among', 'And', 'Are', 'Arrt', 'As', 'At', 'Be']
```

# 实现词汇表-第二步

根据词元，生成对应的词元ID

### 关键代码

```
# 转成dict字典类型，key是词元，value是词元id（第一个词元id是0，第二个是1，以此类推）
vocab = {token: index for index, token in enumerate(all_words)}

print("词汇表前10个数据：")
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 9:
        break

print("词汇表后10个数据：")
for i, item in enumerate(vocab.items()):
    if i <= len(all_words)-11:
        continue

print(item)
```

执行结果：

```
---------------------
词汇表前10个数据：
('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
词汇表后10个数据：
('would', 1120)
('wouldn', 1121)
('year', 1122)
('years', 1123)
('yellow', 1124)
('yet', 1125)
('you', 1126)
('younger', 1127)
('your', 1128)
('yourself', 1129)
```

# 完整代码

```

import re

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# 根据 ,.:;?_!"()  \'  -- 以及空格、tab分割文本
result = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
# 去掉多余的空格
result = [item for item in result if item.strip()]
print("单词数量：", len(result))
print(result[:9])

# ===================================
# 创建词汇表
print("---------------------")
all_words = sorted(set(result))
print("词汇表数量：", len(all_words))
# 转成dict字典类型，key是词元，value是词元id（第一个词元id是0，第二个是1，以此类推）
vocab = {token: index for index, token in enumerate(all_words)}

print("词汇表前10个数据：")
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 9:
        break

print("词汇表后10个数据：")
for i, item in enumerate(vocab.items()):
    if i <= len(all_words)-11:
        continue

print(item)

```

# 使用词汇表

### 转换词元ID

字典包含着许多独立的词元，它们均与唯一的整数标签相关联。下一个目标是使用这张词汇表将新文本转换为词元ID，如图：

[![](https://www.malaoshi.top/upload/0/0/1GW2W9tHY1hQ.png)](https://www.malaoshi.top/upload/0/0/1GW2W9tHY1hQ.png)

从头开始对新的文本样本进行分词，并利用 **词汇表将文本词元转换为词元ID**。这张词汇表是基于整个训练集构建的，不仅可以应用于训练集本身，也适用于任何新的文本样本。

**提示：**为了便于学习理解，这里所展示的词汇表不包含标点符号和特殊字符

### 转回文本

为了将大语言模型的 **输出从数值形式转换回文本**，还需要一种将词元ID转换为文本的方法。为此，可以创建逆向词汇表，将词元ID映射回它们对应的文本词元。

# 分词器类

实现一个完整的 **分词器类**：

- 包含一个用于将文本分词的 `encode()` 方法，并通过词汇表将字符串映射到整数，以生成词元ID。

- 实现一个 `decode()` 方法，执行从整数到字符串的反向映射，将词元ID还原回文本。

[![](https://www.malaoshi.top/upload/0/0/1GW2WAFJiVBi.png)](https://www.malaoshi.top/upload/0/0/1GW2WAFJiVBi.png)

使用 `SimpleTokenizerV1` 类，现在可以利用已有的词汇表实例化新的分词器对象，然后再使用这些对象对文本进行编码和解码，如图：

[![](https://www.malaoshi.top/upload/0/0/1GW2WAGZuAOp.png)](https://www.malaoshi.top/upload/0/0/1GW2WAGZuAOp.png)

分词器通常包含两个常见的方法：`encode()` 方法和 `decode()`方法：

- `encode()`方法：接收文本样本，将其分词为单独的词元，然后再利用词汇表将词元转换为词元ID。

- `decode()`方法：接收一组词元ID，将其转换回文本词元，并将文本词元连接起来，形成自然语言文本

### 实现

```
import re

class SimpleTokenizerV1:

def __init__(self, vocab):
        """
        初始化
        :param vocab: 词汇表
        :return:
        """
        self.str_to_int = vocab
        # 将词元id转成词元
        self.int_to_str = {index: item for item, index in vocab.items()}

def encode(self, text):
        """
        对输入文本编码，返回对应词元ID
        :param text:输入文本
        :return:
        """
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed =[
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

def decode(self, ids):
        """
        将词元id解码成文本
        :param ids: 词元id
        :return:
        """
        # 将词元ID转换文本
        text = " ".join([self.int_to_str[i] for i in ids])
        # 移除特定标点符号前的空格
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
```

# 测试

### 训练词汇表

```
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# 根据 ,.:;?_!"()  \'  -- 以及空格、tab分割文本
result = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
# 去掉多余的空格
result = [item for item in result if item.strip()]
all_words = sorted(set(result))

print("---------------------")
# 转成dict字典类型，key是词元，value是词元id（第一个词元id是0，第二个是1，以此类推）
vocab = {token: index for index, token in enumerate(all_words)}
```

### 测试分词器编码

将输入文本转成词元

```
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
       Mrs. Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)
```

执行结果：

```
[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
```

### 测试分词器解码

将这些词元ID转换回文本

```
print("将词元ID转换文本：", tokenizer.decode(ids))
```

执行结果：

```
将词元ID转换文本： " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.
```

# 缺点

如果出现词汇表中没有的单词，执行就会报错，如下：

```
text = "Hello, do you like tea?"
print(tokenizer.encode(text))
```

报错如下：

```
KeyError: 'Hello'
```

### 注意

在处理大语言模型时，使用规模更大且更多样化的训练集来扩展词汇表

原文出处：http://www.malaoshi.top/show_1GW2WAyk1P5B.html