Retokenization

定义 Definition

retokenization（再分词/重新切分词元）：在自然语言处理（NLP）中，指把一段文本（或已有的分词结果）重新进行分词/切分为 token 的过程，通常为了匹配某个模型或词表（vocabulary）的要求（如 BPE、WordPiece、SentencePiece），或为对齐不同系统的标注与输入格式。

发音 Pronunciation (IPA)

/ˌriːˌtoʊkənaɪˈzeɪʃən/

例句 Examples

The dataset requires retokenization before training.
这个数据集在训练前需要重新分词。

To align the gold annotations with the model’s subword vocabulary, we performed retokenization and updated all token offsets.
为了让人工标注与模型的子词词表对齐，我们进行了重新分词，并更新了所有词元的偏移位置。

词源 Etymology

由 **re-**（“再、重新”）+ token（“词元/标记”）+ -ization（“……化/过程”）构成，字面意思就是“把文本再 token 化的过程”。该词多见于计算语言学与机器学习工程语境中。

文献与作品 Literary / Notable Works

Sennrich, Haddow & Birch (2016), Neural Machine Translation of Rare Words with Subword Units（子词切分与重切分讨论常与 retokenization 同现）
Kudo & Richardson (2018), SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Devlin et al. (2019), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding（相关实现与复现工作中常提及为匹配 WordPiece 而进行的 retokenization）