重提「英文词组」分词问题

之前已经有一个提问了： https://v2ex.com/t/725950 但并没有可行的解决方法。回了贴，但沉太深了，所以另发一冒泡贴，希望高人看到指点一下。

我希望能像 jieba 做中文分词那样将英语句子分成有意义的词组，例如：A match / is / a tool / for starting / a fire. Typically, / modern matches / are made of / small wooden sticks or stiff paper.

我搜了一下，好像没有现成的工具，最接近的办法可能是用 spacy 的 rule based matching 匹配出 noun phrase （比较简单，有现成的）和 verb phrase 。textacy 里有个极简的 VP 常数（'<AUX>* <ADV>* <VERB>'）。

再次请问英文有没有比较方便可以直接分词的工具。

分词

phrase

verb

词组

4 条回复 • 2021-01-09 17:38:01 +08:00

zyx199199

2021-01-05 16:23:21 +08:00

这个在自然语言处理里比较类似于 Constituency Parsing 问题，将一个句子分成多个子组成部分，每个子组成部分又可以进一步细分。

但是题主你的需求定义是不明确的，“有意义的词组”这个概念太宽泛了。比方说 A match / is / a tool / for starting / a fire. 我就觉得 A match / is / a tool / for / starting a fire. 更合理。

我觉得可以先做 Constituency Parsing，然后定义一些规则，用于处理解析结果。

例如 Typically, modern matches are made of small wooden sticks or stiff paper. 使用 AllenNLP （ https://demo.allennlp.org/constituency-parsing/MjYyNTUwNQ==）这个工具做 Constituency Parsing，就可以发现其实已经分好了，只是题主还需要定义规则，决定 Constituency Parsing 的结果细分到哪一层

yucongo

2021-01-05 17:17:02 +08:00

@zyx199199 感谢回复，我查查 Constituency Parsing 。之所以将 for 和 starting 划在一起是因为 for 单独一起就接近一个 stop word，可有可无？其实 for starting a fire 划在一起或许更合理点。

neosfung

2021-01-08 11:03:16 +08:00

如果是文书类的英文，可以先试一下依存句法分析构造依存句法树，然后根据依存关系来组织短语成份

yucongo

2021-01-09 17:38:01 +08:00

@neosfung 感谢回复

我折腾了一个 pypi 包 https://pypi.org/project/phrase-tokenizer/

pip install phrase-tokenizer

开源 github 库： https://github.com/ffreemt/phrase-tokenizer