大家都知道,小薄本子多了,整理起来就麻烦了=。=
我想按作者分,按社团分,按展会分等等,所以写了个正则 想从一个本子的名字里抽取所有信息
但是本子标题五花八门,如下
0. (event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]
(event ) [group (artist )] title (form ) [addition1]
[event] [group (artist )] title (form ) (addition1 )
(tag ) [group (artist )] title
[group (artist )] title
title
我试着写了一个
import re
regex_patern = ur'([\(\[](?P<event>[^\)\]]*)[\)\]])?\s*([\(\[](?P<type>[^\)\](\)\])]*)[\)\]])?\s*(\[(?P<group>[^\(\]]*)(\((?P<artist>[^\)]*)\))?\])?(?P<title>[^\(\)\[\]]*)([\(\[](?P<from>[^\)\]]*)[\)\]])?(\s*[\(\[](?P<more1>[^\)\]]*)[\)\]])'
p = re.compile (regex_patern )
rows= [
'(event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]',
'(event ) [group (artist )] title (form ) [addition1]',
'[event] [group (artist )] title (form ) (addition1 )',
'(tag ) [group (artist )] title',
'[group (artist )] title',
'title',
]
for r in rows:
r = re.search (p, r )
print r.groupdict ()
#输出:
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': 'tag', u'event': 'event'}
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': 'tag'}
{u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': None}
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last )
<ipython-input-5-831c548bc3f0> in <module>()
15 for r in rows:
16 r = re.search (p, r )
---> 17 print r.groupdict ()
AttributeError: 'NoneType' object has no attribute 'groupdict'
从第四行开始结果就不对了,我感觉 re 应该要先匹配中间的简单规则,再最后扩展到最复杂的规则,
但是不知道怎么写。。。。特来请教各位
1
plqws 2015-08-28 17:48:29 +08:00
为啥一定要用正则,代码看起来好难改的样子。
还有我觉得这种东西用 日文分词 + tag 整理起来更方便吧。 |
2
rogerchen 2015-08-28 18:18:38 +08:00
(\s*[\(\[](?P<more1>[^\)\]]*)[\)\]]) 最后一个空白为什么要捕捉,和前边不一致,而且 more1 这个段是可选的吧,应该只有 title 这个段是强制的
|
3
rogerchen 2015-08-28 18:20:54 +08:00
楼主我还发现一个问题,你来源一会写 from 一会儿写 form ,虽然不影响吧,但确实把我看晕了
|
4
rogerchen 2015-08-28 18:24:18 +08:00
改了之后是这样,貌似还有点小问题,我继续看
$ python re.py {u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': 'tag ', u'event': 'event '} {u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event '} {u'from': 'form ', u'more1': 'addition1 ', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'} {u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': 'tag '} {u'from': None, u'more1': None, u'artist': None, u'title': '', u'group': None, u'type': None, u'event': 'group (artist '} {u'from': None, u'more1': None, u'artist': None, u'title': 'title', u'group': None, u'type': None, u'event': None} |
5
rogerchen 2015-08-28 18:34:56 +08:00 1
import re
regex_patern = ur'([\(\[](?P<event>[^\()\)\]]*)[\)\]])?\s*([\(\[](?P<type>[^\)\](\)\])]*)[\)\]])?\s*(\[(?P<group>[^\(\]]*)(\((?P<artist>[^\)]*)\))?\])?(?P<title>[^\(\)\[\]]*)([\(\[](?P<from>[^\)\]]*)[\)\]])?\s*([\(\[](?P<more1>[^\)\]]*)[\)\]])?' p = re.compile (regex_patern ) rows= [ '(event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]', '(event ) [group (artist )] title (form ) [addition1]', '[event] [group (artist )] title (form ) (addition1 )', '(tag ) [group (artist )] title', '[group (artist )] title', 'title', ] for r in rows: r = re.search (p, r ) print r.groupdict () 完全改好了,你有两个地方不对,一个是最后边那个地方强制捕获了,一个是不能让 event 捕获 [group (artist )],所以在 event 那个段里边要改成最后\(也放弃。 $ python re.py {u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': 'tag ', u'event': 'event '} {u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event '} {u'from': 'form ', u'more1': 'addition1 ', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'} {u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': 'tag '} {u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': None} {u'from': None, u'more1': None, u'artist': None, u'title': 'title', u'group': None, u'type': None, u'event': None} |
7
eromoe OP 突然发现一个很囧的问题。。。
[event] [group] title (from ) [event] [artist] title (from ) 是不是无解啊。。。 正则能不能写出 从 title 左边抓一个[XXX] ,然后 XXX 不包含 同人 /Cxx/成年 XXX 这样的,来判断是 group+artist 块? |
8
rogerchen 2015-08-29 09:12:01 +08:00
都要涉及到比较字符串了,只用正则搞就是黑魔法了,建议先抓出来再写点代码判断
|