检索条件匹配句子的效率问题如何改进？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2517 天前的主题，其中的信息可能已经有所发展或是发生改变。

需要写个小脚本给不会编程的同事用，大致是他们自己编写检索条件然后去运行脚本得出结果。

检索条件类似这种：虚汗 or ((怕 or 畏) and (寒 or 冷))

目前试了两种，in 方式（ func1 ）和正则匹配（ func2 ） 测试语料 1220 条，花费时间：

func1 运行时间：26.843648433685303

func2 运行时间：46.992613554000854

当语料很大，检索条件量很多（或者嵌套很多）。效率感觉不怎么理想? 问问有没思路改进一下效率，或者有没更好的方法思路。

相关代码文件链接: https://pan.baidu.com/s/1c2llxlu 密码: v7ia

代码下面贴出了自己写的

# -*- coding=utf-8 -*-
import os
import re
import time
import pandas as pd
from pandas import DataFrame
from itertools import permutations


def str_to_list(p):
    """ 处理字符串转换成列表
        如：虚汗 or ((怕 or 畏) and (寒 or 冷))
        结果：['虚汗','or',[['怕','or','畏'],'and',['寒','or','冷']]]
    """
    p = re.sub('([^\(\)（）\s(?:and|or)]+|(?:and|or))','"\\1",',p)
    p = p.replace('(','[').replace('（','[').replace(')','],').replace('）','],')
    p = '[%s]' % p
    return eval(p)
    
def list_to_regex(p):
    """ 处理列表转换成正则表达式
        递归方法
        如：['虚汗','or',[['怕','or','畏'],'and',['寒','or','冷']]]
        结果：(虚汗|((怕|畏).*?(寒|冷)|(寒|冷).*?(怕|畏)))
    """
    tempP = [list_to_regex(x) if isinstance(x,list) else x 
                for x in p 
                if x not in ['or','and']]
    if 'and' in p:
        tempP = permutations(tempP)  #生成排列
        return '(%s)' % '|'.join(['.*?'.join(x) for x in tempP])
    else:
        return '(%s)' % '|'.join(tempP)

def match_sentence(p,sentence):
    """转成 in 的方式去匹配句子"""
    words = p.replace(' ','').replace(',','').replace('（','(').replace('）',')').replace('and',',and,').replace('or',',or,').replace('(',',(,').replace(')',',),').split(',')
    
    scriptStr = [w if w in 'and or ()' \
                    else '"%s" in "%s"' % (w,sentence) for w in words]
                    
    if eval(' '.join(scriptStr)):
        return True
    return False

def func1(patternFile,sentenceFile):
    """
    转成正则再去匹配
    patternFile -- 含有检索条件的文件名
    sentenceFile -- 语料文件名
    """
    dfS = pd.read_excel(sentenceFile)
    dfP = pd.read_excel(patternFile)
    #编译好的正则列表
    regexList = [re.compile(list_to_regex(str_to_list(x))) for x in dfP.ix[:,-1]]
    resultFile = 'result1.txt'
    for senIdx in dfS.index:
        for i,patt in enumerate(regexList):
            keyword = dfP.ix[i,-2]
            sentence = dfS.ix[senIdx,-1]
            if patt.search(sentence):
                r = dfP.ix[[i],:-1]
                with open(resultFile,'a',encoding='utf-8') as f:
                    f.write('%s\t%s\n' % (keyword, sentence))
    
def func2(patternFile,sentenceFile):
    """
    转成 in 的方式再去匹配
    patternFile -- 含有检索条件的文件名
    sentenceFile -- 语料文件名
    """
    dfS = pd.read_excel(sentenceFile)
    dfP = pd.read_excel(patternFile)
    resultFile = 'result2.txt'
    for senIdx in dfS.index:
        for pattIdx in dfP.index:
            keyword = dfP.ix[pattIdx,-2]
            sentence = dfS.ix[senIdx,-1]
            if match_sentence(dfP.ix[pattIdx,-1],sentence):
                with open(resultFile,'a',encoding='utf-8') as f:
                    f.write('%s\t%s\n' % (keyword, sentence))
    
if __name__ == '__main__':
    """测试"""
    patternFile = '检索条件_测试.xlsx'
    sentenceFile = '语料_测试.xlsx'
    t1 = time.time()
    func2(patternFile,sentenceFile)
    t2 = time.time()
    print('func2 运行时间：',t2-t1)

1 条回复 • 2018-01-02 02:00:18 +08:00

patx

2018-01-02 02:00:18 +08:00

用全文检索引擎会好一点吧