我的程序要实现的功能很简单,就是打开一个写有 1-500000 的 test.txt 的文档,数文档中奇数和偶数的个数,并写入一个字典
用多进程的方法,就是把这 500000 个数分成 n 份,每个进程处理 500000/n 个数
总之最后发现, num_of_process 越大,速度越慢,当等于 1 的时候,速度最快。这是为何?按理多进程,时间应变小。程序要做哪方面优化?
受人之托,特来求问。惭愧了,妹子竟然写的如此一手好代码……感谢。
from multiprocessing import pool
import time
import os
import copy
import multiprocessing
labels= {'0': 0, '1': 0}
train_set = 'test.txt'
num_of_process = 2
def statistics(file,label):
num=int(file)
if num%2==0:
label['0']+=1
else:
label['1']+=1
def union_dict(objs):
_keys = set(sum([obj.keys() for obj in objs], []))
_total = {}
for _key in _keys:
_total[_key] = sum([obj.get(_key, 0) for obj in objs])
return _total
def myprocess(data,i):
labelnew=copy.deepcopy(labels)
for afile in data[i*len(data)/num_of_process:(i+1)*len(data)/num_of_process]:
statistics(afile, labelnew)
return labelnew
if __name__=='__main__':
e1 = time.time()
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())#processes=4 in my mac
result_list=[]
data_train=open(train_set,'r').readlines()
for i in xrange(num_of_process):
#multiprocessing.Process(myprocess(target=myprocess,args=[data_train,i]))
result=pool.apply_async(myprocess,(data_train,i))
result_list.append(result.get())
print result_list
pool.close()
pool.join()
e2 = time.time()
print float(e2 - e1)
用多进程的方法,就是把这 500000 个数分成 n 份,每个进程处理 500000/n 个数
总之最后发现, num_of_process 越大,速度越慢,当等于 1 的时候,速度最快。这是为何?按理多进程,时间应变小。程序要做哪方面优化?
受人之托,特来求问。惭愧了,妹子竟然写的如此一手好代码……感谢。
from multiprocessing import pool
import time
import os
import copy
import multiprocessing
labels= {'0': 0, '1': 0}
train_set = 'test.txt'
num_of_process = 2
def statistics(file,label):
num=int(file)
if num%2==0:
label['0']+=1
else:
label['1']+=1
def union_dict(objs):
_keys = set(sum([obj.keys() for obj in objs], []))
_total = {}
for _key in _keys:
_total[_key] = sum([obj.get(_key, 0) for obj in objs])
return _total
def myprocess(data,i):
labelnew=copy.deepcopy(labels)
for afile in data[i*len(data)/num_of_process:(i+1)*len(data)/num_of_process]:
statistics(afile, labelnew)
return labelnew
if __name__=='__main__':
e1 = time.time()
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())#processes=4 in my mac
result_list=[]
data_train=open(train_set,'r').readlines()
for i in xrange(num_of_process):
#multiprocessing.Process(myprocess(target=myprocess,args=[data_train,i]))
result=pool.apply_async(myprocess,(data_train,i))
result_list.append(result.get())
print result_list
pool.close()
pool.join()
e2 = time.time()
print float(e2 - e1)