请教一下 Python 中列表字典清洗数据的问题

有一个列表字典是这样的

l = [{'name': 'aa', 'type': '游戏'}, {'name': 'bb', 'type': '游戏'}, {'name': 'cc', 'type': '学习'}]

类似上述的列表包括含有类型的键的字典，如何过滤掉和大部分类型不一样的字典

比如列表中一共有 8 个字典，6 个字典中类型是游戏，1 个字典中类型是学习，还有个字典中类型是玩耍，如何过滤后面两个

当然类型是不确定的，数量多的不一定是游戏，还可能是吃饭。。或睡觉

有木有大佬给思路

字典

type'

name'

列表

10 replies • 2018-12-06 00:25:27 +08:00

ipwx

Dec 5, 2018

统计每个类型出现的百分比，然后根据 Zipf's Law 选一个阈值删掉百分比小的类型。

necomancer

Dec 5, 2018

数据少的话：
lst = sorted(l, key=(lambda x : x.get('type')))
ret = [[]]
for prv, nxt in zip(lst[:-1], lst[1:]):
....tmp = ret[-1]
....tmp.append(prv)
....if prv['type']!=nxt['type']:
........ret.append([])
tmp = ret[-1]
tmp.append(t[-1])
然后取 ret 里最多的，或者直接用 groupby
[ list(g) for c, g in groupby(lst, key=(lambda x : x.get('type'))) ]
但是都需要排序。

或者用 pandas:
import pandas as pd
l= [{'name': 'aa', 'type': '游戏'},
{'name': 'cc', 'type': '学习'},
{'name': 'bb', 'type': '游戏'}] # 可以不用考虑顺序

list(pd.DataFrame(l).groupby('type')) 可以搞定，输出是 n 个 categories 的 tuple 的 list

[(分组名 1，分组 1 数据的 dataframe),(分组名 2，分组 2 数据的 dataframe)...]，数据大小可以用 dataframe 的 shape 来确定。

In [40]: list(pd.DataFrame(l).groupby('type'))
Out[40]:
[('学习', name type
1 cc 学习), ('游戏', name type
0 aa 游戏
2 bb 游戏)]

In [41]: p=list(pd.DataFrame(l).groupby('type'))[1][1]

In [42]: p.shape
Out[42]: (2, 2)

In [43]: p
Out[43]:
name type
0 aa 游戏
2 bb 游戏

对一定量的数据，pandas 就可以有很高的处理效率了，如果数据量再大，考虑上 #1 的方法吧。

cyy564

Dec 5, 2018

@ipwx 从第一步我就没想到好方法来统计每个类型出现的百分比

necomancer

Dec 5, 2018

from itertools import groupby
[ list(g) for c, g in groupby(lst, key=(lambda x : x.get('type'))) ]

necomancer

Dec 5, 2018

@cyy564 百分比很好统计:

ret = {}
for i in l:
....if not ret.get(i['type']):
........ret[i['type']] = 0
...ret.get(i['type']) +=1

基本上在不知道 type 有多少的情况下也能轻松统计

necomancer

Dec 5, 2018

Sorry,

ret = {}
for i in l:
....if not ret.get(i['type']):
........ret[i['type']] = 0
...ret[i['type']] +=1

cyy564

Dec 5, 2018

@necomancer 谢谢，这个帮大忙了[ list(g) for c, g in groupby(lst, key=(lambda x : x.get('type'))) ]

cyy564

Dec 5, 2018

@necomancer

额。。如果 l 变成[{'name': 'aa', 'type': '游戏'}, {'name': 'bb', 'type': '游戏'}, {'name': 'cc', 'type': '学习'}, {'name': 'dd', 'type': '游戏'}]

用这个[list(g) for c,g in groupby(l, key=(lambda x: x.get('type')))]居然会拆开他们

输出[[{'name': 'aa', 'type': '游戏'}, {'name': 'bb', 'type': '游戏'}], [{'name': 'cc', 'type': '学习'}], [{'name': 'dd', 'type': '游戏'}]]

这就是我不想要的结果了，我还是看看 pandas 中的 group_by

necomancer

Dec 5, 2018

@cyy564 我在 #2 已经说了，这个需要先排序。pandas 可以无视顺序。所以数据量小考虑直接 python sorted + itertools.groupby，数据量大一些考虑 pandas.DataFrame.groupby，如果超超超大就考虑 #1 的办法。

darkTianTian

Dec 6, 2018

如果 name 没啥用的话可以
from collections import Counter
Counter([x['type'] for x in l]).most_common()