V2EX = way to explore
V2EX 是一个关于分享和探索的地方
Sign Up Now
For Existing Member  Sign In
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
Fikhtengol
V2EX  ›  Python

用python分析在chrome下自己的上网的习惯

  •  
  •   Fikhtengol · Feb 12, 2013 · 8013 views
    This topic created in 4826 days ago, the information mentioned may be changed or developed.
    初二啊今天,玩回来没事,四处浏览的时候,突然感觉自己一直写程序的时候不够专心,老是一会点下这,一会点下那的。于是就想查查自己的上网记录。想看看自己经常访问的网页有哪些,访问最多的是哪几个,我用chrome,但是它的history貌似没有这样的功能。于是就自己写下吧。
    chrome 的history目录在~/.config/google-chrome/Default/,里面History打开是乱码,最后发现是存为sqllite db的,那就很easy了。找到History的那个file,查询下里面的table,有好几个,目测发现了个urls,应该是存自己访问的url的,一看里面还有visit count.就是它,开始select吧。select * from urls order by visit_count limit 0,10 ? 结果弄出来好多都是同一个网站的。好吧,又想看看同一个host下的情况。那就得先取出url中的host,然后再把相同的host相加再排序输出。

    #!/usr/bin/env python
    '''
    analyse the user's chrome behavior.
    '''
    import sqlite3
    import urlparse
    class AnalyseChrome:
    '''
    the user's chrome history log is writed by sqllite. and saved default in ~/.config/google-chrome/Default/History at ubuntu.
    '''
    def __init__(self,db="/home/lijun/.config/google-chrome/Default/History"):
    '''init the AnalyseChrome by the chrome history db path.'''
    self.cn=sqlite3.connect(db)
    self.cu=self.cn.cursor()
    def get_sql_res(self,sql):
    try:
    self.cu.execute(sql)
    except Exception,e:
    print str(e)
    return 0,str(e)
    res=self.cu.fetchall()
    return res,""
    def show_table(self,name="%"):
    '''show the table in db of History'''

    sql="SELECT * FROM sqlite_master WHERE type='table' and name like '%s';"%(name,)
    return self.get_sql_res(sql)

    def clear(self,):
    self.cn.close()

    def top_n(self,n,orderby="host"):
    '''
    return the top n url or host the user visit frequently.default orderby host
    '''

    sql="select url,visit_count from urls order by url ;"
    res,errmsg=self.get_sql_res(sql)
    uniq_res=[]
    #first select all url,visit form urls table sort by url ;
    #and make a new list which has uniq url and new count. by myself.
    #then sort by python's list.sort().
    #at last print top n.
    #maybe,it's not quick enough,or easy enough. max heap?my history is not that much.
    if res:
    urlhost=""
    for item in res:
    if orderby=="host":
    now_urlhost=urlparse.urlparse(item[0]).netloc
    elif orderby=="url":
    now_urlhost=item[0]
    else:
    return None,"error argv in top_n"
    if now_urlhost=="" or now_urlhost==None:
    continue
    if urlhost!=now_urlhost:
    urlhost,count=now_urlhost,item[1]
    uniq_res.append([urlhost,count])

    else:
    uniq_res[-1][-1]=uniq_res[-1][-1]+item[1]
    continue
    else:
    return None,errmsg
    uniq_res.sort(key=lambda x:x[1],reverse=True)
    return [i for i in uniq_res[0:n]],""


    if __name__=="__main__":
    ac=AnalyseChrome()

    tb,errormsg=ac.show_table('urls')
    if tb:
    for i in tb:
    print i

    res,errormsg=ac.top_n(20,"host")
    no=1
    if res:
    for i in res:
    print no,i
    no+=1
    else :
    print errormsg
    ac.clear()
    开个头吧,后面还可以算各个host访问占的比例,某段时间里的访问情况。。。
    17 replies    1970-01-01 08:00:00 +08:00
    Fikhtengol
        1
    Fikhtengol  
    OP
       Feb 12, 2013
    其它几个history的file都没仔细看,可以挖掘的东西应该还是挺多的吧,大家有木有兴趣挖掘下
    Fikhtengol
        2
    Fikhtengol  
    OP
       Feb 12, 2013
    我去,代码直接从编辑器里copy 到这就没有缩进了,应该怎么弄啊?
    paloalto
        4
    paloalto  
       Feb 12, 2013
    @Fikhtengol 可以把代码贴到gist
    zythum
        5
    zythum  
       Feb 12, 2013   ❤️ 1
    https://gist.github.com 去这边然后把地址贴过来
    Fikhtengol
        7
    Fikhtengol  
    OP
       Feb 12, 2013
    ADD-ONS,有木有chrome的呢?
    @paloalto
    lowstz
        9
    lowstz  
       Feb 12, 2013
    db路径用下面这个,减少硬编码
    db = os.path.expanduser('~/.config/google-chrome/Default/History')
    cyr1l
        10
    cyr1l  
       Feb 12, 2013
    我还以为楼主今年初二, 吓我一跳。 原来是说今天。。。 可是今天初三了啊。。 就算你是12个小时57分钟前, 也是初三了。 #不要在意细节。。。
    CaoZ
        11
    CaoZ  
       Feb 12, 2013
    collections.Counter 是个好东西

    不使用浏览器的 API 而直接尝试分析文件, 算不算 hack ? 不过还是 Python 版简单直接...
    Fikhtengol
        12
    Fikhtengol  
    OP
       Feb 13, 2013
    呵呵 我是pwd 后直接copy的
    Fikhtengol
        13
    Fikhtengol  
    OP
       Feb 13, 2013
    Fikhtengol
        14
    Fikhtengol  
    OP
       Feb 13, 2013
    python 写多了怕自己退化啊,数据结构太好用。。。@caoz
    oppih28
        15
    oppih28  
       Feb 13, 2013 via iPhone
    cloverstd
        16
    cloverstd  
       Feb 14, 2013
    database is locked
    Fikhtengol
        17
    Fikhtengol  
    OP
       Feb 14, 2013
    把chrome关了,as database is locked by chrome
    @cloverstd
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   2512 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 64ms · UTC 06:58 · PVG 14:58 · LAX 23:58 · JFK 02:58
    ♥ Do have faith in what you're doing.