V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
leiuu
V2EX  ›  分享发现

试玩了一下去年腾讯开源的 800 w 的中文词词向量

  •  2
     
  •   leiuu · 2019-11-28 18:51:24 +08:00 · 4519 次点击
    这是一个创建于 1823 天前的主题,其中的信息可能已经有所发展或是发生改变。

    最近搞点词嵌入相关的东西,无意中发现腾讯去年开源的词向量模型:
    https://mp.weixin.qq.com/s/b9NWR0F7GQLYtgGSL50gQw

    这个模型涵盖 800w 中文词(虽然里边很多错误词),但总体还是挺强大的。

    简单搭了个 api 哈哈: https://zhuanlan.zhihu.com/p/94124468

    一些有意思的测试:
    1.红烧肉相似词
    output:

    {
       "top_similar_words":[
          [
             "糖醋排骨",
             0.8907967209815979
          ],
          [
             "红烧排骨",
             0.8726683259010315
          ],
          [
             "回锅肉",
             0.858664333820343
          ],
          [
             "红烧鱼",
             0.8542774319648743
          ],
          [
             "梅菜扣肉",
             0.8500987887382507
          ],
          [
             "糖醋小排",
             0.8475514650344849
          ],
          [
             "小炒肉",
             0.8435966968536377
          ],
          [
             "红烧五花肉",
             0.8424086570739746
          ],
          [
             "红烧肘子",
             0.8400496244430542
          ],
          [
             "糖醋里脊",
             0.8381932377815247
          ],
          [
             "红烧猪蹄",
             0.8374584913253784
          ],
          [
             "青椒炒肉",
             0.8344883918762207
          ],
          [
             "粉蒸肉",
             0.8337559700012207
          ],
          [
             "水煮肉片",
             0.8311598300933838
          ],
          [
             "青椒肉丝",
             0.8294434547424316
          ],
          [
             "鱼香茄子",
             0.8291393518447876
          ],
          [
             "烧茄子",
             0.8272593021392822
          ],
          [
             "梅干菜扣肉",
             0.8267726898193359
          ],
          [
             "土豆炖牛肉",
             0.8263725638389587
          ],
          [
             "红烧茄子",
             0.8244959115982056
          ]
       ],
       "word":"红烧肉"
    }
    

    2.因吹斯汀相似词
    output:

    {
       "top_similar_words":[
          [
             "一颗赛艇",
             0.7618176937103271
          ],
          [
             "因吹斯听",
             0.7523878812789917
          ],
          [
             "城会玩",
             0.6856077909469604
          ],
          [
             "厉害了 word 哥",
             0.6615914702415466
          ],
          [
             "emmmmm",
             0.6590334177017212
          ],
          [
             "扎心了老铁",
             0.6527535915374756
          ],
          [
             "神吐槽",
             0.6382066011428833
          ],
          [
             "可以说是非常爆笑了",
             0.6365567445755005
          ],
          [
             "不明觉厉",
             0.6362186670303345
          ],
          [
             "段子哥",
             0.6293908357620239
          ],
          [
             "厉害了我的哥",
             0.6265187859535217
          ],
          [
             "脑洞大开",
             0.6255093216896057
          ],
          [
             "hhhhhh",
             0.6220428943634033
          ],
          [
             "233333",
             0.6189173460006714
          ],
          [
             "没想到你是这样的",
             0.6184067726135254
          ],
          [
             "屌炸天",
             0.6119771003723145
          ],
          [
             "interesting",
             0.6102393865585327
          ],
          [
             "emmmmmmm",
             0.6097372770309448
          ],
          [
             "开脑洞",
             0.6095746755599976
          ],
          [
             "猴赛雷",
             0.6095525026321411
          ]
       ],
       "word":"因吹斯汀"
    }
    

    3.ojbk 相似词
    output:

    {
       "top_similar_words":[
          [
             "我觉得 ok",
             0.6393940448760986
          ],
          [
             "emmmmmmm",
             0.6306545734405518
          ],
          [
             "hhhh",
             0.6229800581932068
          ],
          [
             "hhhhh",
             0.6225401163101196
          ],
          [
             "不存在的",
             0.6077110767364502
          ],
          [
             "溜了溜了",
             0.603063702583313
          ],
          [
             "hhhhhhh",
             0.6008774638175964
          ],
          [
             "emmmm",
             0.6002634167671204
          ],
          [
             "emmm",
             0.5958442687988281
          ],
          [
             "emmmmm",
             0.592516303062439
          ],
          [
             "阿喵",
             0.5918310880661011
          ],
          [
             "哈哈哈",
             0.590988039970398
          ],
          [
             "略略略",
             0.590296745300293
          ],
          [
             "hhhhhh",
             0.5870903730392456
          ],
          [
             "微笑脸",
             0.5860881209373474
          ],
          [
             "tan90°",
             0.5825910568237305
          ],
          [
             "没毛病",
             0.5802331566810608
          ],
          [
             "233333",
             0.5794929265975952
          ],
          [
             "我觉得不行",
             0.5762011408805847
          ],
          [
             "就酱",
             0.5751103162765503
          ]
       ],
       "word":"ojbk"
    }
    
    12 条回复    2019-11-29 15:55:09 +08:00
    nieyujiang
        1
    nieyujiang  
       2019-11-28 18:52:46 +08:00   ❤️ 1
    红烧肉相似词直接给我看饿了
    leiuu
        2
    leiuu  
    OP
       2019-11-28 18:53:28 +08:00
    @nieyujiang 哈哈 不知道晚上吃啥就用这个模型推荐
    nieyujiang
        3
    nieyujiang  
       2019-11-28 18:59:11 +08:00
    @leiuu #2 吃完直接胖三斤  🤣
    leiuu
        4
    leiuu  
    OP
       2019-11-28 19:00:42 +08:00
    @nieyujiang
    还有呢,烤串相似词:
    ```json
    {
    "top_similar_words":[
    [
    "我觉得 ok",
    0.6393940448760986
    ],
    [
    "emmmmmmm",
    0.6306545734405518
    ],
    [
    "hhhh",
    0.6229800581932068
    ],
    [
    "hhhhh",
    0.6225401163101196
    ],
    [
    "不存在的",
    0.6077110767364502
    ],
    [
    "溜了溜了",
    0.603063702583313
    ],
    [
    "hhhhhhh",
    0.6008774638175964
    ],
    [
    "emmmm",
    0.6002634167671204
    ],
    [
    "emmm",
    0.5958442687988281
    ],
    [
    "emmmmm",
    0.592516303062439
    ],
    [
    "阿喵",
    0.5918310880661011
    ],
    [
    "哈哈哈",
    0.590988039970398
    ],
    [
    "略略略",
    0.590296745300293
    ],
    [
    "hhhhhh",
    0.5870903730392456
    ],
    [
    "微笑脸",
    0.5860881209373474
    ],
    [
    "tan90°",
    0.5825910568237305
    ],
    [
    "没毛病",
    0.5802331566810608
    ],
    [
    "233333",
    0.5794929265975952
    ],
    [
    "我觉得不行",
    0.5762011408805847
    ],
    [
    "就酱",
    0.5751103162765503
    ]
    ],
    "word":"ojbk"
    }
    ```
    leiuu
        5
    leiuu  
    OP
       2019-11-28 19:01:40 +08:00
    @nieyujiang 搞错了,重来。
    {
    "top_similar_words":[
    [
    "烤串儿",
    0.927384614944458
    ],
    [
    "羊肉串",
    0.894095778465271
    ],
    [
    "肉串",
    0.8555537462234497
    ],
    [
    "烤腰子",
    0.8516057729721069
    ],
    [
    "撸串",
    0.8469321727752686
    ],
    [
    "涮串",
    0.8465385437011719
    ],
    [
    "大肉串",
    0.8420960903167725
    ],
    [
    "烤肉串",
    0.838364839553833
    ],
    [
    "牛肉串",
    0.8371975421905518
    ],
    [
    "烤海鲜",
    0.8364357948303223
    ],
    [
    "烧烤摊",
    0.8351374864578247
    ],
    [
    "炸串",
    0.8339198231697083
    ],
    [
    "烧烤",
    0.831093430519104
    ],
    [
    "烤羊肉串",
    0.8277176022529602
    ],
    [
    "各种烤串",
    0.8274507522583008
    ],
    [
    "烤鱿鱼",
    0.8235615491867065
    ],
    [
    "烤羊腿",
    0.8228681683540344
    ],
    [
    "烤猪蹄",
    0.8225207328796387
    ],
    [
    "烤生蚝",
    0.8220213055610657
    ],
    [
    "吃串",
    0.820912778377533
    ]
    ],
    "word":"烤串"
    }
    DEANHZED
        6
    DEANHZED  
       2019-11-28 19:20:40 +08:00 via iPhone
    emmmmm
    devallin
        7
    devallin  
       2019-11-28 20:09:16 +08:00
    为什么我第一想法是论文降重?
    leiuu
        8
    leiuu  
    OP
       2019-11-28 22:14:46 +08:00
    @DEANHZED emmmmmmmmmmmm

    @devallin 降重可能有其他的方法,这个模型计算词与词之间的相似度好用。句子和句子之间不好直接用。
    elfive
        9
    elfive  
       2019-11-29 08:26:32 +08:00 via iPhone
    这些词,都特么是微信、QQ 聊天信息里面分析提出来的吧。
    leiuu
        10
    leiuu  
    OP
       2019-11-29 11:20:12 +08:00
    @elfive
    官方的说明是这样的。
    Data collection.
    Our training data contains large-scale text collected from news, webpages, and novels. Text data from diverse domains enables the coverage of various types of words and phrases. Moreover, the recently collected webpages and news data enable us to learn the semantic representations of fresh words.

    Vocabulary building. To enrich our vocabulary, we involve phrases in Wikipedia and Baidu Baike. We also apply the phrase discovery approach in Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches, which enhances the coverage of emerging phrases.

    大概是说用了新闻、网页、小说、维基百科、百度百科的数据。
    没提到聊天数据,不过新闻网页都有评论数据,可能也是数据来源之一。
    aalikes95
        11
    aalikes95  
       2019-11-29 15:47:11 +08:00
    看起来还是不错的
    leiuu
        12
    leiuu  
    OP
       2019-11-29 15:55:09 +08:00
    @aalikes95
    总体还不错,搜一些词,很多能得到意外之喜。
    不过 bug 也比较明显,不少错词。也无法增量更新。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1138 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 23ms · UTC 23:38 · PVG 07:38 · LAX 15:38 · JFK 18:38
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.