V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
RicardoY
V2EX  ›  问与答

这样的乱码应该如何清除?

  •  
  •   RicardoY · 2019-12-30 11:22:53 +08:00 · 1794 次点击
    这是一个创建于 1782 天前的主题,其中的信息可能已经有所发展或是发生改变。

    我尝试了很多种编码,都不能正常显示,联系上下文,我猜这可能是 emoji 之类的东西

    有什么好办法可以处理这样的乱码吗?如果我要删掉它,我也得先定义它...有一个简单的思路是判断一下是不是字符是不是 ascii 里的,如果不是就直接删掉,还有更好的办法吗?

    9 条回复    2019-12-30 13:27:20 +08:00
    mayx
        1
    mayx  
       2019-12-30 11:28:43 +08:00 via Android
    正则表达式吧
    lululau
        2
    lululau  
       2019-12-30 11:34:37 +08:00
    文本发上来
    lqs
        3
    lqs  
       2019-12-30 11:36:43 +08:00
    猜测是用 emoji 编码成 utf8 然后用 iso-8859-1 解码了,可以把乱码发上来看看
    chairuosen
        4
    chairuosen  
       2019-12-30 11:52:40 +08:00
    爬的 twitter 吧?叹号后面肯定是表情啦
    RicardoY
        5
    RicardoY  
    OP
       2019-12-30 12:43:34 +08:00
    @lululau @lqs

    文本在这里

    链接: https://pan.baidu.com/s/1rtelRvHyHldPmB9a-7W0Lg 提取码: wm3f
    RicardoY
        6
    RicardoY  
    OP
       2019-12-30 12:43:52 +08:00
    @lqs 我用 utf-8 打开的
    lqs
        7
    lqs  
       2019-12-30 13:05:49 +08:00   ❤️ 2
    @RicardoY

    和猜测的一样

    $ head train_E6oV3lV.csv |iconv -f utf8 -t iso-8859-1
    id,label,tweet
    1,0, @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
    2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
    3,0, bihday your majesty
    4,0,#model i love u take with u all the time in ur📱!!! 😙😎👄👅💦💦💦



    >>> print(open('train_E6oV3lV.csv').read(1000).decode('utf8').encode('iso-8859-1').decode('utf8'))
    id,label,tweet
    1,0, @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
    2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
    3,0, bihday your majesty
    4,0,#model i love u take with u all the time in ur📱!!! 😙😎👄👅💦💦💦
    5,0, factsguide: society now #motivation
    6,0,[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo
    7,0, @user camping tomorrow @user @user @user @user @user @user @user danny…
    8,0,the next school year is the year for exams.😯 can't think about that 😭 #school #exams #hate #imagine #actorslife #revolutionschool #girl
    9,0,we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers …
    10,0, @user @user welcome here ! i'm it's so #gr8 !
    11,0, ↝ #ireland consumer price index (mom)
    RicardoY
        8
    RicardoY  
    OP
       2019-12-30 13:20:57 +08:00
    @lqs
    想再仔细问一下产生这个问题的原因,是 utf-8 和 iso-8859-1 支持的字符集不同导致的吗?
    ipwx
        9
    ipwx  
       2019-12-30 13:27:20 +08:00 via Android   ❤️ 1
    @RicardoY 我记得 iso 那个编码是西欧编码,字符集大小为 256。换句话说无论啥编码过的二进制文本,都可以被当做西欧编码读出来。然后,这 256 个字符又被编码成 utf-8,毕竟每个西欧字符都被包括在 utf 码表里面了。。。

    以上我猜的,甚至没看你的样本,没电脑
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   5084 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 22ms · UTC 03:52 · PVG 11:52 · LAX 19:52 · JFK 22:52
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.