python2 如何正确的处理 4 字节的字符，为什么一个字符变成了两个？

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3788 days ago, the information mentioned may be changed or developed.

x = u'\U0001f604abc'

print('length:',len(x))
for i in x:
    print(i)

得到输出：
('length:', 5) � � a b c
x 是 4 个字符，其中第一个是 4 字节字符，一个笑脸表情的 unicdoe 码，现在显然被拆分成了两个。我写的过滤函数就过滤失败了：
def filter_invalid_str(text): return ''.join(map(lambda x: x if u'\u0000' < x < u'\uFFFF' else '_', text))

所以，明明一个字符为什么变成了两个，如何当作一个字符处理？

字符

length

Text

8 replies • 2016-01-21 20:07:16 +08:00