Python 中文正则表达式笔记__unicode_xe3x_findPart_80_3x_

当前位置: 编程技术>其它

Python 中文正则表达式笔记

来源: 互联网发布时间：2014-10-16

本文导语: 从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。一点经验可以使用 repr()函数查看字串的原始格式。这对于写正则表...

从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。
一点经验
可以使用 repr()函数查看字串的原始格式。这对于写正则表达式有所帮助。
Python 的 re模块有两个相似的函数：re.match(), re.search 。两个函数的匹配过程完全一致，只是起点不同。match只从字串的开始位置进行匹配，如果失败，它就此放弃；而search则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。如果你了解match的特性（在某些情况下比较快），大可以自由用它；如果不太清楚，search通常是你需要的那个函数。
从一堆文本中，找出所有可能的匹配，以列表的形式返回，这种情况用findall()这个函数。例子见后面的代码。
utf8下，每个汉字占据3个字符位置，正则式为[x80-xff]{3}，这个都知道了吧。
unicode下，汉字的格式如uXXXX，只要找到对应的字符集的范围，就能匹配相应的字串，方便从多语言文本中挑出所需要的某种语言的文本。不过，对于像日文这样的粘着语，既有中文字符，又有平假名片假名，或许结果会有所偏差。
两种字符类可以并列在一起使用，例如，平假名、片假名、中文的放在一起，u"[u4e00-u9fa5u3040-u309fu30a0-u30ff]+"，来自定义所需要匹配的文本。
匹配中文时，正则表达式和目标字串的格式必须相同。这一点至关重要。或者都用默认的utf8，此时你不用额外做什么；如果是unicode，就需要在正则式之前加上u""格式。
可以这样定义unicode字符串：string=u"我爱正则表达式"。如果字串不是unicode的，可以使用unicode()函数转换之。如果你知道源字串的编码，可以使用newstr=unicode(oldstring, original_coding_name)的方式转换，例如 linux 下常用unicode(string, "utf8")，windows 下或许会用cp936吧，没测试。
例程序

代码如下:

 
#!/usr/bin/python 
# -*- coding: utf-8 -*- 
# 
#author: rex 
#blog: http://iregex.org 
#filename py_utf8_unicode.py 
#created: 2010-06-27 09:11 
import re 
def findPart(regex, text, name): 
res=re.findall(regex, text) 
if res: 
print "There are %d %s parts:n"% (len(res), name) 
for r in res: 
print "t",r 
print 
#sample is utf8 by default. 
sample='''en: Regular expression is a powerful tool for manipulating text. 
zh: 正则表达式是一种很有用的处理文本的工具。 
jp: 正規表現は非常に役に立つツールテキストを操作することです。 
jp-char: あアいイうウえエおオ 
kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다. 
puc: 。？！、，；：“ ”‘ '——……·－·《》〈〉！￥％＆＊＃ 
''' 
#let's look its raw representation under the hood: 
print "the raw utf8 string is:n", repr(sample) 
print 
#find the non-ascii chars: 
findPart(r"[x80-xff]+",sample,"non-ascii") 
#convert the utf8 to unicode 
usample=unicode(sample,'utf8') 
#let's look its raw representation under the hood: 
print "the raw unicode string is:n", repr(usample) 
print 
#get each language parts: 
findPart(u"[u4e00-u9fa5]+", usample, "unicode chinese") 
findPart(u"[uac00-ud7ff]+", usample, "unicode korean") 
findPart(u"[u30a0-u30ff]+", usample, "unicode japanese katakana") 
findPart(u"[u3040-u309f]+", usample, "unicode japanese hiragana") 
findPart(u"[u3000-u303fufb00-ufffd]+", usample, "unicode cjk Punctuation") 

其输出结果为：

代码如下:

 
the raw utf8 string is: 
'en: Regular expression is a powerful tool for manipulating text.nzh: xe6xadxa3xe5x88x99xe8xa1xa8xe8xbexbexe5xbcx8fxe6x98xafxe4xb8x80xe7xa7x8dxe5xbex88xe6x9cx89xe7x94xa8xe7x9ax84xe5xa4x84xe7x90x86xe6x96x87xe6x9cxacxe7x9ax84xe5xb7xa5xe5x85xb7xe3x80x82njp: xe6xadxa3xe8xa6x8fxe8xa1xa8xe7x8fxbexe3x81xafxe9x9dx9exe5xb8xb8xe3x81xabxe5xbdxb9xe3x81xabxe7xabx8bxe3x81xa4xe3x83x84xe3x83xbcxe3x83xabxe3x83x86xe3x82xadxe3x82xb9xe3x83x88xe3x82x92xe6x93x8dxe4xbdx9cxe3x81x99xe3x82x8bxe3x81x93xe3x81xa8xe3x81xa7xe3x81x99xe3x80x82njp-char: xe3x81x82xe3x82xa2xe3x81x84xe3x82xa4xe3x81x86xe3x82xa6xe3x81x88xe3x82xa8xe3x81x8axe3x82xaankr:xecxa0x95xeaxb7x9c xedx91x9cxedx98x84xecx8bx9dxecx9dx80 xebxa7xa4xecx9axb0 xecx9cxa0xecx9axa9xedx95x9c xebx8fx84xeaxb5xac xedx85x8dxecx8axa4xedx8axb8xebxa5xbc xecxa1xb0xecx9ex91xedx95x98xebx8ax94 xeaxb2x83xecx9ex85xebx8bx88xebx8bxa4.npuc: xe3x80x82xefxbcx9fxefxbcx81xe3x80x81xefxbcx8cxefxbcx9bxefxbcx9axe2x80x9c xe2x80x9dxe2x80x98 xe2x80x99xe2x80x94xe2x80x94xe2x80xa6xe2x80xa6xc2xb7xefxbcx8dxc2xb7xe3x80x8axe3x80x8bxe3x80x88xe3x80x89xefxbcx81xefxbfxa5xefxbcx85xefxbcx86xefxbcx8axefxbcx83n' 
There are 14 non-ascii parts: 
正则表达式是一种很有用的处理文本的工具。 
正規表現は非常に役に立つツールテキストを操作することです。 
あアいイうウえエおオ 
정규 
표현식은 
매우 
유용한 
도구 
텍스트를 
조작하는 
것입니다 
。？！、，；：“ 
”‘ 
'——……·－·《》〈〉！￥％＆＊＃ 
the raw unicode string is: 
u'en: Regular expression is a powerful tool for manipulating text.nzh: u6b63u5219u8868u8fbeu5f0fu662fu4e00u79cdu5f88u6709u7528u7684u5904u7406u6587u672cu7684u5de5u5177u3002njp: u6b63u898fu8868u73feu306fu975eu5e38u306bu5f79u306bu7acbu3064u30c4u30fcu30ebu30c6u30adu30b9u30c8u3092u64cdu4f5cu3059u308bu3053u3068u3067u3059u3002njp-char: u3042u30a2u3044u30a4u3046u30a6u3048u30a8u304au30aankr:uc815uaddc ud45cud604uc2dduc740 ub9e4uc6b0 uc720uc6a9ud55c ub3c4uad6c ud14duc2a4ud2b8ub97c uc870uc791ud558ub294 uac83uc785ub2c8ub2e4.npuc: u3002uff1fuff01u3001uff0cuff1buff1au201c u201du2018 u2019u2014u2014u2026u2026xb7uff0dxb7u300au300bu3008u3009uff01uffe5uff05uff06uff0auff03n' 
There are 6 unicode chinese parts: 
正则表达式是一种很有用的处理文本的工具 
正規表現 
非常 
役 
立 
操作 
There are 8 unicode korean parts: 
정규 
표현식은 
매우 
유용한 
도구 
텍스트를 
조작하는 
것입니다 
There are 6 unicode japanese katakana parts: 
ツールテキスト 
ア 
イ 
ウ 
エ 
オ 
There are 11 unicode japanese hiragana parts: 
は 
に 
に 
つ 
を 
することです 
あ 
い 
う 
え 
お 
There are 5 unicode cjk Punctuation parts: 
。 
。 
。？！、，；： 
－ 
《》〈〉！￥％＆＊＃