蜘蛛抱蛋 发表于 2010-12-31 19:46:19

正则_英文分词

代码如下,大家运行就知道怎么回事了。希望对大家会有些帮助~   源文段来自一篇英文文献
目前的缺陷:
不能去除单词里的连词号比如“auto-it”
单词里有非换行的连字符会导致匹配两个单词比如“auto-it”会被当作auto和it两个单词
某些引号会被当作一个单词
### 友情提示:本脚本由 Au3.REHelper 于 2010/12/31 19:24 自动生成,不保证其正确性,请自行测试 ###
#include <Array.au3>
Local $Str = _
                'Gene slr1393 of the cyanobacterium Synechocystis sp.' & @CRLF & _
                'PCC6803 encodes a red–green photoreversible cyanobacter-' & @CRLF & _
                'iochrome. The full-length protein contains three GAF' & @CRLF & _
                'domains, but GAF3 (aa 441–596) alone is capable of' & @CRLF & _
                'autocatalytically binding PCB to cysteine-528.' & @CRLF & _
                '' & @CRLF & _
                'Addition' & @CRLF & _
                'of PCB to GA results in a reversibly photochromic chromo-' & @CRLF & _
                'protein, termed RGS (red–green switchable protein): state Pr' & @CRLF & _
                '(lmax =650 nm) is strongly fluorescent (FF =0.06); it is' & @CRLF & _
                'reversibly converted by irradiation with red light into state' & @CRLF & _
                'Pg (lmax =539 nm), which has reduced and strongly blue-' & @CRLF & _
                'shifted fluorescence (Table 1, Figure 1a). Photoswitching can' & @CRLF & _
                'be repeated many times; it is stable over a wide pH range, and' & @CRLF & _
                'is retained after RGS is embedded into polyvinyl alcohol' & @CRLF & _
                '(PVA) film (see Figures S1 and S2 in the Supporting' & @CRLF & _
                'Information).'
MsgBox(0, '原字符串', $Str)
Local $Test = StringRegExp($str, "\b(?!'-)(?:|-[\r\n]++)+", 3)
If Not @Error Then MsgBox(0, '匹配数量: ' & UBound($Test), '其中元素为: ' & $Test)
_ArrayDisplay($Test, UBound($Test))

3mile 发表于 2011-1-1 00:35:50

通常我喜欢用这个方法来分词。
$test=StringRegExp(StringRegExpReplace($str,'[^a-zA-Z\s]',''),'\w+',3)

蜘蛛抱蛋 发表于 2011-1-1 14:50:54

回复 2# 3mile

你的方法返回了118个单词,word 2003显示的是127个非中文单词,看来word也把连字符两边的字母串看作两个单词了
页: [1]
查看完整版本: 正则_英文分词