找回密码
 加入
搜索
查看: 1723|回复: 2

[AU3基础] 正则_英文分词

[复制链接]
发表于 2010-12-31 19:46:19 | 显示全部楼层 |阅读模式
代码如下,大家运行就知道怎么回事了。希望对大家会有些帮助~     源文段来自一篇英文文献
目前的缺陷:
不能去除单词里的连词号  比如“auto-it”
单词里有非换行的连字符会导致匹配两个单词  比如“auto-it”会被当作auto和it两个单词
某些引号会被当作一个单词
### 友情提示:本脚本由 Au3.REHelper 于 2010/12/31 19:24 自动生成,不保证其正确性,请自行测试 ###
#include <Array.au3>
Local $Str = _
                'Gene slr1393 of the cyanobacterium Synechocystis sp.' & @CRLF & _
                'PCC6803 encodes a red–green photoreversible cyanobacter-' & @CRLF & _
                'iochrome. The full-length protein contains three GAF' & @CRLF & _
                'domains, but GAF3 (aa 441–596) alone is capable of' & @CRLF & _
                'autocatalytically binding PCB to cysteine-528.' & @CRLF & _
                '[21]' & @CRLF & _
                'Addition' & @CRLF & _
                'of PCB to GA results in a reversibly photochromic chromo-' & @CRLF & _
                'protein, termed RGS (red–green switchable protein): state Pr' & @CRLF & _
                '(lmax =650 nm) is strongly fluorescent (FF =0.06); it is' & @CRLF & _
                'reversibly converted by irradiation with red light into state' & @CRLF & _
                'Pg (lmax =539 nm), which has reduced and strongly blue-' & @CRLF & _
                'shifted fluorescence (Table 1, Figure 1a). Photoswitching can' & @CRLF & _
                'be repeated many times; it is stable over a wide pH range, and' & @CRLF & _
                'is retained after RGS is embedded into polyvinyl alcohol' & @CRLF & _
                '(PVA) film (see Figures S1 and S2 in the Supporting' & @CRLF & _
                'Information).'
MsgBox(0, '原字符串', $Str)
Local $Test = StringRegExp($str, "\b(?!'-)(?:[a-zA-Z']|-[\r\n]+[a-zA-Z']+)+", 3)
If Not @Error Then MsgBox(0, '匹配数量: ' & UBound($Test), '其中[0]元素为: ' & $Test[0])
_ArrayDisplay($Test, UBound($Test))
发表于 2011-1-1 00:35:50 | 显示全部楼层
通常我喜欢用这个方法来分词。
$test=StringRegExp(StringRegExpReplace($str,'[^a-zA-Z\s]',''),'\w+',3)
 楼主| 发表于 2011-1-1 14:50:54 | 显示全部楼层
回复 2# 3mile

你的方法返回了118个单词,word 2003显示的是127个非中文单词,看来word也把连字符两边的字母串看作两个单词了
您需要登录后才可以回帖 登录 | 加入

本版积分规则

QQ|手机版|小黑屋|AUTOIT CN ( 鲁ICP备19019924号-1 )谷歌 百度

GMT+8, 2024-9-21 14:35 , Processed in 0.078025 second(s), 22 queries .

Powered by Discuz! X3.5 Licensed

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表