比较着急 如何利用 正则表达式 来处理 一个html文件
这是一段 html的页面文件<td class="TrRows" align=Left>2008-9-18 9:00</td> 是每小时会生成一次
如何根据 当前时间 化为整数 然后读取下面的1.1 GB 27.2 GB 28.2 GB 等数据
如9点25分 变为9:00 然后读取下面的内容
<tr>
<td class="TrRows" align=Left>2008-9-18 9:00</td>
<td class="TrRows" align=Left>1.1 GB</td>
<td class="TrRows" align=Left>27.2 GB</td>
<td class="TrRows" align=Left>28.2 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 10:00</td>
<td class="TrOdd" align=Left>1.3 GB</td>
<td class="TrOdd" align=Left>34.8 GB</td>
<td class="TrOdd" align=Left>36.1 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 11:00</td>
<td class="TrRows" align=Left>1.3 GB</td>
<td class="TrRows" align=Left>34.7 GB</td>
<td class="TrRows" align=Left>36.0 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 12:00</td>
<td class="TrOdd" align=Left>1.3 GB</td>
<td class="TrOdd" align=Left>34.6 GB</td>
<td class="TrOdd" align=Left>35.9 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 13:00</td>
<td class="TrRows" align=Left>1.3 GB</td>
<td class="TrRows" align=Left>34.4 GB</td>
<td class="TrRows" align=Left>35.7 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 14:00</td>
<td class="TrOdd" align=Left>1.3 GB</td>
<td class="TrOdd" align=Left>34.7 GB</td>
<td class="TrOdd" align=Left>36.0 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 15:00</td>
<td class="TrRows" align=Left>1.3 GB</td>
<td class="TrRows" align=Left>35.0 GB</td>
<td class="TrRows" align=Left>36.4 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 16:00</td>
<td class="TrOdd" align=Left>1.3 GB</td>
<td class="TrOdd" align=Left>35.1 GB</td>
<td class="TrOdd" align=Left>36.5 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 17:00</td>
<td class="TrRows" align=Left>1.2 GB</td>
<td class="TrRows" align=Left>34.3 GB</td>
<td class="TrRows" align=Left>35.5 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 18:00</td>
<td class="TrOdd" align=Left>1.0 GB</td>
<td class="TrOdd" align=Left>29.9 GB</td>
<td class="TrOdd" align=Left>30.9 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 19:00</td>
<td class="TrRows" align=Left>1.6 GB</td>
<td class="TrRows" align=Left>33.9 GB</td>
<td class="TrRows" align=Left>35.5 GB</td>
</tr> _stringbetween 参见http://l4ever.cn/archives/1119 详细代码 #include <IE.au3>
$Url="C:\Users\l4ever\Desktop\A.html"
$oIE = _IECreate ($url, 0, 0)
$sHTML = _IEBodyReadHTML ($oIE)
$array = StringRegExp($sHTML, '<(?i)td class=TrRows align=Left>(.*?)</(?i)td>', 1)
for $i = 0 to UBound($array) - 1
$hour = StringRight ($array[$i],4)
msgbox (0,"现在时间:"&@hour&":00"," 网页内时间:0"&$hour&"网页内小时最好能补0,设置成2位数")
if @hour&":00" = "0"&$hour then
msgbox (0,$i,$array[$i])
Endif
Next
大致是上面这样,
对比时间加0了,但是加0之后到10点了就又多了个0.
最好能把网页里面的小时值设置成2位数. 仔细再看了一下,如果LZ能修改网页代码,改成2008-9-18 09:00,1.1GB,27.2GB,28.2GB,
2008-9-18 10:00,1.3GB,34.8GB,36.1GB,
2008-9-18 11:00,1.3GB,34.8GB,36.1GB,
2008-9-18 12:00,1.3GB,34.8GB,36.1GB,
2008-9-18 13:00,1.3GB,34.8GB,36.1GB,那么就简单了
#include <IE.au3>
$Url="C:\Users\l4ever\Desktop\A.html"
$oIE = _IECreate ($url, 0, 0)
$sHTML = _IEBodyReadHTML ($oIE)
$size = StringSplit($sHTML, ",");逗号分割
For $i=1 To $size
$Time = StringRight ( $size[$i], 5 );取任意数值的后五位
$now = @hour&":00";当前小时后面加:00
If $now=$Time Then;如果上2行相等
for $j = $i to $i+3;取下三条数据显示
Msgbox (64,"第"&$j&"组数值",$size[$j])
Next
Endif
next
[ 本帖最后由 l4ever 于 2008-9-19 03:28 编辑 ] 又是我,实在太无聊了,如果LZ不能修改网页,那么.嘿嘿.超级替换大法开始咯
#include <IE.au3>
$Url="C:\Users\l4ever\Desktop\A.html"
$oIE = _IECreate ($url, 0, 0)
$sHTML = _IEBodyReadHTML ($oIE)
$step1 = StringReplace($sHTML, "<", "");替换HTML
$step2 = StringReplace($step1, ">", "")
$step3 = StringReplace($step2, "tr", "")
$step4 = StringReplace($step3, "td", "")
$step5 = StringReplace($step4, "class", "")
$step6 = StringReplace($step5, "Rows", "")
$step7 = StringReplace($step6, "/", "")
$step8 = StringReplace($step7, "=", "")
$step9 = StringReplace($step8, "left", "")
$step10 = StringReplace($step9, "align", "")
$step11 = StringReplace($step10, "odd", "")
$step12 = StringReplace($step11, """", "")
$step13 = StringReplace($step12, " ", ",");把空格替换成逗号
$step14 = StringReplace($step13, " GB", "GB") ;去掉GB前面的空格,不能用StringStripWS 因为时间前面有空格的.囧
$Time1= StringReplace($step14, " 1:00", " 01:00");替换时间,让时间规范
$Time2 = StringReplace($Time1, " 2:00", " 02:00")
$Time3 = StringReplace($Time2, " 3:00", " 03:00")
$Time4 = StringReplace($Time3, " 4:00", " 04:00")
$Time5 = StringReplace($Time4, " 5:00", " 05:00")
$Time6 = StringReplace($Time5, " 6:00", " 06:00")
$Time7 = StringReplace($Time6, " 7:00", " 07:00")
$Time8 = StringReplace($Time7, " 8:00", " 08:00")
$Time9 = StringReplace($Time8, " 9:00", " 09:00")
msgbox (0,"替换之后的数据",$Time9);最终结果了.
$size = StringSplit($Time9, ",");把最终结果分割成子字符串
For $i=1 To $size;枚举所有字符
$Time = StringRight ( $size[$i], 5 );取任意数值的后五位
$now = @hour&":00";当前小时后面加:00
If $now=$Time Then;如果上2行相等
for $j = $i to $i+3;取下三条数据显示
Msgbox (64,"第"&$j&"组数值",$size[$j])
Next
Endif
next
[ 本帖最后由 l4ever 于 2008-9-19 03:48 编辑 ] 超级谢谢楼上的一位。。
替换法 貌似很强大 厉害哦,解决了.. 对于这个问题,我的解决办法是:
1、因为这个网页是分开一段一段的,先把时分相同的那一段提取出来;
2、再在提取出来的字串中,提取时间及各GB数值!
提取当前小时内的全部字段数据:
#include <IE.au3>
$Url=@DesktopDir&"\a.htm"
$oIE = _IECreate ($url, 0, 0)
$sHTML = _IEBodyReadHTML ($oIE)
$stringS = StringStripCR($sHTML)
$string = StringStripWS ($stringS,8)
$array = StringRegExp($string, '(?i)('&@YEAR&'-\d\d?-\d\d?'&@HOUR+0&':00)</td>\S+</tr>', 2)
msgbox (0,0,$array)
[ 本帖最后由 liongodmien 于 2008-10-5 23:21 编辑 ] 对于小时数@Hour 得出的是两个数,而当10点前会得到0X的问题可以用:
@Hour + 0
解决!
得:
$array = StringRegExp($string, '(?i)('&@YEAR&'-\d\d?-\d\d?'&@HOUR+0&':00)</td>\S+</tr>', 2)
[ 本帖最后由 liongodmien 于 2008-10-5 23:22 编辑 ] 上面已经把需要的整一段字符提取出来了,再要提的可以再多来几个正则!
反复的看,都找不到一个可以跨行比较用的元字符,只好借用上面兄弟的办法,删除换行和空白了!
最后还是勉强达到了自己开始设想的效果......
[ 本帖最后由 liongodmien 于 2008-10-5 23:24 编辑 ]
Final 版的消息提取
今天再思考了一下,把隔行匹配也搞定了!于是做了这个最终版的消息提取脚本(附LZ要求格式的网页:a.htm)...SHOW:
#include <IE.au3>
$Url=@DesktopDir&"\a.htm"
$oIE = _IECreate ($url, 0, 0)
$sHTML = _IEBodyReadHTML ($oIE)
$array = StringRegExp($sHTML, '(?U)(?i)\d+-\d+-\d+ '&@HOUR+0&':00</td>[^@]+</tr>', 2)
Local $nOffset = 1, $Show = ""
$Show &= "提取出来的消息段:"&$array&@CRLF
$Time = StringRegExp($array, '\d+-\d+-\d+ \d+:00', 1)
$Show &= @CRLF&@CRLF&"消息时间:"&$Time&@CRLF&@CRLF
While 1
$var = StringRegExp($array, '(?U)(?i)">(.+)</', 1, $nOffset)
If @error = 0 Then
$nOffset = @extended
Else
ExitLoop
EndIf
for $i = 0 to UBound($var) - 1
$Show &= "大小记录: "& $var[$i] & " " & @CRLF
Next
WEnd
MsgBox(64, "消息提取", $Show)
希望大家指正改善! 网页内容如下:
<tr>
<td class="TrRows" align=Left>2008-9-18 8:00</td>
<td class="TrRows" align=Left>1.1 GB</td>
<td class="TrRows" align=Left>27.2 GB</td>
<td class="TrRows" align=Left>28.2 GB</td>
</tr><tr>
<td class="TrRows" align=Left>2008-9-18 9:00</td>
<td class="TrRows" align=Left>1.1 GB</td>
<td class="TrRows" align=Left>27.2 GB</td>
<td class="TrRows" align=Left>28.2 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 10:00</td>
<td class="TrOdd" align=Left>1.3 GB</td>
<td class="TrOdd" align=Left>34.8 GB</td>
<td class="TrOdd" align=Left>36.1 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 11:00</td>
<td class="TrRows" align=Left>1.3 GB</td>
<td class="TrRows" align=Left>34.7 GB</td>
<td class="TrRows" align=Left>36.0 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 12:00</td>
<td class="TrOdd" align=Left>1.3 GB</td>
<td class="TrOdd" align=Left>34.6 GB</td>
<td class="TrOdd" align=Left>35.9 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 13:00</td>
<td class="TrRows" align=Left>1.3 GB</td>
<td class="TrRows" align=Left>34.4 GB</td>
<td class="TrRows" align=Left>35.7 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 14:00</td>
<td class="TrOdd" align=Left>1.3 GB</td>
<td class="TrOdd" align=Left>34.7 GB</td>
<td class="TrOdd" align=Left>36.0 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 15:00</td>
<td class="TrRows" align=Left>1.3 GB</td>
<td class="TrRows" align=Left>35.0 GB</td>
<td class="TrRows" align=Left>36.4 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 16:00</td>
<td class="TrOdd" align=Left>1.3 GB</td>
<td class="TrOdd" align=Left>35.1 GB</td>
<td class="TrOdd" align=Left>36.5 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 17:00</td>
<td class="TrRows" align=Left>1.2 GB</td>
<td class="TrRows" align=Left>34.3 GB</td>
<td class="TrRows" align=Left>35.5 GB</td>
</tr>
<tr>
<td class="TrOdd" align=Left>2008-9-18 18:00</td>
<td class="TrOdd" align=Left>1.0 GB</td>
<td class="TrOdd" align=Left>29.9 GB</td>
<td class="TrOdd" align=Left>30.9 GB</td>
</tr>
<tr>
<td class="TrRows" align=Left>2008-9-18 19:00</td>
<td class="TrRows" align=Left>1.6 GB</td>
<td class="TrRows" align=Left>33.9 GB</td>
<td class="TrRows" align=Left>35.5 GB</td>
</tr>
好牛啊
StringRegExp($sHTML, '(?U)(?i)\d+-\d+-\d+ '&@HOUR+0&':00</td>[^@]+</tr>', 2)
这里面的 (?U)(?i)\d+-\d+-\d 正则式 是什么意思呢?
正则式这块没有中文的帮战 好可惜啊 原帖由 lele9013 于 2008-10-6 13:38 发表 http://www.autoitx.com/images/common/back.gif
好牛啊
StringRegExp($sHTML, '(?U)(?i)\d+-\d+-\d+ '&@HOUR+0&':00[^@]+', 2)
这里面的 (?U)(?i)\d+-\d+-\d 正则式 是什么意思呢?
正则式这块没有中文的帮战 好可惜啊
(?U)非贪婪模式匹配
(?i)不理会大小写
\d+一个或多个数字
- 没什么的,就是一横
(连续几个同上)
&@HOUR 取当前的小时值
+0& 将不到10点前的数字化为单一的数字,如:09化为9,08化为8...
:00 时间的分隔及0分
[^@]+ 匹配除@之外的字符一个或多个(这个@不是一般的,是我用全角输入的,网页比较不会有)
后面的</tr> 就是用于段尾的匹配,以便断字
页:
[1]
2