holley 发表于 2022-7-21 12:36:57

怎么正则提取网页内容【已解决】

本帖最后由 holley 于 2022-7-22 09:47 编辑

前情:如何获取http://www.huorong.cn/downloadv5.html返回的下载地址【已解决】-已解决问题区-AUTOIT CN - Powered by Autoit中文论坛 (autoitx.com)

根据此贴得知,火绒的真实请求地址:http://www.huorong.cn/versionShow.php
{"request_type":"2","virustime":"2022-07-20","filesize":"24M","virusVersion":"2022.7.20.1","version":"5.0.69.3","createtime":"2022-07-20 18:17:53","fullName":"sysdiag-full-5.0.69.3-2022.7.20.1.exe","allName":"sysdiag-all-5.0.69.3-2022.7.20.1.exe","urlFull":"https:\/\/down7.huorong.cn\/sysdiag-full-5.0.69.3-2022.7.20.1.exe","urlAll":"https:\/\/down7.huorong.cn\/sysdiag-all-5.0.69.3-2022.7.20.1.exe"}我的表达式:
All['"]:['"]*(\S+)["']
只能得到:
https:\/\/down7.huorong.cn\/sysdiag-all-5.0.69.3-2022.7.20.1.exe求教:怎样才能获取到实际下载地址呢???分段截取匹配还是有其它匹配表达式?

afan 发表于 2022-7-21 13:00:34

你获取的去掉\不就是?

holley 发表于 2022-7-21 14:44:26

afan 发表于 2022-7-21 13:00
你获取的去掉\不就是?

我以为正则直接可以排除掉:face (2):
请教:au3里面有什么函数或命令(我是初学者)可以过滤一下这个地址吗?

afan 发表于 2022-7-21 16:09:58

holley 发表于 2022-7-21 14:44
我以为正则直接可以排除掉
请教:au3里面有什么函数或命令(我是初学者)可以过滤一下这个地 ...

你可以在你获取的基础上再 StringReplace 替换\为空即可。
当然,如果仅捕获该地址也可以直接正则替换
Local $sSource = '{"request_type":"2","virustime":"2022-07-20","filesize":"24M","virusVersion":"2022.7.20.1","version":"5.0.69.3","createtime":"2022-07-20 18:17:53","fullName":"sysdiag-full-5.0.69.3-2022.7.20.1.exe","allName":"sysdiag-all-5.0.69.3-2022.7.20.1.exe","urlFull":"https:\/\/down7.huorong.cn\/sysdiag-full-5.0.69.3-2022.7.20.1.exe","urlAll":"https:\/\/down7.huorong.cn\/sysdiag-all-5.0.69.3-2022.7.20.1.exe"}'
;~ MsgBox(0, '源字符串', $sSource)
Local $sSRERe = StringRegExpReplace($sSource, '(?i)^.+All":"(h.+?:)[\\/]+([\w.]+)[\\/]+([^"]+).+$', '\1/\2/\3')
MsgBox(0, '替换结果', $sSRERe)

lixiaolong 发表于 2022-7-21 17:11:30

Local $sSource = '{"request_type":"2","virustime":"2022-07-20","filesize":"24M","virusVersion":"2022.7.20.1","version":"5.0.69.3","createtime":"2022-07-20 18:17:53","fullName":"sysdiag-full-5.0.69.3-2022.7.20.1.exe","allName":"sysdiag-all-5.0.69.3-2022.7.20.1.exe","urlFull":"https:\/\/down7.huorong.cn\/sysdiag-full-5.0.69.3-2022.7.20.1.exe","urlAll":"https:\/\/down7.huorong.cn\/sysdiag-all-5.0.69.3-2022.7.20.1.exe"}'
Local $str

Local $aSRE = StringRegExp($sSource, '"urlAll":"(?|(https:)\\/\\(/.*)\\(/.*?)")', 3)
If Not @error Then
        For $i = 0 To UBound($aSRE) - 1
                $str = $str & $aSRE[$i]
        Next
EndIf
MsgBox(0, '匹配', $str)

holley 发表于 2022-7-22 09:40:55

lixiaolong 发表于 2022-7-21 17:11


多谢解答,只是我这边测试结果为空。

redapple2008 发表于 2022-7-22 10:25:58

holley 发表于 2022-7-22 09:40
多谢解答,只是我这边测试结果为空。

#include <Array.au3>
Local $sSource = '{"request_type":"2","virustime":"2022-07-20","filesize":"24M","virusVersion":"2022.7.20.1","version":"5.0.69.3","createtime":"2022-07-20 18:17:53","fullName":"sysdiag-full-5.0.69.3-2022.7.20.1.exe","allName":"sysdiag-all-5.0.69.3-2022.7.20.1.exe","urlFull":"https:\/\/down7.huorong.cn\/sysdiag-full-5.0.69.3-2022.7.20.1.exe","urlAll":"https:\/\/down7.huorong.cn\/sysdiag-all-5.0.69.3-2022.7.20.1.exe"}'
;~ MsgBox(0, '源字符串', $sSource)
Local $sSRERe = StringRegExpReplace($sSource, '\\/', '/')
Local $aSRE = StringRegExp($sSRERe, '(?i)(?<=urlAll":")(.+?)(?="})', 3)
If Not @Error Then MsgBox(0, '匹配数量: ' & UBound($aSRE), '其中元素为: ' & $aSRE)

_ArrayDisplay($aSRE, UBound($aSRE))

lixiaolong 发表于 2022-7-22 13:13:32

holley 发表于 2022-7-22 09:40
多谢解答,只是我这边测试结果为空。

"urlAll":"(?|(https:)\\/\\(/.*)\\(/.*?)")

你的图片上代码不对啊

afan 发表于 2022-7-22 13:30:56

lixiaolong 发表于 2022-7-22 13:13
"urlAll":"(?|(https:)\\/\\(/.*)\\(/.*?)")

你的图片上代码不对啊

的确,问号变空白了,很是奇怪…
不过,这里用(?|..)重置是无意义的,可以不要

lixiaolong 发表于 2022-7-22 13:44:27

afan 发表于 2022-7-22 13:30
的确,问号变空白了,很是奇怪…
不过,这里用(?|..)重置是无意义的,可以不要

谢谢提醒,我还是多学习正则吧

holley 发表于 2022-7-22 13:58:18

lixiaolong 发表于 2022-7-22 13:13
"urlAll":"(?|(https:)\\/\\(/.*)\\(/.*?)")

你的图片上代码不对啊

再次感谢,,这样获取的跟a版结果一样
实际使用,需要将a版的改为:
Local $sSRERe = StringRegExpReplace($sSource, '(?i)^.+All":"(h.+?:)[\\/]+([\w.]+)[\\/]+([^"]+).+$', '\1/\/\2/\3')

afan 发表于 2022-7-22 14:03:21

本帖最后由 afan 于 2022-7-22 14:13 编辑

holley 发表于 2022-7-22 13:58
再次感谢,,这样获取的跟a版结果一样
实际使用,需要将a版的改为:
是这样,改得好,我的漏了个/, \1//\2/\3

实际上,对于这种特征很明显的取值(最右地址),可以很简单
Local $sSource = '{"request_type":"2","virustime":"2022-07-20","filesize":"24M","virusVersion":"2022.7.20.1","version":"5.0.69.3","createtime":"2022-07-20 18:17:53","fullName":"sysdiag-full-5.0.69.3-2022.7.20.1.exe","allName":"sysdiag-all-5.0.69.3-2022.7.20.1.exe","urlFull":"https:\/\/down7.huorong.cn\/sysdiag-full-5.0.69.3-2022.7.20.1.exe","urlAll":"https:\/\/down7.huorong.cn\/sysdiag-all-5.0.69.3-2022.7.20.1.exe"}'
Local $aSRE = StringRegExp($sSource, '.+":"(.+)"', 1)
If Not @Error Then MsgBox(0, '', StringReplace($aSRE, '\', ''))

对于复杂的Json结构,用Json函数去取值较好,而对于这种简单的,用正则一定是首选,简单高效。
页: [1]
查看完整版本: 怎么正则提取网页内容【已解决】