请问怎么提取网页的纯文本,不要那些什么DIV之类的代码【已解决】
本帖最后由 fenhanxue 于 2014-4-25 12:12 编辑Dim $url_get_info = "http://saihua870908.blog.163.com/blog/static/1284290382010112211532975/"
Local $temp = InetRead($url_get_info,1)
ClipPut(BinaryToString($temp))用了上面的这个代码,读取到的内容,不是纯文本的,有很多诸如“<script type="text/javascript"> ”“</html>”之类的网页的符号,网页代码不是很懂,看的头很晕,
我只想要这个页面里面的纯文本,如下图
应该怎么弄哇 对正则不怎么熟悉,好像代码还能用。
想要专业的,请教afan吧
Dim $url_get_info = "http://saihua870908.blog.163.com/blog/static/1284290382010112211532975/"
$temp = BinaryToString(InetRead($url_get_info,1))
$tit=StringRegExp ( $temp, "blogTitle:'(.*?)'," ,2)
$text=StringRegExp ( $temp, "blogAbstract:'(.*?)'," ,2)
$txt=StringReplace($text,"<BR\>",@CRLF)
$txt=StringReplace($txt,"<DIV id=shareBody\>","")
$txt=StringReplace($txt,"</DIV\>","")
$str=$tit&@CRLF&$txt
MsgBox(0,"",$str) 本帖最后由 zch11230 于 2014-4-24 21:27 编辑
思路:第一步提取正文第二步提取文字 这样适用性更强比如这篇文章“http://zhangshuyue.blog.163.com/blog/static/179045442013220115848363/”#include <Inet.au3>
#include<IE.au3>
#AutoIt3Wrapper_UseX64=n
$oIE = ObjCreate("Shell.Explorer.2")
GUICreate("", 0, 0, 0, 0)
GUICtrlCreateObj($oIE, 0, 0, 0, 0)
$source=_INetGetSource("http://saihua870908.blog.163.com/blog/static/1284290382010112211532975/")
$title=StringRegExp($source,'class="tcnt">(.+?)</span>',3)
$html=StringRegExp($source,'(?s)<div class="nbw-blog-start"></div>.+?<div class="nbw-blog-end"></div>',3)
_IENavigate($oIE,"about:blank")
_IEBodyWriteHTML($oIE,$html)
MsgBox(0,$title,_IEBodyReadText($oIE)) 本帖最后由 olala 于 2014-4-25 00:04 编辑
好久没来acn了,捣鼓一下代码,在这论坛里可以找到喜悦的感觉(*^__^*) 嘻嘻……
#include <ButtonConstants.au3>
#include <EditConstants.au3>
#include <GUIConstantsEx.au3>
#include <GuiStatusBar.au3>
#include <StaticConstants.au3>
#include <WindowsConstants.au3>
#region ### START Koda GUI section ### Form=
$Form1 = GUICreate("文章采集测试", 580, 448, 192, 124)
$Input1 = GUICtrlCreateInput("", 56, 14, 385, 22)
GUICtrlSetData(-1, "http://saihua870908.blog.163.com/blog/static/1284290382010112211532975/")
$Label1 = GUICtrlCreateLabel("URL:", 24, 18, 30, 17)
$Button1 = GUICtrlCreateButton("获取文章", 464, 12, 81, 25)
$Edit1 = GUICtrlCreateEdit("", 3, 48, 575, 378)
$StatusBar1 = _GUICtrlStatusBar_Create($Form1)
_GUICtrlStatusBar_SetMinHeight($StatusBar1, 20)
GUISetState(@SW_SHOW)
#endregion ### END Koda GUI section ###
While 1
$nMsg = GUIGetMsg()
Switch $nMsg
Case $GUI_EVENT_CLOSE
Exit
Case $Button1
_GetText()
EndSwitch
WEnd
Func _GetText()
$begin = TimerInit()
$url = GUICtrlRead($Input1)
$html = _XmlHttp($url)
$title = _GetTitle($html)
$author = _GetAuthor($html)
$html = _StrCut($html, '<div class="nbw-blog-start">', '<div class="nbw-blog-end">')
$html = _FormatText($html)
GUICtrlSetData($Edit1, $title & @CRLF & $author & @CRLF & $html)
$diff = TimerDiff($begin)
_GUICtrlStatusBar_SetText($StatusBar1, " 采集完成!共计耗时:" & $diff & " 毫秒。")
EndFunc ;==>_GetText
Func _XmlHttp($url)
Local $oHTTP, $sReturn
$oHTTP = ObjCreate("microsoft.xmlhttp")
$oHTTP.Open("get", $url, False)
$oHTTP.Send()
$sReturn = BinaryToString($oHTTP.responseBody)
Return $sReturn
EndFunc ;==>_XmlHttp
Func _StrCut($Str, $StartStr, $EndStr)
$Start = StringInStr($Str, $StartStr) + StringLen($StartStr)
$End = StringInStr($Str, $EndStr)
$Str = StringMid($Str, $Start, $End - $Start)
Return $Str
EndFunc ;==>_StrCut
Func _FormatText($text)
$text = StringReplace($text, '<BR>', @CRLF)
$text = StringRegExpReplace($text, '<.*?>', '')
$text = StringReplace($text, ' ', '')
$text = StringReplace($text, ' ', '')
Return $text
EndFunc ;==>_FormatText
Func _GetTitle($html)
$html = _StrCut($html, '<title>', ' - ')
Return $html
EndFunc ;==>_GetTitle
Func _GetAuthor($html)
$html = _StrCut($html, '<title>', '</title>')
$html = _StrCut($html, '- ', '的日志')
Return $html
EndFunc ;==>_GetAuthor
回复 4# olala
试试这个网址哟http://zhangshuyue.blog.163.com/blog/static/179045442013220115848363/
#include <IE.au3>
$oIE = _IEcreate("http://onlinehelp.microsoft.com/zh-CN/bing/ff808535.aspx")
$text = _IEBodyReadText($oIE)
ConsoleWrite(0,0,$text)
回复 5# zch11230
Ok,我修复了,现在再试试。 回复 7# olala
牛!!!加10分 谢谢诸位大神拉~ 挺牛的,学习了
页:
[1]