fenhanxue 发表于 2014-4-24 17:57:32

请问怎么提取网页的纯文本,不要那些什么DIV之类的代码【已解决】

本帖最后由 fenhanxue 于 2014-4-25 12:12 编辑

Dim $url_get_info = "http://saihua870908.blog.163.com/blog/static/1284290382010112211532975/"
Local $temp = InetRead($url_get_info,1)


ClipPut(BinaryToString($temp))用了上面的这个代码,读取到的内容,不是纯文本的,有很多诸如“<script type="text/javascript"> ”“</html>”之类的网页的符号,网页代码不是很懂,看的头很晕,

我只想要这个页面里面的纯文本,如下图

应该怎么弄哇

gto250 发表于 2014-4-24 21:01:09

对正则不怎么熟悉,好像代码还能用。
想要专业的,请教afan吧

Dim $url_get_info = "http://saihua870908.blog.163.com/blog/static/1284290382010112211532975/"
$temp = BinaryToString(InetRead($url_get_info,1))


$tit=StringRegExp ( $temp, "blogTitle:'(.*?)'," ,2)
$text=StringRegExp ( $temp, "blogAbstract:'(.*?)'," ,2)

$txt=StringReplace($text,"<BR\>",@CRLF)
$txt=StringReplace($txt,"<DIV id=shareBody\>","")
$txt=StringReplace($txt,"</DIV\>","")


$str=$tit&@CRLF&$txt
MsgBox(0,"",$str)

zch11230 发表于 2014-4-24 21:14:02

本帖最后由 zch11230 于 2014-4-24 21:27 编辑

思路:第一步提取正文第二步提取文字   这样适用性更强比如这篇文章“http://zhangshuyue.blog.163.com/blog/static/179045442013220115848363/”#include <Inet.au3>
#include<IE.au3>
#AutoIt3Wrapper_UseX64=n
$oIE = ObjCreate("Shell.Explorer.2")
GUICreate("", 0, 0, 0, 0)
GUICtrlCreateObj($oIE, 0, 0, 0, 0)
$source=_INetGetSource("http://saihua870908.blog.163.com/blog/static/1284290382010112211532975/")
$title=StringRegExp($source,'class="tcnt">(.+?)</span>',3)
$html=StringRegExp($source,'(?s)<div class="nbw-blog-start"></div>.+?<div class="nbw-blog-end"></div>',3)
_IENavigate($oIE,"about:blank")
_IEBodyWriteHTML($oIE,$html)
MsgBox(0,$title,_IEBodyReadText($oIE))

olala 发表于 2014-4-24 23:19:24

本帖最后由 olala 于 2014-4-25 00:04 编辑

好久没来acn了,捣鼓一下代码,在这论坛里可以找到喜悦的感觉(*^__^*) 嘻嘻……


#include <ButtonConstants.au3>
#include <EditConstants.au3>
#include <GUIConstantsEx.au3>
#include <GuiStatusBar.au3>
#include <StaticConstants.au3>
#include <WindowsConstants.au3>
#region ### START Koda GUI section ### Form=
$Form1 = GUICreate("文章采集测试", 580, 448, 192, 124)
$Input1 = GUICtrlCreateInput("", 56, 14, 385, 22)
GUICtrlSetData(-1, "http://saihua870908.blog.163.com/blog/static/1284290382010112211532975/")
$Label1 = GUICtrlCreateLabel("URL:", 24, 18, 30, 17)
$Button1 = GUICtrlCreateButton("获取文章", 464, 12, 81, 25)
$Edit1 = GUICtrlCreateEdit("", 3, 48, 575, 378)
$StatusBar1 = _GUICtrlStatusBar_Create($Form1)
_GUICtrlStatusBar_SetMinHeight($StatusBar1, 20)
GUISetState(@SW_SHOW)
#endregion ### END Koda GUI section ###
While 1
        $nMsg = GUIGetMsg()
        Switch $nMsg
                Case $GUI_EVENT_CLOSE
                        Exit
                Case $Button1
                        _GetText()
        EndSwitch
WEnd
Func _GetText()
        $begin = TimerInit()
        $url = GUICtrlRead($Input1)
        $html = _XmlHttp($url)
        $title = _GetTitle($html)
        $author = _GetAuthor($html)
        $html = _StrCut($html, '<div class="nbw-blog-start">', '<div class="nbw-blog-end">')
        $html = _FormatText($html)
        GUICtrlSetData($Edit1, $title & @CRLF & $author & @CRLF & $html)
        $diff = TimerDiff($begin)
        _GUICtrlStatusBar_SetText($StatusBar1, " 采集完成!共计耗时:" & $diff & " 毫秒。")
EndFunc   ;==>_GetText
Func _XmlHttp($url)
        Local $oHTTP, $sReturn
        $oHTTP = ObjCreate("microsoft.xmlhttp")
        $oHTTP.Open("get", $url, False)
        $oHTTP.Send()
        $sReturn = BinaryToString($oHTTP.responseBody)
        Return $sReturn
EndFunc   ;==>_XmlHttp
Func _StrCut($Str, $StartStr, $EndStr)
        $Start = StringInStr($Str, $StartStr) + StringLen($StartStr)
        $End = StringInStr($Str, $EndStr)
        $Str = StringMid($Str, $Start, $End - $Start)
        Return $Str
EndFunc   ;==>_StrCut
Func _FormatText($text)
        $text = StringReplace($text, '<BR>', @CRLF)
        $text = StringRegExpReplace($text, '<.*?>', '')
        $text = StringReplace($text, '&nbsp;', '')
        $text = StringReplace($text, '      ', '')
        Return $text
EndFunc   ;==>_FormatText
Func _GetTitle($html)
        $html = _StrCut($html, '<title>', ' - ')
        Return $html
EndFunc   ;==>_GetTitle
Func _GetAuthor($html)
        $html = _StrCut($html, '<title>', '</title>')
        $html = _StrCut($html, '- ', '的日志')
        Return $html
EndFunc   ;==>_GetAuthor

zch11230 发表于 2014-4-24 23:24:50

回复 4# olala


    试试这个网址哟http://zhangshuyue.blog.163.com/blog/static/179045442013220115848363/

damoo 发表于 2014-4-24 23:34:31


#include <IE.au3>

$oIE = _IEcreate("http://onlinehelp.microsoft.com/zh-CN/bing/ff808535.aspx")
$text = _IEBodyReadText($oIE)
ConsoleWrite(0,0,$text)

olala 发表于 2014-4-25 00:05:21

回复 5# zch11230


Ok,我修复了,现在再试试。

zch11230 发表于 2014-4-25 00:18:54

回复 7# olala


    牛!!!加10分

fenhanxue 发表于 2014-4-25 12:11:29

谢谢诸位大神拉~

zxhou1 发表于 2014-4-30 16:11:26

挺牛的,学习了
页: [1]
查看完整版本: 请问怎么提取网页的纯文本,不要那些什么DIV之类的代码【已解决】