本帖最后由 netbean 于 2011-6-13 04:46 编辑
还不完善,请大家帮忙改善、纠错。
1,先保存天涯论坛贴子到MyDocumentsDir,如 http://www.tianya.cn/publicforum/content/no11/1/637021.shtml
2,加入要提取的发贴人“作者”名,如:小晴天儿
目前不能提取第一楼的内容,要修改;可能需要优化,纠错。谢谢!
$sFile = @MyDocumentsDir & "\637021.html"
$sHTML = FileOpen($sFile, 0)
$sText = FileRead($sHTML)
FileClose($sFile)
$Writer = '小晴天儿'
$Title = StringRegExp($sText, '<TITLE>(.*)</TITLE>', 3)
$WriterURL = StringRegExp($sText, '作者:<a href="(.+)" target="_blank">' & $Writer , 3)
$newText = ''
$newText &= '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">' & @LF
$newText &= '<HTML><HEAD>' & @LF
$newText &= '<META http-equiv=Content-Type content="text/html; charset=gb2312">' & @LF
$newText &= '<TITLE>' & $Title[0] & '</TITLE>' & '</HEAD>' & @LF
$newText &= '<BODY style="FONT-SIZE: 14px; MARGIN: 1cm; LINE-HEIGHT: 0.5cm; FONT-FAMILY: 宋体; BACKGROUND-COLOR: #ffffff">' & @LF
$newText &= '<P style="FONT-SIZE: 28px; LINE-HEIGHT: 32pt; FONT-FAMILY: 黑体" align=center>' & $Title[0] & '</P></A><BR><FONT color=green><BR>' & @LF
$PostDate = StringRegExp($sText, '>' & $Writer & '.+日期:(.+)</(?:td|font)>' , 3)
$Content = StringRegExp($sText, $Writer & '.*\s+.*"post">(.+)<div class="post-jb">' , 3)
For $i = 0 to UBound($Content) - 1
$newText &= '<CENTER>作者:<a href="' & $WriterURL[0] & '" target="_blank">' & $Writer & '</a> 日期:' & $PostDate[$i] & '</CENTER>' & @LF
$newText &= $Content[$i] & '<P>' & @LF
Next
$newText = StringRegExpReplace($newText, 'img(.+)src' , 'img src')
MsgBox(0,0,'OK')
$newHtml = FileOpen(@MyDocumentsDir & "\new.html", 2)
FileWrite($newHtml, $newText)
FileClose($newHtml)
|