Creating a Bot App - 2
Continued from Creating a Bot App - 1
Now add another Class Module and name it PagePicker (We have referenced it in last part). Add references to “Microsoft WinHTTP Services, version 5.1″
This class “PagePicker” will contain a single method “Pick”, which will take a string passing the URL to be fetched. An optional reference to the “Spider” object can also be passed as second parameter (sent by the spider object itself to faciliate adding URLs gathered from the page text). We need not send this parameter if we are directly accessing the PagePicker object to fetch a single page. In case page is accessed by the spider object the liks found in the page are added to the spider object and in return recives the filename by which the same is to be saved in local path.
Public Function Pick(ByVal strPickURL As String, Optional ByRef objSpiderRef As Spider) As String
Dim strHTTPResp As String
Dim strNewPage As String
Dim lngOldIndex As Long
Dim strLocalFileName As String
Dim objHTTP As New WinHttp.WinHttpRequest
Dim objRegEx As New RegExp
Dim objMatch As Match
objHTTP.Open "GET", strPickURL, False
objHTTP.Send
strHTTPResp = objHTTP.ResponseText
strNewPage = ""
lngOldIndex = 1
objRegEx.IgnoreCase = True
objRegEx.Pattern = "((src|href)\\s*=\\s*|@import\\s*|url\\s*\\()""*\\s*[a-z0-9/_%:\\.&-\\?\\+]+\\s*”"*\\s*\\)*”
objRegEx.Global = True
For Each objMatch In objRegEx.Execute(strHTTPResp)
If objSpiderRef Is Nothing Then
strLocalFileName = URLClean(objMatch)
Else
strLocalFileName = objSpiderRef.AddURL(AbsoluteURL(URLClean(objMatch), strPickURL))
End If
strNewPage = strNewPage & Mid(strHTTPResp, lngOldIndex, objMatch.FirstIndex - lngOldIndex + 1) _
& UrlReform(strLocalFileName, objMatch)
lngOldIndex = objMatch.FirstIndex + objMatch.Length + 1
Next
strNewPage = strNewPage & Mid(strHTTPResp, lngOldIndex)
Pick = strNewPage
Set objHTTP = Nothing
Set objRegEx = Nothing
End Function
The Pick function creates a WinHTTP objects which fetches the page content of given URL. The response is searched for Page/Image/CSS links using the Regular Expressions. Since the result of RegEx matching also contains the attribute names as HREF SRC etc, the private function URLClean is utilised, which itself uses Replace method of RegExp object to clean off the unwanted texts.
Private Function URLClean(ByVal StrURL As String)
Dim objRegEx2 As New RegExp
Dim StrURL2 As String
objRegEx2.IgnoreCase = True
objRegEx2.Global = True
objRegEx2.Pattern = "^((src|href)\\s*=\\s*|@import\\s*|url\\s*\\()""*\\s*"
StrURL2 = Trim(objRegEx2.Replace(StrURL, " "))
objRegEx2.Pattern = "(""|\\)|;)*$"
URLClean = Trim(objRegEx2.Replace(StrURL2, " "))
End Function
The URL so formed may be a relative URL, hence passed to private function AbsoluteURL along with the full url of the current page to get it converted to the absolute URL. The AbsoluteURL function checks the first few chacters to see if they start with http:// which signifies that the URL itself is an absolute URl and needs no processing. If it starts with a slash, the sitename of the current page url is pre-pended to it otherwise the folder path of the current page is appended.
“More” validations are also performed, i.e. if the absolute url so formed contains /../ the same is removed alongwith previous foldername to make it more accurate. I found this useful when I tested the Bot for fetching the Help pages of IIS (http://localhost/iishelp/iis/misc/default.asp), which formed such URLs.
Private Function AbsoluteURL(StrURL, strPickURL)
Dim strThisFolder As String
Dim strThisSite As String
strThisFolder = Mid(strPickURL, 1, InStrRev(strPickURL, "/"))
strThisSite = Mid(strPickURL, 1, InStr(Mid(strPickURL, 8), "/") + 7)
If Left(StrURL, 7) = "http://" Then AbsoluteURL = StrURL: GoTo more
If Left(StrURL, 1) = "/" Then AbsoluteURL = strThisSite & Mid(StrURL, 2): GoTo more
AbsoluteURL = strThisFolder & StrURL
more:
Dim objRegEx2 As New RegExp
objRegEx2.IgnoreCase = True
objRegEx2.Global = True
objRegEx2.Pattern = "/[a-z0-9_%-]+/\.\./”
Do While objRegEx2.Test(AbsoluteURL)
tmp = objRegEx2.Replace(AbsoluteURL, “/”)
AbsoluteURL = tmp
Loop
End Function
The fetched current page requires all links to be replaced with local filenames to enable browsing, this is down by the UrlReform function which takes the Local filename and the LInk text recived from RegExp match and reforms it accordingly and thus forming the new page content.
Private Function UrlReform(StrURL, strMatchText)
UrlReform = Replace(strMatchText, URLClean(strMatchText), StrURL)
End Function
Now, generate the DLL file and register it by double clicking it, or alternatively issuing “RegSvr32 ” from command prompt. Once you have done this, the DLL file can be reference from a VB project to create the interface. I am avoiding getting into another post, so in few words… Create reference in your application for “MyBot” as we had named it…
You can fetch entire site content calling the spider as below.
Dim objSpider As New MyBot.Spider objSpider.AddURL "http://localhost/iishelp/iis/misc/default.asp" objSpider.AllowURL = "http://localhost/iishelp/" objSpider.DenyURL = "\\.gif|\\.jpg|\\.png" objSpider.LocalFolder = "c:\UploadFolder\" objSpider.BotStart
or fetch a single page by calling the PageFetcher as below
Dim objPicker As New MyBot.PagePicker
txt = objPicker.Pick("http://localhost/")

Looks like you\\\’ve put a lot of hard work into your site. It shows!
Comment by World History — May 23, 2007 @ 11:36 pm