Jalaj

December 20, 2006

Creating a Bot App - 2

Filed under: COM, Regular Expressions, Visual Basic — Jalaj @ 1:11 pm

Continued from Creating a Bot App - 1

Now add another Class Module and name it PagePicker (We have referenced it in last part). Add references to “Microsoft WinHTTP Services, version 5.1″

This class “PagePicker” will contain a single method “Pick”, which will take a string passing the URL to be fetched. An optional reference to the “Spider” object can also be passed as second parameter (sent by the spider object itself to faciliate adding URLs gathered from the page text). We need not send this parameter if we are directly accessing the PagePicker object to fetch a single page. In case page is accessed by the spider object the liks found in the page are added to the spider object and in return recives the filename by which the same is to be saved in local path.

Public Function Pick(ByVal strPickURL As String, Optional ByRef objSpiderRef As Spider) As String

    Dim strHTTPResp As String
    Dim strNewPage As String
    Dim lngOldIndex As Long
    Dim strLocalFileName As String

    Dim objHTTP As New WinHttp.WinHttpRequest
    Dim objRegEx As New RegExp
    Dim objMatch As Match

    objHTTP.Open "GET", strPickURL, False
    objHTTP.Send
    strHTTPResp = objHTTP.ResponseText

    strNewPage = ""
    lngOldIndex = 1

    objRegEx.IgnoreCase = True
    objRegEx.Pattern = "((src|href)\\s*=\\s*|@import\\s*|url\\s*\\()""*\\s*[a-z0-9/_%:\\.&-\\?\\+]+\\s*”"*\\s*\\)*”
    objRegEx.Global = True

    For Each objMatch In objRegEx.Execute(strHTTPResp)
        If objSpiderRef Is Nothing Then
            strLocalFileName = URLClean(objMatch)
        Else
            strLocalFileName = objSpiderRef.AddURL(AbsoluteURL(URLClean(objMatch), strPickURL))
        End If
        strNewPage = strNewPage & Mid(strHTTPResp, lngOldIndex, objMatch.FirstIndex - lngOldIndex + 1) _
		 & UrlReform(strLocalFileName, objMatch)
        lngOldIndex = objMatch.FirstIndex + objMatch.Length + 1
    Next
    strNewPage = strNewPage & Mid(strHTTPResp, lngOldIndex)
    Pick = strNewPage

    Set objHTTP = Nothing
    Set objRegEx = Nothing

End Function

The Pick function creates a WinHTTP objects which fetches the page content of given URL. The response is searched for Page/Image/CSS links using the Regular Expressions. Since the result of RegEx matching also contains the attribute names as HREF SRC etc, the private function URLClean is utilised, which itself uses Replace method of RegExp object to clean off the unwanted texts.

Private Function URLClean(ByVal StrURL As String)

    Dim objRegEx2 As New RegExp
    Dim StrURL2 As String

    objRegEx2.IgnoreCase = True
    objRegEx2.Global = True
    objRegEx2.Pattern = "^((src|href)\\s*=\\s*|@import\\s*|url\\s*\\()""*\\s*"
    StrURL2 = Trim(objRegEx2.Replace(StrURL, " "))
    objRegEx2.Pattern = "(""|\\)|;)*$"
    URLClean = Trim(objRegEx2.Replace(StrURL2, " "))

End Function

The URL so formed may be a relative URL, hence passed to private function AbsoluteURL along with the full url of the current page to get it converted to the absolute URL. The AbsoluteURL function checks the first few chacters to see if they start with http:// which signifies that the URL itself is an absolute URl and needs no processing. If it starts with a slash, the sitename of the current page url is pre-pended to it otherwise the folder path of the current page is appended.

“More” validations are also performed, i.e. if the absolute url so formed contains /../ the same is removed alongwith previous foldername to make it more accurate. I found this useful when I tested the Bot for fetching the Help pages of IIS (http://localhost/iishelp/iis/misc/default.asp), which formed such URLs.

Private Function AbsoluteURL(StrURL, strPickURL)

    Dim strThisFolder As String
    Dim strThisSite As String

    strThisFolder = Mid(strPickURL, 1, InStrRev(strPickURL, "/"))
    strThisSite = Mid(strPickURL, 1, InStr(Mid(strPickURL, 8), "/") + 7)
    If Left(StrURL, 7) = "http://" Then AbsoluteURL = StrURL: GoTo more
    If Left(StrURL, 1) = "/" Then AbsoluteURL = strThisSite & Mid(StrURL, 2): GoTo more
    AbsoluteURL = strThisFolder & StrURL

more:

    Dim objRegEx2 As New RegExp
    objRegEx2.IgnoreCase = True
    objRegEx2.Global = True
    objRegEx2.Pattern = "/[a-z0-9_%-]+/\.\./”
    Do While objRegEx2.Test(AbsoluteURL)
        tmp = objRegEx2.Replace(AbsoluteURL, “/”)
        AbsoluteURL = tmp
    Loop

End Function

The fetched current page requires all links to be replaced with local filenames to enable browsing, this is down by the UrlReform function which takes the Local filename and the LInk text recived from RegExp match and reforms it accordingly and thus forming the new page content.

Private Function UrlReform(StrURL, strMatchText)

    UrlReform = Replace(strMatchText, URLClean(strMatchText), StrURL)

End Function

Now, generate the DLL file and register it by double clicking it, or alternatively issuing “RegSvr32 ” from command prompt. Once you have done this, the DLL file can be reference from a VB project to create the interface. I am avoiding getting into another post, so in few words… Create reference in your application for “MyBot” as we had named it…

You can fetch entire site content calling the spider as below.


Dim objSpider As New MyBot.Spider
objSpider.AddURL "http://localhost/iishelp/iis/misc/default.asp"
objSpider.AllowURL = "http://localhost/iishelp/"
objSpider.DenyURL = "\\.gif|\\.jpg|\\.png"
objSpider.LocalFolder = "c:\UploadFolder\"
objSpider.BotStart

or fetch a single page by calling the PageFetcher as below

Dim objPicker As New MyBot.PagePicker
txt = objPicker.Pick("http://localhost/")

1 Comment »

  1. Looks like you\\\’ve put a lot of hard work into your site. It shows!

    Comment by World History — May 23, 2007 @ 11:36 pm

RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.