Can't dig deep for certain barriers to pull data from a site.



  • Recently working with some code i got stuck pulling specific data from a certain
    web. So far when i worked with sites i could notice that every documents are usually embedded in a regular way
    as in, when i go deeper i meant, in multilayered sites i noticed that every link connected to another finally
    ends at some point where the elements by which they are called are same. But, in a specific site when i run my
    code i see that going to the second layer some links end and some go even deeper. So, for the links end there and
    for those going deeper are not with the same elements to be called. A slight tuning will help me accomplish my code.
    If they ended at the same time i could do that myself. I tried to make you understand what i'm facing with. For your consideration
    i pasted here my code.

    *******Option Explicit

    Const pageurl As String = "http://www.bjs.com"
    Public Sub parsehtml()
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, hmm As New HTMLDocument, hmc As New HTMLDocument
    Dim topics As Object, topic As Object, fla As Object, z As String, zz As String, vla As Object
    Dim i As Long, x As Long, mla As Object, link As String, aa As String, qq As String, docs As Object
    Dim cc As String, posts As Object, dla As Object, m As Long, la As Object, validlinks As String, refinedlinks As String

    x = 2

    http.Open "GET", "http://www.bjs.com/", False
    http.send
    html.body.innerHTML = http.responseText

    Set topics = html.getElementsByClassName("shop-categories")(0)
    Set mla = topics.getElementsByTagName("a")
    
    For m = 0 To mla.Length - 1
    z = mla(m).getAttribute("href")
    link = pageurl & Mid(z, InStr(z, ":") + 1)
    Next m
    
        http.Open "GET", link, False
        http.send
        hmm.body.innerHTML = http.responseText
    
            Set posts = hmm.getElementsByClassName("brick")
            
            For Each fla In posts
            Set dla = fla.getElementsByTagName("a")(0)
                aa = dla.getAttribute("href")
                qq = IIf(Right(aa, 2) = ".1", aa, "")
                zz = pageurl & Mid(qq, InStr(qq, ":") + 1)
                cc = IIf(Right(zz, 2) = ".1", zz, "")
                If cc <> "" Then
                refinedlinks = cc
                End If
                validlinks = refinedlinks
    

    [ ''Now it produces valid links with some duplicates that i don't want moreover some go deep some end here. So links are here with different lengths.]

    ' Cells(x, 1) = validlinks
    ' x = x + 1

            Next fla
    

    [ I'm stuck at this point. Not i can pull links from here nor can go deeper. Because object elements are not same for all the links.]

                http.Open "GET", validlinks, False
                http.send
                hmc.body.innerHTML = http.responseText
    
                    Set topic = hmc.getElementsByClassName("category ng-scope")
    
                    For Each docs In topic
                    Set vla = docs.getElementsByTagName("a")(0)
                    Cells(x, 1) = vla.getAttribute("href")
                    x = x = 1
    
                    Next docs
    

    End Sub*********


  • administrators

    Actually I did not understand anything, can you post some link samples?



  • Hi Ranjith! Thanks you are back. I just made the description complicated and it was not explicit. My problem only solves if i understand the basic usage of "HTMLDOCUMENT" variable in the right place with right manner.I'm seriously worried about untangling a complex issue concerning When to use "HTMLDOCUMENT" variable same for all the "http" request in a single subroutine and when i go for using different "HTMLDOCUMENT" variable but can't get any solution. Because in some cases using the "HTMLDOCUMENT" variable differently i get bunch of duplicate values and using the same i get run time error 91. Are there hard and fast rules on using it? For your consideration i pasted here the code i collected and modified to fit in a single subroutine. It works fine but before sending "http" request for the first time i used "html" as variable secondly "hmm" as variable and thirdly "hmm" as variable. Why they are not to be the same or to be the different always. Thanks in advance.

    Const Pageurl As String = "http://www.wiseowl.co.uk/videos/"

    Sub parsehtml()
    Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, hmm As New HTMLDocument
    Dim topics As Object, gist As Object
    Dim x As String, y As String, z As String
    Dim vid As Integer, vidcat As Object
    Dim vrow As Object, vrows As Object, m As String, n As String
    Dim vlink As Object, find As Object, q As Integer
    Dim t As String, r As String, s As String, L As String

    Range("A2").Select

    http.Open "GET", Pageurl, False
    http.send
    html.body.innerHTML = http.responseText
    Set http = Nothing
    "Look up and down to find "html" variable"
    Set topics = html.getElementsByClassName("woMenuList")(0)
    Set gist = topics.getElementsByTagName("a")

    For vid = 1 To gist.Length - 1
    Set vidcat = gist(vid)
    x = vidcat.getAttribute("href")
    z = Pageurl & Mid(x, InStr(x, ":") + 9)

    http.Open "GET", z, False
    http.send
    hmm.body.innerHTML = http.responseText
    Set http = Nothing
    
    "Look up and down to find "hmm" variable"
    

    Set find = hmm.getElementsByClassName("woPagingItem")
    For q = 0 To find.Length - IIf(find.Length > 0, 1, 0)
    If q > 0 Then
    t = find(q).innerText
    r = find(q).getAttribute("href")
    s = Pageurl & Mid(r, InStr(r, ":") + 9)

    http.Open "GET", s, False
    http.send
    hmm.body.innerHTML = http.responseText
    Set http = Nothing
    
    End If
    
         "Look up and down to find "hmm" variable"
    
    Set vrows = hmm.getElementsByClassName("woVideoListRow")
    For Each vrow In vrows
        Set vlink = vrow.getElementsByTagName("a")(0)
        m = vlink.getAttribute("href")
        L = Pageurl & Mid(m, InStr(m, ":") + 9)
        n = vlink.innerText
        ActiveCell.Value = n
        ActiveCell.Offset(0, 1) = L
        ActiveCell.Offset(1, 0).Select
    Next vrow
    

    Next q
    Next vid
    End Sub


  • administrators

    Of course, It has to be a different variable in this case because you are in a FOR loop. When loop runs second time and if the HTMLDocument is cleared by setting it to some other document then It won't be able to access the Main document.

    You are using only 3 variables which is totally fine in this case.



  • @ranjithkumar10
    I got it you Ranjith, but still some confusions are there. I understood that if for loop continues and i keep my html variable same for all then at some point in the loop under any condition it may get altered and can't access its main links by which it is extracting information. But what i can't understand is that why the last two "htmldocument" variables are same because they are in loop also and you know in this subroutine three for loops are running. Anyways, if i make the last two "htmldocument" variable different from each other then it comes up with messy results and program crashes which i don't want. And, this is where another confusion begins. Thanks you respond and hope to have another. Have a nice time.


Log in to reply
 

Looks like your connection to Codingislove Forum was lost, please wait while we try to reconnect.