[Solved] Can't dig deep for certain barriers to pull data from a site.



  • Working with some code i got stuck pulling specific data from a certain webpage. Now, I got it solved. Here is the full code:

    Sub bjscrawler()
    Const url = "http://www.bjs.com"
    Dim html As New HTMLDocument, htm As New HTMLDocument
    Dim topics As Object, post As Object, topic As Object, newlinks As String
    Dim links As Object, link As Object, data As Object
    
    With CreateObject("MSXML2.serverXMLHTTP")
    .Open "GET", url, False
    .setRequestHeader "Content-Type", "text/xml"
    .send
    html.body.innerHTML = .responseText
    End With
    Set topics = html.getElementsByClassName("text")
        For Each post In topics
        Set topic = post.getElementsByTagName("a")(0)
        newlinks = url & Split(topic.href, ":")(1)
        
        With CreateObject("MSXML2.serverXMLHTTP")
        .Open "GET", newlinks, False
        .send
        htm.body.innerHTML = .responseText
        End With
        
        Set links = htm.getElementsByClassName("rightView")
        For Each link In links
        Set data = link.getElementsByTagName("h1")(0)
        x = x + 1
        Cells(x, 1) = data.innerText
        Next link
    Next post
    End Sub

  • administrators

    Actually I did not understand anything, can you post some link samples?



  • Hi Ranjith! Thanks you are back. I just made the description complicated and it was not explicit. My problem only solves if i understand the basic usage of "HTMLDOCUMENT" variable in the right place with right manner.I'm seriously worried about untangling a complex issue concerning When to use "HTMLDOCUMENT" variable same for all the "http" request in a single subroutine and when i go for using different "HTMLDOCUMENT" variable but can't get any solution. Because in some cases using the "HTMLDOCUMENT" variable differently i get bunch of duplicate values and using the same i get run time error 91. Are there hard and fast rules on using it? For your consideration i pasted here the code i collected and modified to fit in a single subroutine. It works fine but before sending "http" request for the first time i used "html" as variable secondly "hmm" as variable and thirdly "hmm" as variable. Why they are not to be the same or to be the different always. Thanks in advance.

    Sub bjscrawler()
    Const url = "http://www.bjs.com"
    Dim html As New HTMLDocument, htm As New HTMLDocument
    Dim topics As Object, post As Object, topic As Object, newlinks As String
    Dim links As Object, link As Object, data As Object
    
    With CreateObject("MSXML2.serverXMLHTTP")
    .Open "GET", url, False
    .setRequestHeader "Content-Type", "text/xml"
    .send
    html.body.innerHTML = .responseText
    End With
    Set topics = html.getElementsByClassName("text")
        For Each post In topics
        Set topic = post.getElementsByTagName("a")(0)
        newlinks = url & Split(topic.href, ":")(1)
        
        With CreateObject("MSXML2.serverXMLHTTP")
        .Open "GET", newlinks, False
        .send
        htm.body.innerHTML = .responseText
        End With
        
        Set links = htm.getElementsByClassName("rightView")
        For Each link In links
        Set data = link.getElementsByTagName("h1")(0)
        x = x + 1
        Cells(x, 1) = data.innerText
        Next link
    Next post
    End Sub

  • administrators

    Of course, It has to be a different variable in this case because you are in a FOR loop. When loop runs second time and if the HTMLDocument is cleared by setting it to some other document then It won't be able to access the Main document.

    You are using only 3 variables which is totally fine in this case.



  • @ranjithkumar10
    I got you Ranjith, but still some confusions are there. I understood that if for loop continues and i keep my html variable same for all then at some point in the loop under any condition it may get altered and can't access its main links by which it is extracting information. But what i can't understand is that why the last two "htmldocument" variables are same because they are in loop also and you know in this subroutine three for loops are running. Anyways, if i make the last two "htmldocument" variable different from each other then it comes up with messy results and program crashes which i don't want. And, this is where another confusion begins. Thanks you respond and hope to have another. Have a nice time.


guest-login-reply
 

reconnecting-message