Can't dig deep for certain barriers to pull data from a site.
Recently working with some code i got stuck pulling specific data from a certain
web. So far when i worked with sites i could notice that every documents are usually embedded in a regular way
as in, when i go deeper i meant, in multilayered sites i noticed that every link connected to another finally
ends at some point where the elements by which they are called are same. But, in a specific site when i run my
code i see that going to the second layer some links end and some go even deeper. So, for the links end there and
for those going deeper are not with the same elements to be called. A slight tuning will help me accomplish my code.
If they ended at the same time i could do that myself. I tried to make you understand what i'm facing with. For your consideration
i pasted here my code.
Const pageurl As String = "http://www.bjs.com" Public Sub parsehtml() Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, hmm As New HTMLDocument, hmc As New HTMLDocument Dim topics As Object, topic As Object, fla As Object, z As String, zz As String, vla As Object Dim i As Long, x As Long, mla As Object, link As String, aa As String, qq As String, docs As Object Dim cc As String, posts As Object, dla As Object, m As Long, la As Object, validlinks As String, refinedlinks As String x = 2 http.Open "GET", "http://www.bjs.com/", False http.send html.body.innerHTML = http.responseText Set topics = html.getElementsByClassName("shop-categories")(0) Set mla = topics.getElementsByTagName("a") For m = 0 To mla.Length - 1 z = mla(m).getAttribute("href") link = pageurl & Mid(z, InStr(z, ":") + 1) Next m http.Open "GET", link, False http.send hmm.body.innerHTML = http.responseText Set posts = hmm.getElementsByClassName("brick") For Each fla In posts Set dla = fla.getElementsByTagName("a")(0) aa = dla.getAttribute("href") qq = IIf(Right(aa, 2) = ".1", aa, "") zz = pageurl & Mid(qq, InStr(qq, ":") + 1) cc = IIf(Right(zz, 2) = ".1", zz, "") If cc <> "" Then refinedlinks = cc End If validlinks = refinedlinks [ ''Now it produces valid links with some duplicates that i don't want moreover some go deep some end here. So links are here with different lengths.] ' Cells(x, 1) = validlinks ' x = x + 1 Next fla [ I'm stuck at this point. Not i can pull links from here nor can go deeper. Because object elements are not same for all the links.] http.Open "GET", validlinks, False http.send hmc.body.innerHTML = http.responseText Set topic = hmc.getElementsByClassName("category ng-scope") For Each docs In topic Set vla = docs.getElementsByTagName("a")(0) Cells(x, 1) = vla.getAttribute("href") x = x = 1 Next docs End Sub
Actually I did not understand anything, can you post some link samples?
Hi Ranjith! Thanks you are back. I just made the description complicated and it was not explicit. My problem only solves if i understand the basic usage of "HTMLDOCUMENT" variable in the right place with right manner.I'm seriously worried about untangling a complex issue concerning When to use "HTMLDOCUMENT" variable same for all the "http" request in a single subroutine and when i go for using different "HTMLDOCUMENT" variable but can't get any solution. Because in some cases using the "HTMLDOCUMENT" variable differently i get bunch of duplicate values and using the same i get run time error 91. Are there hard and fast rules on using it? For your consideration i pasted here the code i collected and modified to fit in a single subroutine. It works fine but before sending "http" request for the first time i used "html" as variable secondly "hmm" as variable and thirdly "hmm" as variable. Why they are not to be the same or to be the different always. Thanks in advance.
Const Pageurl As String = "http://www.wiseowl.co.uk/videos/" Sub parsehtml() Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, hmm As New HTMLDocument Dim topics As Object, gist As Object Dim x As String, y As String, z As String Dim vid As Integer, vidcat As Object Dim vrow As Object, vrows As Object, m As String, n As String Dim vlink As Object, find As Object, q As Integer Dim t As String, r As String, s As String, L As String Range("A2").Select http.Open "GET", Pageurl, False http.send html.body.innerHTML = http.responseText Set http = Nothing "Look up and down to find "html" variable" Set topics = html.getElementsByClassName("woMenuList")(0) Set gist = topics.getElementsByTagName("a") For vid = 1 To gist.Length - 1 Set vidcat = gist(vid) x = vidcat.getAttribute("href") z = Pageurl & Mid(x, InStr(x, ":") + 9) http.Open "GET", z, False http.send hmm.body.innerHTML = http.responseText Set http = Nothing "Look up and down to find "hmm" variable" Set find = hmm.getElementsByClassName("woPagingItem") For q = 0 To find.Length - IIf(find.Length > 0, 1, 0) If q > 0 Then t = find(q).innerText r = find(q).getAttribute("href") s = Pageurl & Mid(r, InStr(r, ":") + 9) http.Open "GET", s, False http.send hmm.body.innerHTML = http.responseText Set http = Nothing End If "Look up and down to find "hmm" variable" Set vrows = hmm.getElementsByClassName("woVideoListRow") For Each vrow In vrows Set vlink = vrow.getElementsByTagName("a")(0) m = vlink.getAttribute("href") L = Pageurl & Mid(m, InStr(m, ":") + 9) n = vlink.innerText ActiveCell.Value = n ActiveCell.Offset(0, 1) = L ActiveCell.Offset(1, 0).Select Next vrow Next q Next vid End Sub
Of course, It has to be a different variable in this case because you are in a FOR loop. When loop runs second time and if the HTMLDocument is cleared by setting it to some other document then It won't be able to access the Main document.
You are using only 3 variables which is totally fine in this case.
I got you Ranjith, but still some confusions are there. I understood that if for loop continues and i keep my html variable same for all then at some point in the loop under any condition it may get altered and can't access its main links by which it is extracting information. But what i can't understand is that why the last two "htmldocument" variables are same because they are in loop also and you know in this subroutine three for loops are running. Anyways, if i make the last two "htmldocument" variable different from each other then it comes up with messy results and program crashes which i don't want. And, this is where another confusion begins. Thanks you respond and hope to have another. Have a nice time.