Html / 脚本使用 Hpricot (Ruby On Rails) 抓取 Google 地图
我在抓取代码以提取我正在创建的 Web MashUp 的信息时遇到问题。
基本上,我试图从以下位置抓取代码:
http://yellowpages.com.mt/Meranti-Ltd-In-Malta-Gozo;/Hair-Accessories;Hijjhkikke=Hiojhhfokje.aspx
这只是我需要抓取的页面之一,因此我无法直接向程序提供我需要的代码=/。
当我使用以下代码(在 Hpricot 中)抓取页面时,
puts open(ypUrl, 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }
我注意到我只看到了脚本参考,而不是我需要的代码部分,即
<script type="text/javascript" src="http://maps.google.com/maps?file=api&v=2&sensor=false&key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ"></script><title>
马耳他的 Beautimport Ltd (Balmain Hair Extensions) |黄页?? (马耳他)有限公司| YellowPages.com.mt
这也是我在 Firefox 上查看源代码时看到的内容。然而,当我将鼠标悬停在 Firebug 中的元素上时,我可以获得一个 XPath,不幸的是,由于脚本引用仍然如此,因此该 XPath 不起作用。 (我不确定我的解释是否正确)。我确实需要由于脚本而在页面上生成的所有代码(目前只能在 firebug 中查看)。我需要这个,以便我可以提取以下内容(通过将鼠标悬停在地图上的 Google 图标上从 firebug 获取:
<a title="Click to see this area on Google Maps" href="http://maps.google.com/maps?ll=35.88805,14.46627&spn=0.006988,0.015922&z=16&key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ&sensor=false&mapclient=jsapi&oi=map_misc&ct=api_logo" target="_blank">
它给出了以下 Xpath(//表示 tbody),但正如我所提到的,因为它没有给出整个Hpricot 中的代码,它非常无用,因为它无法获取它!
/html/body/form/table//tr/td/div/table[2]//tr[2]/td[2]/div/div[2]/table//tr/td/div/div[2]/a
以这种方式,我将能够提取我的项目真正需要的 Lng 和 Lat,我真的不知道如何使用另一种方式来解决这个问题。 Hpricot 因为它没有给我我需要的所有代码,所以我将非常感激。
I am having a problem Scraping Code i require to extract information for a Web MashUp i'm creating.
Basically, I am trying to Scrape Code from:
http://yellowpages.com.mt/Meranti-Ltd-In-Malta-Gozo;/Hair-Accessories;Hijjhkikke=Hiojhhfokje.aspx
This is just one of the pages i will need to scrape and hence i cannot feed the program directly the code i need =/.
When i Scrape the Page using the following code (in Hpricot)
puts open(ypUrl, 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }
I am noticing that instead of the part of code i require, i am only seeing the script reference, namely
<script type="text/javascript" src="http://maps.google.com/maps?file=api&v=2&sensor=false&key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ"></script><title>
Beautimport Ltd (Balmain Hair Extensions) in Malta | Yellow Pages?? (Malta) Ltd | YellowPages.com.mt
This is also what i see when i do view source on Firefox. However when i hover over the elements in Firebug, I am able to get an XPath, which unfortunately is not working due to the script reference remaining such. (i'm not sure if i'm explaining is correct). I would really require all the code that is generated on the page due to the script (which is far only viewable in firebug). I would need this so that i can extract the following (taken from firebug by hovering over the Google Icon on the map:
<a title="Click to see this area on Google Maps" href="http://maps.google.com/maps?ll=35.88805,14.46627&spn=0.006988,0.015922&z=16&key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ&sensor=false&mapclient=jsapi&oi=map_misc&ct=api_logo" target="_blank">
which gives the following Xpath (//denotes a tbody), but as i mentioned, as it is not giving the entire code in Hpricot, its pretty useless as it can't get to it!
/html/body/form/table//tr/td/div/table[2]//tr[2]/td[2]/div/div[2]/table//tr/td/div/div[2]/a
In this manner i would be able to extract the Lng and Lat which i really require for my project. I really dont know how to go about this in another manner using Hpricot as its not giving me all the code i require. Any Help would be extremely appreciate.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这很有趣。这是可以做到的,但是需要的时间比 hpricot 还要多。我注意到
嗅探正在调用 Web 服务来填充纬度和经度。这是什么
您可以执行以下操作来获取该信息:
像平常一样抓取网站,但查找对 LoadMap javascript 的调用
功能。该行看起来像这样:
解析 id 并调用 Web 服务。这最终看起来像这样:
您会在输出中看到纬度和经度:
希望这会有所帮助。祝你好运!
This was a fun one. It can be done, but it's going to take more that hpricot. I noticed while
sniffing that a webservice is being called to populate the latitude and longitude. Here's what
you can do to get to that information:
Scrape the site like you're normally doing, but look for a call to the LoadMap javascript
function. The line will look something like:
Parse the id out and call the webservice. This will end up looking something like:
You see the latitude and longitude in the output:
Hope this helps. Good luck!
这种类型的屏幕抓取不起作用,因为您试图抓取在页面的 HTML 发送到浏览器后动态添加到页面的元素。在这种情况下,浏览器是 hpricot,它看到的只是从服务器发送的内容,而不是页面 javascript 运行后的内容。
Firebug 之所以能够看到您试图抓取的元素,是因为 Firebug 会分析浏览器中页面的当前状态,其中包括来自 Google 地图的动态脚本优点。
This type of screen scraping won't work because you're trying to grab elements that are added to the page dynamically after the page's HTML has been sent to the browser. In this case, the browser is hpricot, and all it's seeing is the content as sent from the server, rather than the content after the page's javascript has been run.
The reason that Firebug can see the elements you're trying to grab is that Firebug analyzes the current state of a page in the browser, which includes the dynamic scripty goodness from Google Maps.