Html / 脚本使用 Hpricot (Ruby On Rails) 抓取 Google 地图

发布于 2024-08-10 11:57:45 字数 1702 浏览 12 评论 0原文

我在抓取代码以提取我正在创建的 Web MashUp 的信息时遇到问题。

基本上，我试图从以下位置抓取代码：

http://yellowpages.com.mt/Meranti-Ltd-In-Malta-Gozo;/Hair-Accessories;Hijjhkikke=Hiojhhfokje.aspx

这只是我需要抓取的页面之一，因此我无法直接向程序提供我需要的代码=/。

当我使用以下代码（在 Hpricot 中）抓取页面时，

puts open(ypUrl, 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }

我注意到我只看到了脚本参考，而不是我需要的代码部分，即

<script type="text/javascript" src="http://maps.google.com/maps?file=api&amp;v=2&amp;sensor=false&amp;key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ"></script><title>

马耳他的 Beautimport Ltd (Balmain Hair Extensions) |黄页？？（马耳他）有限公司| YellowPages.com.mt

这也是我在 Firefox 上查看源代码时看到的内容。然而，当我将鼠标悬停在 Firebug 中的元素上时，我可以获得一个 XPath，不幸的是，由于脚本引用仍然如此，因此该 XPath 不起作用。（我不确定我的解释是否正确）。我确实需要由于脚本而在页面上生成的所有代码（目前只能在 firebug 中查看）。我需要这个，以便我可以提取以下内容（通过将鼠标悬停在地图上的 Google 图标上从 firebug 获取：

<a title="Click to see this area on Google Maps" href="http://maps.google.com/maps?ll=35.88805,14.46627&spn=0.006988,0.015922&z=16&key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ&sensor=false&mapclient=jsapi&oi=map_misc&ct=api_logo" target="_blank">

它给出了以下 Xpath（//表示 tbody），但正如我所提到的，因为它没有给出整个Hpricot 中的代码，它非常无用，因为它无法获取它！

/html/body/form/table//tr/td/div/table[2]//tr[2]/td[2]/div/div[2]/table//tr/td/div/div[2]/a

以这种方式，我将能够提取我的项目真正需要的 Lng 和 Lat，我真的不知道如何使用另一种方式来解决这个问题。 Hpricot 因为它没有给我我需要的所有代码，所以我将非常感激。

原文

I am having a problem Scraping Code i require to extract information for a Web MashUp i'm creating.

Basically, I am trying to Scrape Code from:

http://yellowpages.com.mt/Meranti-Ltd-In-Malta-Gozo;/Hair-Accessories;Hijjhkikke=Hiojhhfokje.aspx

This is just one of the pages i will need to scrape and hence i cannot feed the program directly the code i need =/.

When i Scrape the Page using the following code (in Hpricot)

puts open(ypUrl, 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }

I am noticing that instead of the part of code i require, i am only seeing the script reference, namely

<script type="text/javascript" src="http://maps.google.com/maps?file=api&v=2&sensor=false&key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ"></script><title>

Beautimport Ltd (Balmain Hair Extensions) in Malta | Yellow Pages?? (Malta) Ltd | YellowPages.com.mt

This is also what i see when i do view source on Firefox. However when i hover over the elements in Firebug, I am able to get an XPath, which unfortunately is not working due to the script reference remaining such. (i'm not sure if i'm explaining is correct). I would really require all the code that is generated on the page due to the script (which is far only viewable in firebug). I would need this so that i can extract the following (taken from firebug by hovering over the Google Icon on the map:

<a title="Click to see this area on Google Maps" href="http://maps.google.com/maps?ll=35.88805,14.46627&spn=0.006988,0.015922&z=16&key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ&sensor=false&mapclient=jsapi&oi=map_misc&ct=api_logo" target="_blank">

which gives the following Xpath (//denotes a tbody), but as i mentioned, as it is not giving the entire code in Hpricot, its pretty useless as it can't get to it!

/html/body/form/table//tr/td/div/table[2]//tr[2]/td[2]/div/div[2]/table//tr/td/div/div[2]/a

In this manner i would be able to extract the Lng and Lat which i really require for my project. I really dont know how to go about this in another manner using Hpricot as its not giving me all the code i require. Any Help would be extremely appreciate.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

魄砕の薆 2024-08-17 11:57:45

这很有趣。这是可以做到的，但是需要的时间比 hpricot 还要多。我注意到
嗅探正在调用 Web 服务来填充纬度和经度。这是什么
您可以执行以下操作来获取该信息：

像平常一样抓取网站，但查找对 LoadMap javascript 的调用
功能。该行看起来像这样：

<script type='text/javascript'>LoadMapByDetail(1668154, 0, 1)</script>

解析 id 并调用 Web 服务。这最终看起来像这样：

require 'rubygems'
require 'hpricot' 
require 'open-uri' 
require 'soap/wsdlDriver'

WSDL_URL="http://yellowpages.com.mt/Web_Service/SearchMap.asmx?WSDL" 
soap = SOAP::WSDLDriverFactory.new(WSDL_URL).create_rpc_driver 
response = soap.GetCoordByDetail(:mainDetailID => '1668154', :type => '1')
soap.reset_stream response.getCoordByDetailResult.anyType.each { |x| puts x.anyType }

您会在输出中看到纬度和经度：

35.88805
14.46627

希望这会有所帮助。祝你好运！

This was a fun one. It can be done, but it's going to take more that hpricot. I noticed while
sniffing that a webservice is being called to populate the latitude and longitude. Here's what
you can do to get to that information:

Scrape the site like you're normally doing, but look for a call to the LoadMap javascript
function. The line will look something like:

<script type='text/javascript'>LoadMapByDetail(1668154, 0, 1)</script>

Parse the id out and call the webservice. This will end up looking something like:

require 'rubygems'
require 'hpricot' 
require 'open-uri' 
require 'soap/wsdlDriver'

WSDL_URL="http://yellowpages.com.mt/Web_Service/SearchMap.asmx?WSDL" 
soap = SOAP::WSDLDriverFactory.new(WSDL_URL).create_rpc_driver 
response = soap.GetCoordByDetail(:mainDetailID => '1668154', :type => '1')
soap.reset_stream response.getCoordByDetailResult.anyType.each { |x| puts x.anyType }

You see the latitude and longitude in the output: