使用 Nokogiri & 解析和重新格式化数据的最有效方法西纳特拉

发布于 2024-09-30 08:54:15 字数 1433 浏览 1 评论 0原文

我正在为一些汽车经销商的库存管理器重新格式化搜索查询的 HTML 输出。没有直接的数据库访问，服务创建者没有提供任何信息，因此我决定尝试使用 Nokogiri 解析和重新格式化数据，并根据搜索查询生成新的结果页面。

第一次加载页面时，我只是使用默认搜索来生成第一个结果。

为了使搜索正常进行，我将查询发送到如下 URL：

post '/search/?:search_query' do
  url = "http://domain.com/v/?DealerId=" + settings.dealer_id + "&maxrows=10&#{params[:search_query]}"
  doc = Nokogiri::HTML(open(url))
  doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
    session["msrp"] = msrp.inner_html
  end  
  doc.css("td:nth-child(4) .ForeColor4").each do |price|
    session["price"] = price.inner_html
  end
  erb :index    
end

我知道必须有一种更智能的方法来做到这一点。

编辑：

请求数据的示例 URL：

http://domain.com/?DealerId=1234&object=list&lang=en&MAKE=&MODEL=&maxrows=50&MinYear=&MaxYear=2011&Type=N&MinPrice=&MaxPrice=&STYLE=&ExtColor=&MaxMiles=&StockNo=

生成的 HTML 的描述：

不幸的是，它是旧代码，几乎完全基于表格，具有内联样式，并且在大多数区域中缺少类或 id。

CSS 选择器的示例：

td:nth-child(5) .ForeColor4

XPath 选择器：

//td[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//*[contains(concat( " ", @class, " " ), concat( " ", "ForeColor4", " " ))]

我也考虑过 mechanize 或 hpricot 作为可能性，但我不知道最适合这项工作的工具，因为我以前没有尝试过屏幕抓取。

摘要：我想从 HTML 中提取数据，暂时将其存储在变量/会话/cookie 中（数据每天更改几次），然后能够将输出重新格式化为我自己的 HTML /CSS 样式。

原文

I'm working on reformatting HTML output from a search query for an inventory manager for a number of car dealers. There's no direct DB access, no information available from the service creators so I decided to attempts to parse and reformat the data with Nokogiri and generate new pages of results based on the search query.

On first load of the page, I'm just using a default search to generate the first results.

For the search to work, I'm sending the query to a URL like this:

post '/search/?:search_query' do
  url = "http://domain.com/v/?DealerId=" + settings.dealer_id + "&maxrows=10&#{params[:search_query]}"
  doc = Nokogiri::HTML(open(url))
  doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
    session["msrp"] = msrp.inner_html
  end  
  doc.css("td:nth-child(4) .ForeColor4").each do |price|
    session["price"] = price.inner_html
  end
  erb :index    
end

I know there's got to be a smarter way to do this.

Edit:

An example URL to request data:

http://domain.com/?DealerId=1234&object=list&lang=en&MAKE=&MODEL=&maxrows=50&MinYear=&MaxYear=2011&Type=N&MinPrice=&MaxPrice=&STYLE=&ExtColor=&MaxMiles=&StockNo=

A description of the HTML generated:

Unfortunately, it's old code that's almost entirely table-based, has inline-styles and lacks classes or ids in most areas.

An example of a CSS selector:

td:nth-child(5) .ForeColor4

An XPath selector:

//td[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//*[contains(concat( " ", @class, " " ), concat( " ", "ForeColor4", " " ))]

I've also looked at mechanize or hpricot as possibilities but I'm not aware of the best tools for the job as I haven't attempted screen-scraping before.

Summary: I want to pull the data from the HTML, temporarily store it in a variable / session / cookie (data changes several times per day), and then be able to reformat the output into my own HTML/CSS styling.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谜兔 2024-10-07 08:54:15

就我个人而言，我会将抓取与用户操作分离。有一个独立的流程来抓取和填充您的数据库。这将极大地提高性能，因为每个操作的获取、创建 DOM、解析、然后渲染输出都会变得很慢。

回复收藏 0 原文

三生一梦 2024-10-07 08:54:15

  doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
    session["msrp"] = msrp.inner_html
  end  
  doc.css("td:nth-child(4) .ForeColor4").each do |price|
    session["price"] = price.inner_html
  end

您可能想要使用 Nokogiri 的 at_css() 方法而不是常规的 css()。 at_css() 查找目标的第一次出现并仅返回该节点，类似于针对 .css().first代码> 返回。

这会将您的查找简化为这种形式：

session["msrp"] = doc.at_css("td:nth-child(5) .ForeColor4").inner_html

我可能会在测试到查找末尾时添加类似 rescue 'msrp Lookup failed' 的内容，以防万一您遇到了错误的访问器。或者，当 inner_html() 疯狂地尝试从 nil 读取时，您可以让代码失败。这只是一种更友好的调试方式。

否则你的查找似乎还不错。

  doc.css("td:nth-child(5) .ForeColor4").each do |msrp|
    session["msrp"] = msrp.inner_html
  end  
  doc.css("td:nth-child(4) .ForeColor4").each do |price|
    session["price"] = price.inner_html
  end

You might want to use Nokogiri's at_css() method instead of the regular css(). at_css() finds the first occurrence of your target and only returns that one node, similar to doing a .first against the nodeset that .css() returns.

That would simplify your lookups to this form:

session["msrp"] = doc.at_css("td:nth-child(5) .ForeColor4").inner_html

I'd probably add something like rescue 'msrp lookup failed' while testing to the end of the lookups just in case you've got bad accessors. Or you could let the code fail when inner_html() got mad trying to read from a nil. It's just a bit friendlier way to debug.

Otherwise your lookups seem to be decent.

回复收藏 0 原文

~没有更多了~