在 ruby 中查找网页中的重复模式
我正在尝试找到一种在网页中查找重复模式的方法,以便我可以将内容提取到我的数据库中。
编辑:我事先不知道重复模式是什么,所以我不能只通过正则表达式或其他东西搜索给定的模式。
例如,如果您有 10 个销售汽车的网站,但这些网站都不同,则在每个网站上查看,汽车都以 html 格式重复列在该网站的页面下方。
其他站点将以不同的方式列出,但每个站点都具有重复的模式。
有谁知道怎么做,或者有这样的经验吗?
我喜欢 ruby,所以希望能用 ruby 来做这件事,如果有人拥有或知道任何可以帮助我的库/宝石?
I am trying to find a way of finding repeat patterns in webpages so that i can extract the content into my database.
EDIT : I don't know what the repeat pattern is before hand so i can't just search for a given pattern via a regex or something.
For example if you have 10 sites selling cars but the sites are all different, looking on each site the cars are listed in html in a repeated way down the page for this site.
The other sites will be listed in a different way but each with a repeated pattern.
Does anyone know how, or have any experience of this sort of thing?
i love ruby so was hoping to do it in ruby if any one has or knows of any libs / gems that may help me out ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Rick,机器模式匹配是一个复杂的主题,您在 Ruby 上找不到开箱即用的好库。
Kyle 的回答是一个开始,一旦您使用 Ruby 获取页面,典型的技术将是 xpath 或“XML 路径语言”。
使用 Xpath,您可以编写简单的选择器来提取与模式匹配的每个项目,例如,HTML 文档上的每个链接可能是
//a
,每个h1
都将是 < code>//h1,并且直接位于 div 内的每个图像(其中图像具有“car”类)将类似于://div/image[class="car"]
。XPath 的结果是每个项目的可枚举列表,然后您可以查询子元素,获取元素的
content()
,并构建关系以提取所需的数据。Ruby 的首选库称为 Nokogiri,并且作为 gem 提供 -直接文档有点弱,但如果您知道要查找什么,那么所有内容都包含在内。
一些 Ruby 库将爬行与一种以 Nokogiri 文档形式访问底层 HTML/XML 的简单方法结合起来,一个这样的例子是
Anemone
这是一个“用 Ruby 构建网络蜘蛛的框架”——我可以高度推荐它。Rick, machine pattern matching is a complicated topic, and not something that you'll find a good library for out of the box on Ruby.
Kyle's answer was a start, once you get the page with Ruby, the typical techology for this would be xpath or "The XML Path Language".
Using Xpath you can write simple selectors that will extract every item matching a pattern, for instance, every link on an HTML document might be
//a
, everyh1
would be//h1
, and every image directly inside a div, where the image has the class "car" would be something like://div/image[class="car"]
.The result of the XPath is an enumerable list of each item, you can then query for sub-elements, get the
content()
of the elements, and build relationships to extract the data you need.The go-to library for Ruby is called Nokogiri, and is avaiable as a gem - the direct documentation is a little weak, but it's all covered there if you know what to look for.
Some libraries for Ruby combine the crawling, with an easy way to access the underlying HTML/XML as a Nokogiri document, one such example is
Anemone
which is a "framework for building web spiders in Ruby" - and I can recomment it very highly.在 Ruby 中,如果您想获取网页的文本,您只需使用
Net::HTTP
命名空间即可。get
方法返回网页的字符串表示形式。之后您可能想要使用某种 XML 解析器来创建页面模型并在其上导航。我听说过有关
Hpricot
的好消息。In Ruby, if you want to get the text of a webpage all you have to do is use the
Net::HTTP
namespace. Theget
method returns a string representation of the webpage.You're probably going to want to use some sort of XML Parser after that to make a model of the page and navigate over it. I've heard good things about
Hpricot
.