从网络上抓取 URL
<a href="http://www.utoronto.ca/gdrs/" title="Rehabilitation Science"> Rehabilitation Science</a>
对于上面的例子,我想同时获取部门名称“康复科学”及其主页网址“http://www.utoronto.ca/gdrs/”。
有人可以建议一些可以为我完成这项工作的智能正则表达式吗?
<a href="http://www.utoronto.ca/gdrs/" title="Rehabilitation Science"> Rehabilitation Science</a>
For the example above, I want to get the department name "Rehabilitation Science" and its homepage url "http://www.utoronto.ca/gdrs/" at the same time.
Could someone please suggest some smart regular expressions that would do the job for me?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
根本没有理由使用正则表达式来执行此操作。这里有一个使用 Nokogiri 的解决方案,它是常见的 Ruby HTML/XML 解析器:
这会返回 URL 的哈希值作为键,并带有相关的
标记的内容作为值。这意味着您将只捕获唯一的 URL,并丢弃重复的 URL。如果您希望所有 URL 使用:
这会导致:
我使用 CSS 访问器
'a'
来定位标签。如果我只想抓取链接,忽略锚点,我可以使用'a[href]'
。正则表达式在处理 HTML 和 XML 时非常脆弱,因为标记格式过于自由;它们的格式可能会有所不同,但仍保持有效,尤其是 HTML,其“正确性”可能会有很大差异。如果您不拥有正在解析的文件的生成,那么您的代码将受到使用正则表达式时生成它的人的支配;文件中的简单更改可能会严重破坏模式,从而导致持续的维护麻烦。
解析器因为实际上了解文件的内部结构,所以可以承受这些更改。请注意,我故意创建了一些格式错误的 HTML,但代码并不关心。比较解析器版本与正则表达式解决方案的简单性,并考虑长期可维护性。
There's no reason to use regex to do this at all. Here's a solution using Nokogiri, which is the usual Ruby HTML/XML parser:
This returns a hash of URLs as keys with the related content of the
<a>
tag as the value. That means you'll only capture unique URLs, throwing away duplicates. If you want all URLs use:which results in:
I used a CSS accessor
'a'
to locate the tags. I could use'a[href]'
if I wanted to grab only links, ignoring anchors.Regex are very fragile when dealing with HTML and XML because the markup formats are too freeform; They can vary in their format while remaining valid, especially HTML, which can vary wildly in its "correctness". If you don't own the generation of the file being parsed, then your code is at the mercy of whoever does generate it when using regex; A simple change in the file can break the pattern badly, resulting in a continual maintenance headache.
A parser, because it actually understands the internal structure of the file, can withstand those changes. Notice that I deliberately created some malformed HTML but the code didn't care. Compare the simplicity of the parser version vs. a regex solution and think of long term maintainability.
我建议使用像 @mrk 建议的 HTML 解析器。然后获取返回的结果并将其通过正则表达式搜索器。我喜欢用红宝石。这将向您显示正则表达式正在捕获的内容,并且您可以避免得到不需要的结果。我发现在这种情况下使用正则表达式 /http[^"]+/ 会起作用,因为即使没有“www.”,它也会抓取整个 url,并且您可以避免捕获引号。
I would suggest using an HTML parser like @mrk suggested. Then taking the result you got back and putting it through a regex searcher. I like to use Rubular. This will show you what you what the regex is capturing and you can avoid getting unwanted results. I found that using the regex expression /http[^"]+/ works will in a situation like this because it will grab the entire url even if there is no "www." and you avoid capturing the quotes.
如果您正在构建蜘蛛,那么 Ruby 的 Mechanize 是一个不错的选择。要获取页面并提取链接:
文档和指南(我链接到的)列出了您可能想要做的很多事情。使用正则表达式来解析 HTML(或 XML)是出了名的棘手且容易出错。使用完整的解析器(正如其他人所建议的那样)将节省您的精力并使您的代码更加健壮。
If you're building a spider, then Ruby's Mechanize is a great choice. To fetch a page and extract the links:
The documentation and the guide (that I linked to) lay out a lot of what you'll probably want to do. Using regular expressions to parse HTML (or XML) is notoriously tricky and error prone. Using a full parser (as others have suggested) will save you effort and make you code more robust.
尽量不要把这件事做得过于复杂:
Trying to not do this overcomplicated:
这是我的 Ruby 方法:
这将为您提供一个数组数组,其中每个(内部)数组的第一项是 url,第二项是标题。希望这对您有所帮助,并注意正则表达式上的
u
开关,这是为了避免编码问题。Here is my Ruby Approach:
This will give to you an array of arrays, in which the first item of each (inner) array is the url, and the second is the title. Hope this helps and note the
u
switch on the regex, it's to avoid encoding problems.