使用 Nokogiri/xpath 从巨大的 HTML 文件中提取一些文本
我正在抓取一个网站,并尝试从 HTML 中提取某些元素。在我正在抓取的网站中,有脚本标签,其中包含一堆信息,但是,这些标签中有一个我感兴趣的部分。该行基本上如下所示:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
上面有一些内容并在其下方。现在,这对于每个页面源来说都是不同的,除了域和一些存储图像的子文件夹之外。
我该如何查看该特定行的源代码,并仅删除 URL?我觉得我需要使用正则表达式,因为 URL 是动态的。
“gsub”方法执行与我想要搜索的类似的操作,它能够使用/regex/。但是,我不想替换任何内容,我只想使用 /regex/ 在源代码中找到该 URL 并复制它。
I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据您的评论,这就是您正在寻找的我猜
示例 http://jsfiddle.net/Km9ZB/
According to you comments, this is what you're looking for I guess
Example http://jsfiddle.net/Km9ZB/