使用 Nokogiri/xpath 从巨大的 HTML 文件中提取一些文本

发布于 2025-01-07 07:51:44 字数 397 浏览 1 评论 0原文

我正在抓取一个网站，并尝试从 HTML 中提取某些元素。在我正在抓取的网站中，有脚本标签，其中包含一堆信息，但是，这些标签中有一个我感兴趣的部分。该行基本上如下所示：

'image':'http://ut5.example.com/t/231/3_b_643435.jpg',

上面有一些内容并在其下方。现在，这对于每个页面源来说都是不同的，除了域和一些存储图像的子文件夹之外。

我该如何查看该特定行的源代码，并仅删除 URL？我觉得我需要使用正则表达式，因为 URL 是动态的。

“gsub”方法执行与我想要搜索的类似的操作，它能够使用/regex/。但是，我不想替换任何内容，我只想使用 /regex/ 在源代码中找到该 URL 并复制它。

原文

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:

'image':'http://ut5.example.com/t/231/3_b_643435.jpg',

With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.

How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.

The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.

分享到QQ

分享到微博