Javascript:REGEX 将所有相对 URL 更改为绝对 URL
我目前正在创建 Node.js 网络抓取器/代理,但在解析源代码脚本部分中找到的相对 URL 时遇到问题,我认为 REGEX 可以解决问题。 尽管不知道我将如何实现这一目标。
无论如何我可以解决这个问题吗?
另外,我愿意采用更简单的方法来做到这一点,因为我对其他代理如何解析网站感到非常困惑。我认为大多数只是美化的网站抓取工具,可以读取网站的源代码并将所有链接/表单转发回代理。
I'm currently creating a Node.js webscraper/proxy, but I'm having trouble parsing relative Urls found in the scripting part of the source, I figured REGEX would do the trick.
Although it is unknown how I would achieve this.
Is there anyway I can go about this?
Also I'm open to an easier way of doing this, as I'm quite baffle about how other proxies parse websites. I figured that most are just glorified site scrapers that can read a site's source a relay all links/forms back to the proxy.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
高级 HTML 字符串替换功能
OP 请注意,因为他请求这样的功能:将
base_url
更改为代理的 basE URL 以达到所需的结果。将显示两个函数下面(使用指南包含在代码中)。确保您没有跳过此答案解释的任何部分以完全理解该函数的行为。
rel_to_abs(urL)
- 此函数返回绝对 URL。当传递一个具有普遍信任协议的绝对URL时,它会立即返回这个URL。否则,将从base_url
和函数参数生成绝对 URL。相对 URL 已正确解析(../
;./
;.
;//
)。replace_all_rel_by_abs
- 此函数将解析在 HTML 中具有重要意义的所有出现的 URL,例如 CSSurl()
、链接和外部资源。请参阅代码以获取已解析实例的完整列表。请参阅此答案用于调整实现,以清理来自外部源的 HTML 字符串(以嵌入到文档中)。rel_to_abs
- Parsing relative URLs案例/示例:
http://foo.bar
。已经是绝对 URL,因此立即返回。/doo
相对于根:返回当前根 + 提供的相对 URL。./meh
相对于当前目录。../booh
相对于父目录。该函数将相对路径转换为
../
,并执行搜索和替换 (http://domain/sub/anything-but-a-slash/../me< /code> 到
http://domain/sub/me
)。replace_all_rel_by_abs
- Convert all relevant occurences of URLsURLs inside script instances (
<script>
, event handlers are not replaced, because it's near-impossible to create a fast-and-secure filter to parse JavaScript.该脚本内附有一些注释。正则表达式是动态创建的,因为单个 RE 的大小可以为 3000 个字符。
可以通过各种方式进行混淆,从而影响 RE 的大小。
私有函数的简短摘要:
rel_to_abs(url)
- 将相对/未知 URL 转换为绝对 URLreplace_all_rel_by_abs(html)
- 替换字符串中所有相关出现的 URL HTML 通过绝对 URL。ae
- 任何实体实体 - 返回 RE 模式来处理 HTML 实体。by
- 替换 by - 这个简短的函数请求实际的 url 替换 (rel_to_abs
)。该函数可能被调用数百次,甚至数千次。请注意不要向此函数添加缓慢的算法(自定义)。cr
- Create Replace - 创建并执行搜索和替换。示例:
href ="..."
(在任何 HTML 标记内)。cri
- Create Replace Inline - 创建并执行搜索和替换。示例:HTML 标记内的所有
style
属性内的url(..)
。测试用例
打开任意页面,并将以下书签粘贴到位置栏中:
注入的代码包含两个函数(如上定义)以及测试用例,如下所示。 注意:测试用例不修改页面的 HTML,但在文本区域中显示解析结果(可选)。
另请参阅:
Advanced HTML string replacement functions
Note for OP, because he requested such a function: Change
base_url
to your proxy's basE URL in order to achieve the desired results.Two functions will be shown below (the usage guide is contained within the code). Make sure that you don't skip any part of the explanation of this answer to fully understand the function's behaviour.
rel_to_abs(urL)
- This function returns absolute URLs. When an absolute URL with a commonly trusted protocol is passed, it will immediately return this URL. Otherwise, an absolute URL is generated from thebase_url
and the function argument. Relative URLs are correctly parsed (../
;./
;.
;//
).replace_all_rel_by_abs
- This function will parse all occurences of URLs which have a significant meaning in HTML, such as CSSurl()
, links and external resources. See the code for a full list of parsed instances. See this answer for an adjusted implementation to sanitise HTML strings from an external source (to embed in the document).rel_to_abs
- Parsing relative URLsCases / examples:
http://foo.bar
. Already an absolute URL, thus returned immediately./doo
Relative to the root: Returns the current root + provided relative URL../meh
Relative to the current directory.../booh
Relative to the parent directory.The function converts relative paths to
../
, and performs a search-and-replace (http://domain/sub/anything-but-a-slash/../me
tohttp://domain/sub/me
).replace_all_rel_by_abs
- Convert all relevant occurences of URLsURLs inside script instances (
<script>
, event handlers are not replaced, because it's near-impossible to create a fast-and-secure filter to parse JavaScript.This script is served with some comments inside. Regular Expressions are dynamically created, because an individual RE can have a size of 3000 characters.
<meta http-equiv=refresh content=.. >
can be obfuscated in various ways, hence the size of the RE.A short summary of the private functions:
rel_to_abs(url)
- Converts relative / unknown URLs to absolute URLsreplace_all_rel_by_abs(html)
- Replaces all relevant occurences of URLs within a string of HTML by absolute URLs.ae
- Any Entity - Returns a RE-pattern to deal with HTML entities.by
- replace by - This short function request the actual url replace (rel_to_abs
). This function may be called hundreds, if not thousand times. Be careful to not add a slow algorithm to this function (customisation).cr
- Create Replace - Creates and executes a search-and-replace.Example:
href="..."
(within any HTML tag).cri
- Create Replace Inline - Creates and executes a search-and-replace.Example:
url(..)
within the allstyle
attribute within HTML tags.Test case
Open any page, and paste the following bookmarklet in the location bar:
The injected code contains the two functions, as defined above, plus the test case, shown below. Note: The test case does not modify the HTML of the page, but shows the parsed results in a textarea (optionally).
See also:
这是 Rob W 在当前线程中回答“高级 HTML 字符串替换函数” 加上我重构的一些代码以制作 JSLint快乐的。
我应该将其发布为答案的评论,但我没有足够的声誉点。
This is Rob W answer "Advanced HTML string replacement functions" in current thread plus some code re-factoring from me to make JSLint happy.
I should post it as answer's comment but I don't have enough reputation points.
将相对网址转换为绝对网址的可靠方法是使用内置的
url
模块。例子:
A reliable way to convert urls from relative to absolute is to use the built-in
url
module.Example:
如果您使用正则表达式来查找所有非绝对 URL,则只需在它们前面加上当前 URL 前缀即可。
您需要修复的 URL 不以
/
或http(s)://
(或其他协议标记,如果您关心的话)开头他们)举个例子,假设您正在抓取
http://www.example.com/
。如果遇到相对 URL,例如foo/bar
,您只需将要抓取的 URL 作为前缀即可,如下所示:http://www.example.com/foo/bar< /code>
对于从页面中抓取 URL 的正则表达式,如果你稍微谷歌一下,可能会有很多好的可用的,所以我不会在这里开始发明一个糟糕的:)
If you use a regex to find all non-absolute URLs, you can then just prefix them with the current URL and that should be it.
The URLs you need to fix would be ones which don't start either with a
/
orhttp(s)://
(or other protocol markers, if you care about them)As an example, let's say you're scraping
http://www.example.com/
. If you encounter a relative URL, let's sayfoo/bar
, you would simply prefix the URL being scraped to it like so:http://www.example.com/foo/bar
For a regex to scrape the URLs from the page, there are probably plenty of good ones available if you google a bit so I'm not going to start inventing a poor one here :)
根据 Rob W 上面关于基本标签的评论,我编写了一个注入函数:
From a comment by Rob W above about the base tag I wrote an injection function: