如何在Delphi中实现一套标准的超链接检测规则

发布于 2024-12-28 07:38:33 字数 525 浏览 1 评论 0原文

我目前在程序中自动检测文本中的超链接。我做得非常简单,只查找 http://www。

但是,一位用户建议我将其扩展为其他形式,例如: https://.com

然后我意识到它可能不会就此停止,因为还有 ftp、mailto 和 file、所有其他顶级域,甚至电子邮件地址和文件路径。

我认为最好的方法是遵循当前正在使用的一些常用的标准超链接检测规则集,将其限制在实用范围内。也许 Microsoft Word 是如何做到的,或者 RichEdit 是如何做到的,或者您可能知道更好的标准。

所以我的问题是:

是否有一个内置函数可以从 Delphi 调用来进行检测,如果有,调用会是什么样子? (我计划将来使用 FireMonkey,所以我更喜欢能够在 Windows 之外工作的东西。)

如果没有可用的功能,是否可以在某个地方找到一组记录在 Word 中检测到的规则,在 RichEdit 中,或者任何其他应该检测什么规则集?这样我就可以自己编写检测代码。

I currently do automatic detection of hyperlinks within text in my program. I made it very simple and only look for http:// or www.

However, a user suggested to me that I extend it to other forms, e.g.: https:// or .com

Then I realized it might not stop there because there's ftp and mailto and file, all the other top level domains, and even email addresses and file paths.

What I think is best is to limit it to what is practical by following some often-used standard set of hyperlink detection rules that are currently in use. Maybe how Microsoft Word does it, or maybe how RichEdit does it or maybe you know of a better standard.

So my question is:

Is there a built in function that I can call from Delphi to do the detection, and if so, what would the call look like? (I plan in the future to go to FireMonkey, so I would prefer something that will work beyond Windows.)

If there isn't a function available, is there some place I can find a documented set of rules of what is detected in Word, in RichEdit, or any other set of rules of what should be detected? That would then allow me to write the detection code myself.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

迷爱 2025-01-04 07:38:33

尝试使用 PathIsURL 函数,该函数在ShLwApi 单元。

Try the PathIsURL function which is declarated in the ShLwApi unit.

青朷 2025-01-04 07:38:33

遵循从 RegexBuddy 库中获取的正则表达式可能会让您入门(我无法对性能做出任何声明)。

正则表达式

Match; JGsoft; case insensitive:  
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

说明

URL:全文查找
最终的字符类确保如果 URL 是某些文本的一部分,
URL 后面的标点符号(例如逗号或句号)不会被解释为一部分
网址。

匹配(全部或部分)

http://regexbuddy.com
http://www.regexbuddy.com 
http://www.regexbuddy.com/ 
http://www.regexbuddy.com/index.html 
http://www.regexbuddy.com/index.html?source=library 
You can download RegexBuddy at http://www.regexbuddy.com/download.html.

不匹配

regexbuddy.com
www.regexbuddy.com
"www.domain.com/quoted URL with spaces"
[email protected]

对于一组规则,您可以查看RFC 3986

统一资源标识符 (URI) 是一个紧凑的序列
标识抽象或物理资源的字符。这个
规范定义了通用 URI 语法和处理
解析可能采用相对形式的 URI 引用,以及

上使用 URI 的指南和安全注意事项
互联网

验证 RFC 3986 中指定的 URL 的正则表达式是

^
(# Scheme
 [a-z][a-z0-9+\-.]*:
 (# Authority & path
  //
  ([a-z0-9\-._~%!
amp;'()*+,;=]+@)?              # User
  ([a-z0-9\-._~%]+                            # Named host
  |\[[a-f0-9:.]+\]                            # IPv6 host
  |\[v[a-f0-9][a-z0-9\-._~%!
amp;'()*+,;=:]+\])  # IPvFuture host
  (:[0-9]+)?                                  # Port
  (/[a-z0-9\-._~%!
amp;'()*+,;=:@]+)*/?          # Path
 |# Path without authority
  (/?[a-z0-9\-._~%!
amp;'()*+,;=:@]+(/[a-z0-9\-._~%!
amp;'()*+,;=:@]+)*/?)?
 )
|# Relative URL (no scheme or authority)
 ([a-z0-9\-._~%!
amp;'()*+,;=@]+(/[a-z0-9\-._~%!
amp;'()*+,;=:@]+)*/?  # Relative path
 |(/[a-z0-9\-._~%!
amp;'()*+,;=:@]+)+/?)                            # Absolute path
)
# Query
(\?[a-z0-9\-._~%!
amp;'()*+,;=:@/?]*)?
# Fragment
(\#[a-z0-9\-._~%!
amp;'()*+,;=:@/?]*)?
$

Following regex taken from RegexBuddy's library might get you started (I can't make any claims about performance).

Regex

Match; JGsoft; case insensitive:  
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

Explanation

URL: Find in full text
The final character class makes sure that if an URL is part of some text,
punctuation such as a comma or full stop after the URL is not interpreted as part
of the URL.

Matches (whole or partial)

http://regexbuddy.com
http://www.regexbuddy.com 
http://www.regexbuddy.com/ 
http://www.regexbuddy.com/index.html 
http://www.regexbuddy.com/index.html?source=library 
You can download RegexBuddy at http://www.regexbuddy.com/download.html.

Does not match

regexbuddy.com
www.regexbuddy.com
"www.domain.com/quoted URL with spaces"
[email protected]

For a set of rules you might look into RFC 3986

A Uniform Resource Identifier (URI) is a compact sequence of
characters that identifies an abstract or physical resource. This
specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet

A regex that validates a URL as specified in RFC 3986 would be

^
(# Scheme
 [a-z][a-z0-9+\-.]*:
 (# Authority & path
  //
  ([a-z0-9\-._~%!
amp;'()*+,;=]+@)?              # User
  ([a-z0-9\-._~%]+                            # Named host
  |\[[a-f0-9:.]+\]                            # IPv6 host
  |\[v[a-f0-9][a-z0-9\-._~%!
amp;'()*+,;=:]+\])  # IPvFuture host
  (:[0-9]+)?                                  # Port
  (/[a-z0-9\-._~%!
amp;'()*+,;=:@]+)*/?          # Path
 |# Path without authority
  (/?[a-z0-9\-._~%!
amp;'()*+,;=:@]+(/[a-z0-9\-._~%!
amp;'()*+,;=:@]+)*/?)?
 )
|# Relative URL (no scheme or authority)
 ([a-z0-9\-._~%!
amp;'()*+,;=@]+(/[a-z0-9\-._~%!
amp;'()*+,;=:@]+)*/?  # Relative path
 |(/[a-z0-9\-._~%!
amp;'()*+,;=:@]+)+/?)                            # Absolute path
)
# Query
(\?[a-z0-9\-._~%!
amp;'()*+,;=:@/?]*)?
# Fragment
(\#[a-z0-9\-._~%!
amp;'()*+,;=:@/?]*)?
$
遥远的绿洲 2025-01-04 07:38:33

正则表达式可能是这里的方法,用于定义您认为合适的超链接的各种模式。

Regular Expressions may be the way to go here, to define the various patterns which you deem to be appropriate hyperlinks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文