选择其中包含多个格式标记的文本字符串

发布于 2025-01-10 16:05:54 字数 775 浏览 0 评论 0 原文

上下文

使用 htmlagility pack 处理 html 文档的 VB.NET 应用程序。

问题

在 html 文档中,我想为所有以 # 开头并以空格结尾的字符串添加 URL 前缀,无论其中使用什么格式标记。 所以#sth会变成http://www.anything.tld/sth

例如:

之前:

<p>#string1</p> blablabla
<p><strong>#stri</strong>ng2</p> bliblibli

之后:

<p><a href="http://www.anything.tld/string1">#string1</a> blablabla</p>
<p><a href="http://www.anything.tld/string2"><strong>#stri</strong>ng2</a> bliblibli</p>

我想我可以使用 html 敏捷包来实现这一点,但是如何选择不带格式的整个文本字符串?

或者我应该使用一个简单的正则表达式替换例程?

Context:

VB.NET application using htmlagility pack to handle html document.

Issue:

In a html document, I'd like to prefixe all the strings starting with # and ending with space by an url whatever formatting tags are used within.
So #sth would became http://www.anything.tld/sth

For instance:

Before:

<p>#string1</p> blablabla
<p><strong>#stri</strong>ng2</p> bliblibli

After:

<p><a href="http://www.anything.tld/string1">#string1</a> blablabla</p>
<p><a href="http://www.anything.tld/string2"><strong>#stri</strong>ng2</a> bliblibli</p>

I guess i can achieve this with html agility pack but how to select the entire text string without its formatting ?

Or should i use a simple regex replace routine?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

方圜几里 2025-01-17 16:05:54

这是我的解决方案。我确信这会让一些经验丰富的开发人员流血不止,但它确实有效。
htmlcode 位于 strCorpusHtmlContent

Dim matchsHashtag As MatchCollection
Dim matchHashtag As Match
Dim captureHashtag As Capture
Dim strHashtagFormatted As String
Dim strRegexPatternHashtag As String = "#([\s]*)(\w*)"
matchsHashtag = Regex.Matches(strCorpusHtmlContent, strRegexPatternHashtag)
For Each matchHashtag In matchsHashtag
     For Each captureHashtag In matchHashtag.Captures
         Dim strHashtagToFormat As String
         Dim strHashtagValueToFormat As String
         ' Test if the hashtag is followed by a tag
         If Mid(strCorpusHtmlContent, captureHashtag.Index + captureHashtag.Length + 1, 1) = "<" Then
            strHashtagValueToFormat = captureHashtag.Value                    
            Dim intStartPosition As Integer = captureHashtag.Index + captureHashtag.Length + 1
            Dim intSpaceCharPostion As Integer = intStartPosition
            Dim nextChar As Char
            Dim blnInATag As Boolean = True
            Do Until (nextChar = " " Or nextChar = vbCr Or nextChar = vbLf Or nextChar = vbCrLf) And blnInATag = False
                  nextChar = CChar(Mid(strCorpusHtmlContent, intSpaceCharPostion + 1, 1))
                  If nextChar = "<" Then
                     blnInATag = True
                  ElseIf nextChar = ">" Then
                     blnInATag = False
                  End If
                  If blnInATag = False And nextChar <> ">" And nextChar <> " " Then
                     strHashtagValueToFormat &= nextChar
                  End If
                  intSpaceCharPostion += 1
              Loop
              strHashtagToFormat = Mid(strCorpusHtmlContent, captureHashtag.Index + 1, intSpaceCharPostion - captureHashtag.Length)
         Else
              strHashtagToFormat = captureHashtag.Value
         End If

             strHashtagFormatted = "<a href=" & Chr(34) & strUrnPrefixHashtag & strHashtagValueToFormat & Chr(34) & ">" & strHashtagToFormat & "</a>"

             strCorpusHtmlContent = Regex.Replace(strCorpusHtmlContent, strHashtagToFormat, strHashtagFormatted)
     Next
Next

之前:

<p>#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltifortmat</span> to convert</p>

之后:

<p><a href="web:keyword:#hashtag_multi ">#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltiformat</span></a> to convert</p>

Here's my solution. I'm sure it would make some experienced developpers bleed from every hole but it actually works.
The htmlcode is in strCorpusHtmlContent

Dim matchsHashtag As MatchCollection
Dim matchHashtag As Match
Dim captureHashtag As Capture
Dim strHashtagFormatted As String
Dim strRegexPatternHashtag As String = "#([\s]*)(\w*)"
matchsHashtag = Regex.Matches(strCorpusHtmlContent, strRegexPatternHashtag)
For Each matchHashtag In matchsHashtag
     For Each captureHashtag In matchHashtag.Captures
         Dim strHashtagToFormat As String
         Dim strHashtagValueToFormat As String
         ' Test if the hashtag is followed by a tag
         If Mid(strCorpusHtmlContent, captureHashtag.Index + captureHashtag.Length + 1, 1) = "<" Then
            strHashtagValueToFormat = captureHashtag.Value                    
            Dim intStartPosition As Integer = captureHashtag.Index + captureHashtag.Length + 1
            Dim intSpaceCharPostion As Integer = intStartPosition
            Dim nextChar As Char
            Dim blnInATag As Boolean = True
            Do Until (nextChar = " " Or nextChar = vbCr Or nextChar = vbLf Or nextChar = vbCrLf) And blnInATag = False
                  nextChar = CChar(Mid(strCorpusHtmlContent, intSpaceCharPostion + 1, 1))
                  If nextChar = "<" Then
                     blnInATag = True
                  ElseIf nextChar = ">" Then
                     blnInATag = False
                  End If
                  If blnInATag = False And nextChar <> ">" And nextChar <> " " Then
                     strHashtagValueToFormat &= nextChar
                  End If
                  intSpaceCharPostion += 1
              Loop
              strHashtagToFormat = Mid(strCorpusHtmlContent, captureHashtag.Index + 1, intSpaceCharPostion - captureHashtag.Length)
         Else
              strHashtagToFormat = captureHashtag.Value
         End If

             strHashtagFormatted = "<a href=" & Chr(34) & strUrnPrefixHashtag & strHashtagValueToFormat & Chr(34) & ">" & strHashtagToFormat & "</a>"

             strCorpusHtmlContent = Regex.Replace(strCorpusHtmlContent, strHashtagToFormat, strHashtagFormatted)
     Next
Next

Before:

<p>#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltifortmat</span> to convert</p>

After:

<p><a href="web:keyword:#hashtag_multi ">#has<strong>hta</strong><em>g_m</em>u<span style="text-decoration: underline;">ltiformat</span></a> to convert</p>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文