正则表达式问题和c#

发布于 2024-12-17 03:21:24 字数 5340 浏览 0 评论 0原文

我编写的字符串操作方法有问题。此方法的目的是在长字符串中查找链接标签，并重新格式化它们的 href。

为了提供一些上下文，我正在解析 CD 上的大量 HTML 文件，并将结果整理到单独项目中网站上的 xml 文件中（我将其编写为控制台应用程序的一部分）。 html 文件包含说明文本，其中包含与 CD 上的文件相关的链接，我需要将 href 更改为与信息所在的网站相关。

如果只有一个链接标记，下面的代码似乎工作得很好，但如果传递两个链接标记，则输出会非常混乱。奇怪的是，Visual Studio 的正则表达式编辑器声称下面的 linkTag 正则表达式仅匹配链接标记，但是当需要用正确的 href 替换链接时，它会在指令字符串中的各个点插入链接片段。

附加正则表达式的 alphaDir 的原因是我最终将扩展此方法以纠正具有不同起始 href 的链接。我们正在讨论解析数千个 html 文件，但这种格式是迄今为止最常见的。

我对此有点茫然，因为我是一名正则表达式初学者，并且在我自己下面编写了所有正则表达式，因此对其中任何一个的任何想法也都很棒。

典型输入字符串

Hold 1st <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back outward
  &amp; fingers forward, and put 2nd <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back forward
  &amp; fingers inward, with lower knuckle of its 4th finger on
  lower knuckle of 1st thumb; then slide 2nd hand forwards one
  hand's length.

方法

static string instructions(string instructions)
    {
        Regex Spaces = new Regex(@"\s+|\n|\r");
        Regex linkTag = new Regex(@"<a(.*?)>(.*?)<\/a>");
        Regex linkTagHtml = new Regex(@"<a(.*?)>|<\/a>");
        Regex hrefAttr = new Regex("href=\"(.)*?\"");
        Regex alphaDir = new Regex(@"/([a-z])?/");

        string signName = string.Empty;
        char alphaChar;
        string replacementLinkTag = string.Empty;
        string replacementHref = string.Empty;

        instructions = Spaces.Replace(instructions, " ");

        MatchCollection matches = linkTag.Matches(instructions);

        foreach (Match link in matches)
        {
            Match alphaDirMatch = alphaDir.Match(link.Value.ToString());
            if (alphaDirMatch.Success)
            {
                Match hrefAttrMatch = hrefAttr.Match(link.Value.ToString());
                if (hrefAttrMatch.Success)
                {
                    signName = linkTagHtml.Replace(link.Value.ToString(), string.Empty).ToLower().Trim();
                    signName = signName.Replace(" ", "_");
                    alphaChar = signName[0];

                    replacementHref = "href=\"/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() +"&sign=" + signName + "\"";
                    replacementLinkTag = hrefAttr.Replace(link.Value.ToString(), replacementHref);

                    instructions = instructions.Remove(link.Index, link.Length);
                    instructions = instructions.Insert(link.Index, replacementLinkTag);
                }
            }
        }            

        return instructions;
    }

当前输出字符串

Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward &amp; finge<a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a>f="../f/fist_hand.html">FIST</a></strong> hand, back forward &amp; fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.

所需输出字符串

Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward &amp; fingers forward, and put 2nd <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back forward &amp; fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.

解决方案 - 感谢 Oded 的建议！

我使用 HtmlAgilityPack 将指令字符串加载为 html，并找到将这些指令字符串存储在 HtmlNodeCollection 中的链接标记，循环遍历每个标记并获取 href 值，然后进行编辑。

对于那些感兴趣的人来说，代码最终看起来像这样：

static string instructions(string instructions)
    {
        char alphaChar;
        Regex Spaces = new Regex(@"\s+|\n|\r");
        Regex alphaDir = new Regex(@"/([a-z])?/");
        string signName = string.Empty;
        string replacementHref = string.Empty;

        instructions = Spaces.Replace(instructions, " ");

        HtmlDocument instr = new HtmlDocument();
        instr.LoadHtml(instructions);

        HtmlNodeCollection links = instr.DocumentNode.SelectNodes("//a");

        if (links != null)
        {
            foreach (HtmlNode link in links)
            {
                string href = link.GetAttributeValue("href", string.Empty);

                if (!string.IsNullOrWhiteSpace(href))
                {
                    Match alphaDirMatch = alphaDir.Match(href);

                    if (alphaDirMatch.Success)
                    {
                        signName = Regex.Replace(href, "(.)*?/([a-z])?/|(.html)?", string.Empty);
                        signName = signName.Replace(" ", "_");
                        alphaChar = signName[0];

                        replacementHref = "/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() + "&sign=" + signName;
                        link.SetAttributeValue("href", replacementHref);
                    }
                }
            }
        }

        instructions = instr.DocumentNode.InnerHtml.ToString();

        return instructions;
    }

原文

I am having an issue with a string manipulation method I have written. The purpose of this method is to seek out link tags within a long string, and reformat their hrefs.

To give some context, I am parsing a large number of HTML files that were on a CD and collating the results in to xml files that are on a website in a separate project (I wrote this as part of a console app). The html files contain instructional text and this contains links that are relative to the files on the CD, and I need to change the hrefs to be relative to the website the information is going on.

The following code appears to work just fine if there is only one link tag, but pass it two, and the output is very messed up. Strangely, Visual Studio's Regex editor claims that the linkTag regex below is only matching the link tags, but when it comes round to replacing the links with the correct hrefs, it inserts link fragments at various points within the instructions string.

The reason for the additional regex's alphaDir is that I will eventually expand this method to correct links with different starting hrefs. We are talking about parsing thousands of html files, but this format is the most common by far.

I am at a bit of a loss on this one as I am very much a regex beginner, and wrote all of the regex's below myself, so any thoughts on any of these would be great too.

Typical Input string

Hold 1st <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back outward
  & fingers forward, and put 2nd <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back forward
  & fingers inward, with lower knuckle of its 4th finger on
  lower knuckle of 1st thumb; then slide 2nd hand forwards one
  hand's length.

The Method

static string instructions(string instructions)
    {
        Regex Spaces = new Regex(@"\s+|\n|\r");
        Regex linkTag = new Regex(@"<a(.*?)>(.*?)<\/a>");
        Regex linkTagHtml = new Regex(@"<a(.*?)>|<\/a>");
        Regex hrefAttr = new Regex("href=\"(.)*?\"");
        Regex alphaDir = new Regex(@"/([a-z])?/");

        string signName = string.Empty;
        char alphaChar;
        string replacementLinkTag = string.Empty;
        string replacementHref = string.Empty;

        instructions = Spaces.Replace(instructions, " ");

        MatchCollection matches = linkTag.Matches(instructions);

        foreach (Match link in matches)
        {
            Match alphaDirMatch = alphaDir.Match(link.Value.ToString());
            if (alphaDirMatch.Success)
            {
                Match hrefAttrMatch = hrefAttr.Match(link.Value.ToString());
                if (hrefAttrMatch.Success)
                {
                    signName = linkTagHtml.Replace(link.Value.ToString(), string.Empty).ToLower().Trim();
                    signName = signName.Replace(" ", "_");
                    alphaChar = signName[0];

                    replacementHref = "href=\"/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() +"&sign=" + signName + "\"";
                    replacementLinkTag = hrefAttr.Replace(link.Value.ToString(), replacementHref);

                    instructions = instructions.Remove(link.Index, link.Length);
                    instructions = instructions.Insert(link.Index, replacementLinkTag);
                }
            }
        }            

        return instructions;
    }

Current output string

Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward & finge<a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a>f="../f/fist_hand.html">FIST</a></strong> hand, back forward & fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.

Desired output string

Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward & fingers forward, and put 2nd <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back forward & fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.

The solution - Thanks for the suggestion Oded!

I used the HtmlAgilityPack to load the instructions string as html, and found the link tags storing these in a HtmlNodeCollection, looping over each and getting the href values, and doing the edits.

The code ended up looking like this for those interested:

static string instructions(string instructions)
    {
        char alphaChar;
        Regex Spaces = new Regex(@"\s+|\n|\r");
        Regex alphaDir = new Regex(@"/([a-z])?/");
        string signName = string.Empty;
        string replacementHref = string.Empty;

        instructions = Spaces.Replace(instructions, " ");

        HtmlDocument instr = new HtmlDocument();
        instr.LoadHtml(instructions);

        HtmlNodeCollection links = instr.DocumentNode.SelectNodes("//a");

        if (links != null)
        {
            foreach (HtmlNode link in links)
            {
                string href = link.GetAttributeValue("href", string.Empty);

                if (!string.IsNullOrWhiteSpace(href))
                {
                    Match alphaDirMatch = alphaDir.Match(href);

                    if (alphaDirMatch.Success)
                    {
                        signName = Regex.Replace(href, "(.)*?/([a-z])?/|(.html)?", string.Empty);
                        signName = signName.Replace(" ", "_");
                        alphaChar = signName[0];

                        replacementHref = "/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() + "&sign=" + signName;
                        link.SetAttributeValue("href", replacementHref);
                    }
                }
            }
        }

        instructions = instr.DocumentNode.InnerHtml.ToString();

        return instructions;
    }

分享到QQ

分享到微博