在大型 html 文档中为图像添加缺少的 alt 标签的最有效方法

发布于 2024-12-06 07:52:49 字数 854 浏览 6 评论 0原文

为了符合可访问性标准,我需要确保某些动态生成的 html(我不控制)中的所有图像都有一个空的 alt 标记(如果未指定)。

示例输入:

<html>
    <body>
          <img src="foo.gif" />
          <p>Some other content</p>
          <img src="bar.gif" alt="" />
          <img src="blah.gif" alt="Blah!" />
    </body>
</html>

所需输出:

<html>
    <body>
          <img src="foo.gif" alt="" />
          <p>Some other content</p>
          <img src="bar.gif" alt="" />
          <img src="blah.gif" alt="Blah!" />
    </body>
</html>

html 可能非常大,并且 DOM 严重嵌套,因此不再使用 Html Agility Pack 之类的东西。

谁能建议一种有效的方法来实现这一目标?

更新

可以肯定地假设我正在处理的 html 格式良好,因此潜在的解决方案根本不需要考虑这一点。

In order to comply with accessibility standards, I need to ensure that all images in some dynamically-generated html (which I don't control) have an empty alt tag if none is specified.

Example input:

<html>
    <body>
          <img src="foo.gif" />
          <p>Some other content</p>
          <img src="bar.gif" alt="" />
          <img src="blah.gif" alt="Blah!" />
    </body>
</html>

Desired output:

<html>
    <body>
          <img src="foo.gif" alt="" />
          <p>Some other content</p>
          <img src="bar.gif" alt="" />
          <img src="blah.gif" alt="Blah!" />
    </body>
</html>

The html could be quite large and the DOM heavily-nested, so using something like the Html Agility Pack is out.

Can anyone suggest an efficient way to accomplish this?

Update:

It is a safe assumption that the html I'm dealing with is well-formed, so a potential solution need not account for that at all.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

迟月 2024-12-13 07:52:49

您的问题似乎非常具体,您需要更改一些输出,但出于性能原因,您不想使用(通用的东西)HTMLAgilityPack 解析整个内容。最好的解决办法似乎是采取困难的方式。

我只会暴力破解它。很难比这样更有效地做到这一点(完全未经测试,几乎保证不会完全按原样工作,但如果在某处缺少“+1”或“-1”,逻辑应该没问题):

string addAltTag(string html) {
    StringBuilder sb = new StringBuilder();
    int pos=0;
    int lastPos=0;
    while(pos>=0) {
       int nextpos;
       pos=html.IndexOf("<img",pos);
       if (pos>=0) {
          // images can't have children, and there should not be any angle braces 
          // anyhere in the attributes, so should work fine
          nextPos =html.IndexOf(">",pos);

       }

       if (nextPos>0) {
          // back up if XML formed
          if (html.indexOf(nextPos-1,1)=="/") {
            nextPos--;
          }
           // output everything from last position up to but
           // before the closing caret
           sb.Append(html.Substring(lastPos,nextPos-lastPos-1);
           // can't just look for "alt" could be in the image url or class name
           if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) {
               sb.Append(" alt="\"\"");
           }
           lastPos=nextPos;
       } else {
           // unclosed image -- just quit
           pos=-1;
       }
    }
    sb.Append(html.Substring(lastPos);
    return sb.ToString();
}

您可以需要做一些事情,比如在测试之前转换为小写,解析或测试变体,例如 alt = " (即带空格)等,具体取决于您对 HTML 的一致性期望

。方式,没有办法这会更快,但是如果如果出于某种原因想要使用更通用的东西,您也可以尝试 CsQuery。是我自己的 jQuery 的 C# 实现,它可以很容易地执行类似的操作,例如,

obj.Select("img").Not("[alt]").Attr("alt",String.Empty);

既然您说 HTML 敏捷包在深度嵌套的 HTML 上表现不佳,那么这可能对您来说效果更好,因为我使用的 HTML 解析器不是递归的,应该线性执行不管嵌套如何,但是它比仅仅根据您的确切需要进行编码要慢得多,因为它当然会将整个文档解析为对象模型。谁知道这对于您的情况是否足够快。

Your problem seems very specific, you need to alter some output, but you don't want to parse the whole thing with (something general-purpose like) HTMLAgilityPack for performance reasons. The best solution would seem to be to do it the hard way.

I would just brute force it. It would be hard to do it more efficiently than something like this (completely untested and almost guaranteed not to work exactly as-is, but logic should be fine, if missing a "+1" or "-1" somewhere):

string addAltTag(string html) {
    StringBuilder sb = new StringBuilder();
    int pos=0;
    int lastPos=0;
    while(pos>=0) {
       int nextpos;
       pos=html.IndexOf("<img",pos);
       if (pos>=0) {
          // images can't have children, and there should not be any angle braces 
          // anyhere in the attributes, so should work fine
          nextPos =html.IndexOf(">",pos);

       }

       if (nextPos>0) {
          // back up if XML formed
          if (html.indexOf(nextPos-1,1)=="/") {
            nextPos--;
          }
           // output everything from last position up to but
           // before the closing caret
           sb.Append(html.Substring(lastPos,nextPos-lastPos-1);
           // can't just look for "alt" could be in the image url or class name
           if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) {
               sb.Append(" alt="\"\"");
           }
           lastPos=nextPos;
       } else {
           // unclosed image -- just quit
           pos=-1;
       }
    }
    sb.Append(html.Substring(lastPos);
    return sb.ToString();
}

You may need to do things like convert to lowercase before testing, parse or test for variants e.g alt = " (that is, with spaces), etc. depending on the consistency you can expect from your HTML.

By the way, there is no way this would be faster, but if you want to use something a little more general for some reason, you can also give a shot to CsQuery. This is my own C# implementation of jQuery which would do something like this very easily, e.g.

obj.Select("img").Not("[alt]").Attr("alt",String.Empty);

Since you say that HTML agility pack performs badly on deeply-nested HTML, this may work better for you, because the HTML parser I use is not recursive and should perform linearly regardless of nesting. But it would be far slower than just coding to your exact need since it does, of course, parse the entire document into an object model. Whether that is fast enough for your situation, who knows.

烟酉 2024-12-13 07:52:49

我刚刚在一个 8mb、大约 250,000 行的 HTML 文件上进行了测试。文档加载确实需要几秒钟,但是 select 方法非常快。不确定您的文件有多大或您期望什么。我什至编辑了 HTML 文件以包含一些缺失的标签,例如 和一些随机的 。它仍然能够正确解析。

HtmlDocument doc = new HtmlDocument();
doc.Load(@"c:\\test.html");
HtmlNodeCollection col = doc.DocumentNode.SelectNodes("//img[not(@alt)]");

我总共有 54,322 个节点。选择花费了几毫秒。

如果上述方法不起作用,并且您可以可靠地预测输出,则您可以将文件流式传输并将其分解为可管理的块。

伪代码

  • 流文件,
  • 在 HtmlAgilityPack 循环中解析
  • 直到流结束

我想您也可以在其中合并 Parallel.ForEach() ,尽管我找不到有关 HtmlAgilityPack 是否安全的文档。

I just tested this on a 8mb HTML file with about 250,000 lines. It did take a few seconds for the document to load, but the select method was very fast. Not sure how big your file is or what you are expecting. I even edited the HTML file to include some missing tags, such as </body> and some random </div>. It still was able to parse correctly.

HtmlDocument doc = new HtmlDocument();
doc.Load(@"c:\\test.html");
HtmlNodeCollection col = doc.DocumentNode.SelectNodes("//img[not(@alt)]");

I had a total of 54,322 nodes. The select took milliseconds.

If the above will not work, and you can reliably predict the output, it is possible for you to stream the file in and break it in to manageable chunks.

pseduo-code

  • stream file in
  • parse in HtmlAgilityPack
  • loop until end of stream

I imagine you could incorporate Parallel.ForEach() in there as well, although I can't find documentation on whether this is safe with HtmlAgilityPack.

太阳哥哥 2024-12-13 07:52:49

好吧,如果我审核您的内容是否符合第 508 条规定,我将使您的网站或内容不合格 - 除非空白替代文本仅用于装饰(理解内容不需要)。

空白替代文本仅用于装饰。插入它可能会愚弄一些自动报告工具,但是您肯定没有满足第 508 条的合规性。

从项目管理的角度来看,您最好让它失败,以便创建内容的最终用户承担责任并自动化工具准确地将其报告为不合规。

Well, if I review your content for Section 508 compliance, I will fail your web site or content - unless the blank alt text is for decorative (not needed for comprehension of content) only.

Blank alt text is only for decoration. Inserting it might fool some automated reporting tools, but you certainly are not meeting Section 508 compliance.

From a project management standpoint, you are better off leaving it failing so the end-users creating the content become responsible and the automated tool accurately reports it as non-compliant.

单调的奢华 2024-12-13 07:52:49

希望 Chaps 足够聪明,能够在需要的地方生成 Html 标记。然后,这里有一个快速技巧,可以轻松转换查找缺少 ALT 属性的图像的 SEO 结果。

  private static bool HasImagesWithoutAltTags(string htmlContent)
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(htmlContent);
            return doc.DocumentNode.Descendants("img").Any() && doc.DocumentNode.SelectNodes("//img[not(@alt)]").Any();
        }

Hoping Chaps are clever enough to generate the Html markup wherever they need. Then here is the quick trick to convert the find out the SEO result for Images missing ALT attribute without too much struggle.

  private static bool HasImagesWithoutAltTags(string htmlContent)
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(htmlContent);
            return doc.DocumentNode.Descendants("img").Any() && doc.DocumentNode.SelectNodes("//img[not(@alt)]").Any();
        }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文