语义、标准和使用“lang”标记中源代码的属性

发布于 2024-10-19 14:12:52 字数 2401 浏览 1 评论 0 原文

我无法找到以下内容的权威解释、微格式或指南,所以我将其开放。如果我错过了什么,请说出来!

假设您有一个 HTML 页面,其中在

 元素中包含一些编程源代码的示例:(

<pre>
    # code...
</pre>

更新:正如 Pekka 下面指出的, 可能比

 更好,但以下示例/讨论可以适用于两者。正如 Brian Campbell 指出的那样,这两个元素当然都应该使用。对于预格式化代码

现在:如何以语义正确且符合规范的方式声明
 块内容的编程语言

这将是有用的信息,可以以语义一致的方式包含在标记中。

从语义的角度来看,显而易见的选择是使用 lang 属性:

<pre lang="ruby">

But 根据 HTML 4 规范,第 8.1.1 节

lang 属性的值是一个语言代码,用于标识自然语言[...]计算机语言被明确排除在语言代码之外。

(强调我的)

此外,“ruby”无论如何都不是标准语言代码。

该规范确实允许使用 x 主标记添加“实验”或“私人使用”代码。规范中的示例是 lang="x-klingon"

理论上,您可以使用 x-ruby、x-java 等来声明

 块中包含的编程语言– 除了规范似乎不赞成在一般编程语言中使用 lang 属性。

有关该主题的 HTML 5 规范 没有让事情变得更清楚。规范本身没有明确提及“自然”语言与“编程”语言。相反,它让读者参考 BCP 47,其中(再次)指出:

语言标签用于帮助识别语言[...],但不包括主要用于人类交流的语言,例如编程语言。

然而,它继续提到(第 56 页,第 4.1 节)zxx 主要语言子标签,其中:

标识语言分类不合适或不适用的内容。一些示例可能包括器乐或电子音乐 [...] 或编程源代码

(强调我的)

同样,规范似乎自相矛盾,但它开启了使用 zxx-x-ruby(或类似)作为完全符合规范的方式来声明某些内容的可能性用一种语言(只是不是人类语言)编写声明所涉及的特定(非人类)语言。

那么,关于该做什么,是否有任何类似的标准/微格式/微语法/君子协议/任何事情

就我个人而言,我喜欢 zxx-x-ruby 因为它是最完整的。当然,x-ruby 本身更短、更整洁,但除非我弄错了,否则

 块仍然会继承其父级的主要语言(例如enfr 或类似的)。


附录:

正如 Pekka 在下面提到的, 标签可能更合适,并且从语义上来说,简单地说 非常简洁lang="...">。然而, 标签也是一个内联元素,我最初只考虑更长的源代码运行,即声明所有 的语言> 元素包含在块级

 元素中。

幸运的是,lang 属性是全局的,可以应用于任一元素,因此任一元素都可以工作。

第二:我不小心在各处输入了“zzx”,而不是正确的“zxx”!这是一个“z”,两个“x”。对于造成的混乱表示歉意。

I haven't been able to find authorative explanations, microformats or guidelines for the following, so I'm throwing it open. If I've missed something, speak up!

Let's say you have an HTML page that includes an example of some programming source code inside a <pre> element:

<pre>
    # code...
</pre>

(Update: As Pekka points out below, <code> might be better than <pre> but the following examples/discussion can apply to both. And as Brian Campbell points out both elements should of course be used for preformatted code)

Now: How do you – in a semantically correct and spec compliant way – declare the programming language of the <pre> block's contents?

This would be useful information to include in the markup in a semantically consistent way.

The obvious choice, from a semantic standpoint, would be to use the lang attribute:

<pre lang="ruby">

But according to the HTML 4 spec, section 8.1.1:

The lang attribute's value is a language code that identifies a natural language [...] Computer languages are explicitly excluded from language codes.

(emphasis mine)

And besides, "ruby" isn't a standard language code anyway.

The spec does allow for adding "experimental" or "private use" codes using the x primary tag. The example from the spec is lang="x-klingon".

In theory, you could use x-ruby, x-java and so forth to declare the programming languge contained in the <pre> block – except that it seems the spec frowns upon using the lang attribute for programming languages in general.

The HTML 5 spec on the topic doesn't make matters any clearer. The spec itself doesn't explicitly mention "natural" vs "programming" languages. Instead it refers the reader to BCP 47, which states (again):

Language tags are used to help identify languages [...] but excludes languages not intended primarily for human communication, such as programming languages.

However, it goes on to mention (in section 4.1, page 56) the zxx primary language subtag, which:

identifies content for which a language classification is inappropriate or does not apply. Some examples might include instrumental or electronic music [...] or programming source code.

(emphasis mine)

Again, the spec seems to contradict itself, but it opens up the possiblilty of using zxx-x-ruby (or similar) as a fully spec-compliant way of both declaring something to be written in a language (just not a human one) and declaring the specific (non-human) language involved.

So, is there any semblance of a standard/microformat/microsyntax/gentleman's agreement/anything on what to do?

Personally, I like zxx-x-ruby as its the most complete. x-ruby by itself is shorter and neater of course, but unless I'm mistaken, the <pre> block would still inherit the primary language of its parent (e.g. en or fr or similar).


Addendum:

As Pekka mentions below, the <code> tag would probably be more appropriate, and semantically it'd be very neat to simply say <code lang="...">. However, the <code> tag is also an inline element, and I was initially thinking only of longer runs of source code, i.e. declaring the language for all <code> elements contained in block-level <pre> elements.

Luckily, the lang attribute is global and can be applied to either element, so either one would work.

Second: I accidentally typed "zzx" everywhere instead of the correct "zxx"! It's one 'z', two 'x's. Apologies for the confusion.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不回头走下去 2024-10-26 14:12:52

要回答这个问题,我们应该看两件事:任何潜在的相关规范,以及现实世界中实际做了什么。您已经提到了相关规范对 lang 属性的规定;它通常用于指示所引用内容的人类语言,而不是编程语言。虽然 BCP 47 提到了非语言内容的 zxx 标签,但我认为使用 lang 属性和 zxx 并不真正合适> 用于指定编程语言的子标签。原因是大多数源代码实际上具有一些语言内容,这些内容是自然语言的;注释、变量名、字符串等。可能应该使用 lang 属性来指示这些,特别是在使用 CJK 字符(其中字体选择可能基于 lang 属性)的情况下。代码示例中包含的编程语言实际上与其中包含的人类语言正交;将两者混为一谈可能会导致混乱,而不是清晰。

因此,让我们检查一下 lang 属性替代方案的规范。正如 Pekka 在另一个答案中指出的那样, 元素在标记源代码方面比

 元素在语义上更有意义,所以让我们检查一下。根据 HTML5 规范

code 元素表示计算机代码的片段。这可以是 XML 元素名称、文件名、计算机程序或计算机可以识别的任何其他字符串。

尽管没有正式的方法来指示所标记的计算机代码的语言,但希望使用所使用的语言标记 code 元素的作者,例如,以便语法突出显示脚本可以使用正确的规则,可以通过向元素添加一个前缀为“language-”的类来实现。

...

<块引用>

以下示例展示了如何使用 pre 和 code 元素来标记代码块。

var i: 整数;
开始
   我:= 1;
结束。

该示例中使用类来指示所使用的语言。

现在,这不是正式的规范,只是关于如何使用类来指示所表示的语言的非正式建议。该示例还演示了如何使用

 标记和  标记来标记代码块。

我们可以在其他地方寻找任何类型的标准,但我还没有找到;没有用于代码格式化的微格式,并且我没有找到任何其他规范提到它。因此,我们继续讨论人们实际做的事情。发现这一点的最好方法是查看 HTML 语法突出显示库的作用,因为它们是嵌入到网页中的代码的主要生产者和消费者,而网页中的语言实际上很重要。

HTML 语法荧光笔主要有两种类型:那些在服务器上或离线运行、使用 Ruby、Python 或 PHP 运行并生成要由浏览器显示的静态 HTML 和 CSS 的程序,以及那些用 JavaScript 编写的程序,它们查找并突出显示

或客户端的  元素。第二类更有趣,因为他们需要从提供给他们的 HTML 中检测语言;在第一类中,您通常通过 API 或通过特定于您的 wiki、博客或 CMS 语法的某种机制手动指定语言,因此没有实际使用者使用可能嵌入 HTML 中的任何语言信息。为了完整起见,我们将看看这两个类别。

对于 JavaScript 语法荧光笔,我找到了以下内容,以及用于指定代码块及其语言的语法示例:

对于基于服务器和离线语法荧光笔,大多数(CodeRay紫外线PygmentsHighlight) 根本不会在其输出的 HTML 中嵌入任何语言信息。 GeSHi 是我发现的唯一嵌入该语言的语言,如

...

,一个

 标记,以裸语言名称作为类。

在这份清单中,似乎没有达成真正的共识。最流行的选择是仅使用裸语言名称作为类。下一个最流行的方法是使用某种形式的前缀语言名称,即以库名称、lang-language- 为前缀。有一些有自己奇怪的约定,或者根本不在 HTML 中指定语言。

虽然唯一足以成为事实标准的共同点是使用裸语言名称作为类,但我建议使用 HTML5 规范建议的内容,类名称为 language- 后跟语言的名称。一些语法荧光笔支持这一点,其余的可能可以轻松修改以支持它。与仅将语言名称作为类相比,它更不那么含糊,也不太可能与其他类发生冲突。而且,即使没有正式指定,也至少在规范中提到。

我还会使用 标记来指示源代码,可以是裸露的,也可以嵌入

 标记中;  标记和 language- 前缀类的组合可用于指示您拥有特定语言的源代码,并且可用于指示您希望它被突出显示,并且比语法突出显示库使用的其他一些指示符更清晰且更好地匹配元素的语义。对于无法使用  标记的情况,例如嵌入仅接受有限 HTML 子集(如 Tumblr)的网站,只需使用 
 标记可能是最好的。

编辑添加CommonMark 规范,它试图标准化 Markdown,以便实现可以互操作,在给定相同输入的情况下生成相同的 HTML,也采用了这个建议的约定。它将 围栏代码块 添加到 Markdown,并用 ``` 包围~~~,这比基于缩进的代码块更容易使用。紧随开放栅栏之后的是 信息字符串,其定义为:

可以在开放代码围栏后提供信息字符串。开头和结尾的空格将被去除,并且以 language- 为前缀的第一个单词将用作 codeclass 属性的值> 元素位于封闭的 pre 元素内。

检查实际实现的作用也很有启发性。 在 Babelmark 上尝试隔离代码块显示了支持隔离代码块的实现(并非所有实现都支持,因为它是原始 Markdown 的扩展),我们看到以下细分:

  • showdown、blakfriday、haskell markdown:
    ...

  • 标记为:< code>
    ...
  • commonmark、parsedown、cebe/markdown:
     ...

  • 小气,最小值:
    ...< ;/pre>
  • pandoc:
    ...< /code>

    (相当矫枉过正)

  • Maruku:
    ... 

查看其他转换为 HTML 的文档标记语言并对代码块有一些了解:

  • AsciiDoc:
    ...

    < /代码>;只是使用 Pygments 来突出显示,并且不在 HTML 中包含语言信息。

  • rst2html 给了我
    ...

    ,用 Pygments 突出显示。

  • Sphinx:
    ...

    < /code>,也用 Pygments 突出显示。

因此,总的来说,不同项目的选择有相当大的多样性,但似乎确实有一些朝着标准化

... 方向发展的运动;

To answer this question, we should look at two things; any potentially relevant specifications, and what is actually done in the real world. You've already mentioned what the relevant specifications have said on the lang attribute; it is generally used for indicating the human language of the content referenced, not the programming language. While BCP 47 mentions the zxx tag for non-linguistic content, I don't believe that it is really appropriate to use the lang attribute and zxx subtag for specifying the programming language. The reason is that most source code does actually have some linguistic content, which is in a natural language; comments, variable names, strings, and the like. The lang attribute should probably be used to indicate these, especially in cases like use of CJK characters where font selection might be based on the lang attribute. The programming language contained within a code example is really orthogonal to the human language contained within it; conflating the two will likely lead to confusion, not clarity.

So, let's check the specs for an alternative to the lang attribute. As Pekka points out in another answer, the <code> element is more semantically meaningful for marking up source code than the <pre> element, so let's check there. According to the HTML5 spec:

The code element represents a fragment of computer code. This could be an XML element name, a filename, a computer program, or any other string that a computer would recognize.

Although there is no formal way to indicate the language of computer code being marked up, authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, may do so by adding a class prefixed with "language-" to the element.

...

The following example shows how a block of code could be marked up using the pre and code elements.

<pre><code class="language-pascal">var i: Integer;
begin
   i := 1;
end.</code></pre>

A class is used in that example to indicate the language used.

Now, this isn't a formal specification, just an informal recommendation for how you could use a class to indicate the language represented. The example also shows how to use both a <pre> tag and <code> tag to mark up a block of code.

We can look elsewhere for any sort of standards, but I haven't found any; there are no microformats for code formatting, and I haven't found any other specs that mention it. So, we move on to what people actually do. The best way to discover this is to look at what HTML syntax highlighting libraries do, since they are the main producers and consumers of code embedded in web pages in which the language actually matters.

There are two main types of HTML syntax highlighters; those that run on the server or offline, in Ruby or Python or PHP, and produce static HTML and CSS to be displayed by the browser, and those written in JavaScript, which find and highlight <pre> or <code> elements on the client side. The second category is more interesting, as they need to detect the language from the HTML provided to them; in the first category, you usually specify the language manually through the API or through some mechanism specific to your wiki, blog, or CMS syntax, and so there is no actual consumer of any language information that might be embedded in the HTML. We'll take a look at both categories for the sake of completeness.

For JavaScript syntax highlighters, I've found the following, with examples of their syntax for specifying a code block and its language:

  • SyntaxHighligher: <pre class="brush: html">...</pre>. Appears to completely ignore how class should be used by introducing its own syntax for class attributes based on CSS syntax with the brush keyword used to indicate the language. Also has an option for using the <script> tag, to make it easier to copy and paste code in without having to escape <, using the same class syntax.
  • Highlight.js: <pre><code class="html">...</code></pre> or class="language-html" or the same on <pre>. This gives you several options, one of which corresponds to the recommendation in the HTML5 spec, the other simply uses the bare language name as the class name.
  • SHJS: <pre class="sh_html">...</pre>. Uses its own prefix for language names in the class, and only works on <pre>, not other elements.
  • beautyOfCode: <pre class="code"><code class="html">...</code></pre>. Based on SyntaxHighlighter, but with a somewhat less weird syntax. Requires a the <pre> tag with class code and the code tag with a class indicating the language.
  • Chili: <code class="html">...</code>. Uses just the <code> tag, and uses the bare language as a class name.
  • Lighter.js: <pre class="html">...</code>. Uses the bare language as a class name. You select the elements it will apply to using the API, but the example demonstrates it on <pre> tags.
  • DlHighlight: <pre name="code" class="html">...</pre>. Uses the bare language as a class name. You choose via the API what type of element to highlight (the example used pre) and the value of the name attribute to look for to indicate that you want syntax highlighting. I believe that this is an abuse of the name attribute.
  • google-code-prettify: <pre class="prettyprint lang-html">. Uses class names prefixed with lang- to specify the language, and the class prettyprint to indicate that you want syntax highlighting. The language class is optional; it will try to auto-detect the language if not specified.
  • JUSH: <code class="jush-html">...</code> or <code class="language-html">...</code>. Uses the code tag, with languages in a class prefixed by jush- or language-.
  • Rainbow: <pre><code data-language="javascript">...</code></pre> uses the custom attribute data-language, applied to either a <code> element, or a <pre> element, in order to support sites like Tumblr which strip out <code> elements.
  • Prism: <pre><code class="language-css">...</code></pre> follows the HTML5 spec for nested <pre> and <code>, and the recommendation for the class name.

For server-based and offline syntax highlighters, the majority (CodeRay, UltraViolet, Pygments, Highlight) do not embed any language information in the HTML they output at all. GeSHi is the only one I found that embeds the language, as <pre class="html">...</pre>, a <pre> tag with a bare language name as the class.

Out of that list, there seems to be no real consensus. The most popular option is just using the bare language name as a class. The next most popular is using some form of prefixed language name, either prefixed by the library name, lang-, or language-. There are a few that have their own strange conventions, or don't specify the language in the HTML at all.

While the only thing common enough to be a de-facto standard is using the bare language name as a class, I would recommend going with what the HTML5 spec recommends, a class name of language- followed by the name of the language. This is supported by a few of the syntax highlighters, the rest could probably be easily modified to support it. It is less ambiguous and less likely to conflict with other classes than just the bare language name as a class. And, even if not formally specified, it is at least mentioned in a spec.

I would also use the <code> tag to indicate source code, either bare or embedded in a <pre> tag; the combination of a <code> tag and language- prefixed class can be used to indicate that you have source code in a particular language, and could be used to indicate you want it to be highlighted, and is clearer and better matches the semantics of the elements than some of the other indicators used by syntax highlighting libraries. For cases in which a <code> tag can't be used, such as embedding in sites that accept only a limited HTML subset like Tumblr, just using the <pre> tag with the same class convention is probably best.

edit to add: The CommonMark specification, which attempts to standardize Markdown so that implementations can be interoperable, producing the same HTML given the same input, has also adopted this suggested convention. It adds fenced code blocks to Markdown, surrounded with ``` or ~~~, which can be easier to use than indentation based code blocks. Immediately following the opening fence can be an info string, which is defined as:

An info string can be provided after the opening code fence. Opening and closing spaces will be stripped, and the first word, prefixed with language-, is used as the value for the class attribute of the code element within the enclosing pre element.

It can be instructive also the check what actual implementations do. Trying out a fenced code block on Babelmark shows that of those implementations that support fenced code blocks (not all do as it's an extension to the original Markdown), we see the following breakdown:

  • showdown, blakfriday, haskell markdown: <pre><code class="python">...</code></pre>
  • marked: <pre><code class="lang-python">...</code></pre>
  • commonmark, parsedown, cebe/markdown: <pre><code class="language-python">...</code></pre>
  • cheapskate, minima: <pre class="python">...</pre>
  • pandoc: <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">...</code></pre></div> (quite the overkill)
  • Maruku: <pre class="python"><code class="python">...</code></pre>

Looking at other document markup languages that convert to HTML and have some understanding of code blocks:

  • AsciiDoc: <pre>...</pre>; simply uses Pygments to highlight and does not include language information in the HTML.
  • rst2html gave me <pre class="code python literal-block">...</pre>, highlighted with Pygments.
  • Sphinx: <div class="highlight-python"><div class="highlight"><pre>...</pre></div></div>, also highlighted with Pygments.

So, overall, fairly large diversity in choices by different projects, but there does seem to be some movement towards standardizing on <pre><code class="language-python">...</code></pre>.

苹果你个爱泡泡 2024-10-26 14:12:52

似乎没有比滥用 lang 属性和您提到的 zzx 前缀更好的方法了(顺便说一句,有趣的发现!)。 type 属性可能稍微合适一点,但它在 pre 元素中当然无效。

顺便说一句, (W3C参考此处)可能比更合适<前>

HTML 代码元素 () 表示计算机代码的片段。默认情况下,它以浏览器的默认等宽字体显示。


There doesn't seem to be a better way than to misuse the lang attribute with the zzx prefix you mention (interesting find by the way!). The type attribute might be slightly more fitting, but it of course isn't valid in pre elements.

By the way, <code> (W3C reference here) might be more fitting than <pre>:

The HTML Code Element (<code>) represents a fragment of computer code. By default, it is displayed in the browser's default monospace font.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文