我无法找到以下内容的权威解释、微格式或指南,所以我将其开放。如果我错过了什么,请说出来!
假设您有一个 HTML 页面,其中在
元素中包含一些编程源代码的示例:(
<pre>
# code...
</pre>
更新:正如 Pekka 下面指出的,
可能比
更好,但以下示例/讨论可以适用于两者。正如 Brian Campbell 指出的那样,这两个元素当然都应该使用。对于预格式化代码)
现在:如何以语义正确且符合规范的方式声明
块内容的编程语言?
这将是有用的信息,可以以语义一致的方式包含在标记中。
从语义的角度来看,显而易见的选择是使用 lang
属性:
<pre lang="ruby">
But 根据 HTML 4 规范,第 8.1.1 节:
lang 属性的值是一个语言代码,用于标识自然语言[...]计算机语言被明确排除在语言代码之外。
(强调我的)
此外,“ruby”无论如何都不是标准语言代码。
该规范确实允许使用 x
主标记添加“实验”或“私人使用”代码。规范中的示例是 lang="x-klingon"
。
理论上,您可以使用 x-ruby、x-java 等来声明
块中包含的编程语言– 除了规范似乎不赞成在一般编程语言中使用 lang
属性。
有关该主题的 HTML 5 规范 没有让事情变得更清楚。规范本身没有明确提及“自然”语言与“编程”语言。相反,它让读者参考 BCP 47,其中(再次)指出:
语言标签用于帮助识别语言[...],但不包括主要用于人类交流的语言,例如编程语言。
然而,它继续提到(第 56 页,第 4.1 节)zxx
主要语言子标签,其中:
标识语言分类不合适或不适用的内容。一些示例可能包括器乐或电子音乐 [...] 或编程源代码。
(强调我的)
同样,规范似乎自相矛盾,但它开启了使用 zxx-x-ruby(或类似)作为完全符合规范的方式来声明某些内容的可能性用一种语言(只是不是人类语言)编写并声明所涉及的特定(非人类)语言。
那么,关于该做什么,是否有任何类似的标准/微格式/微语法/君子协议/任何事情?
就我个人而言,我喜欢 zxx-x-ruby
因为它是最完整的。当然,x-ruby 本身更短、更整洁,但除非我弄错了,否则
块仍然会继承其父级的主要语言(例如en
或 fr
或类似的)。
附录:
正如 Pekka 在下面提到的,
标签可能更合适,并且从语义上来说,简单地说 非常简洁lang="...">
。然而,
标签也是一个内联元素,我最初只考虑更长的源代码运行,即声明所有
的语言> 元素包含在块级
元素中。
幸运的是,lang
属性是全局的,可以应用于任一元素,因此任一元素都可以工作。
第二:我不小心在各处输入了“zzx”,而不是正确的“zxx”!这是一个“z”,两个“x”。对于造成的混乱表示歉意。
I haven't been able to find authorative explanations, microformats or guidelines for the following, so I'm throwing it open. If I've missed something, speak up!
Let's say you have an HTML page that includes an example of some programming source code inside a <pre>
element:
<pre>
# code...
</pre>
(Update: As Pekka points out below, <code>
might be better than <pre>
but the following examples/discussion can apply to both. And as Brian Campbell points out both elements should of course be used for preformatted code)
Now: How do you – in a semantically correct and spec compliant way – declare the programming language of the <pre>
block's contents?
This would be useful information to include in the markup in a semantically consistent way.
The obvious choice, from a semantic standpoint, would be to use the lang
attribute:
<pre lang="ruby">
But according to the HTML 4 spec, section 8.1.1:
The lang attribute's value is a language code that identifies a natural language [...] Computer languages are explicitly excluded from language codes.
(emphasis mine)
And besides, "ruby" isn't a standard language code anyway.
The spec does allow for adding "experimental" or "private use" codes using the x
primary tag. The example from the spec is lang="x-klingon"
.
In theory, you could use x-ruby
, x-java
and so forth to declare the programming languge contained in the <pre>
block – except that it seems the spec frowns upon using the lang
attribute for programming languages in general.
The HTML 5 spec on the topic doesn't make matters any clearer. The spec itself doesn't explicitly mention "natural" vs "programming" languages. Instead it refers the reader to BCP 47, which states (again):
Language tags are used to help identify languages [...] but excludes languages not intended primarily for human communication, such as programming languages.
However, it goes on to mention (in section 4.1, page 56) the zxx
primary language subtag, which:
identifies content for which a language classification is inappropriate or does not apply. Some examples might include instrumental or electronic music [...] or programming source code.
(emphasis mine)
Again, the spec seems to contradict itself, but it opens up the possiblilty of using zxx-x-ruby
(or similar) as a fully spec-compliant way of both declaring something to be written in a language (just not a human one) and declaring the specific (non-human) language involved.
So, is there any semblance of a standard/microformat/microsyntax/gentleman's agreement/anything on what to do?
Personally, I like zxx-x-ruby
as its the most complete. x-ruby
by itself is shorter and neater of course, but unless I'm mistaken, the <pre>
block would still inherit the primary language of its parent (e.g. en
or fr
or similar).
Addendum:
As Pekka mentions below, the <code>
tag would probably be more appropriate, and semantically it'd be very neat to simply say <code lang="...">
. However, the <code>
tag is also an inline element, and I was initially thinking only of longer runs of source code, i.e. declaring the language for all <code>
elements contained in block-level <pre>
elements.
Luckily, the lang
attribute is global and can be applied to either element, so either one would work.
Second: I accidentally typed "zzx" everywhere instead of the correct "zxx"! It's one 'z', two 'x's. Apologies for the confusion.
发布评论
评论(2)
要回答这个问题,我们应该看两件事:任何潜在的相关规范,以及现实世界中实际做了什么。您已经提到了相关规范对 lang 属性的规定;它通常用于指示所引用内容的人类语言,而不是编程语言。虽然 BCP 47 提到了非语言内容的
zxx
标签,但我认为使用lang
属性和zxx
并不真正合适> 用于指定编程语言的子标签。原因是大多数源代码实际上具有一些语言内容,这些内容是自然语言的;注释、变量名、字符串等。可能应该使用lang
属性来指示这些,特别是在使用 CJK 字符(其中字体选择可能基于lang
属性)的情况下。代码示例中包含的编程语言实际上与其中包含的人类语言正交;将两者混为一谈可能会导致混乱,而不是清晰。因此,让我们检查一下 lang 属性替代方案的规范。正如 Pekka 在另一个答案中指出的那样,
元素在标记源代码方面比
现在,这不是正式的规范,只是关于如何使用类来指示所表示的语言的非正式建议。该示例还演示了如何使用
我们可以在其他地方寻找任何类型的标准,但我还没有找到;没有用于代码格式化的微格式,并且我没有找到任何其他规范提到它。因此,我们继续讨论人们实际做的事情。发现这一点的最好方法是查看 HTML 语法突出显示库的作用,因为它们是嵌入到网页中的代码的主要生产者和消费者,而网页中的语言实际上很重要。
HTML 语法荧光笔主要有两种类型:那些在服务器上或离线运行、使用 Ruby、Python 或 PHP 运行并生成要由浏览器显示的静态 HTML 和 CSS 的程序,以及那些用 JavaScript 编写的程序,它们查找并突出显示
对于 JavaScript 语法荧光笔,我找到了以下内容,以及用于指定代码块及其语言的语法示例:
。似乎完全忽略了如何使用
class
,通过引入自己的基于 CSS 语法的class
属性语法,并使用brush
关键字来指示语言。还可以选择使用标记,以便更轻松地复制和粘贴代码,而无需使用相同的
转义
<
类语法。或
class="language-html"
或< /代码>。在类中使用自己的语言名称前缀,并且仅适用于
< ;pre class="code">
。基于 SyntaxHighlighter,但语法不太奇怪。需要带有...
code
类的
。仅使用。 ..
标记,并使用裸语言作为类名。
。使用裸语言作为类名。您可以通过 API 选择要突出显示的元素类型(示例使用
pre
)以及要查找的name
属性值,以指示您想要语法突出显示。我认为这是对name
属性的滥用。... 或
。使用...
code
标记,类中的语言以jush-
或language-
为前缀。使用自定义属性
data-language
,应用于元素或 < code>
遵循嵌套
对于基于服务器和离线语法荧光笔,大多数(CodeRay、紫外线、Pygments、Highlight) 根本不会在其输出的 HTML 中嵌入任何语言信息。 GeSHi 是我发现的唯一嵌入该语言的语言,如
,一个
在这份清单中,似乎没有达成真正的共识。最流行的选择是仅使用裸语言名称作为类。下一个最流行的方法是使用某种形式的前缀语言名称,即以库名称、
lang-
或language-
为前缀。有一些有自己奇怪的约定,或者根本不在 HTML 中指定语言。虽然唯一足以成为事实标准的共同点是使用裸语言名称作为类,但我建议使用 HTML5 规范建议的内容,类名称为
language-
后跟语言的名称。一些语法荧光笔支持这一点,其余的可能可以轻松修改以支持它。与仅将语言名称作为类相比,它更不那么含糊,也不太可能与其他类发生冲突。而且,即使没有正式指定,也至少在规范中提到。我还会使用
标记来指示源代码,可以是裸露的,也可以嵌入
(相当矫枉过正)
查看其他转换为 HTML 的文档标记语言并对代码块有一些了解:
< /代码>;只是使用 Pygments 来突出显示,并且不在 HTML 中包含语言信息。
rst2html
给了我,用 Pygments 突出显示。
< /code>,也用 Pygments 突出显示。
因此,总的来说,不同项目的选择有相当大的多样性,但似乎确实有一些朝着标准化
。
To answer this question, we should look at two things; any potentially relevant specifications, and what is actually done in the real world. You've already mentioned what the relevant specifications have said on the
lang
attribute; it is generally used for indicating the human language of the content referenced, not the programming language. While BCP 47 mentions thezxx
tag for non-linguistic content, I don't believe that it is really appropriate to use thelang
attribute andzxx
subtag for specifying the programming language. The reason is that most source code does actually have some linguistic content, which is in a natural language; comments, variable names, strings, and the like. Thelang
attribute should probably be used to indicate these, especially in cases like use of CJK characters where font selection might be based on thelang
attribute. The programming language contained within a code example is really orthogonal to the human language contained within it; conflating the two will likely lead to confusion, not clarity.So, let's check the specs for an alternative to the
lang
attribute. As Pekka points out in another answer, the<code>
element is more semantically meaningful for marking up source code than the<pre>
element, so let's check there. According to the HTML5 spec:Now, this isn't a formal specification, just an informal recommendation for how you could use a class to indicate the language represented. The example also shows how to use both a
<pre>
tag and<code>
tag to mark up a block of code.We can look elsewhere for any sort of standards, but I haven't found any; there are no microformats for code formatting, and I haven't found any other specs that mention it. So, we move on to what people actually do. The best way to discover this is to look at what HTML syntax highlighting libraries do, since they are the main producers and consumers of code embedded in web pages in which the language actually matters.
There are two main types of HTML syntax highlighters; those that run on the server or offline, in Ruby or Python or PHP, and produce static HTML and CSS to be displayed by the browser, and those written in JavaScript, which find and highlight
<pre>
or<code>
elements on the client side. The second category is more interesting, as they need to detect the language from the HTML provided to them; in the first category, you usually specify the language manually through the API or through some mechanism specific to your wiki, blog, or CMS syntax, and so there is no actual consumer of any language information that might be embedded in the HTML. We'll take a look at both categories for the sake of completeness.For JavaScript syntax highlighters, I've found the following, with examples of their syntax for specifying a code block and its language:
<pre class="brush: html">...</pre>
. Appears to completely ignore howclass
should be used by introducing its own syntax forclass
attributes based on CSS syntax with thebrush
keyword used to indicate the language. Also has an option for using the<script>
tag, to make it easier to copy and paste code in without having to escape<
, using the sameclass
syntax.<pre><code class="html">...</code></pre>
orclass="language-html"
or the same on<pre>
. This gives you several options, one of which corresponds to the recommendation in the HTML5 spec, the other simply uses the bare language name as the class name.<pre class="sh_html">...</pre>
. Uses its own prefix for language names in the class, and only works on<pre>
, not other elements.<pre class="code"><code class="html">...</code></pre>
. Based on SyntaxHighlighter, but with a somewhat less weird syntax. Requires a the<pre>
tag with classcode
and thecode
tag with a class indicating the language.<code class="html">...</code>
. Uses just the<code>
tag, and uses the bare language as a class name.<pre class="html">...</code>
. Uses the bare language as a class name. You select the elements it will apply to using the API, but the example demonstrates it on<pre>
tags.<pre name="code" class="html">...</pre>
. Uses the bare language as a class name. You choose via the API what type of element to highlight (the example usedpre
) and the value of thename
attribute to look for to indicate that you want syntax highlighting. I believe that this is an abuse of thename
attribute.<pre class="prettyprint lang-html">
. Uses class names prefixed withlang-
to specify the language, and the classprettyprint
to indicate that you want syntax highlighting. The language class is optional; it will try to auto-detect the language if not specified.<code class="jush-html">...</code>
or<code class="language-html">...</code>
. Uses thecode
tag, with languages in a class prefixed byjush-
orlanguage-
.<pre><code data-language="javascript">...</code></pre>
uses the custom attributedata-language
, applied to either a<code>
element, or a<pre>
element, in order to support sites like Tumblr which strip out<code>
elements.<pre><code class="language-css">...</code></pre>
follows the HTML5 spec for nested<pre>
and<code>
, and the recommendation for the class name.For server-based and offline syntax highlighters, the majority (CodeRay, UltraViolet, Pygments, Highlight) do not embed any language information in the HTML they output at all. GeSHi is the only one I found that embeds the language, as
<pre class="html">...</pre>
, a<pre>
tag with a bare language name as the class.Out of that list, there seems to be no real consensus. The most popular option is just using the bare language name as a class. The next most popular is using some form of prefixed language name, either prefixed by the library name,
lang-
, orlanguage-
. There are a few that have their own strange conventions, or don't specify the language in the HTML at all.While the only thing common enough to be a de-facto standard is using the bare language name as a class, I would recommend going with what the HTML5 spec recommends, a class name of
language-
followed by the name of the language. This is supported by a few of the syntax highlighters, the rest could probably be easily modified to support it. It is less ambiguous and less likely to conflict with other classes than just the bare language name as a class. And, even if not formally specified, it is at least mentioned in a spec.I would also use the
<code>
tag to indicate source code, either bare or embedded in a<pre>
tag; the combination of a<code>
tag andlanguage-
prefixed class can be used to indicate that you have source code in a particular language, and could be used to indicate you want it to be highlighted, and is clearer and better matches the semantics of the elements than some of the other indicators used by syntax highlighting libraries. For cases in which a<code>
tag can't be used, such as embedding in sites that accept only a limited HTML subset like Tumblr, just using the<pre>
tag with the same class convention is probably best.edit to add: The CommonMark specification, which attempts to standardize Markdown so that implementations can be interoperable, producing the same HTML given the same input, has also adopted this suggested convention. It adds fenced code blocks to Markdown, surrounded with
```
or~~~
, which can be easier to use than indentation based code blocks. Immediately following the opening fence can be an info string, which is defined as:It can be instructive also the check what actual implementations do. Trying out a fenced code block on Babelmark shows that of those implementations that support fenced code blocks (not all do as it's an extension to the original Markdown), we see the following breakdown:
<pre><code class="python">...</code></pre>
<pre><code class="lang-python">...</code></pre>
<pre><code class="language-python">...</code></pre>
<pre class="python">...</pre>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">...</code></pre></div>
(quite the overkill)<pre class="python"><code class="python">...</code></pre>
Looking at other document markup languages that convert to HTML and have some understanding of code blocks:
<pre>...</pre>
; simply uses Pygments to highlight and does not include language information in the HTML.rst2html
gave me<pre class="code python literal-block">...</pre>
, highlighted with Pygments.<div class="highlight-python"><div class="highlight"><pre>...</pre></div></div>
, also highlighted with Pygments.So, overall, fairly large diversity in choices by different projects, but there does seem to be some movement towards standardizing on
<pre><code class="language-python">...</code></pre>
.似乎没有比滥用
lang
属性和您提到的zzx
前缀更好的方法了(顺便说一句,有趣的发现!)。type
属性可能稍微合适一点,但它在pre
元素中当然无效。顺便说一句,
(W3C参考此处)可能比
更合适<前>
:There doesn't seem to be a better way than to misuse the
lang
attribute with thezzx
prefix you mention (interesting find by the way!). Thetype
attribute might be slightly more fitting, but it of course isn't valid inpre
elements.By the way,
<code>
(W3C reference here) might be more fitting than<pre>
: