截断包含 HTML 的文本,忽略标签
我想截断一些文本(从数据库或文本文件加载),但它包含 HTML,因此包含标签并且返回的文本较少。 这可能会导致标签未关闭或部分关闭(因此 Tidy 可能无法正常工作并且内容仍然较少)。 我如何根据文本进行截断(当您到达表格时可能会停止,因为这可能会导致更复杂的问题)。
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
会导致:
Hello, my <strong>name</st...
我想要的是:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
我该怎么做?
虽然我的问题是如何在 PHP 中执行此操作,但最好知道如何在 C# 中执行此操作...两者都应该没问题,因为我认为我能够将该方法移植过来(除非它是内置的)方法)。
另请注意,我已经包含了一个 HTML 实体 ´
- 必须将其视为单个字符(而不是本示例中的 7 个字符)。
strip_tags
是一个后备方案,但我会丢失格式和链接,而且 HTML 实体仍然存在问题。
I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
Would result in:
Hello, my <strong>name</st...
What I would want is:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
How can I do this?
While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).
Also note that I have included an HTML entity ´
- which would have to be considered as a single character (rather than 7 characters as in this example).
strip_tags
is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.
发布评论
评论(13)
Bounce 为 Søren Løvborg 的解决方案添加了多字节字符支持 - 我添加了:
、等不会被关闭 - 在 HTML 中,这些末尾不需要“/”(尽管在 XHTML 中是这样)),
& hellips;
ie … ),所有这些都在Pastie。
Bounce added multi-byte character support to Søren Løvborg's solution - I've added:
<hr>
,<br>
<col>
etc. don't get closed - in HTML a '/' is not required at the end of these (in is for XHTML though)),&hellips;
i.e. … ),All this at Pastie.
假设您使用有效的 XHTML,那么解析 HTML 并确保正确处理标签就很简单。 您只需跟踪到目前为止已打开的标签,并确保“在退出时”再次关闭它们。
编码说明:以上代码假设 XHTML 为 UTF-8< /a> 编码。 还支持 ASCII 兼容的单字节编码(例如 Latin-1),只需传递 false 作为第三个参数。 不支持其他多字节编码,但您可以通过在调用函数之前使用
mb_convert_encoding
转换为 UTF-8,然后在每个print
语句中再次转换回来来支持。(不过,您应该始终使用UTF-8。)
编辑:已更新以处理字符实体和UTF-8。 修复了如果该字符是字符实体,该函数会打印过多一个字符的错误。
Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".
Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass
false
as the third argument. Other multibyte encodings are not supported, though you may hack in support by usingmb_convert_encoding
to convert to UTF-8 before calling the function, then converting back again in everyprint
statement.(You should always be using UTF-8, though.)
Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.
我已经编写了一个函数,按照您的建议截断 HTML,但它没有将其打印出来,而是将其全部保留在字符串变量中。 也处理 HTML 实体。
I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.
在这种情况下,可以使用 DomDocument 进行令人讨厌的正则表达式黑客攻击,最糟糕的是,如果标签损坏,会出现警告:
应该给出输出:
Hello, my **name**
。Could use DomDocument in this case with a nasty regex hack, worst that would happen is a warning, if there's a broken tag :
Should give output :
Hello, my <strong>**name**</strong>
.我对 Søren Løvborg
printTruncated
函数进行了轻微更改,使其兼容 UTF-8:I've made light changes to Søren Løvborg
printTruncated
function making it UTF-8 compatible:您也可以使用 tidy :
you can use tidy as well:
我使用了一个很好的函数 http:// /alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words,显然取自 CakePHP
I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP
以下是一个简单的状态机解析器,它可以成功处理您的测试用例。 但我在嵌套标签上失败了,因为它不跟踪标签本身。 我还对 HTML 标签内的实体感到窒息(例如,在
标签的
href
属性中)。 因此它不能被认为是这个问题的 100% 解决方案,但因为它很容易理解,所以它可以成为更高级功能的基础。The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an
href
-attribute of an<a>
-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.100% 准确,但相当困难的方法:
Easy 暴力方法:
preg_split('/()/')
和 PREG_DELIM_CAPTURE 将字符串拆分为标签(不是元素)和文本片段。html_entity_decode()
来帮助准确测量)&[^\s;] +$
位于末尾以消除可能被切碎的实体)100% accurate, but pretty difficult approach:
Easy brute-force approach:
preg_split('/(<tag>)/')
with PREG_DELIM_CAPTURE.html_entity_decode()
to help measure accurately)&[^\s;]+$
at the end to get rid of possibly chopped entity)对 Søren Løvborg printTruncated 函数的另一项细微更改使其与 UTF-8(需要 mbstring)兼容,并使其返回字符串而不是打印字符串。 我认为它更有用。
我的代码没有像 Bounce 变体那样使用缓冲,只是多了一个变量。
UPD:要使其与标记属性中的 utf-8 字符正常工作,您需要 mb_preg_match 函数,如下所示。
非常感谢 Søren Løvborg 提供的这个功能,非常好。
Another light changes to Søren Løvborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful.
And my code not use buffering like Bounce variant, just one more variable.
UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.
Great thanks to Søren Løvborg for that function, it's very good.
使用以下函数
truncateHTML()
:https://github.com/jlgrall/truncateHTML
示例: 之后截断9 个字符(包括省略号):
功能: UTF-8、可配置省略号、包含/排除省略号长度、自闭合标签、折叠空格、不可见元素 (
、
、
、