解析文本文档的最佳方法

发布于 2024-11-01 05:01:30 字数 560 浏览 1 评论 0原文

我正在尝试用 PHP 解析纯文本文档，但不知道如何正确执行。我想分隔每个单词，为它们分配一个 ID 并将结果保存为 JSON 格式。

示例文本：

"Hello, how are you (today)"

这就是我现在正在做的事情：

$document_array  = explode(' ', $document_text);
json_encode($document_array);

生成的 JSON 是

[["Hello,"],["how"],["are"],["you"],["(today)"]]

如何确保空格保留在适当的位置并且符号不与单词一起包含...

[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],["  ("],["today"],[")"]]

我确定需要某种正则表达式...但不知道应用什么样的模式来处理所有情况...有什么建议吗？

原文

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.

Sample text:

"Hello, how are you (today)"

This is what im doing at the moment:

$document_array  = explode(' ', $document_text);
json_encode($document_array);

The resulting JSON is

[["Hello,"],["how"],["are"],["you"],["(today)"]]

How do I ensure that spaces are kept in-place and that symbols are not included along with the words...

[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],["  ("],["today"],[")"]]

I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光倒影 2024-11-08 05:01:30

这实际上是一个非常复杂的问题，需要进行大量的学术研究。听起来很简单（只需在空格上分割！也许还有一些标点符号规则......）但您很快就会遇到问题。 “没有”是一两个词吗？带连字符的单词怎么办？有的可能是一个词，有的可能是两个词。多个连续的标点符号怎么办？所有格与引号？等等等等。即使确定句子的结尾也很重要。（这只是一个句号，对吧？！）

这个问题是标记化之一，也是一个搜索主题引擎非常重视。老实说，您确实应该考虑寻找适合您选择的语言的标记器。

回复收藏 0 原文

无名指的心愿 2024-11-08 05:01:30

也许是这个：？

array_filter(preg_split('/\b/', $document_text))

'array_filter'，删除结果数组的第一个和/或最后一个索引处的空值，如果您的字符串以单词边界开头或结尾，则会出现该空值（\b 请参阅：http://php.net/manual/en/regexp.reference.escape.php)

Maybe this:?

array_filter(preg_split('/\b/', $document_text))

the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

回复收藏 0 原文

~没有更多了~