解析文本文档的最佳方法

发布于 2024-11-01 05:01:30 字数 560 浏览 1 评论 0原文

我正在尝试用 PHP 解析纯文本文档,但不知道如何正确执行。 我想分隔每个单词,为它们分配一个 ID 并将结果保存为 JSON 格式。

示例文本:

"Hello, how are you (today)"

这就是我现在正在做的事情:

$document_array  = explode(' ', $document_text);
json_encode($document_array);

生成的 JSON 是

[["Hello,"],["how"],["are"],["you"],["(today)"]]

如何确保空格保留在适当的位置并且符号不与单词一起包含...

[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],["  ("],["today"],[")"]]

我确定需要某种正则表达式...但不知道应用什么样的模式来处理所有情况...有什么建议吗?

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.

Sample text:

"Hello, how are you (today)"

This is what im doing at the moment:

$document_array  = explode(' ', $document_text);
json_encode($document_array);

The resulting JSON is

[["Hello,"],["how"],["are"],["you"],["(today)"]]

How do I ensure that spaces are kept in-place and that symbols are not included along with the words...

[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],["  ("],["today"],[")"]]

I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

时光倒影 2024-11-08 05:01:30

这实际上是一个非常复杂的问题,需要进行大量的学术研究。听起来很简单(只需在空格上分割!也许还有一些标点符号规则......)但您很快就会遇到问题。 “没有”是一两个词吗?带连字符的单词怎么办?有的可能是一个词,有的可能是两个词。多个连续的标点符号怎么办?所有格与引号?等等等等。即使确定句子的结尾也很重要。 (这只是一个句号,对吧?!)

这个问题是标记化之一,也是一个搜索主题引擎非常重视。老实说,您确实应该考虑寻找适合您选择的语言的标记器。

This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)

This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.

无名指的心愿 2024-11-08 05:01:30

也许是这个:?

array_filter(preg_split('/\b/', $document_text))

'array_filter',删除结果数组的第一个和/或最后一个索引处的空值,如果您的字符串以单词边界开头或结尾,则会出现该空值(\b 请参阅:http://php.net/manual/en/regexp.reference.escape.php)

Maybe this:?

array_filter(preg_split('/\b/', $document_text))

the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文