如何修复从 HTML 中提取的纯文本的句子间距?

发布于 2024-11-03 19:51:56 字数 460 浏览 6 评论 0原文

我正在从特定的 URL 中提取文章以转换为句子,但文本正文具有消除某些句子之间空格的随机行为,导致:

Jane went to the store.She bought a dog. The dog was very friendly.It had no teeth.

我的一些文本是股票符号(AZ.GAN)等。所以我不能简单地在没有相邻空格的所有句点之间插入一个空格。

Jane bought several shares of (TY.JPN). She lost all her cash money."Arg!" She cried.

上面的示例将破坏股票代码变量。

好奇是否有人知道其中的原因。我已经尝试了几种 HTML 和 DOM。我使用 Simple_DOM 来获取明文。尽管如此,如果我手动执行或使用任何其他解析引擎,我会得到相同的结果。

I'm pulling articles from specific URLs for conversion to sentences, but the text body has a random behavior of eliminating whitespace between some sentences resulting in:

Jane went to the store.She bought a dog. The dog was very friendly.It had no teeth.

Some of my text is stock symbols (AZ.GAN) etc. So I can't simply insert a space between all periods which have no adjacent whitespace.

Jane bought several shares of (TY.JPN). She lost all her cash money."Arg!" She cried.

The above example would destroy the stock symbol variable.

Curious if anyone knows the cause of this. I have tried several HTML and DOM. I use Simple_DOM to grab the plaintext. Although, I get the same result if I do it manually, or with any other parsing engine.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦初启 2024-11-10 19:51:56

不幸的是,我没有针对您的具体问题的方法,但是句子之间缺少的空格实际上是否有可能是您的文本查看器(无论它是什么)没有向您显示的换行符(例如 \n)?

也许尝试这样的事情只是为了确保


var ArticleContent = ... // 获取内容
文章内容 = 文章内容.replace(/\n/g, ' 新行 ');

Unfortunately I don't have an approach for your specific question, but is it possible that the missing space between sentences is actually a linebreak (e.g. \n) that your text viewer (whatever it is) isn't showing you?

Perhaps try something like this just to make sure


var articleContent = ... // get content
articleContent = articleContent.replace(/\n/g, ' NEW LINE ');

第七度阳光i 2024-11-10 19:51:56

尝试做:

$str = trim(preg_replace('~([(].+?[.])\s(.+?[)])~', '$1$2', str_replace('.', '. ', $str)));

Try doing:

$str = trim(preg_replace('~([(].+?[.])\s(.+?[)])~', '$1$2', str_replace('.', '. ', $str)));
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文