分析数据结构中的文本并将其存储

发布于 2024-10-13 03:09:49 字数 1024 浏览 7 评论 0原文

我希望你明白我想做什么。很难选择最好的单词,因为英语不是我的母语,而且我不信任自动翻译器。我会尽力解释。

我正在考虑分析一篇长文本。例如,假设我有一个分为段落的字符串。

Lorem ipsum dolor sat amet,consectetur adipiscing elit。 Nulla vitae elit libero,一个 pharetra augue。 Lorem ipsum dolor sat amet,consectetur adipiscing elit。 Cras mattis consectetur purus sat amet发酵。

Duis mollis,est non commodo luctus,nisi erat porttitor ligula,eget lacinia odio sem nec elit。 Aenean eu leo quam。 Pellentesque ornare sem lacinia quam venenatis 前庭。 Cras justo odio,dapibus ac facilisis in,egestas eget quam。 Lorem ipsum dolor sat amet,consectetur adipiscing elit。 Curabitur 温和的Tempus porttitor。 Maecenas sed diam eget risus varius blandit sat amet non magna。

我想将此字符串存储在数组或类似的东西中,这样我就可以非常快速地找到两个段落的长度或位置。例如(伪代码):

Array => {

    paragraphs => {

        "Lorem ipsum dolor sit amet, [...] fermentum.",
        ...

    }

}

我真的不知道这是否有名称。我想关于如何完成此类任务有很多理论。我对处理大量文本时关注性能的实践非常感兴趣。我想要一些东西来仔细研究和阅读。

任何帮助将不胜感激。预先感谢,
——阿尔贝托

I hope you understand what I want to do. It is hard to choose the best words, because English is not my first language and I distrust automatic translators. I will try to explain as well as I can.

I was thinking about analyzing a long text. Suppose, for example, that I have a string divided into paragraphs.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla vitae elit libero, a pharetra augue. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras mattis consectetur purus sit amet fermentum.

Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Cras justo odio, dapibus ac facilisis in, egestas eget quam. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur blandit tempus porttitor. Maecenas sed diam eget risus varius blandit sit amet non magna.

I would like to store this string in an array or something similar, in a way I can find the length or location of the two paragraphs very quickly. For example (pseudocode):

Array => {

    paragraphs => {

        "Lorem ipsum dolor sit amet, [...] fermentum.",
        ...

    }

}

I don't really know whether this has a name. I suppose there is much theory about how to do this type of task. I am really interested in practices that take care about performance when processing a big amount of text. I would like to have something to study and read carefully.

Any help would be very appreciated. Thanks in advance,
—Alberto

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

红ご颜醉 2024-10-20 03:09:49

也许读一下 Apache 的 UIMA,它都是关于分析非结构化信息,文本分析是其中的主要组成部分。

Perhaps read into Apache's UIMA, it's all about analyzing unstructured information, text analysis being a major component of it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文