从 Rails 中的字符串解析/提取文本?

发布于 2024-11-17 06:02:56 字数 916 浏览 4 评论 0原文

我在 Rails 中有一个字符串,例如“这是一条 Twitter 消息。#books War & Peace by Leo Tolstoy。我喜欢这本书!”,我想解析文本并仅提取某些短语,例如“War & Peace”。列夫·托尔斯泰的《和平》。

这是使用正则表达式并将“#books”之间的文本提升到“.”的问题吗?

如果消息没有结构怎么办,例如: “这是一条 Twitter 消息 #books 战争与和平 作者列夫·托尔斯泰 我喜欢这本书!”或者 “这是一条 Twitter 消息。我喜欢列夫·托尔斯泰的《战争与和平》#books” 在不知道事前短语的情况下,如何可靠地提取短语“列夫·托尔斯泰的战争与和平”。

有什么宝石、方法等可以帮助我做到这一点吗?

至少,你会怎么称呼我正在尝试做的事情?它将帮助我在 Google 上搜索解决方案。我尝试了一些关于“解析”的搜索,但没有成功。

- - 编辑 - - 根据@rogeliog的建议,我将添加以下内容:

我可以忍受#books之后的垃圾文本,但之前什么都不能忍受。我尝试了“match.(/#books.*/)”——结果在这里: www.rubular.com/ r/gM7oSZxF5M

但我怎样才能捕获结果#6? (例如,当有人将#books放在句子末尾时)?

有没有办法让我用正则表达式来做 if-then ?像这样的东西:

如果 [#books 位于消息末尾],

然后[获取 #books 之前的最后 10 个单词],

其他[匹配。(/#books.*/)]

如果您提供正则表达式,请通过使用 rubular.com 的永久链接

I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy".

Is this a matter of using Regex and lifting the text between "#books" to "."?

What if there's no structure to the message, like:
"This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or
"This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books"
How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex ante.

Are there any gems, methods, etc. that can help me do this?

At the very least, what would you call what I'm trying to do? It will help me search for a solution on Google. I've tried a few searches on "parsing" with no luck.

--- edit ---
based on @rogeliog suggestion, I will add the following:

I can live with the garbage text that comes after #books, but nothing before. I tried "match.(/#books.*/)" -- results here: www.rubular.com/r/gM7oSZxF5M.

But how can I capture Result #6? (e.g., when someone puts #books at the end of the sentence)?

Is there a way for me to do an if-then with regex? Something like:

if [#books is at the end of the message],

then [take the last 10 words preceding #books],

else [match.(/#books.*/)]

If you offer a regex, please post your solution via a permalink using rubular.com

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

九命猫 2024-11-24 06:02:56

我认为您需要的是自然语言处理。这是一个非常大的领域,有很多技术和应用。特别是对于 Ruby,您可能需要查看 Ruby Linguistics 项目。

祝你好运,解析和处理自然语言并不是一件容易的事。

I think what you're going to need is Natural Language Processing. It's a very large field and has many techniques and applications. With Ruby in particular you may want to look at the Ruby Linguistics project.

Good luck to you, parsing and processing natural language is not an easy thing to do.

酷炫老祖宗 2024-11-24 06:02:56

我认为您正在尝试解析一些非常复杂的变化。您有包含所有书名的数据库吗?这将有助于分配。

要从第一个示例中获取标题(“这是一条 Twitter 消息。#books War & Peace by Leo Tolstoy。我喜欢这本书!”),您可以简单地:

"This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book".match(/#book.*\./).to_s.gsub("#books",'')

这将返回:“War & Peace by Leo Tolstoy ”。

如果您想根据 #books 是否在末尾执行 if else 语句,您可以:

if text.match(/#books$/)
  puts text.match(/([^\s]*\s){10}(#books$)/).to_s
else
  puts text.match(/#books.*/).to_s.gsub("#books",'')
end

如果 #books 在末尾,则将为您提供 books 之前的最后 10 个单词;如果 #books 在末尾,则为您提供 books 之前的最后 10 个单词。还没有结束,

我真的没有更好的想法,希望对你有用,让我知道:)

I Think that you are trying to parse some pretty complex variations. Do you have a DB with all the book titles? That will help allot.

To get out the title from the first example("This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!") you can simply:

"This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book".match(/#book.*\./).to_s.gsub("#books",'')

That will return: " War & Peace by Leo Tolstoy."

If you want to do an if else statement depending if #books is at the end or not, you can:

if text.match(/#books$/)
  puts text.match(/([^\s]*\s){10}(#books$)/).to_s
else
  puts text.match(/#books.*/).to_s.gsub("#books",'')
end

That will give you the last 10 words preceding books if #books is at the end, and whatever it is after #books if it is not at the end

I dont really have a better idea, hope that works for you, let me know:)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文