从 Rails 中的字符串解析/提取文本?
我在 Rails 中有一个字符串,例如“这是一条 Twitter 消息。#books War & Peace by Leo Tolstoy。我喜欢这本书!”,我想解析文本并仅提取某些短语,例如“War & Peace”。列夫·托尔斯泰的《和平》。
这是使用正则表达式并将“#books”之间的文本提升到“.”的问题吗?
如果消息没有结构怎么办,例如: “这是一条 Twitter 消息 #books 战争与和平 作者列夫·托尔斯泰 我喜欢这本书!”或者 “这是一条 Twitter 消息。我喜欢列夫·托尔斯泰的《战争与和平》#books” 在不知道事前短语的情况下,如何可靠地提取短语“列夫·托尔斯泰的战争与和平”。
有什么宝石、方法等可以帮助我做到这一点吗?
至少,你会怎么称呼我正在尝试做的事情?它将帮助我在 Google 上搜索解决方案。我尝试了一些关于“解析”的搜索,但没有成功。
- - 编辑 - - 根据@rogeliog的建议,我将添加以下内容:
我可以忍受#books之后的垃圾文本,但之前什么都不能忍受。我尝试了“match.(/#books.*/)”——结果在这里: www.rubular.com/ r/gM7oSZxF5M。
但我怎样才能捕获结果#6? (例如,当有人将#books放在句子末尾时)?
有没有办法让我用正则表达式来做 if-then ?像这样的东西:
如果 [#books 位于消息末尾],
然后[获取 #books 之前的最后 10 个单词],
其他[匹配。(/#books.*/)]
如果您提供正则表达式,请通过使用 rubular.com 的永久链接
I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy".
Is this a matter of using Regex and lifting the text between "#books" to "."?
What if there's no structure to the message, like:
"This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or
"This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books"
How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex ante.
Are there any gems, methods, etc. that can help me do this?
At the very least, what would you call what I'm trying to do? It will help me search for a solution on Google. I've tried a few searches on "parsing" with no luck.
--- edit ---
based on @rogeliog suggestion, I will add the following:
I can live with the garbage text that comes after #books, but nothing before. I tried "match.(/#books.*/)" -- results here: www.rubular.com/r/gM7oSZxF5M.
But how can I capture Result #6? (e.g., when someone puts #books at the end of the sentence)?
Is there a way for me to do an if-then with regex? Something like:
if [#books is at the end of the message],
then [take the last 10 words preceding #books],
else [match.(/#books.*/)]
If you offer a regex, please post your solution via a permalink using rubular.com
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为您需要的是自然语言处理。这是一个非常大的领域,有很多技术和应用。特别是对于 Ruby,您可能需要查看 Ruby Linguistics 项目。
祝你好运,解析和处理自然语言并不是一件容易的事。
I think what you're going to need is Natural Language Processing. It's a very large field and has many techniques and applications. With Ruby in particular you may want to look at the Ruby Linguistics project.
Good luck to you, parsing and processing natural language is not an easy thing to do.
我认为您正在尝试解析一些非常复杂的变化。您有包含所有书名的数据库吗?这将有助于分配。
要从第一个示例中获取标题(“这是一条 Twitter 消息。#books War & Peace by Leo Tolstoy。我喜欢这本书!”),您可以简单地:
这将返回:“War & Peace by Leo Tolstoy ”。
如果您想根据 #books 是否在末尾执行 if else 语句,您可以:
如果 #books 在末尾,则将为您提供 books 之前的最后 10 个单词;如果 #books 在末尾,则为您提供 books 之前的最后 10 个单词。还没有结束,
我真的没有更好的想法,希望对你有用,让我知道:)
I Think that you are trying to parse some pretty complex variations. Do you have a DB with all the book titles? That will help allot.
To get out the title from the first example("This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!") you can simply:
That will return: " War & Peace by Leo Tolstoy."
If you want to do an if else statement depending if #books is at the end or not, you can:
That will give you the last 10 words preceding books if #books is at the end, and whatever it is after #books if it is not at the end
I dont really have a better idea, hope that works for you, let me know:)