根据关键字选择要剪切的文本部分的最佳方法是什么?
当您在 Stackoverflow 中搜索某些内容时,它会剪切问题描述中最符合您的条件的部分,然后标记条件单词。
我想知道在 C# 中手动执行此操作的最佳方法,即无需全文搜索引擎的帮助。
主要问题是如何快速选择最佳文本部分?
到目前为止我所做的是:
- 我获取文本的空间索引。这让我知道在哪里 言语开始,以便我可以开始我的 来自它们的子字符串测试。
- 从每个空间索引中,我向前获取 300 个字符并测试如何 关键词 I 多次出现 找到。
- 我假设 300 个字符长的部分包含最多 出现的次数是最好的,所以我把它从原文中删掉了。
这是一个好方法吗?有更快的方法吗?计算出现次数是找到最相关部分的最佳方法吗?
When you search something in Stackoverflow it cuts the portion of the question description that best matches your criteria and after that it marks the criteria words.
I wonder the best way to do this manually in C#, meaning without the help of a full-text search engine.
The main problem is how to select the best text portion in a fast way?
What I did so far is:
- I obtain the space indexes of the text. This allows me to know where the
words begin so that I can start my
substring tests from them.- From each of the space indexes, I get 300 characters ahead and test how
many occurrences of the keywords I
find.- I assume that the 300 characters long portion that has the most
occurrences is the best so I cut it from the original text.
Is this a good approach? Is there a faster way? Is counting the number of occurrences the best way to find the most relevant portion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用这种方法,您通常会在匹配开始或结束时找到与关键字的最佳匹配,这意味着您不会为这些关键字提供太多上下文。我会添加一个额外的条件,即匹配开始和结束附近的关键字两侧必须有 n 个单词。
您可以考虑在更方便的地方中断匹配,例如标点符号或连词而不是空格。
您可能还想查看术语频率 - 逆文档频率 为关键字赋予不同的权重,而不仅仅是对它们进行计数。
Using this approach you will often find a best match with keywords near the start or end of the match, which means you won't have much context for those keywords. I'd add an extra condition that there must be n words on either side of keywords near the start and end of the match.
You could consider breaking the match at more convenient places, such as punctuation or conjunction words instead of spaces.
You might also want to look into term frequency - inverse document frequency to give different weightings to the keywords rather than just counting them.