如何查看一个字符串是否松散地包含另一个字符串(大小写、多余的空格和标点符号被忽略)?
我正在用 C# 编写一个程序,它比较字符串的方式类似于 Google 搜索文档中的关键字。
我想要搜索“堆栈溢出”以返回“堆栈溢出”(普通)的 true,“这是堆栈溢出”。 (中间),“欢迎来到 Stack Overflow。” (不区分大小写),“我喜欢堆栈溢出。” (变量空格)和“谁在堆栈溢出中放了破折号?”,但不是“stackoverflow”(没有空格)。
我想我可以使用像“stack([ -]|. )+overflow”这样的正则表达式,必须用每个新关键字的字符集替换每个关键字中的每个空格似乎有点矫枉过正。因为“堆栈溢出”不是我正在搜索的唯一字符串,所以我必须务实地进行搜索。
I am writing a program in C# that compares strings similarly to the way that Google searches documents for keywords.
I am wanting a search for "stack overflow" to return true for "stack overflow" (plain), "This is the stack overflow." (in the middle), "Welcome to Stack Overflow." (case insensitive), "I like stack overflow." (variable whitespace), and "Who puts a dash in stack-overflow?", but not "stackoverflow" (no whitespace).
I was thinking that I could use a regular expression like "stack([ -]|. )+overflow", it seems overkill to have to replace every space in each keyword with a character set for each new keyword. Because "stack overflow" is not the only string I am searching, I have to do it pragmatically.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为了满足您的规范,您可以首先执行
(将纯文本搜索字符串转换为正则表达式,该正则表达式还允许在以前只有空格的地方使用标点符号),然后将该正则表达式应用于您正在搜索的任何文本。
但当然,如果有最轻微的拼写错误,这将无法匹配,而使用 Levensthein 距离的算法也将匹配“Stak Overfloor”。
To meet your specification, you could first do
(to transform your plain text search string into a regular expression that also allows punctuation in places where there used to be only whitespace), and then apply that regex to whatever text you're searching.
But of course this will fail to match on the slightest typo whereas an algorithm using Levensthein distance will also match "Stak Overfloor".
如果您只是想达到您提到的特定情况下的效果,您可以使用正则表达式将要忽略的标记替换为单个空格(或空字符串)。
如果您想要更复杂的解决方案,可以使用动态编程来获得将第一个字符串转换为第二个字符串所需的最小排列。这也将允许匹配(少数)丢失的字母或拼写错误。
If you simply want to achieve the effect in the specific case you mentioned, you can use regular expressions to replace the tokens you want to ignore by a single space (or empty string).
If you want a more elaborate solution, you could use dynamic programming to get the smallest permutation required to transform the first string into the second. This will allow matching with (few) missing letters or typos, too.
如果您要与 短 字符串进行比较,那么我能看到的最简单的方法是从两个字符串中删除所有空格和其他字符,然后执行一个简单的
string.Contains< /代码>。
If you are comparing against short strings, then the easiest way that I can see would be to strip out all of the white space and other characters from both strings and do a simple
string.Contains
.