Python中的字符串操作:处理简单搜索引擎的查询
我正在实施一个简单的搜索引擎,该引擎在源数据中搜索,该源数据是不同主题的12K书面新闻。我们假设搜索引擎只能做出响应:
- 短语查询 在双引号中随附的
- 不询问 是在感叹号标记
- 和查询 之后出现的,例如没有任何特定标记的
查询:
“全球变暖”在全球范围内! /em>
是应包含的查询:
- 短语查询 :“全局变暖”
- strong> : worldwide
- 不包含 不查询 :
美国> 短语查询 应该在单词之间没有其他单词的独特作品中连续出现! 我的问题是使用Python或RE库的字符串操作将这三种类型的查询分开。
我已经写了这件代码,用于提取 短语查询 和 不查询 。但是我还没有处理 和查询 !
query = input()
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)
print(phrase_query)
print(not_query)
输入
全球变暖”
的
['global warming']
['USA']
对于: “ !但是,我无法提取 和查询 。如何在其他列表中提取 和查询 :全球?
I am implementing a simple search engine that searches in a source data which is the 12k pieces of written-news of different topics. We assume that the search engine just have the ability to respond to:
- Phrase Queries that come with inside of the double-quotation marks
- Not Queries that come after the exclamation mark
- And Queries which come without any specific mark
For instance this query:
"global warming" worldwide !USA
is a query that should contain:
- the Phrase Query: "global warming"
- the And Query: worldwide
- not contain the Not Query: USA
The point is that the Phrase Query should come continuously in a unique piece with no other words between the words!
My problem is with splitting these three types of queries using string operation of Python or re library.
I have write this piece of code for extracting Phrase Queries and Not Queries. but I have not handled to extract the And queries yet!
query = input()
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)
print(phrase_query)
print(not_query)
For the input of:
"global warming" worldwide !USA
the above code returns:
['global warming']
['USA']
Which is great. However I can not extract the And Query. How can I extract the And Query: worldwide in a different list?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果我正确理解问题,那么任何不属于 phase Query 和 not QUERY 的东西都是和查询的一部分。因此,我们从本质上可以从字符串中删除这些查询中的术语,然后将其拆分以获取单个条款。
因此,我在这里做的是,在第一个循环中,我正在循环所有短语查询,然后通过在原始查询中显示它们之前和之后通过添加引号来完成它们。然后,我将用空白字符串替换它们。因此,它基本上只会删除它们。之后,我对所有不是查询都做了类似的事情,但是这次我在前面添加了感叹号。
然后,搜索中的其余术语都是和查询,因此我们可以将它们分开以单独获取这些术语在列表中。
编辑更强大的解决方案(有效处理空间的解决方案):
If I understand the problem correct, anything that is not a part of the phase query and the not query, is part of the and query. So, we can essentially just remove the terms that come in those queries from the string and then split it to get the individual terms.
So, what I am doing here is, in the first for loop, I am looping over all the phrase queries and then completing them by adding the quotes before and after, just like they would be shown in the original query. Then I will replace them with a blank string. So it would basically just remove them. After that, I am doing a similar thing with all the not queries, but this time I am adding an exclamation in the front.
Then, the remaining terms in the search are all and queries, so we can split them to get those terms individually in a list.
EDIT for a more robust solution(one that handles spaces effectively):