当前位置：文江博客话题详情

Python中的字符串操作：处理简单搜索引擎的查询

发布于 2025-02-11 07:09:22 字数 1109 浏览 2 评论 0原文

我正在实施一个简单的搜索引擎，该引擎在源数据中搜索，该源数据是不同主题的12K书面新闻。我们假设搜索引擎只能做出响应：

短语查询 在双引号中随附的
不询问 是在感叹号标记
和查询 之后出现的，例如没有任何特定标记的

查询：

“全球变暖”在全球范围内！ /em>

是应包含的查询：

短语查询 ：“全局变暖”
strong> ： worldwide
不包含 不查询 ：

美国> 短语查询 应该在单词之间没有其他单词的独特作品中连续出现！我的问题是使用Python或RE库的字符串操作将这三种类型的查询分开。

我已经写了这件代码，用于提取 短语查询 和 不查询 。但是我还没有处理 和查询 ！

query = input()
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)
print(phrase_query)
print(not_query)

输入

全球变暖”

的

['global warming']
['USA']

对于： “ ！但是，我无法提取 和查询 。如何在其他列表中提取 和查询 ：全球？

原文

I am implementing a simple search engine that searches in a source data which is the 12k pieces of written-news of different topics. We assume that the search engine just have the ability to respond to:

Phrase Queries that come with inside of the double-quotation marks
Not Queries that come after the exclamation mark
And Queries which come without any specific mark

For instance this query:

"global warming" worldwide !USA

is a query that should contain:

the Phrase Query: "global warming"
the And Query: worldwide
not contain the Not Query: USA

The point is that the Phrase Query should come continuously in a unique piece with no other words between the words!
My problem is with splitting these three types of queries using string operation of Python or re library.

I have write this piece of code for extracting Phrase Queries and Not Queries. but I have not handled to extract the And queries yet!

query = input()
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)
print(phrase_query)
print(not_query)

For the input of:

"global warming" worldwide !USA

the above code returns:

['global warming']
['USA']

Which is great. However I can not extract the And Query. How can I extract the And Query: worldwide in a different list?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷途知返 2025-02-18 07:09:22

如果我正确理解问题，那么任何不属于 phase Query 和 not QUERY 的东西都是和查询的一部分。因此，我们从本质上可以从字符串中删除这些查询中的术语，然后将其拆分以获取单个条款。

import re

data = '"global warming" worldwide !USA'

query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)

and_query = data[:]

for q in phrase_query:
    complete_text = '"' + q + '"'
    and_query = and_query.replace(complete_text, "")
for q in not_query:
    complete_text = "!" + q
    and_query = and_query.replace(complete_text, "")

and_query = and_query.split()


print(and_query)
print(phrase_query)
print(not_query)

因此，我在这里做的是，在第一个循环中，我正在循环所有短语查询，然后通过在原始查询中显示它们之前和之后通过添加引号来完成它们。然后，我将用空白字符串替换它们。因此，它基本上只会删除它们。之后，我对所有不是查询都做了类似的事情，但是这次我在前面添加了感叹号。

然后，搜索中的其余术语都是和查询，因此我们可以将它们分开以单独获取这些术语在列表中。

编辑更强大的解决方案（有效处理空间的解决方案）：


import re

data = '" global warming " worldwide ! USA'

query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!([^w+]*)', query)

and_query = data[:]

for q in phrase_query:
    complete_text = '"' + q + '"'
    and_query = and_query.replace(complete_text, "")
for q in not_query:
    complete_text = "!" + q
    and_query = and_query.replace(complete_text, "")

and_query = [answer.strip() for answer in and_query.split()]
phrase_query = [answer.strip() for answer in phrase_query]
not_query = [answer.strip() for answer in not_query]


print(and_query)
print(phrase_query)
print(not_query)

If I understand the problem correct, anything that is not a part of the phase query and the not query, is part of the and query. So, we can essentially just remove the terms that come in those queries from the string and then split it to get the individual terms.

import re

data = '"global warming" worldwide !USA'

query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)

and_query = data[:]

for q in phrase_query:
    complete_text = '"' + q + '"'
    and_query = and_query.replace(complete_text, "")
for q in not_query:
    complete_text = "!" + q
    and_query = and_query.replace(complete_text, "")

and_query = and_query.split()


print(and_query)
print(phrase_query)
print(not_query)

So, what I am doing here is, in the first for loop, I am looping over all the phrase queries and then completing them by adding the quotes before and after, just like they would be shown in the original query. Then I will replace them with a blank string. So it would basically just remove them. After that, I am doing a similar thing with all the not queries, but this time I am adding an exclamation in the front.

Then, the remaining terms in the search are all and queries, so we can split them to get those terms individually in a list.

EDIT for a more robust solution(one that handles spaces effectively):


import re

data = '" global warming " worldwide ! USA'

query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!([^w+]*)', query)

and_query = data[:]

for q in phrase_query:
    complete_text = '"' + q + '"'
    and_query = and_query.replace(complete_text, "")
for q in not_query:
    complete_text = "!" + q
    and_query = and_query.replace(complete_text, "")

and_query = [answer.strip() for answer in and_query.split()]
phrase_query = [answer.strip() for answer in phrase_query]
not_query = [answer.strip() for answer in not_query]


print(and_query)
print(phrase_query)
print(not_query)

回复收藏 0 原文

~没有更多了~