如何在re.search中给所有特殊角色的字面意义?
我试图弄清楚制造商/品牌所有者是否在在线平台上出售产品。例如,对于带有品牌名称“ Hello Olly”的产品,我希望以下卖家名称显示
- Hello olly
- helloolly Inc.
- Hello Olly Company,
但不是匹配,
- XYZ Seller
- Hello Olly Company
问题:我遇到了品牌名称具有特殊角色的问题,例如(
goal :将所有特殊字符视为字面字符串。例如,
- '您好(Olly'应该与“ Hello”(Olly Company'
- 如果它也与“ Hello Olly Company”相匹配,那将是多么的好 - 注释(已以卖方名称删除。
- 与 “ Hello Olly Company”相匹配 - 注释(已被卖方名称删除。打开(。这两个。都有(在产品名称中,如果没有匹配的闭合括号会产生额外的并发症。
将特殊字符视为文字字符串,则所有这些问题都应解决。
如果 我希望他们都有任意数量的特殊字符。
:如果我没有特殊字符, 处理特殊角色,但没有帮助
def match_string(brand, seller):
brand = str(brand).lower().replace(" ", "") .replace("-", "") # may not need replace("-", "") if I have a better process to deal with all special characters.
seller = str(seller).lower().replace(" ", "") .replace("-", "")
# Tried the following two lines to give special characters their literal meaning. But it doesn't seem to work
brand = re.escape(brand)
seller = re.escape(seller)
try:
match = re.search(brand, seller).group()
return True
except AttributeError:
return False
谢谢大家
I am trying to figure out if the manufacturer/brand owner is selling a product on an online platform. For example, for a product with the brand name “Hello Olly”, I would like the following seller names to show a match
- HELlo ollY
- HelloOlly Inc.
- The hello olly Company
But not a Match for,
- XYZ Seller
- Hello The olly company
Problem: I run into problems where the brand name has special characters, such as (
Goal: To treat all special characters as literal strings. For example,
- ‘Hello (olly’ should show a match with ‘The Hello (olly Company’
- Would be extra nice, if it also matches ‘Hello olly Company’ – note ( has been removed in seller name.
- ‘Hello (olly)’ should show a match with ‘The Hello (olly) Company’ – note the first instance had only opening (. This has both (). Having just ( in the product name creates extra complications, if there isn’t a matching closing bracket.
All of these problems should be resolved if special characters are treated as literal strings.
Note: There could be an arbitrary number of special characters at any position. I would like them all to have their literal meaning.
The following function works if there are no special characters. I tried to use re.escape() to deal with special characters, but it didn’t help
def match_string(brand, seller):
brand = str(brand).lower().replace(" ", "") .replace("-", "") # may not need replace("-", "") if I have a better process to deal with all special characters.
seller = str(seller).lower().replace(" ", "") .replace("-", "")
# Tried the following two lines to give special characters their literal meaning. But it doesn't seem to work
brand = re.escape(brand)
seller = re.escape(seller)
try:
match = re.search(brand, seller).group()
return True
except AttributeError:
return False
THANKS, everyone
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我通常更喜欢我们只是
REGEX
或只是“标准”字符串操作,而不是混合它们(可能只是我的个人喜好)。您可以首先设置re.ignorecase
标志,以避免将所有内容施放到str.lowercase()
首先。现在,直截了当的方法是实际将正则pattern喂给您的功能。在您的示例中,它将是
'Hello \ s?\(?olly \)?(\ s?(inc \。?)|(Company)|(comp \。))?'''
。您当然只能查找'Hello \ s?如果输入字符串是
'Hello(olly)company)'
。在编程中构造正则构造的问题是,python会自动添加逃脱字符,以逃脱您要在模式中使用的逃生字符
。
或
r'(\(\(| \ [| \ {)?'
时,(| \ [| \ {)吗
'( \ 可以解释为布尔人:
I usually prefer to us just
regex
or just "standard" string-manipulation and not mix them (could be hat it is just my personal preference). You could set there.IGNORECASE
flag to avoid casting everything tostr.lowercase()
first.Now, the straight forward way would be to actually feed a regex-pattern to your function. In your example, it would be
'Hello\s?\(?olly\)?(\s?(Inc\.?)|(Company)|(Comp\.))?’'
. You could of course only look for'Hello\s?\(?olly'
which would also return a match in all cases but not returns the part'Hello (olly'
if the input string yould be'The Hello (olly) Company)'
.Howevery, I fear that you are trying to write a function that builds a regex pattern from an input string. That is difficult as it needs quite a few assumptions. The problem with constructing regex-patterns programatically is that is that python automatically adds escape characters to escape the escape characters that you want to use in the pattern.
That is why you get a
or
r'(\(|\[|\{)?'
will look likeI never thought of how to avoid this tough...
BTW, the return value of
re.search(brand, seller, re.IGNORECASE)
can be interpreted as a boolean: