Python正则表达式用于排除包含特定单词的字符串
我试图在抓取维基百科时使用正则表达式来排除消歧页面。我四处寻找有关使用负向前瞻的技巧和 我似乎无法让它发挥作用。我想我错过了一些基本的东西 关于它的用途,但到目前为止我完全一无所知。有人可以吗 给我指明正确的方向吗? (我不想使用 if y 中的“消歧义” ,我试图抓住 负前瞻的工作原理。)谢谢。 这是代码:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)(!?disambiguation)'
for x in list_links:
y = re.findall(regex1, x)
print(y)
findString(list_links)```
I am trying to use a regex to exclude disambiguation pages when scraping wikipedia. I looked around for tips about using the negative lookahead and
I cannot seem to make it work. I think I am missing something fundamental
about its use but as of now I am totally clueless. Could someone please
point me in the right direction? (I don't want to use
if 'disambiguation' in y
, I am trying to grasp
the workings of the negative lookahead.) Thank you.
Here is the code:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)(!?disambiguation)'
for x in list_links:
y = re.findall(regex1, x)
print(y)
findString(list_links)```
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以根据需要使用正则表达式之一。此外,我还对函数定义进行了一些更改以尊重 PEP。
You can use one of the regex, based on your need. Also, I have added some changes to the function definition to respect PEP.
对于您的情况,最简单的解决方案就是不使用正则表达式......
只是做类似的事情:
For your case the simplest solution would just be not using regex for that...
just do something like:
您不需要使用正则表达式。您可以遍历
list_links
并检查您要查找的字符串“disambiguation”是否位于list_links
中的每个项目中。You do not need to use regex. You can iterate through
list_links
and check if the string you are looking for, 'disambiguation` is in each item inlist_links
.