如何使用正则表达式提取子字符串？（屏幕抓取）

发布于 2024-09-02 08:10:56 字数 546 浏览 8 评论 0原文

嘿伙计们，我真的很想在抓取网站时理解正则表达式，我已经在我的代码中使用它足以提取以下内容，但我被困在这里。我需要快速抓住这一点：

http://www.example.com/online/store/TitleDetail?detail&sku=123456789

从这里：

('<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t            \tcheck store inventory\r\n\t\t\t            </a>', 1)

这就是我感到困惑的地方。有什么想法吗？

编辑：每个产品的 sku 编号都会变化，所以这对我来说是个麻烦

原文

Hey guys, i'm really trying to understand regular expressions while scraping a site, i've been using it in my code enough to pull the following, but am stuck here. I need to quickly grab this:

http://www.example.com/online/store/TitleDetail?detail&sku=123456789

from this:

('<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t            \tcheck store inventory\r\n\t\t\t            </a>', 1)

This is where I got confused. any ideas?

Edit: the sku number changes per product so therein lies the trouble for me

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清风无影 2024-09-09 08:10:56

http://www\.example\.com/online/store/TitleDetail\?detail&sku=\d+

使用带有“Greedy”+ 的 \d 组来限定 sku 字段中的任何整数值

http://www\.example\.com/online/store/TitleDetail\?detail&sku=\d+

use the \d group with a "Greedy" +, to qualify any integer value in the sku field

回复收藏 0 原文

迷乱花海 2024-09-09 08:10:56

pattern = re.compile(r"window.location=\\'([^\\]*)")
haystack = r"""<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t\tcheck store inventory\r\n\t\t\t</a>"""
url = re.search(pattern, haystack).group(1)

pattern = re.compile(r"window.location=\\'([^\\]*)")
haystack = r"""<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t\tcheck store inventory\r\n\t\t\t</a>"""
url = re.search(pattern, haystack).group(1)

回复收藏 0 原文

等待我真够勒 2024-09-09 08:10:56

您不需要正则表达式，只需使用字符串方法：

result = html[0].split("window.location='")[1].split("'")[0]

You don't need regular expressions for that, just use string methods:

result = html[0].split("window.location='")[1].split("'")[0]

回复收藏 0 原文

救星 2024-09-09 08:10:56

如果总是 9 位数字

http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]{9}

如果有任意数量的数字：

http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]*

更一般：

http*?sku=[0-9]*

（*? 中的 ? 表示它将首先找到较短的匹配项，因此不太可能找到跨越多个 URL 的匹配项。）

编辑：[ 0-9]。不是 [1-9]

if there are always 9 digits

http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]{9}

if there are an arbitrary number of digits:

http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]*

more general:

http*?sku=[0-9]*

(the ? in *? means it will find shorter matches first, so it is less likely to find a match that spans multiple URLs.)

edit: [0-9]. not [1-9]

回复收藏 0 原文

本王不退位尔等都是臣 2024-09-09 08:10:56

http://txt2re.com/ 可能对您有帮助

回复收藏 0 原文

~没有更多了~

关于作者

世界和平

暂无简介

0 文章

0 评论

23 人气

关注发私信

书间行客

文章 0 评论 0

关注

我ぃ本無心為│何有愛

文章 0 评论 0

关注

神妖

文章 0 评论 0

关注

undefined

文章 0 评论 0

关注

38169838

文章 0 评论 0

关注

彡翼

文章 0 评论 0

友情链接

文江博客

如何使用正则表达式提取子字符串？（屏幕抓取）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

书间行客

我ぃ本無心為│何有愛

神妖

undefined

38169838

彡翼

友情链接

如何使用正则表达式提取子字符串？ （屏幕抓取）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

书间行客

我ぃ本無心為│何有愛

神妖

undefined

38169838

彡翼

友情链接

如何使用正则表达式提取子字符串？（屏幕抓取）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。