当前位置：文江博客话题详情

JavaScript Python regex web-scraping scraper

从 JavaScript 中提取数据（Python Scraper）

发布于 2024-10-14 14:09:12 字数 368 浏览 7 评论 0原文

我目前正在使用 urllib2、pyquery 和 json 的融合来抓取网站，现在我发现我需要从 JavaScript 中提取一些数据。一种想法是使用 JavaScript 引擎（如 V8），但这对于我的需要来说似乎有点过分了。我会使用正则表达式，但这个表达式似乎很复杂。

JavaScript：

(function(){DOM.appendContent(this, HTML("<html>"));;})

我需要提取，但我不完全确定如何执行此操作。本身可以包含基本上所有字符，因此 [^"] 不起作用。

有什么想法吗？

I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.

JavaScript:

(function(){DOM.appendContent(this, HTML("<html>"));;})

I need to extract the <html>, but I'm not entirely sure how to do so. The <html> itself can contain basically every character under the sun, so [^"] won't work.

Any thoughts?

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（2）

人│生佛魔见 2024-10-21 14:09:12

为什么是正则表达式？当您知道要删除开头和结尾的字符数时，难道不能只使用两个子字符串吗？

string[42:-7]

除了比正则表达式更快之外，内的引号是否被转义并不重要。

Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?

string[42:-7]

As well as being quicker than a regex, it then doesn't matter if quotes inside <html> are escaped or not.

回复收藏 0 原文

卷耳 2024-10-21 14:09:12

如果 html 代码中每次出现 " 都可以使用 \" 进行转义（毕竟它是一个 JavaScript 字符串），那么您可以使用

HTML\("((?:\\"|.)*?)"\)

将 HTML 的参数获取到第一个捕获组。

请注意，此正则表达式尚未转义为 Javascript 字符串本身。

If every occurance of " inside the html code would be escaped by using \" (it is a JavaScript string after all), you could use

HTML\("((?:\\"|.)*?)"\)

to get the parameter to HTML into the first capturing group.

Note that this Regex is not yet escaped to be a Javascript String itself.

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

文章

评论

25 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

佚名

文章 0 评论 0

羁客

文章 0 评论 0

天天爱笑的徐老师

文章 0 评论 0

星

文章 0 评论 0

夏日落

文章 0 评论 0

隐诗

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文