使用 wget 和 regex 进行数据抓取
我刚刚学习 bash 脚本,我试图从网站上抓取一些数据,主要是维基词典。这就是我现在正在命令行上尝试的,但它没有返回任何结果
wget -qO- http://en.wiktionary.org/wiki/robust | egrep '<ol>{[a-zA-Z]*[0-9]*}*</ol>'
我正在尝试的是获取标签之间的数据,只是希望它们被显示。你能帮我找出我做错了什么吗?
谢谢
i'm just learning bash scripting, i was trying to scrape some data out of a site, mostly wikitionary. This is what I'm trying on the command line right now but it is not returning any result
wget -qO- http://en.wiktionary.org/wiki/robust | egrep '<ol>{[a-zA-Z]*[0-9]*}*</ol>'
What i'm trying is to get the data between the tags, just want them to be displayed. Can you please help me find out what I'm doing wrong ?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您需要将输出发送到 stdout:
以使用 grep 获取所有
标记,您可以执行以下操作:
you need to send output to stdout:
to get all
<ol>
tags with grep you can do:至少您需要
-e
开关来激活正则表达式。-O -
选项将 wget 的输出发送到 stdout 而不是发送到磁盘老实说,我想说 grep 是执行此任务的错误工具,因为 grep 是按行工作的,并且你的表情有好几行。
我认为 sed 或 awk 更适合这项任务。
使用
sed
,它看起来像如果你想摆脱额外的
和
你可以做追加
相关链接
At least you need to
-e
switch.-O -
optionHonestly, I'd say grep is the wrong tool for this task, since grep works on a per-line basis, and your expression stretches over several lines.
I think
sed
orawk
would be a better fit for this task.With
sed
it would look likeIf you want to get rid of the extra
<ol>
and</ol>
you could do appendRelated links
如果我正确理解了这个问题,那么目标是从 ol-sections 中提取可见的文本内容。我会这样做:
[来源:"Using the Linux Shell for Web Scraping"]
hxnormalize 预处理 hxselect 的 HTML 代码,该代码应用 CSS 选择器“ol”。 Lynx 将渲染代码并将其缩减为浏览器中可见的内容。
If I understand the question correctly then the goal is to extract the visible text content from within ol-sections. I would do it this way:
[source: "Using the Linux Shell for Web Scraping"]
hxnormalize preprocesses the HTML code for hxselect which applies the CSS selector "ol". Lynx will render the code and reduce it to what is visible in a browser.