是否有类似“CSS 选择器”的东西？或 XPath grep？

发布于 2024-12-03 18:48:06 字数 268 浏览 1 评论 0原文

我需要找到一堆 HTML 文件中位于以下结构 (CSS) 中的所有位置：

div.a ul.b

或 XPath:

//div[@class="a"]//div[@class="b"]

grep 在这里对我没有帮助。是否有一个命令行工具可以返回与此条件匹配的所有文件（以及可选的其中所有位置）？即，如果文件与某个 HTML 或 XML 结构匹配，则返回文件名。

原文

I need to find all places in a bunch of HTML files, that lie in following structure (CSS):

div.a ul.b

or XPath:

//div[@class="a"]//div[@class="b"]

grep doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

堇色安年 2024-12-10 18:48:06

试试这个：

安装 http://www.w3.org/Tools/HTML-XML-实用程序/。
- Ubuntu：aptitude 安装 html-xml-utils
- MacOS：brew install html-xml-utils
保存网页（将其命名为 filename.html）。
运行：hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"

其中 "label.black" 是唯一标识 HTML 元素名称的 CSS 选择器。编写一个名为 cssgrep 的帮助程序脚本：

#!/bin/bash

# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"

然后您可以运行：

cssgrep filename.html "label.black"

这将为 black 类的所有 HTML label 元素生成内容。

-l 240 参数对于避免解析输出中的换行符非常重要。例如，如果输入 Text to \nextract ，则 -l 240 会将 HTML 重新格式化为 要提取的文本，在第 240 列插入换行符，这简化了解析。也可以扩展到 1024 或更高。

另请参阅：

https://superuser.com/a/529024/9067 - 类似问题
https://gist.github.com/Boldewyn/4473790 - 包装脚本

Try this:

Install http://www.w3.org/Tools/HTML-XML-utils/.
- Ubuntu: aptitude install html-xml-utils
- MacOS: brew install html-xml-utils
Save a web page (call it filename.html).
Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"

Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:

#!/bin/bash

# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"

You can then run:

cssgrep filename.html "label.black"

This will generate the content for all HTML label elements of the class black.

The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.

See also:

https://superuser.com/a/529024/9067 - similar question
https://gist.github.com/Boldewyn/4473790 - wrapper script

回复收藏 0 原文

想念有你 2024-12-10 18:48:06

我用 Node JS 构建了一个命令行工具，它就是这样做的。您输入 CSS 选择器，它将搜索目录中的所有 HTML 文件，并告诉您哪些文件与该选择器匹配。

您需要安装 Element Finder，cd 到您要搜索的目录中，然后运行：

elfinder -s "div.a ul.b"

有关详细信息，请参阅 http://keegan.st/2012/06/03/find-in-files-with-css-selectors/

I have built a command line tool with Node JS which does just this. You enter a CSS selector and it will search through all of the HTML files in the directory and tell you which files have matches for that selector.

You will need to install Element Finder, cd into the directory you want to search, and then run:

elfinder -s "div.a ul.b"

For more info please see http://keegan.st/2012/06/03/find-in-files-with-css-selectors/

回复收藏 0 原文

彩扇题诗 2024-12-10 18:48:06

至少有 4 个工具：

pup - 受到 jq 的启发，pup 的目标是成为一个快速的以及从终端探索 HTML 的灵活方式。
htmlq - 喜欢 jq，但针对 HTML。使用 CSS 选择器从 HTML 文件中提取部分内容。
hq - 使用 CSS 和 XPath 选择器的轻量级命令行 HTML 处理器。
xq - 命令行 XML 和 HTML 美化器和内容提取器。

示例：

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

$ pup --color 'title' < robots.html
<title>
 Robots exclusion standard - Wikipedia
</title>

$ htmlq --text 'title' < robots.html
Robots exclusion standard - Wikipedia

$ hq --xpath '//title' < robots.html
<title>robots.txt - Wikipedia</title>

$ xq --xpath '//title' < robots.html
robots.txt - Wikipedia

There are at least 4 tools:

pup - Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.
htmlq - Likes jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.
hq - Lightweight command line HTML processor using CSS and XPath selectors.
xq - Command-line XML and HTML beautifier and content extractor.

Examples:

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

$ pup --color 'title' < robots.html
<title>
 Robots exclusion standard - Wikipedia
</title>

$ htmlq --text 'title' < robots.html
Robots exclusion standard - Wikipedia

$ hq --xpath '//title' < robots.html
<title>robots.txt - Wikipedia</title>

$ xq --xpath '//title' < robots.html
robots.txt - Wikipedia

回复收藏 0 原文

乖乖哒 2024-12-10 18:48:06

Per Nat 的回答在这里：

How to parse XML in Bash?

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library

Per Nat's answer here:

How to parse XML in Bash?

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library

回复收藏 0 原文

~没有更多了~