正则表达式解析html标题标签

发布于 2024-10-31 19:29:36 字数 629 浏览 1 评论 0原文

我需要解析很多 html 文件才能知道哪些文件在标题标签中包含特定文本。

假设标题是

file1.htm
<title>100 text other text</title>
file2.htm
<title>text 100 text other text</title>
file3.htm
<title>text 1000 text other text</title>
file4.htm
<title>text one hundred text other text</title>

按照我的示例，我需要查找包含 100 或 100 的文件名，即文件 1,2 和 4。

我的问题是我不知道如何编写正则表达式

gci "c:\my_folder" | ? {$_.extension -eq ".htm"} | 
select-string -pattern '<title>*100*</title>' |
Select-Object -Unique Path

请注意，如果这可能对于正则表达式很重要，标题标签不在行的开头而是在中间。提前致谢。

原文

I need to parse a lot of html files in order to know which ones contain specific text within title tag.

Let's suppose that titles are

file1.htm
<title>100 text other text</title>
file2.htm
<title>text 100 text other text</title>
file3.htm
<title>text 1000 text other text</title>
file4.htm
<title>text one hundred text other text</title>

Following my example I need to find files name that contain 100 or one hundred, that is files 1,2 and 4.

My problem is that I don't know how to write regular expression

gci "c:\my_folder" | ? {$_.extension -eq ".htm"} | 
select-string -pattern '<title>*100*</title>' |
Select-Object -Unique Path

Please note, if this may be important for regexp, that title tag is not at the beginning of a row but in the middle.
Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

捂风挽笑 2024-11-07 19:29:36

这应该可以做到。

^.*<title>(.*(100|one\shundred)[^0].*)?</title>.*$

This should do it.

^.*<title>(.*(100|one\shundred)[^0].*)?</title>.*$

回复收藏 0 原文

没︽人懂的悲伤 2024-11-07 19:29:36

尝试

<title>(.*[^[:alnum:]])?(100|one hundred)([^[:alnum:]].*)?</title>

匹配模式。模式语法是PCRE（就像perl中的一样），如果需要的话可以重新表述。

最好的问候，

卡斯滕

PS：
当心陷阱——评论中的所有建议和警告确实有效；不过，就您而言，正则表达式方法似乎是可行的（主要是因为您正在调查“标题”标签的内容，每个文件应该只有一个，并将其分布在多行中将是愚蠢的恕我直言）。

try

<title>(.*[^[:alnum:]])?(100|one hundred)([^[:alnum:]].*)?</title>

for the pattern to match. pattern syntax is PCRE (like in perl), it can be reformulated if necessary.

best regards,

carsten

ps:
beware of the pitfalls - all the recommendations and warnings from the comments do hold; still, in your case, the regex approach seems viable (mainly because you're investigating the 'title' tag's content, there should only be a single one per file and spreading it across multiple lines would be plain silly imho).

回复收藏 0 原文

~没有更多了~