从文本文件中获取分隔文件名的列表

发布于 2024-07-26 07:38:21 字数 1703 浏览 4 评论 0原文

我对 Bash 还很陌生，所以这对你们大多数人来说可能听起来很愚蠢。我正在尝试从文本文件中获取一些文件名的列表。尝试使用 sed 和 awk 来完成此操作，但以我有限的知识无法使其工作。

这是示例文件内容：

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948)  -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
 width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
 xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</tspan></text>
</svg>

我想从此示例中获得一个具有以下确切内容的新文本文件：

/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf /Volumes/Secondary500/Temp/Untitled-2_Layer 1 副本.pdf /Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

我想告诉 sed 打印 'font-size"10">' 和 ' 之间的所有匹配条目' 但是...我得到的最好的结果是一个整行都包含我的字段分隔符的文件。

如果您能解释完成的每个步骤，那就太好了。

文件名可以更多也可以更少。这3个只是一个例子。

原文

I'm really new to Bash, so this could sound silly to most of you.
I'm trying to get a list of some filenames from a text file. Tried to do this with sed and awk, but couldn't get it to work with my limited knowledge.

This is a sample file content:

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948)  -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
 width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
 xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</tspan></text>
</svg>

What I would like to get from this sample is a new text file with this exact content:

/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

I thought telling sed to print all the matching entries between 'font-size"10">' and '</tspan>' but... the best I got was a file with the whole line contaning my field delimiters.

If you could explain each step done, would be great.

The filenames could be more or less. This 3 are just an example.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

撕心裂肺的伤痛 2024-08-02 07:38:21

怎么样：

cat file.xml | sed -e's/^[^>]*>//' -e's/<.*$//' | grep \\.

它不是很通用，但要完全通用会复杂很多（XML 需要完整的解析器等）。

基本上，sed 脚本有两部分。首先，删除从行首 (^) 到第一个“>”的所有字符特点。请注意，我匹配所有非“>” 为了做到这一点。第二部分去掉最左边“<”中的所有字符字符到行尾。由于第二部分出现在第一部分之后，因此它是在第一次剥离完成后完成的，这就是为什么它不会擦除整行。

然后，grep 语句仅返回带有“.”的行。其中，只剩下带有文件名的行。

希望有帮助！

How about this:

cat file.xml | sed -e's/^[^>]*>//' -e's/<.*$//' | grep \\.

It's not very general-purpose, but to be fully general would be A LOT more complicated (XML requires a full parser, etc.).

Basically, the sed script has two parts. First, strip off all characters from beginning of line (^) to the first ">" character. Note that I match all non ">" in order to do that. The second part strips off all characters from the left most "<" character to the end of line. Since this second part comes AFTER the first part, it's done after the first stripping is done, that's why it doesn't erase the whole line.

Then, the grep statement returns only lines with a "." in them, which is only the lines with filenames remaining.

Hope that helps!

回复收藏 0 原文

别在捏我脸啦 2024-08-02 07:38:21

用于此的 sed 命令将是

 sed  -n 's|font-size="[0-9]*".\(.*\)</tspan.*|\1|p' file.xml
            -------------------  --  ---------
               prefix part       \1   suffix

这就是它的工作原理，

-n 禁止打印缓冲区中的所有行，
最后的 p表示要打印替换的缓冲区
'|' 用作分隔符，而不是通常的 '/' 有助于轻松过滤路径分隔符
搜索字符串与所有内容匹配在 font-size="[0-9]*". 和 ` 之间
$ 和 $ 之间的部分是我们感兴趣的部分
- \1 表示我们希望将其保留在缓冲区中以供打印

此命令使用此处描述的组运算符。

在您的文件中，

/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

请注意，获取正确的前缀和后缀字符串以获取所有匹配项非常重要。在您的示例中，这些是我在上面找到的 font-size 和 tspan 部分。但是，文件中的所有文件字符串可能并非如此。所以检查一下。

The sed command for this will be

 sed  -n 's|font-size="[0-9]*".\(.*\)</tspan.*|\1|p' file.xml
            -------------------  --  ---------
               prefix part       \1   suffix

This is how it works,

The -n suppresses printing of all lines from the buffer
the p at the end indicates the replaced buffer is to be printed
the '|' used as a separator instead of the usual '/' helps filtering path separators easily
the search string is matching for all content between font-size="[0-9]*". and `
the part between $ and $ is the one we are interested in
- the \1 indicates we want to retain that in the buffer for the print

This command uses the group operator which is described here.

On your file this gives,

/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

Note that it is important to get the correct prefix and suffix strings to get all the matches. In your example these are the font-size and tspan parts i found above. But, that may not be the case with all the file strings in your file. So check that.

回复收藏 0 原文

小嗲 2024-08-02 07:38:21

Sed 和 awk 通常不是读取 XML 的正确方法。它们可能有效，但 XML 可以随时更改布局并破坏某些内容，同时仍然是完全有效的 XML。

更好的是使用 Perl 之类的东西。通过 CPAN 或在 ubunto 上使用“sudo apt-get install libxml-smart-perl”安装 XML::Smart 模块。

然后是一个像这样的简单脚本：

use strict;
use diagnostics;

use XML::Smart;

my $xml = XML::Smart->new ("svg.xml") || die "Cannot read XML: $!.";
my $version = $xml->{svg}{version} || die "Cannot determine SVG version.";

foreach my $file ($xml->{svg}{text}{tspan}('@')) {
    print $file->content . "\n";
}

将其另存为 svg.pl。将 XML 保存为 svg.xml。

$ perl svg.pl
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 副本 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 副本.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

此：

解析 XML，检查其是否正确。
检查版本是否存在（实际上只是健全性检查）。
循环遍历所有 svg/text/tspan 的数组并打印内容。

玩得开心！

Sed and awk are generally not the right way to read XML. They may work, but the XML can change layout at any time and break things, while still being perfectly valid XML.

Much better is to use something like Perl. Install the XML::Smart module either via CPAN, or on ubunto with "sudo apt-get install libxml-smart-perl".

Then a simple script like this:

use strict;
use diagnostics;

use XML::Smart;

my $xml = XML::Smart->new ("svg.xml") || die "Cannot read XML: $!.";
my $version = $xml->{svg}{version} || die "Cannot determine SVG version.";

foreach my $file ($xml->{svg}{text}{tspan}('@')) {
    print $file->content . "\n";
}

Save it as svg.pl. Save your XML as svg.xml.

$ perl svg.pl
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

This:

Parses the XML, checking it is correct.
Checks that the version exists (just a sanity check really).
Loops through an array of all svg/text/tspans and prints the content.

Have fun!

回复收藏 0 原文

鲜血染红嫁衣 2024-08-02 07:38:21

其他人已经给出了很好的答案，说明如果您想解析 XML，为什么应该使用适当的 XML 解析器，但就如何在 sed 中完成此操作进行了解释，以防您遇到类似的问题：

#Full Command
sed -n 's/^[^<]*<tspan[^>]*>\([^<]*\)<.*/\1/p'  ~/your_file.xml

n 选项使除非要求，否则 sed 不会发送任何输出。通常 sed 会在末尾重复模式空间，这可能会混淆

以 s 开头，因为我们[s]进行了替换。接下来的“/”告诉 sed 我们将使用“/”来划分脚本的不同部分。

抓取行开头 (^) 以及之后不是开括号 ([^`<]*) 的所有内容。这将在稍后被丢弃。

抓住 tspan 及其后所有非右括号 ([^>]*>) 的内容。这也将被丢弃。

抓住右括号之后的所有内容，那不是左括号。这是我们想要保留的部分，因此我们将其括在转义括号中。 "([^<]*)"

抓取从最后一个右括号到行尾 "<.*" 的所有内容。我们也会把它扔掉。

命令的第二部分：\1
所有这些意味着：重复我们之前使用的第一组转义括号中的内容。只有一组括号，因此 \2、\3 等在这里没有意义，但您可以在其他脚本中使用它们。就您而言，您想重复我们从您的内部匹配的内容

最后：“p”使 sed 打印出匹配项。这与开头的 -n 一起使用，相当于“不打印任何'除了'匹配”

希望有帮助......

Others have given good answers on why you should use a proper XML parser if you want to go around parsing XML, but as far as an explanation on how to accomplish this in sed, in case you come across a similar issue:

#Full Command
sed -n 's/^[^<]*<tspan[^>]*>\([^<]*\)<.*/\1/p'  ~/your_file.xml

The n option makes sed not send any output unless asked to do so. Normally sed will repeat the pattern space at the end, which can be confusing

Starting off with s, since were [s]ubstituting. The "/" that follows tells sed that we'll be using "/" to divide the different parts of the script.

Grab everything from the beginning of the line (^) along with everything after that is not an open bracket ([^`<]*). This will be discarded later on.

Grab tspan and everything after it that is not a closing bracket ([^>]*>). This will also be discarded.

Grab everything after that closing bracket, that is not an open bracket. This is the part we'll want to keep, so we enclose it in escaped parentheses. "([^<]*)"

Grab everything from that last closing bracket until the end of the line "<.*" . We'll be throwing this away, too.

Second part of the command: \1
All this means is: repeat back whatever was in the first set of escaped parentheses we used earlier. There was only one set of parentheses, so \2, \3, etc are meaningless here, but you might use them in other scripts. In your case, you want to repeat back what we matched from inside your

Lastly: "p" makes sed print out the matches. This works with the -n at the beginning, amounting to "don't print anything 'except' matches"

Hope that was helpful ...

回复收藏 0 原文

美胚控场 2024-08-02 07:38:21

如果你有xmlgawk，就可以轻松搞定。

@load xml

BEGIN {
    XMLMODE = 1;
    XMLCHARSET = "utf-8";
}

XMLCHARDATA {
    data = $0;
}

XMLENDELEM == "tspan" {
    print data;
}

和

$ xgawk -f pick_from_svg.awk sample.xml 
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

If you have xmlgawk, you can get easily.

@load xml

BEGIN {
    XMLMODE = 1;
    XMLCHARSET = "utf-8";
}

XMLCHARDATA {
    data = $0;
}

XMLENDELEM == "tspan" {
    print data;
}

and

$ xgawk -f pick_from_svg.awk sample.xml 
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

回复收藏 0 原文

仙女 2024-08-02 07:38:21

awk 'BEGIN{RS="font-size=\"10\">|</tspan>"}/pdf/' xml.txt

结果

$ awk 'BEGIN{RS="font-size=\"10\">|"}/pdf/' xml.txt
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

这段代码可能是最简单的代码，没有混乱的正则表达式，并且它具有很强的可扩展性，并且您可以轻松地根据自己的喜好进行调整。我决定匹配术语“pdf”，因此是代码的 /pdf/ 部分，但是，例如，如果您想要匹配其他文件，这些文件不是 pdf，但确实包含单词“Volumes”您可以简单地使用 /Volumes/ 代替。

awk 'BEGIN{RS="font-size=\"10\">|</tspan>"}/pdf/' xml.txt

Result

$ awk 'BEGIN{RS="font-size=\"10\">|"}/pdf/' xml.txt
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

This code is probably the simplest one yet with no messy regex and it is very extensible and easy for you to adjust it to your likings. I decided to match against the term 'pdf' hence the /pdf/ portion of the code but if, for example, you had other files that you want to match that aren't pdf's but do contain the word 'Volumes' you can simply use /Volumes/ instead.

回复收藏 0 原文

~没有更多了~