如何从 Perl 中的文本文件中提取/解析表格数据?
我正在寻找类似 HTML::TableExtract 的东西,只是不适用于 HTML 输入,但是用于包含用缩进和间距格式化的“表格”的纯文本输入。
数据可能如下所示:
Here is some header text.
Column One Column Two Column Three
a b
a b c
Some more text
Another Table Another Column
abdbdbdb aaaa
I am looking for something like HTML::TableExtract, just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing.
Data could look like this:
Here is some header text.
Column One Column Two Column Three
a b
a b c
Some more text
Another Table Another Column
abdbdbdb aaaa
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不知道任何打包的解决方案,但是假设您可以对文件进行两次传递,那么一些不太灵活的事情就相当简单:(以下是部分Perlish伪代码示例)
Text::CSV(_XS)
。示例代码:
现在,如果您要求列分隔至少为 X 个空格,其中 X>1,它也是可行的,但列位置的解析器需要更复杂一点:
Not aware of any packaged solution, but something not very flexible is fairly simple to do assuming you can do two passes over the file: (the following is partially Perlish pseudocode example)
Text::CSV(_XS)
.Sample code:
Now, if you require column separation to be at least X spaces where X>1, it's also doable but the parser of column locations needs to be a bit more complex :
这是一个非常快速的解决方案,并带有概述。 (我对长度表示歉意。)基本上,如果一个“单词”出现在列标题 n 开始之后,那么它最终会出现在 n 列中,除非它的大部分内容都在 n 列中。 body 进入列 n + 1,在这种情况下,它会在那里结束。整理它、扩展它以支持多个不同的表等都作为练习。您还可以使用列标题左偏移以外的其他内容作为边界标记,例如中心或由列号确定的某个值。
示例输出:
Here's a very quick solution, commented with an overview. (My apologies for the length.) Basically, if a "word" appears after the start of column header n, then it ends up in column n, unless most of its body trails into column n + 1, in which case it ends up there instead. Tidying this up, extending it to support multiple different tables, etc. are left as an exercise. You could also use something other than the left offset of the column header as the boundary mark, such as the centre, or some value determined by the column number.
Sample output: