当前位置：文江博客话题详情

解析 HTML 表格的最佳方法

发布于 2024-12-21 05:33:25 字数 266 浏览 0 评论 0原文

我有兴趣解析下表和其他类似的表： http://www.cityofames.org/ftp/routes/Fall/wdreds& ;w.html

对于这项工作的最佳工具有什么建议吗？经过一番搜索后，我无法决定应该使用什么，并且希望在做出某件事之前获得一些反馈。

我对任何语言/工具都持开放态度。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

居里长安 2024-12-28 05:33:25

如果您正在寻找 HTML 解析器，Java 中有多种选择：

您可能还想对优点和缺点进行非常全面的讨论使用这些的缺点此处。

回复收藏 0 原文

仙女山的月亮 2024-12-28 05:33:25

使用 lynx，我可以做到：

$ lynx -dump http://www.cityofames.org/ftp/routes/Fall/wdreds\&w.html
    6:25  6:31  6:36  6:41 -----  6:46  6:50      6:56
    7:02  7:08  7:14  7:20 -----  7:26  7:30      7:36
   ----- ----- ----- -----  7:38  7:43  7:47      7:53 1A
    7:28  7:35  7:42  7:48 -----  7:56  8:00      8:06
   ----- ----- ----- -----  7:58  8:03  8:07      8:13 1A
...

使用所选的脚本语言变得非常容易解析，html2text 也可以工作（从未使用过它）。

您还可以使用 grep/sed 来格式化它。

With lynx I can do:

$ lynx -dump http://www.cityofames.org/ftp/routes/Fall/wdreds\&w.html
    6:25  6:31  6:36  6:41 -----  6:46  6:50      6:56
    7:02  7:08  7:14  7:20 -----  7:26  7:30      7:36
   ----- ----- ----- -----  7:38  7:43  7:47      7:53 1A
    7:28  7:35  7:42  7:48 -----  7:56  8:00      8:06
   ----- ----- ----- -----  7:58  8:03  8:07      8:13 1A
...

becomes very easy to parse with scripting language of choice, html2text may also work(never used it).

You could also play around with grep/sed to format it.

回复收藏 0 原文

完美的未来在梦里 2024-12-28 05:33:25

HTML 太难被任何解析器理解。您需要首先使用 tidy(http://tidy.sourceforge.net/) 等程序将其转换为相当接近的 XML 格式（对于格式良好 - 意味着匹配的标签），例如 XHTML。
然后，您可以使用 XML/XHTML 解析器来解析格式良好的 XML。请注意，您必须根据字体样式处理数据，并将基于字体样式的标签转换为时间数组。

这是解析时您可以执行的操作

start TR element
  --Create Array
 start b element
  -- Add One time
 end b element
 start b element
  -- Add second time
 end b element
end TR element

HTML is too difficult to be understood by any parser. You need to first convert this to a reasonably close XML format(for wellformedness- means tags that are matched) like XHTML using a program like tidy(http://tidy.sourceforge.net/).
You can then use a XML/XHTML parser to parse the wellformed XML. Note that you will have to process your data based on font styles and convert the tags based on font styles to an array of times.

Here is what you can do when parsing

start TR element
  --Create Array
 start b element
  -- Add One time
 end b element
 start b element
  -- Add second time
 end b element
end TR element

回复收藏 0 原文

~没有更多了~