XPath 如何从 html 文档中检索表格单元格的值
我有一个 html 文档,文档中的某处位于表格下方,我可以获取表格行和 java DOM 对象。我不清楚的是,当值是字符串并且是二进制资源时,如何提取表格单元格的值?
我使用的代码如下:
XPath xpath;
XPathExpression expr;
NodeList nodes=null;
// Use XPath to obtain whatever you want from the (X)HTML
try{
xpath = XPathFactory.newInstance().newXPath();
//<table class="data">
NodeList list = doc.getElementsByTagName("table");
// Node node = list.item(0);
//System.out.println(node.getTextContent());
//String textContent=node.getTextContent();
expr = xpath.compile("//table/tr/td");
nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
和循环如下:
for (int i = 0; i < nodes.getLength(); i++) {
Node ln = list.item(i);
String lnText=ln.toString();
NodeList rowElements=ln.getChildNodes();
Node one=rowElements.item(0);
String oneText=one.toString();
String nodeName=one.getNodeName();
String valOne = one.getNodeValue();
但我没有看到表中的值。
<table class="data">
<tr><td>ImageName1</td><td width="50"></td><td><img src="/images/036000291452" alt="036000291452" /></td></tr>
<tr><td>ImageName2</td><td width="50"></td><td><img src="/images/36000291452" alt="36000291452" /></td></tr>
<tr><td>Description</td><td></td><td>Time Magazine</td></tr>
<tr><td>Size/Weight</td><td></td><td>14 Issues</td></tr>
<tr><td>Issuing Country</td><td></td><td>United States</td></tr>
</table>
I have a html document and somewhere inside the doc is below a table, I can get the table rows and java DOM objects. What is not clear to me is how to extract the value of the table cell when the value is a string and also when it is a binary resource?
I am using code like:
XPath xpath;
XPathExpression expr;
NodeList nodes=null;
// Use XPath to obtain whatever you want from the (X)HTML
try{
xpath = XPathFactory.newInstance().newXPath();
//<table class="data">
NodeList list = doc.getElementsByTagName("table");
// Node node = list.item(0);
//System.out.println(node.getTextContent());
//String textContent=node.getTextContent();
expr = xpath.compile("//table/tr/td");
nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
and loopiong like:
for (int i = 0; i < nodes.getLength(); i++) {
Node ln = list.item(i);
String lnText=ln.toString();
NodeList rowElements=ln.getChildNodes();
Node one=rowElements.item(0);
String oneText=one.toString();
String nodeName=one.getNodeName();
String valOne = one.getNodeValue();
But I am not seeing the values in the table.
<table class="data">
<tr><td>ImageName1</td><td width="50"></td><td><img src="/images/036000291452" alt="036000291452" /></td></tr>
<tr><td>ImageName2</td><td width="50"></td><td><img src="/images/36000291452" alt="36000291452" /></td></tr>
<tr><td>Description</td><td></td><td>Time Magazine</td></tr>
<tr><td>Size/Weight</td><td></td><td>14 Issues</td></tr>
<tr><td>Issuing Country</td><td></td><td>United States</td></tr>
</table>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
此 XPath 表达式:
选择作为顶部元素的第一个
tr
子元素(<所提供的 XML 文档的代码>表)。XPath 表达式:
选择
td
元素(无命名空间),该元素是顶部元素的第一个tr
子元素的第二个子元素(<所提供的 XML 文档的代码>表)。一般:
选择作为
$m
$n 的第$n
子元素的td
元素(在无命名空间中) code>-所提供 XML 文档的顶部元素 (table
) 的第tr
子元素。只需将$m
和$n
替换为所需的整数值即可。您可以使用标准 XPath 函数
string ()
获取其字符串值:计算为
td
元素(无命名空间)的字符串值,即$n< /code>-的第一个孩子
$m
- 所提供 XML 文档的顶部元素 (table
) 的第tr
子级。This XPath expression:
selects the
td
element (in no namespace) that is the first child of the firsttr
child of the top element (table
) of the provided XML document.The XPath expression:
selects the
td
element (in no namespace) that is the second child of the firsttr
child of the top element (table
) of the provided XML document.In general:
selects the
td
element (in no namespace) that is the$n
-th child of the$m
-thtr
child of the top element (table
) of the provided XML document. Just replace$m
and$n
with the desired integer values.You can use the standard XPath function
string()
to obtain their string value:evaluates to the string value of the
td
element (in no namespace) that is the$n
-th child of the$m
-thtr
child of the top element (table
) of the provided XML document.使用“string(//td)”之类的路径来获取每个单元格的字符串内容。对于链接资源,您需要使用“//td/img/@src”之类的内容来获取 URL,然后相对于源 URL 对其进行规范化,并从网络中获取结果 URL。
Use a path like "string(//td)" to get the string contents of each cell. For linked resources, you will need to use something like "//td/img/@src" to get the URLs, then canonicalize them relative to the source url, and fetch te resulting URL from the network.