使用 Google Sheet 提取 div 中的文本
我正在尝试从Bostonglobe网站提取头条新闻。头条新闻嵌入了Divs。我使用Google表功能importxml来刮擦头条新闻。 但是,我的代码是
=IMPORTXML("https://www.bostonglobe.com/search/?p1=BGSearch_Overlay&q=problem","//div[@class= 'queryly_item_title')]")
,结果表明导入XML内容无法解析。有人知道发生了什么吗?我应该如何修改代码?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
google 表格导入公式不支持废弃 javaScript 内容:
google sheets import formulae do not support javaScript content to be scrapped:
Boston Globe 网站的内容(包括所有标题和文章)是使用动态脚本生成的,而不是静态 HTML。因此您将无法使用 IMPORTXML 进行提取。
The Boston Globe site's content, including all headlines and articles, is generated with dynamic script, not static HTML. So you will not be able to extract with IMPORTXML.
我可以提供一个解决方法。
转到实际网页并按 Ctrl-A 选择整个网页内容。按 Ctrl-C 将其复制到剪贴板。然后转到Google表格,单击任何空白表格的A1并按Ctrl-V将所有网页内容复制到那里。它将沿着 A 列运行。
现在,在另一个空列的某个单元格(例如,现在为 C2)中,放置此公式:
=ArrayFormula(REGEXREPLACE(REGEXREPLACE(FILTER(A:A,REGEXMATCH(A:A, “\/通过")),JOIN("|",TEXT(DATE(2022,SEQUENCE(1,12),1),"mmm")),"~"),"~\s\d+.+$"," "))
这应该从 A 列的内容中提取标题。
FILTER
使用REGEXMATCH
提取带有“By”行的任何内容。JOIN
部分形成所有三个字母的月份名称的 REGEX“OR 列表”。 (年份并不重要;它必须是“某个东西”,以便DATE
从中提取月份名称。最里面的 REGEXREPLACE 会将那些三个字母的月份名称替换为波形符(~ )作为要删除的内容的标记,然后最外层的 REGEXREPLACE 将从那里删除到这些字符串的末尾,仅保留标题
在其他页面上不保证,但此过程和公式应该适用于所有类似的内容。 在上面的示例中,
如果您还需要使超链接保持活动状态,那么您必须分两列进行操作:
我们将第一个公式放在 D2 中。可以放置此公式(取自第一个公式):
=ArrayFormula(FILTER(A:A,REGEXMATCH(A:A,"\/ By")))
这将提取整个标题,通过- 线和开口详细信息为长字符串,但它将保留超链接,只需将 Col-D 宽度更改为较窄的值(如 50),然后选择整个 Col D 并应用 Format > 。包装>剪辑以截断溢出。
如果您想打破对 A 列中网页“数据删除”的依赖,以便可以删除所有内容:
1.) 完全选择公式列并按 Ctrl-C 复制到剪贴板。
2.) 打开一个新的Word文档(或Google Docs,但我建议使用Word,如果你有的话,因为它在Google帝国之外,并且将完全分离更多方面)。按 Ctrl-V 粘贴公式结果。
3.) 然后按 Ctrl-A 选择刚刚放入 Word 文档中的内容,按 Ctrl-C 将其复制回剪贴板,然后返回 Google 表格 [公式列仍处于选中状态] 并按Ctrl-V。
4.) 然后您需要应用格式>;包装>如果需要,可以进行剪辑以截断要剪辑的溢出部分。
然后您可以删除 Col-A 数据。
看起来可能有很多事情要做。但老实说,实际上,从开始到结束应该只需要不到一分钟的时间。
I can offer a workaround.
Go to the actual webpage and hit Ctrl-A to select the entire webpage contents. Hit Ctrl-C to copy that to clipboard. Then go to Google Sheets, click on A1 of any blank sheet and hit Ctrl-V to copy all of the webpage contents there. It will run down Col A.
Now, in some cell of another empty column (say, C2 for now), place this formula:
=ArrayFormula(REGEXREPLACE(REGEXREPLACE(FILTER(A:A,REGEXMATCH(A:A,"\/ By")),JOIN("|",TEXT(DATE(2022,SEQUENCE(1,12),1),"mmm")),"~"),"~\s\d+.+$",""))
This should extract your headlines from the content in Col A.
The
FILTER
usesREGEXMATCH
to extract anything with a "By" line.The
JOIN
part forms a REGEX "OR listing" of all three-letter month names. (The year is not important; it just has to be "something" in order forDATE
to extract the month names from. And the innermost REGEXREPLACE will replace those three-letter month names with a tilde (~) as a marker of what to delete beyond.The outermost REGEXREPLACE will then delete from there to the end of those strings, leaving only the headers.
No guarantees on other pages, but this process and formula(s) should work on all similar headline-listing pages from this website.
If you also need the hyperlinks to remain active, that gets tricky. You'll have to do it in two columns:
In the above example, we placed the first formula in C2. In D2, you could place this formula (taken from the first formula):
=ArrayFormula(FILTER(A:A,REGEXMATCH(A:A,"\/ By")))
This will extract the entire headline, by-line and opening details as a long string, but it will retain the hyperlink. Just change the Col-D width to something narrow like 50, then select the entire Col D and apply Format > Wrapping > Clip to truncate the overflow.
If you want to break the dependence on the webpage "data drop" in Col A so that you can delete all that:
1.) Select the formula column(s) entirely and hit Ctrl-C to copy to clipboard.
2.) Open a new Word document (or Google Docs, but I suggest Word if you have it, since it's outside of the Google empire and will fully detach more facets). Hit Ctrl-V to paste the formula results.
3.) Then hit Ctrl-A to select what you just dropped in the Word document, hit Ctrl-C to copy it back to clipboard, and then go back to Google Sheets [with the formula column(s) still select] and hit Ctrl-V.
4.) You'll need then apply Format > Wrapping > Clip to truncate the overflow to clip if desired.
Then you can delete the Col-A data.
It may seem like a lot to do. But honestly, in practice, it should only take less than a minute from start to finish.