是否可以使用 Google Apps 脚本监控网站的更改?
我正在制作一个监视 URL 更改的应用程序。为了对应用程序逻辑进行编程,我使用 Google Apps 脚本和 Google Sheet。
我解释一下我想到的监控机制。首先,脚本将从包含以下列的工作表中读取数据:
URL:我们指示要监控的 URL
第一次:指示是否是第一次分析 URL。
更改:表示相对于上次分析是否已进行更改。
HashValue:应用 MD5 哈希后分析的 URL 的 HTML 代码。
在执行脚本时,将开始读取工作表的行。对于每一行:
- 将读取 URL 并执行 URLFetchApp 方法以从该网页获取响应。
- getContentText 方法将应用于获得的答案,以获取网页的 HTML 代码,并将其保存在变量中。
- 我们将在 HTML 代码上应用 MD5 哈希算法,并将其保存在变量中。
- 如果是第一次分析 URL,我们将在 Changes 列中指示未进行任何更改(这是我们第一次分析它),并且我们将在 HashValue 列中保存带有哈希值的变量内容HTML 代码。
- 如果之前已经分析过该 URL,我们会将之前注册的 HashValue 值与现在获取的 HashValue 值进行比较。
- 如果值不同,我们将在 Changes 列中指示已发生更改,并且我们将在 HashValue 列中保存新的哈希值。
我已经编写了代码。它适用于某些网站。但对于其他网站它不起作用。在分析了不起作用的网站的 HTML 代码后,使用在线文本比较器查找代码中的差异后,我注意到以下内容:
有些网站在重新加载同一页面两次时,即使内容不同,代码也会发生一些变化是静态的。例如,可以更改的是 HTML 标记的 ID 为 box-wrap-140,而再次重新加载页面时,ID 为 box-wrap-148。
因此,由于 HTML 代码不同,因此执行的脚本会检测到所做的更改。在研究了很多东西之后,我找不到解决这个问题的替代方案,因此标题中的问题
PS:我们可以忽略诸如网站没有关闭或给我们 404、301 等响应代码之类的细节。这已经被编程并且可以正常工作。
PS2:抱歉我的英语水平。
I'm making an application that monitors URLs for changes. To program the application logic I am using Google Apps Script and a Google Sheet.
I explain the monitoring mechanism I have thought of. First of all the Script will read data from a sheet with the following columns:
URL: We indicate the URLs we want to monitor
First Time: Indicates if it is the first time that a URL is analyzed.
Changes: Indicates if changes have been made or not with respect to the previous time it has been analyzed.
HashValue: HTML code of the URL analyzed after applying an MD5 hash.
At the moment of the execution of the script the rows of the sheet will start to be read. For each row:
- The URL will be read and the URLFetchApp method will be executed to get a response from that web page.
- The getContentText method will be applied on the obtained answer to obtain the HTML code of the web page and we will save it in a variable.
- We will apply the MD5 Hash algorithm on the HTML code and we will save it in a variable.
- In case the URL is being analyzed for the first time we will indicate in the column Changes that no changes have been made (it is the first time we analyze it) and we will save in the column HashValue the content of the variable with the hashed HTML code.
- In case the URL has already been analyzed previously, we will compare the previously registered HashValue value with the one we have obtained now.
- In case the value is different we will indicate in the Changes column that there have been changes and we will save in the HashValue column the new hash value.
I have already programmed the code. And it works with some web sites. But with other web sites it does not work. After analyzing the HTML code of the websites where it did not work, looking for differences in the code with an online text comparator I noticed the following:
There are websites in which when reloading twice the same page the code changes a little even if the content is static. For example what can change is that an HTML tag has an ID box-wrap-140 and when reloading the page again the ID is box-wrap-148.
Therefore the script as it is implemented would detect that changes are made, because the HTML code is different. After researching many things I can't find an alternative that solves this problem, hence the question in the title
PS: We can ignore details such as the website not being down or giving us 404, 301, etc. response codes. This has already been programmed and works correctly.
PS2: Sorry for my level of English.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用 cheerio GS 查找自定义标签并排除这些更改(
Yon can use cheerio GS to look for custom tags and exclude those changes(
<footer>
) or include those changes(like<div>
).