如何解析具有压缩链接的站点地图索引

发布于 2025-01-18 15:47:16 字数 2343 浏览 4 评论 0 原文

我编写了一个程序，该程序读取页面的 /robots.txt 和 /sitemap.xml 并减去可用的站点地图并将它们存储在 siteMapsUnsorted 列表中。到达那里后，我使用crawler-commons库来分析链接是否是SiteMap或SiteMapIndexes（SiteMap集群）。

当我在正常的 siteMapIndex 上使用它时，它可以工作，在某些情况下，较大的网站具有压缩格式的 SiteMapIndexes 列表，例如：

压缩的 sitemapIndex： http://tripadvisor-sitemaps.s3-website-us-east-1.amazonaws.com/2/es/sitemap_es_index.xml
正常站点地图索引： https://www.infolibre.es/sitemap_index_382e2.xml

我正在使用的代码：

SiteMapParser sitemapParser = new SiteMapParser();

for (String sitemapURLStr : siteMapsUnsorted) {
    AbstractSiteMap siteMapCandidate = sitemapParser.parseSiteMap(new URL(sitemapURLStr));
//AbstractSiteMap siteMapCandidate = sitemapParser.parseSiteMap("xml", content , new URL(sitemapURLStr));
    
    // Check if the elements inside the list are SiteMapIndexes or SiteMaps, if they are SiteMapINDEXES, we need to break them down into SiteMaps
    if (siteMapCandidate instanceof SiteMapIndex){
        SiteMapIndex siteMapIndex = (SiteMapIndex) siteMapCandidate;

        for (AbstractSiteMap aSiteMap : siteMapIndex.getSitemaps()){
            if (aSiteMap instanceof  SiteMap){
                String siteMapString = aSiteMap.getUrl().toString();
                System.out.println(siteMapString);
                siteMaps.add(siteMapString);
            } else{
                LOG.warn("ignoring site map index inside site map index: " + aSiteMap.getUrl());
            }
        }
    }
    // If the elements inside the list are individual SiteMaps we add them to the SiteMaps list
    else {
        siteMaps.add(siteMapCandidate.getUrl().toString());
    }
}

我注意到该方法 parseSitemap 根据您传递给它的参数而变化，但在尝试多次后我找不到处理压缩元素的方法。

我的最后一个选择是编写一个方法，下载每个 .tar.gz，解压缩它，读取解压缩的链接列表，存储它们，最后删除该目录；但这会极其缓慢且低效，所以首先我来这里看看是否有人有更好的想法/可以帮助我使用 parseSitemap()。

感谢任何人提前提供帮助。

原文

I've made a program that reads the /robots.txt and the /sitemap.xml of a page and substracts the available sitemaps and stores them on the siteMapsUnsorted list.
Once there I use crawler-commons library to analyze if the links are SiteMaps or SiteMapIndexes (cluster of SiteMaps).

When I use it on a normal siteMapIndex it works, the problem occurs in some cases where bigger sites have the list of SiteMapIndexes on a compressed format, e.g:

Compressed sitemapIndex: http://tripadvisor-sitemaps.s3-website-us-east-1.amazonaws.com/2/es/sitemap_es_index.xml
Normal sitemapIndex: https://www.infolibre.es/sitemap_index_382e2.xml

The code I'm using:

SiteMapParser sitemapParser = new SiteMapParser();

for (String sitemapURLStr : siteMapsUnsorted) {
    AbstractSiteMap siteMapCandidate = sitemapParser.parseSiteMap(new URL(sitemapURLStr));
//AbstractSiteMap siteMapCandidate = sitemapParser.parseSiteMap("xml", content , new URL(sitemapURLStr));
    
    // Check if the elements inside the list are SiteMapIndexes or SiteMaps, if they are SiteMapINDEXES, we need to break them down into SiteMaps
    if (siteMapCandidate instanceof SiteMapIndex){
        SiteMapIndex siteMapIndex = (SiteMapIndex) siteMapCandidate;

        for (AbstractSiteMap aSiteMap : siteMapIndex.getSitemaps()){
            if (aSiteMap instanceof  SiteMap){
                String siteMapString = aSiteMap.getUrl().toString();
                System.out.println(siteMapString);
                siteMaps.add(siteMapString);
            } else{
                LOG.warn("ignoring site map index inside site map index: " + aSiteMap.getUrl());
            }
        }
    }
    // If the elements inside the list are individual SiteMaps we add them to the SiteMaps list
    else {
        siteMaps.add(siteMapCandidate.getUrl().toString());
    }
}

I've noticed that the method parseSitemap changes depending the parameters you pass to it, but after trying multiple times I couldnt find a way to handle the compressed elements.

My last alternative would be to program a method that downloads every .tar.gz, decompresses it, reads the decompressed list of links, store them and finally deletes the directory; but that would be extremelly slow and inefficient, so first I came here to see if anyone has a better idea/could help me with the parseSitemap().

Thanks to anyone helping in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只是我以为 2025-01-25 15:47:17

失败的原因是 Tripadvisor 没有在其站点地图上设置正确的 mime 类型：

$ curl --head https://www.tripadvisor.es/sitemap/2/es/sitemap-1662847-es-articles-1644753222.xml.gz
...
content-type: text/plain; charset=utf-8

以及使用仅在内容类型为以下之一时使用 gzip 进行解码

private static String[] GZIP_MIMETYPES = new String[] { 
  "application/gzip",
  "application/gzip-compressed",
  "application/gzipped",
  "application/x-gzip",
  "application/x-gzip-compressed",
  "application/x-gunzip",
  "gzip/document"
};

：可以通过更好地检测 gzip 和 xml（例如 URL 以 .xml.gz 结尾的 URL）来解决此问题，并在将站点地图下载到字节[]。

The reason this is failing is that Tripadvisor doesn't set the correct mime type on its sitemaps:

$ curl --head https://www.tripadvisor.es/sitemap/2/es/sitemap-1662847-es-articles-1644753222.xml.gz
...
content-type: text/plain; charset=utf-8

and the library that are using only decodes with gzip when the content type is one of:

private static String[] GZIP_MIMETYPES = new String[] { 
  "application/gzip",
  "application/gzip-compressed",
  "application/gzipped",
  "application/x-gzip",
  "application/x-gzip-compressed",
  "application/x-gunzip",
  "gzip/document"
};

You could probably work around this by implementing better detection of gzip and xml (like the URL ends in .xml.gz) and call the processGzippedXML method directly after downloading the sitemap to a byte[].