如何从 heritrix 爬网中排除除 text/html 之外的所有内容？

发布于 2024-09-14 16:55:24 字数 310 浏览 10 评论 0原文

上：Heritrix 用例有一个用例“仅存储成功” HTML 页面”

我的问题：我不知道如何在我的 cxml 文件中实现它。尤其：将 ContentTypeRegExpFilter 添加到 ARCWriterProcessor =>将其正则表达式设置设置为 text/html.*。 ... 示例 cxml 文件中没有 ContentTypeRegExpFilter。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若能看破又如何 2024-09-21 16:55:24

Kris 的答案只说了一半（至少对于我正在使用的 Heritrix 3.1.x 而言）。 DecideRule 返回 ACCEPT、REJECT 或 NONE。如果规则返回 NONE，则意味着该规则对此“没有意见”（如 Spring Security 中的 ACCESS_ABSTAIN）。现在 ContentTypeMatchesRegexDecideRule (与所有其他 MatchesRegexDecideRule) 可以配置为在正则表达式匹配时返回决策（由两个属性“decision”和“regex”配置）。该设置意味着，如果正则表达式匹配，则此规则返回 ACCEPT 决策，但如果不匹配，则返回 NONE。正如我们所看到的 - NONE 不是意见，因此 shouldProcessRule 将评估为 ACCEPT，因为尚未做出任何决定。

因此，要仅存档具有 text/html* Content-Type 的响应，请配置 DecideRuleSequence，其中默认情况下所有内容都被拒绝，并且只有选定的条目才会被接受。

看起来像这样：

 <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
   <property name="shouldProcessRule">
     <bean class="org.archive.modules.deciderules.DecideRuleSequence">
       <property name="rules">
         <list>
           <!-- Begin by REJECTing all... -->
           <bean class="org.archive.modules.deciderules.RejectDecideRule" />
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
             <property name="decision" value="ACCEPT" />
             <property name="regex" value="^text/html.*" />
           </bean>
         </list>
       </property>
     </bean>
   </property>
   <!-- other properties... -->
 </bean>

为了避免下载图像、电影等，请使用 MatchesListRegexDecideRule 配置“scope”bean，该规则会拒绝具有众所周知的文件扩展名的 url，例如：

<!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
      <property name="decision" value="REJECT"/>
      <property name="listLogicalOr" value="true" />
      <property name="regexList">
       <list>
         <value>.*(?i)(\.(avi|wmv|mpe?g|mp3))lt;/value>
         <value>.*(?i)(\.(rar|zip|tar|gz))lt;/value>
         <value>.*(?i)(\.(pdf|doc|xls|odt))lt;/value>
         <value>.*(?i)(\.(xml))lt;/value>
         <value>.*(?i)(\.(txt|conf|pdf))lt;/value>
         <value>.*(?i)(\.(swf))lt;/value>
         <value>.*(?i)(\.(js|css))lt;/value>
         <value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))lt;/value>
       </list>
      </property>
</bean>

Kris's answer is only half the truth (at least with Heritrix 3.1.x that I'm using). A DecideRule return ACCEPT, REJECT or NONE. If a rule returns NONE, it means that this rule has "no opinion" about that (like ACCESS_ABSTAIN in Spring Security). Now ContentTypeMatchesRegexDecideRule (as all other MatchesRegexDecideRule) can be configured to return a decision if a regex matches (configured by the two properties "decision" and "regex"). The setting means that this rule returns an ACCEPT decision if the regex matches, but returns NONE if it does not match. And as we have seen - NONE is not an opinion so that shouldProcessRule will evaluate to ACCEPT because no decisions have been made.

So to only archive responses with text/html* Content-Type, configure a DecideRuleSequence where everything is REJECTed by default and only selected entries will be ACCEPTed.

This looks like this:

 <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
   <property name="shouldProcessRule">
     <bean class="org.archive.modules.deciderules.DecideRuleSequence">
       <property name="rules">
         <list>
           <!-- Begin by REJECTing all... -->
           <bean class="org.archive.modules.deciderules.RejectDecideRule" />
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
             <property name="decision" value="ACCEPT" />
             <property name="regex" value="^text/html.*" />
           </bean>
         </list>
       </property>
     </bean>
   </property>
   <!-- other properties... -->
 </bean>

To avoid that images, movies etc. are downloaded at all, configure the "scope" bean with a MatchesListRegexDecideRule that REJECTs urls with well known file extensions like:

<!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
      <property name="decision" value="REJECT"/>
      <property name="listLogicalOr" value="true" />
      <property name="regexList">
       <list>
         <value>.*(?i)(\.(avi|wmv|mpe?g|mp3))lt;/value>
         <value>.*(?i)(\.(rar|zip|tar|gz))lt;/value>
         <value>.*(?i)(\.(pdf|doc|xls|odt))lt;/value>
         <value>.*(?i)(\.(xml))lt;/value>
         <value>.*(?i)(\.(txt|conf|pdf))lt;/value>
         <value>.*(?i)(\.(swf))lt;/value>
         <value>.*(?i)(\.(js|css))lt;/value>
         <value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))lt;/value>
       </list>
      </property>
</bean>

回复收藏 0 原文

善良天后 2024-09-21 16:55:24

您引用的用例有些过时，并参考 Heritrix 1.x（过滤器已被替换为决定规则，非常不同的配置框架）。基本概念仍然是相同的。

cxml 文件基本上是一个 Spring 配置文件。您需要将 ARCWriter bean 上的属性 shouldProcessRule 配置为 ContentTypeMatchesRegexDecideRule

可能的 ARCWriter 配置：

  <bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
    <property name="shouldProcessRule">
      <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
        <property name="decision" value="ACCEPT" />
        <property name="regex" value="^text/html.*">
      </bean>
    </property>
    <!-- Other properties that need to be set ... -->
  </bean>

这将导致处理器仅处理那些与 DecideRule 匹配的项目，其中反过来，只传递那些内容类型（mime 类型）与所提供的正则表达式匹配的内容。

请注意“决策”设置。你正在排除我们的事情吗？（我的示例规则了所有内容，任何不匹配的内容都被排除）。

由于 shouldProcessRule 是从 Processor 继承的，因此它可以应用于任何处理器。

有关配置 Heritrix 3 的更多信息，请参阅 Heritrix 3 Wiki （crawler.archive.org 上的用户指南是关于 Heritrix 1 的）

The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.

The cxml file is basically a Spring configuration file. You need to configure the property shouldProcessRule on the ARCWriter bean to be the ContentTypeMatchesRegexDecideRule

A possible ARCWriter configuration:

  <bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
    <property name="shouldProcessRule">
      <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
        <property name="decision" value="ACCEPT" />
        <property name="regex" value="^text/html.*">
      </bean>
    </property>
    <!-- Other properties that need to be set ... -->
  </bean>

This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.

Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).

As shouldProcessRule is inherited from Processor, this can be applied to any processor.

More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)

回复收藏 0 原文

~没有更多了~