当前位置：文江博客话题详情

如何用solr实现网页快照？

发布于 2021-11-25 10:55:15 字数 471 浏览 770 评论 6

最近我用nutch进行网页爬虫，采集到数据提交给solr建索引。根据这个地址：http://williamx.blog.51cto.com/3629295/722707

进行了配置，并在索引上形成了catche_content字段有值，但是这个值是一段很长的字符串（下面的图片），

不知道如何在网页上使用这个值？请高手们指教！谢谢

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别再吹冷风 2021-11-29 23:01:46

哦，那我看看吧。

顺便问一个问题？如何用Nutch来抓取图片呢？默认情况下nutch是不抓去图片，视频的。不知您有没有研究过？谢谢！

回复收藏 0

酒几许 2021-11-29 22:43:30

这个我猜是你爬取数据的问题吧，我的显示有样式的。

回复收藏 0

裸钻 2021-11-29 22:38:47

你好，感谢您的回复！

还有一个问题是我按照你说的去做了，网页快照做出来了，byte[] inOutb = (byte[]) doc.getFieldValue("cache_content");cache_content = new String(inOutb, "gbk");最后cache_content内容就是抓取页面的html代码。通过out.println(cache_content);直接显示页面后发现显示的页面很乱。是因为没有css,js还有images图片。如果能把原来页面的js,css，图片和当前显示快照的页面放在一起的话就更好看了。不知这个问题有没有研究过？请给我指教！谢谢

回复收藏 0

千纸鹤带着心事 2021-11-29 19:24:50

您好，我是按照你说的去做了，用nutch抓取网页内容后，数据提交给solr建索引。用solr查询数据时cache_content字段有内容，但是一串串很长的字符串，就像上面我贴的内容所示。请问我如何把这个这个字符串转换成有效的html值显示在页面上？谢谢。。。！我是用solrj提取solr索引数据的。

回复收藏 0

残花月 2021-11-29 16:50:47

1.找到 eclipse中nutch-1.7 项目的 solrindex-mapping.xml,添加 <field dest="cache_content" source="cache_content"/>

2.schema.xml 新增<field name="cache_content" type="binary" indexed="false" stored="true" />

3.修改源代码：具体见下面吧

IndexerMapReduce

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.nutch.indexer;

import java.io.IOException;
import java.util.Collection;
import java.util.Iterator;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.CrawlDb;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.crawl.LinkDb;
import org.apache.nutch.crawl.NutchWritable;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.metadata.Nutch;
import org.apache.nutch.net.URLFilters;
import org.apache.nutch.net.URLNormalizers;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseData;
import org.apache.nutch.parse.ParseImpl;
import org.apache.nutch.parse.ParseText;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.scoring.ScoringFilterException;
import org.apache.nutch.scoring.ScoringFilters;

public class IndexerMapReduce extends Configured
implements Mapper<Text, Writable, Text, NutchWritable>,
          Reducer<Text, NutchWritable, Text, NutchIndexAction> {

  public static final Logger LOG = LoggerFactory.getLogger(IndexerMapReduce.class);

  public static final String INDEXER_PARAMS = "indexer.additional.params";
  public static final String INDEXER_DELETE = "indexer.delete";
  public static final String INDEXER_DELETE_ROBOTS_NOINDEX = "indexer.delete.robots.noindex";
  public static final String INDEXER_SKIP_NOTMODIFIED = "indexer.skip.notmodified";
  public static final String URL_FILTERING = "indexer.url.filters";
  public static final String URL_NORMALIZING = "indexer.url.normalizers";

  private boolean skip = false;
  private boolean delete = false;
  private boolean deleteRobotsNoIndex = false;
  private IndexingFilters filters;
  private ScoringFilters scfilters;

  // using normalizers and/or filters
  private boolean normalize = false;
  private boolean filter = false;

  // url normalizers, filters and job configuration
  private URLNormalizers urlNormalizers;
  private URLFilters urlFilters;

  public void configure(JobConf job) {
    setConf(job);
    this.filters = new IndexingFilters(getConf());
    this.scfilters = new ScoringFilters(getConf());
    this.delete = job.getBoolean(INDEXER_DELETE, false);
    this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX, false);
    this.skip = job.getBoolean(INDEXER_SKIP_NOTMODIFIED, false);

    normalize = job.getBoolean(URL_NORMALIZING, false);
    filter = job.getBoolean(URL_FILTERING, false);

    if (normalize) {
      urlNormalizers = new URLNormalizers(getConf(), URLNormalizers.SCOPE_DEFAULT);
    }

    if (filter) {
      urlFilters = new URLFilters(getConf());
    }
  }

  /**
   * Normalizes and trims extra whitespace from the given url.
   *
   * @param url The url to normalize.
   *
   * @return The normalized url.
   */
  private String normalizeUrl(String url) {
    if (!normalize) {
      return url;
    }

    String normalized = null;
    if (urlNormalizers != null) {
      try {

        // normalize and trim the url
        normalized = urlNormalizers.normalize(url,
          URLNormalizers.SCOPE_INDEXER);
        normalized = normalized.trim();
      }
      catch (Exception e) {
        LOG.warn("Skipping " + url + ":" + e);
        normalized = null;
      }
    }

    return normalized;
  }

  /**
   * Filters the given url.
   *
   * @param url The url to filter.
   *
   * @return The filtered url or null.
   */
  private String filterUrl(String url) {
    if (!filter) {
      return url;
    }

    try {
      url = urlFilters.filter(url);
    } catch (Exception e) {
      url = null;
    }

    return url;
  }

  public void map(Text key, Writable value,
      OutputCollector<Text, NutchWritable> output, Reporter reporter) throws IOException {

    String urlString = filterUrl(normalizeUrl(key.toString()));
    if (urlString == null) {
      return;
    } else {
      key.set(urlString);
    }

    output.collect(key, new NutchWritable(value));
  }

  public void reduce(Text key, Iterator<NutchWritable> values,
                     OutputCollector<Text, NutchIndexAction> output, Reporter reporter)
    throws IOException {
    Inlinks inlinks = null;
    CrawlDatum dbDatum = null;
    CrawlDatum fetchDatum = null;
    ParseData parseData = null;
    ParseText parseText = null;
    byte[] cache_content = null;
    
    
    while (values.hasNext()) {
      final Writable value = values.next().get(); // unwrap
      if (value instanceof Inlinks) {
        inlinks = (Inlinks)value;
      } else if (value instanceof CrawlDatum) {
        final CrawlDatum datum = (CrawlDatum)value;
        if (CrawlDatum.hasDbStatus(datum)) {
          dbDatum = datum;
        }
        else if (CrawlDatum.hasFetchStatus(datum)) {

          // don't index unmodified (empty) pages
          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
            fetchDatum = datum;

            /**
             * Check if we need to delete 404 NOT FOUND and 301 PERMANENT REDIRECT.
             */
            if (delete) {
              if (fetchDatum.getStatus() == CrawlDatum.STATUS_FETCH_GONE) {
                reporter.incrCounter("IndexerStatus", "Documents deleted", 1);

                NutchIndexAction action = new NutchIndexAction(null, NutchIndexAction.DELETE);
                output.collect(key, action);
                return;
              }
              if (fetchDatum.getStatus() == CrawlDatum.STATUS_FETCH_REDIR_PERM) {
                reporter.incrCounter("IndexerStatus", "Perm redirects deleted", 1);

                NutchIndexAction action = new NutchIndexAction(null, NutchIndexAction.DELETE);
                output.collect(key, action);
                return;
              }
            }
          }

        } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                   CrawlDatum.STATUS_SIGNATURE == datum.getStatus() ||
                   CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {
          continue;
        } else {
          throw new RuntimeException("Unexpected status: "+datum.getStatus());
        }
      } else if (value instanceof ParseData) {
        parseData = (ParseData)value;

        // Handle robots meta? https://issues.apache.org/jira/browse/NUTCH-1434
        if (deleteRobotsNoIndex) {
          // Get the robots meta data
          String robotsMeta = parseData.getMeta("robots");

          // Has it a noindex for this url?
          if (robotsMeta != null && robotsMeta.toLowerCase().indexOf("noindex") != -1) {
            // Delete it!
            NutchIndexAction action = new NutchIndexAction(null, NutchIndexAction.DELETE);
            output.collect(key, action);
            return;
          }
        }
      } else if (value instanceof ParseText) {
        parseText = (ParseText)value;
      } 
      
      
      else if (value instanceof Content) {
          cache_content = ((Content) value).getContent();
      }
      
      
      else if (LOG.isWarnEnabled()) {
        LOG.warn("Unrecognized type: "+value.getClass());
      }
    }

    if (fetchDatum == null || dbDatum == null
        || parseText == null || parseData == null) {
      return;                                     // only have inlinks
    }

    // Whether to skip DB_NOTMODIFIED pages
    if (skip && dbDatum.getStatus() == CrawlDatum.STATUS_DB_NOTMODIFIED) {
      reporter.incrCounter("IndexerStatus", "Skipped", 1);
      return;
    }

    if (!parseData.getStatus().isSuccess() ||
        fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
      return;
    }

    NutchDocument doc = new NutchDocument();
    final Metadata metadata = parseData.getContentMeta();

    // add segment, used to map from merged index back to segment files
    doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));

    // add digest, used by dedup
    doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));

    doc.add("cache_content", cache_content);
    
    final Parse parse = new ParseImpl(parseText, parseData);
    try {
      // extract information from dbDatum and pass it to
      // fetchDatum so that indexing filters can use it
      final Text url = (Text) dbDatum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY);
      if (url != null) {
        fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);
      }
      // run indexing filters
      doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
    } catch (final IndexingException e) {
      if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
      reporter.incrCounter("IndexerStatus", "Errors", 1);
      return;
    }

    // skip documents discarded by indexing filters
    if (doc == null) {
      reporter.incrCounter("IndexerStatus", "Skipped by filters", 1);
      return;
    }

    float boost = 1.0f;
    // run scoring filters
    try {
      boost = this.scfilters.indexerScore(key, doc, dbDatum,
              fetchDatum, parse, inlinks, boost);
    } catch (final ScoringFilterException e) {
      if (LOG.isWarnEnabled()) {
        LOG.warn("Error calculating score " + key + ": " + e);
      }
      return;
    }
    // apply boost to all indexed fields.
    doc.setWeight(boost);
    // store boost for use by explain and dedup
    doc.add("boost", Float.toString(boost));

    reporter.incrCounter("IndexerStatus", "Documents added", 1);

    NutchIndexAction action = new NutchIndexAction(doc, NutchIndexAction.ADD);
    output.collect(key, action);
  }

  public void close() throws IOException { }

  public static void initMRJob(Path crawlDb, Path linkDb,
                           Collection<Path> segments,
                           JobConf job) {

    LOG.info("IndexerMapReduce: crawldb: " + crawlDb);
    final String DIR_CACHE = "content";
    
    if (linkDb!=null)
      LOG.info("IndexerMapReduce: linkdb: " + linkDb);

    for (final Path segment : segments) {
      LOG.info("IndexerMapReduces: adding segment: " + segment);
      FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));
      FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));
      // 添加 快照
      FileInputFormat.addInputPath(job, new Path(segment, DIR_CACHE));
    }

    FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
    
    if (linkDb!=null)
	  FileInputFormat.addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME));
    
    job.setInputFormat(SequenceFileInputFormat.class);

    job.setMapperClass(IndexerMapReduce.class);
    job.setReducerClass(IndexerMapReduce.class);

    job.setOutputFormat(IndexerOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setMapOutputValueClass(NutchWritable.class);
    job.setOutputValueClass(NutchWritable.class);
  }
}

回复收藏 0

累赘 2021-11-29 01:27:52

public String getById(Page page, String id) throws SolrServerException,

MalformedURLException {

// 获取连接服务

CommonsHttpSolrServer server = SolrServer.getInstance().getServer();

// 提升性能采用流输出方式

server.setRequestWriter(new BinaryRequestWriter());

// 搜索条件的设置

SolrQuery query = new SolrQuery();

String ckw = page.getConditions().get("kw").toString();// 为防止和其他变量重名，在变量的开始加上condition的首字母c

query.set("q", "content:" + ckw);

QueryResponse qrsp = server.query(query);

SolrDocumentList list = qrsp.getResults();

String cache_content = null;

Iterator<SolrDocument> iter = list.iterator();

while (iter.hasNext()) {

SolrDocument doc = iter.next();

String idStr = doc.getFieldValue("id").toString();

if (idStr.equals(id)) {

byte[] inOutb = (byte[]) doc.getFieldValue("cache_content");

// System.out.println(inOutb);

try {

cache_content = new String(inOutb, "gbk");

// System.out.println(cache_content);

} catch (Exception e) { // TODO Auto-generated catch

e.printStackTrace();

}

break;

}

}

return cache_content;