当前位置：文江博客话题详情

如何抓取 html 页面以创建网站主观总分

发布于 12-01 21:42 字数 502 浏览 3 评论 0原文

预先感谢您的帮助。

我一直在竭力寻找/编写一个执行以下操作的实用程序：

爬行指定站点（站点名称），在该站点上的所有 html 页面中查找各种字符串（a、b、c、d、e）并且特定命名的 javascript 文件 (javascriptfile.js)
如果个别页面没有找到javascript文件，则将该页面的名称/url输出到文件中，然后继续抓取。
根据每个字符串在页面上找到的次数创建总分（每个字符串“a”1 分，每个字符串“b”2 分）等等。

我被困在第一部分——因为我没有编写爬行部分的编码技巧。我尝试过 Wget、pavuk、mechanize 和一些 php 脚本，但它们似乎也都有限制。

任何人对我如何使用或修改上述实用程序之一有任何示例或想法，或者编写一个脚本来完成上述任务？

我开放 C、java、php、perl 等...——只是想完成这个！

非常感谢您的帮助！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你是我的挚爱i2024-12-08 21:42:55

我建议python的urllib。

获取网页
使用 Python 通过 HTTP 获取标准网页非常容易：
导入urllib
f = urllib.urlopen("http://www.python.org")
s = f.read()
f.close()

--这是来自这里

然后使用python 的 html 解析器

回复收藏 0 原文

谁的年少不轻狂2024-12-08 21:42:55

抓取指定网站（网站名称），在网站上的所有 html 页面中查找各种字符串（a、b、c、d、e）以及特定命名的 javascript 文件 (javascriptfile.js)

在 python 中你会想使用 urllib 。这将使您能够轻松地与 Http 服务器进行通信。
然后你需要研究正则表达式，这将允许你进行爬行和字符串搜索。由于大多数服务器没有开放索引，因此您需要找到标记，然后删除除它们指向的位置之外的所有内容，然后获取要爬行的新目的地。

从锚标记获取 Href 属性
比较域确保它们相同或相对路径（以“/”开头）
重复过程

您可以查看“beautifulsoup”来帮助您解决此问题。它将为您完成阅读 HTML 的所有艰苦工作。 Beautiful Soup

甚至应该帮助搜索您的字符串。

如果个别页面没有找到该javascript文件，则将该页面的名称/url输出到一个文件中，然后继续抓取。

这里你可以再次使用Beautiful Soup或者RegEx来看看是否有事实上，它已包含在页面

根据每个字符串在页面上找到的次数创建总分（每个字符串“a”1分，每个字符串“b”2分）等等。

这将完成抓取页面的所有内容，使用正则表达式，您可以计算文本模式的特定实例出现的次数，因此您只需将它们添加到字典中即可获得结果。
也许创建一个映射，所以 score = {'a': 10}; IF a FOUND: 点数 += 分数['a']*出现次数。

良好的正则表达式参考：正则表达式信息

回复收藏 0 原文

知你几分2024-12-08 21:42:55

好吧，第 1 点确实是这样的（在 PHP 中）：

加载 html 页面 - 您可以使用 file_get_contents() 或 curl（推荐）为此
执行一些 preg_match在网站上查找 a、b、c 和 js 脚本名称OR use use http://www.php.net/manual/en/book.dom.php 将页面加载为 XML 并对其执行一些 xpath ( http://www.php.net/manual/en/book.dom.php#93637 ) （推荐）

只有那时你可以转到第2点和第3点

回复收藏 0 原文

别再吹冷风2024-12-08 21:42:55

不太明白这个问题，但我认为这会有所帮助：

只需创建一个简单的爬虫，将数据插入数据库。然后在另一个 PHP 文件中从表中选择这些行，然后找到爬网文本的特定部分，然后为它们提供所需的值。然后更新数据库。

这是一段 PHP 爬虫代码：

<?php
$urls = array("http://www.chilledlime.com");
$parsed = array();

$sitesvisited = 0;

mysql_connect("localhost", "username", "password");
mysql_select_db("db_name");

mysql_query("DROP TABLE search;");
mysql_query("CREATE TABLE search (URL CHAR(255), Contents TEXT);");
mysql_query("ALTER TABLE search ADD FULLTEXT(Contents);");

function parse_site() {
    GLOBAL $urls, $parsed, $sitesvisited;

    $newsite = array_shift($urls);

    echo "\n Now parsing $newsite...\n";

    // the @ is because not all URLs are valid, and we don't want
    // lots of errors being printed out
    $ourtext = @file_get_contents($newsite);
    if (!$ourtext) return;

    $newsite = addslashes($newsite);
    $ourtext = addslashes($ourtext);

    mysql_query("INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");

    // this site has been successfully indexed; increment the counter
    ++$sitesvisited;

    // this extracts all hyperlinks in the document
    preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);

    if (count($matches)) {
        $matches = $matches[0];
        $nummatches = count($matches);

        echo "Got $nummatches from $newsite\n";

        foreach($matches as $match) {

            // we want to ignore all these strings
            if (stripos($match, ".exe") !== false) continue;


            // yes, these next two are very vague, but they do cut out
            // the vast majority of advertising links.  Like I said,
            // this indexer is far from perfect!
            if (stripos($match, "ads.") !== false) continue;
            if (stripos($match, "ad.") !== false) continue;

            if (stripos($match, "doubleclick") !== false) continue;

            // this URL looks safe
            if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
                if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
                    array_push($urls, $match);
                    echo "Adding $match...\n";
                }
            }
        }
    } else {
        echo "Got no matches from $newsite\n";
    }

    // add this site to the list we've visited already
    $parsed[] = $newsite;
}

while ($sitesvisited < 50 && count($urls) != 0) {
    parse_site();

    // this stops us from overloading web servers
    sleep(5);
}
?>

祝你好运！

Not quite understood the question, but I think this will help:

Just create a simple crawler, that inserts the data into a database. Then in another PHP file select these rows from the table and then find the specific parts of the crawled text, then give them the value you want. Then update the db.

Here is a crawler code in PHP:

<?php
$urls = array("http://www.chilledlime.com");
$parsed = array();

$sitesvisited = 0;

mysql_connect("localhost", "username", "password");
mysql_select_db("db_name");

mysql_query("DROP TABLE search;");
mysql_query("CREATE TABLE search (URL CHAR(255), Contents TEXT);");
mysql_query("ALTER TABLE search ADD FULLTEXT(Contents);");

function parse_site() {
    GLOBAL $urls, $parsed, $sitesvisited;

    $newsite = array_shift($urls);

    echo "\n Now parsing $newsite...\n";

    // the @ is because not all URLs are valid, and we don't want
    // lots of errors being printed out
    $ourtext = @file_get_contents($newsite);
    if (!$ourtext) return;

    $newsite = addslashes($newsite);
    $ourtext = addslashes($ourtext);

    mysql_query("INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");

    // this site has been successfully indexed; increment the counter
    ++$sitesvisited;

    // this extracts all hyperlinks in the document
    preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);

    if (count($matches)) {
        $matches = $matches[0];
        $nummatches = count($matches);

        echo "Got $nummatches from $newsite\n";

        foreach($matches as $match) {

            // we want to ignore all these strings
            if (stripos($match, ".exe") !== false) continue;


            // yes, these next two are very vague, but they do cut out
            // the vast majority of advertising links.  Like I said,
            // this indexer is far from perfect!
            if (stripos($match, "ads.") !== false) continue;
            if (stripos($match, "ad.") !== false) continue;

            if (stripos($match, "doubleclick") !== false) continue;

            // this URL looks safe
            if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
                if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
                    array_push($urls, $match);
                    echo "Adding $match...\n";
                }
            }
        }
    } else {
        echo "Got no matches from $newsite\n";
    }

    // add this site to the list we've visited already
    $parsed[] = $newsite;
}

while ($sitesvisited < 50 && count($urls) != 0) {
    parse_site();

    // this stops us from overloading web servers
    sleep(5);
}
?>

Good luck!

回复收藏 0 原文

已下线请稍等2024-12-08 21:42:55

只是另一个选项 html5lib 一年多前，它似乎是解析 HTML 的不错选择。请参阅：http://code.google.com/p/html5lib/wiki/UserDocumentation

在这里，处理此搜索结果页面的示例： http://index.hu/24ora?tol=2010-08-25&ig=2011-08-25（这是匈牙利语）它将提取搜索结果的数量

from datetime import datetime, timedelta
from html5lib import treebuilders, treewalkers, serializer
import html5lib
import re
import urllib2
import sys

def openURL (url):
    """
    utlitity function, returns (page, url)
    sets user_agent and resolves possible redirection
    returned url may be different than initial url in the case of a redirect
    """    
    request = urllib2.Request(url)
    user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
    request.add_header("User-Agent", user_agent)
    pagefile=urllib2.urlopen(request)
    realurl = pagefile.geturl()
    return (pagefile, realurl)

def daterange(start, stop, step=timedelta(days=1), inclusive=True):
    """
    utility function, returns list of dates within the specified range
    """
    # inclusive=False to behave like range by default
    if step.days > 0:
        while start < stop:
            yield start
            start = start + step
            # not +=! don't modify object passed in if it's mutable
            # since this function is not restricted to
            # only types from datetime module
    elif step.days < 0:
        while start > stop:
            yield start
            start = start + step
    if inclusive and start == stop:
        yield start

def processURLindex(url):
    """
    process an url of an index.hu search result page
    returns number of search results
    e.g. http://index.hu/24ora/?s=LMP&tol=2010-04-02&ig=2010-04-02    
    """
    (f, new_url) = openURL(url)
    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
    tree = parser.parse(f)
    tree.normalize()
    for span in tree.getElementsByTagName("span"):            
        if span.hasAttribute("class") and (span.getAttribute("class") =="talalat"):
            return re.findall(r'\d+', span.firstChild.data)[0]



def daterange2URLindex(term, start_date, end_date):
    urlpattern = "http://index.hu/24ora/?s=$TERMamp;tol=2010-04-02&ig=2010-04-02"
    cum = 0
    for single_date in daterange(start_date, end_date):
        datestr = single_date.strftime("%Y-%m-%d")
        url = re.sub(r"\d\d\d\d-\d\d-\d\d", datestr, urlpattern)
        url = url.replace("$TERM$", term);
        num = int(processURLindex(url))
        cum = cum + num
        print "\t".join([str(num), str(cum), datestr, url])  


if __name__ == '__main__':
    if len(sys.argv) == 4:
        start_date = datetime.strptime(sys.argv[2], '%Y-%m-%d')
        end_date = datetime.strptime(sys.argv[3], '%Y-%m-%d')
        daterange2URLindex(sys.argv[1], start_date, end_date)
    else:
        print 'search index.hu within a date range; usage:'
        print 'index.hu.py [search term] [from date] [to date] > results.txt'
        print 'the date format is yyyy-mm-dd'
        print 'the output format is TAB delimited and will be the following:'
        print '[count of search results]TAB[count cumulated]TAB[date]TAB[search URL for that date]'
        sys.exit(-1)

Just an other option html5lib More then a year ago it seemed to be a good choice for parsing HTML. see: http://code.google.com/p/html5lib/wiki/UserDocumentation

Here you go, the example processing this search result page: http://index.hu/24ora?tol=2010-08-25&ig=2011-08-25 ( this is Hungarian) It will extract the number of search results

from datetime import datetime, timedelta
from html5lib import treebuilders, treewalkers, serializer
import html5lib
import re
import urllib2
import sys

def openURL (url):
    """
    utlitity function, returns (page, url)
    sets user_agent and resolves possible redirection
    returned url may be different than initial url in the case of a redirect
    """    
    request = urllib2.Request(url)
    user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
    request.add_header("User-Agent", user_agent)
    pagefile=urllib2.urlopen(request)
    realurl = pagefile.geturl()
    return (pagefile, realurl)

def daterange(start, stop, step=timedelta(days=1), inclusive=True):
    """
    utility function, returns list of dates within the specified range
    """
    # inclusive=False to behave like range by default
    if step.days > 0:
        while start < stop:
            yield start
            start = start + step
            # not +=! don't modify object passed in if it's mutable
            # since this function is not restricted to
            # only types from datetime module
    elif step.days < 0:
        while start > stop:
            yield start
            start = start + step
    if inclusive and start == stop:
        yield start

def processURLindex(url):
    """
    process an url of an index.hu search result page
    returns number of search results
    e.g. http://index.hu/24ora/?s=LMP&tol=2010-04-02&ig=2010-04-02    
    """
    (f, new_url) = openURL(url)
    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
    tree = parser.parse(f)
    tree.normalize()
    for span in tree.getElementsByTagName("span"):            
        if span.hasAttribute("class") and (span.getAttribute("class") =="talalat"):
            return re.findall(r'\d+', span.firstChild.data)[0]



def daterange2URLindex(term, start_date, end_date):
    urlpattern = "http://index.hu/24ora/?s=$TERMamp;tol=2010-04-02&ig=2010-04-02"
    cum = 0
    for single_date in daterange(start_date, end_date):
        datestr = single_date.strftime("%Y-%m-%d")
        url = re.sub(r"\d\d\d\d-\d\d-\d\d", datestr, urlpattern)
        url = url.replace("$TERM$", term);
        num = int(processURLindex(url))
        cum = cum + num
        print "\t".join([str(num), str(cum), datestr, url])  


if __name__ == '__main__':
    if len(sys.argv) == 4:
        start_date = datetime.strptime(sys.argv[2], '%Y-%m-%d')
        end_date = datetime.strptime(sys.argv[3], '%Y-%m-%d')
        daterange2URLindex(sys.argv[1], start_date, end_date)
    else:
        print 'search index.hu within a date range; usage:'
        print 'index.hu.py [search term] [from date] [to date] > results.txt'
        print 'the date format is yyyy-mm-dd'
        print 'the output format is TAB delimited and will be the following:'
        print '[count of search results]TAB[count cumulated]TAB[date]TAB[search URL for that date]'
        sys.exit(-1)

回复收藏 0 原文

~没有更多了~