如何抓取 html 页面以创建网站主观总分

发布于 12-01 21:42 字数 502 浏览 3 评论 0原文

预先感谢您的帮助。

我一直在竭力寻找/编写一个执行以下操作的实用程序:

  1. 爬行指定站点(站点名称),在该站点上的所有 html 页面中查找各种字符串(a、b、c、d、e)并且特定命名的 javascript 文件 (javascriptfile.js)

  2. 如果个别页面没有找到javascript文件,则将该页面的名称/url输出到文件中,然后继续抓取。

  3. 根据每个字符串在页面上找到的次数创建总分(每个字符串“a”1 分,每个字符串“b”2 分)等等。

我被困在第一部分——因为我没有编写爬行部分的编码技巧。我尝试过 Wget、pavuk、mechanize 和一些 php 脚本,但它们似乎也都有限制。

任何人对我如何使用或修改上述实用程序之一有任何示例或想法,或者编写一个脚本来完成上述任务?

我开放 C、java、php、perl 等...——只是想完成这个!

非常感谢您的帮助!

Thanks in advance for your help.

I have been exhaustively trying to find/write a utility that does the following:

  1. Crawls through a specified site (sitename) looking for various strings (a, b, c, d, e) in all html pages on the site AND a specific named javascript file (javascriptfile.js)

  2. If the javascript file is not found on an individual page, output the name/url of the page to a file, and then continue crawling.

  3. Create a total score based on how many times each string is found on the page (1 point for each string "a", 2 points for each string "b") etc. etc.

I am stuck at the first part -- because I don't have the coding skills to write the crawling portion. I have tried Wget, pavuk, mechanize, and some php scripts but they all seem to be limiting as well.

Anyone have any examples or thoughts on how I can use either use or modify one of the mentioned utilities, or write a script would accomplish the above?

I am open C, java, php, perl, etc... -- just want to get this done!

Thanks so much for your help!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

你是我的挚爱i2024-12-08 21:42:55

我建议python的urllib

获取网页

使用 Python 通过 HTTP 获取标准网页非常容易:

导入urllib
f = urllib.urlopen("http://www.python.org")
s = f.read()
f.close()

--这是来自这里

然后使用python 的 html 解析器

I suggest python's urllib.

Fetching Web Pages

Fetching standard Web pages over HTTP is very easy with Python:

import urllib
f = urllib.urlopen("http://www.python.org")
s = f.read()
f.close()

--this is from here

Then use python's html parser

谁的年少不轻狂2024-12-08 21:42:55

抓取指定网站(网站名称),在网站上的所有 html 页面中查找各种字符串(a、b、c、d、e)以及特定命名的 javascript 文件 (javascriptfile.js)

在 python 中你会想使用 urllib 。这将使您能够轻松地与 Http 服务器进行通信。
然后你需要研究正则表达式,这将允许你进行爬行和字符串搜索。由于大多数服务器没有开放索引,因此您需要找到 标记,然后删除除它们指向的位置之外的所有内容,然后获取要爬行的新目的地。

从锚标记获取 Href 属性

比较域确保它们相同或相对路径(以“/”开头)

重复过程

您可以查看“beautifulsoup”来帮助您解决此问题。它将为您完成阅读 HTML 的所有艰苦工作。 Beautiful Soup

甚至应该帮助搜索您的字符串。

如果个别页面没有找到该javascript文件,则将该页面的名称/url输出到一个文件中,然后继续抓取。

这里你可以再次使用Beautiful Soup或者RegEx来看看是否有事实上,它已包含在页面

根据每个字符串在页面上找到的次数创建总分(每个字符串“a”1分,每个字符串“b”2分)等等。

这将完成抓取页面的所有内容,使用正则表达式,您可以计算文本模式的特定实例出现的次数,因此您只需将它们添加到字典中即可获得结果。
也许创建一个映射,所以 score = {'a': 10}; IF a FOUND: 点数 += 分数['a']*出现次数

良好的正则表达式参考:正则表达式信息

Crawls through a specified site (sitename) looking for various strings (a, b, c, d, e) in all html pages on the site AND a specific named javascript file (javascriptfile.js)

In python you'll want to use urllib. This will allow you to communicate with Http Servers easily.
Then you'll want to look into regexp, this will allow you to do the crawling and the string searching. As most server dont have an open index, you'll need to find <a> tags and then strip out everything but where they point, then grab a new destination to crawl to.

Get The Href Attribute From Anchor Tags

Compare Domains Make Sure They're The Same Or A Relative Path (start with '/')

Repeat Process

You could look into 'beautifulsoup' to help you with this. It'll do all the hard work of reading through HTML for you. Beautiful Soup

Should even help with search for your strings.

If the javascript file is not found on an individual page, output the name/url of the page to a file, and then continue crawling.

You can once again use Beautiful Soup or RegEx here to see if they're infact including it on the page <script src='urltofile'>. Then just write the current page you're crawling to a file.

Create a total score based on how many times each string is found on the page (1 point for each string "a", 2 points for each string "b") etc. etc.

This will be done everything you're crawling the page, using Regex you can count how many times a specific instance of a text pattern occurs, so you'd just add those to a dict and get your result.
Maybe create a mapping so score = {'a': 10}; IF a FOUND: points += score['a']*occurences.

Good Reg-Exp reference: Regexp Info

知你几分2024-12-08 21:42:55

好吧,第 1 点确实是这样的(在 PHP 中):

只有那时你可以转到第2点和第3点

well, point 1 is really like this(in PHP):

only then you can go over to point 2 and 3

别再吹冷风2024-12-08 21:42:55

不太明白这个问题,但我认为这会有所帮助:

只需创建一个简单的爬虫,将数据插入数据库。然后在另一个 PHP 文件中从表中选择这些行,然后找到爬网文本的特定部分,然后为它们提供所需的值。然后更新数据库。

这是一段 PHP 爬虫代码:

<?php
$urls = array("http://www.chilledlime.com");
$parsed = array();

$sitesvisited = 0;

mysql_connect("localhost", "username", "password");
mysql_select_db("db_name");

mysql_query("DROP TABLE search;");
mysql_query("CREATE TABLE search (URL CHAR(255), Contents TEXT);");
mysql_query("ALTER TABLE search ADD FULLTEXT(Contents);");

function parse_site() {
    GLOBAL $urls, $parsed, $sitesvisited;

    $newsite = array_shift($urls);

    echo "\n Now parsing $newsite...\n";

    // the @ is because not all URLs are valid, and we don't want
    // lots of errors being printed out
    $ourtext = @file_get_contents($newsite);
    if (!$ourtext) return;

    $newsite = addslashes($newsite);
    $ourtext = addslashes($ourtext);

    mysql_query("INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");

    // this site has been successfully indexed; increment the counter
    ++$sitesvisited;

    // this extracts all hyperlinks in the document
    preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);

    if (count($matches)) {
        $matches = $matches[0];
        $nummatches = count($matches);

        echo "Got $nummatches from $newsite\n";

        foreach($matches as $match) {

            // we want to ignore all these strings
            if (stripos($match, ".exe") !== false) continue;


            // yes, these next two are very vague, but they do cut out
            // the vast majority of advertising links.  Like I said,
            // this indexer is far from perfect!
            if (stripos($match, "ads.") !== false) continue;
            if (stripos($match, "ad.") !== false) continue;

            if (stripos($match, "doubleclick") !== false) continue;

            // this URL looks safe
            if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
                if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
                    array_push($urls, $match);
                    echo "Adding $match...\n";
                }
            }
        }
    } else {
        echo "Got no matches from $newsite\n";
    }

    // add this site to the list we've visited already
    $parsed[] = $newsite;
}

while ($sitesvisited < 50 && count($urls) != 0) {
    parse_site();

    // this stops us from overloading web servers
    sleep(5);
}
?> 

祝你好运!

Not quite understood the question, but I think this will help:

Just create a simple crawler, that inserts the data into a database. Then in another PHP file select these rows from the table and then find the specific parts of the crawled text, then give them the value you want. Then update the db.

Here is a crawler code in PHP:

<?php
$urls = array("http://www.chilledlime.com");
$parsed = array();

$sitesvisited = 0;

mysql_connect("localhost", "username", "password");
mysql_select_db("db_name");

mysql_query("DROP TABLE search;");
mysql_query("CREATE TABLE search (URL CHAR(255), Contents TEXT);");
mysql_query("ALTER TABLE search ADD FULLTEXT(Contents);");

function parse_site() {
    GLOBAL $urls, $parsed, $sitesvisited;

    $newsite = array_shift($urls);

    echo "\n Now parsing $newsite...\n";

    // the @ is because not all URLs are valid, and we don't want
    // lots of errors being printed out
    $ourtext = @file_get_contents($newsite);
    if (!$ourtext) return;

    $newsite = addslashes($newsite);
    $ourtext = addslashes($ourtext);

    mysql_query("INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");

    // this site has been successfully indexed; increment the counter
    ++$sitesvisited;

    // this extracts all hyperlinks in the document
    preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);

    if (count($matches)) {
        $matches = $matches[0];
        $nummatches = count($matches);

        echo "Got $nummatches from $newsite\n";

        foreach($matches as $match) {

            // we want to ignore all these strings
            if (stripos($match, ".exe") !== false) continue;


            // yes, these next two are very vague, but they do cut out
            // the vast majority of advertising links.  Like I said,
            // this indexer is far from perfect!
            if (stripos($match, "ads.") !== false) continue;
            if (stripos($match, "ad.") !== false) continue;

            if (stripos($match, "doubleclick") !== false) continue;

            // this URL looks safe
            if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
                if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
                    array_push($urls, $match);
                    echo "Adding $match...\n";
                }
            }
        }
    } else {
        echo "Got no matches from $newsite\n";
    }

    // add this site to the list we've visited already
    $parsed[] = $newsite;
}

while ($sitesvisited < 50 && count($urls) != 0) {
    parse_site();

    // this stops us from overloading web servers
    sleep(5);
}
?> 

Good luck!

已下线请稍等2024-12-08 21:42:55

只是另一个选项 html5lib 一年多前,它似乎是解析 HTML 的不错选择。请参阅:http://code.google.com/p/html5lib/wiki/UserDocumentation

在这里,处理此搜索结果页面的示例: http://index.hu/24ora?tol=2010-08-25&ig=2011-08-25(这是匈牙利语)它将提取搜索结果的数量

from datetime import datetime, timedelta
from html5lib import treebuilders, treewalkers, serializer
import html5lib
import re
import urllib2
import sys

def openURL (url):
    """
    utlitity function, returns (page, url)
    sets user_agent and resolves possible redirection
    returned url may be different than initial url in the case of a redirect
    """    
    request = urllib2.Request(url)
    user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
    request.add_header("User-Agent", user_agent)
    pagefile=urllib2.urlopen(request)
    realurl = pagefile.geturl()
    return (pagefile, realurl)

def daterange(start, stop, step=timedelta(days=1), inclusive=True):
    """
    utility function, returns list of dates within the specified range
    """
    # inclusive=False to behave like range by default
    if step.days > 0:
        while start < stop:
            yield start
            start = start + step
            # not +=! don't modify object passed in if it's mutable
            # since this function is not restricted to
            # only types from datetime module
    elif step.days < 0:
        while start > stop:
            yield start
            start = start + step
    if inclusive and start == stop:
        yield start

def processURLindex(url):
    """
    process an url of an index.hu search result page
    returns number of search results
    e.g. http://index.hu/24ora/?s=LMP&tol=2010-04-02&ig=2010-04-02    
    """
    (f, new_url) = openURL(url)
    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
    tree = parser.parse(f)
    tree.normalize()
    for span in tree.getElementsByTagName("span"):            
        if span.hasAttribute("class") and (span.getAttribute("class") =="talalat"):
            return re.findall(r'\d+', span.firstChild.data)[0]



def daterange2URLindex(term, start_date, end_date):
    urlpattern = "http://index.hu/24ora/?s=$TERM
amp;tol=2010-04-02&ig=2010-04-02"
    cum = 0
    for single_date in daterange(start_date, end_date):
        datestr = single_date.strftime("%Y-%m-%d")
        url = re.sub(r"\d\d\d\d-\d\d-\d\d", datestr, urlpattern)
        url = url.replace("$TERM$", term);
        num = int(processURLindex(url))
        cum = cum + num
        print "\t".join([str(num), str(cum), datestr, url])  


if __name__ == '__main__':
    if len(sys.argv) == 4:
        start_date = datetime.strptime(sys.argv[2], '%Y-%m-%d')
        end_date = datetime.strptime(sys.argv[3], '%Y-%m-%d')
        daterange2URLindex(sys.argv[1], start_date, end_date)
    else:
        print 'search index.hu within a date range; usage:'
        print 'index.hu.py [search term] [from date] [to date] > results.txt'
        print 'the date format is yyyy-mm-dd'
        print 'the output format is TAB delimited and will be the following:'
        print '[count of search results]TAB[count cumulated]TAB[date]TAB[search URL for that date]'
        sys.exit(-1)

Just an other option html5lib More then a year ago it seemed to be a good choice for parsing HTML. see: http://code.google.com/p/html5lib/wiki/UserDocumentation

Here you go, the example processing this search result page: http://index.hu/24ora?tol=2010-08-25&ig=2011-08-25 ( this is Hungarian) It will extract the number of search results

from datetime import datetime, timedelta
from html5lib import treebuilders, treewalkers, serializer
import html5lib
import re
import urllib2
import sys

def openURL (url):
    """
    utlitity function, returns (page, url)
    sets user_agent and resolves possible redirection
    returned url may be different than initial url in the case of a redirect
    """    
    request = urllib2.Request(url)
    user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
    request.add_header("User-Agent", user_agent)
    pagefile=urllib2.urlopen(request)
    realurl = pagefile.geturl()
    return (pagefile, realurl)

def daterange(start, stop, step=timedelta(days=1), inclusive=True):
    """
    utility function, returns list of dates within the specified range
    """
    # inclusive=False to behave like range by default
    if step.days > 0:
        while start < stop:
            yield start
            start = start + step
            # not +=! don't modify object passed in if it's mutable
            # since this function is not restricted to
            # only types from datetime module
    elif step.days < 0:
        while start > stop:
            yield start
            start = start + step
    if inclusive and start == stop:
        yield start

def processURLindex(url):
    """
    process an url of an index.hu search result page
    returns number of search results
    e.g. http://index.hu/24ora/?s=LMP&tol=2010-04-02&ig=2010-04-02    
    """
    (f, new_url) = openURL(url)
    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
    tree = parser.parse(f)
    tree.normalize()
    for span in tree.getElementsByTagName("span"):            
        if span.hasAttribute("class") and (span.getAttribute("class") =="talalat"):
            return re.findall(r'\d+', span.firstChild.data)[0]



def daterange2URLindex(term, start_date, end_date):
    urlpattern = "http://index.hu/24ora/?s=$TERM
amp;tol=2010-04-02&ig=2010-04-02"
    cum = 0
    for single_date in daterange(start_date, end_date):
        datestr = single_date.strftime("%Y-%m-%d")
        url = re.sub(r"\d\d\d\d-\d\d-\d\d", datestr, urlpattern)
        url = url.replace("$TERM$", term);
        num = int(processURLindex(url))
        cum = cum + num
        print "\t".join([str(num), str(cum), datestr, url])  


if __name__ == '__main__':
    if len(sys.argv) == 4:
        start_date = datetime.strptime(sys.argv[2], '%Y-%m-%d')
        end_date = datetime.strptime(sys.argv[3], '%Y-%m-%d')
        daterange2URLindex(sys.argv[1], start_date, end_date)
    else:
        print 'search index.hu within a date range; usage:'
        print 'index.hu.py [search term] [from date] [to date] > results.txt'
        print 'the date format is yyyy-mm-dd'
        print 'the output format is TAB delimited and will be the following:'
        print '[count of search results]TAB[count cumulated]TAB[date]TAB[search URL for that date]'
        sys.exit(-1)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文