如何抓取 html 页面以创建网站主观总分
预先感谢您的帮助。
我一直在竭力寻找/编写一个执行以下操作的实用程序:
爬行指定站点(站点名称),在该站点上的所有 html 页面中查找各种字符串(a、b、c、d、e)并且特定命名的 javascript 文件 (javascriptfile.js)
如果个别页面没有找到javascript文件,则将该页面的名称/url输出到文件中,然后继续抓取。
根据每个字符串在页面上找到的次数创建总分(每个字符串“a”1 分,每个字符串“b”2 分)等等。
我被困在第一部分——因为我没有编写爬行部分的编码技巧。我尝试过 Wget、pavuk、mechanize 和一些 php 脚本,但它们似乎也都有限制。
任何人对我如何使用或修改上述实用程序之一有任何示例或想法,或者编写一个脚本来完成上述任务?
我开放 C、java、php、perl 等...——只是想完成这个!
非常感谢您的帮助!
Thanks in advance for your help.
I have been exhaustively trying to find/write a utility that does the following:
Crawls through a specified site (sitename) looking for various strings (a, b, c, d, e) in all html pages on the site AND a specific named javascript file (javascriptfile.js)
If the javascript file is not found on an individual page, output the name/url of the page to a file, and then continue crawling.
Create a total score based on how many times each string is found on the page (1 point for each string "a", 2 points for each string "b") etc. etc.
I am stuck at the first part -- because I don't have the coding skills to write the crawling portion. I have tried Wget, pavuk, mechanize, and some php scripts but they all seem to be limiting as well.
Anyone have any examples or thoughts on how I can use either use or modify one of the mentioned utilities, or write a script would accomplish the above?
I am open C, java, php, perl, etc... -- just want to get this done!
Thanks so much for your help!!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
发布评论
评论(5)
抓取指定网站(网站名称),在网站上的所有 html 页面中查找各种字符串(a、b、c、d、e)以及特定命名的 javascript 文件 (javascriptfile.js)
在 python 中你会想使用 urllib 。这将使您能够轻松地与 Http 服务器进行通信。
然后你需要研究正则表达式,这将允许你进行爬行和字符串搜索。由于大多数服务器没有开放索引,因此您需要找到 标记,然后删除除它们指向的位置之外的所有内容,然后获取要爬行的新目的地。
从锚标记获取 Href 属性
比较域确保它们相同或相对路径(以“/”开头)
重复过程
您可以查看“beautifulsoup”来帮助您解决此问题。它将为您完成阅读 HTML 的所有艰苦工作。 Beautiful Soup
甚至应该帮助搜索您的字符串。
如果个别页面没有找到该javascript文件,则将该页面的名称/url输出到一个文件中,然后继续抓取。
这里你可以再次使用Beautiful Soup或者RegEx来看看是否有事实上,它已包含在页面
根据每个字符串在页面上找到的次数创建总分(每个字符串“a”1分,每个字符串“b”2分)等等。
这将完成抓取页面的所有内容,使用正则表达式,您可以计算文本模式的特定实例出现的次数,因此您只需将它们添加到字典中即可获得结果。
也许创建一个映射,所以 score = {'a': 10}; IF a FOUND: 点数 += 分数['a']*出现次数
。
良好的正则表达式参考:正则表达式信息
好吧,第 1 点确实是这样的(在 PHP 中):
- 加载 html 页面 - 您可以使用
file_get_contents()
或curl
(推荐)为此 - 执行一些
preg_match
在网站上查找 a、b、c 和 js 脚本名称OR use use http://www.php.net/manual/en/book.dom.php 将页面加载为 XML 并对其执行一些 xpath ( http://www.php.net/manual/en/book.dom.php#93637 ) (推荐)
只有那时你可以转到第2点和第3点
不太明白这个问题,但我认为这会有所帮助:
只需创建一个简单的爬虫,将数据插入数据库。然后在另一个 PHP 文件中从表中选择这些行,然后找到爬网文本的特定部分,然后为它们提供所需的值。然后更新数据库。
这是一段 PHP 爬虫代码:
<?php
$urls = array("http://www.chilledlime.com");
$parsed = array();
$sitesvisited = 0;
mysql_connect("localhost", "username", "password");
mysql_select_db("db_name");
mysql_query("DROP TABLE search;");
mysql_query("CREATE TABLE search (URL CHAR(255), Contents TEXT);");
mysql_query("ALTER TABLE search ADD FULLTEXT(Contents);");
function parse_site() {
GLOBAL $urls, $parsed, $sitesvisited;
$newsite = array_shift($urls);
echo "\n Now parsing $newsite...\n";
// the @ is because not all URLs are valid, and we don't want
// lots of errors being printed out
$ourtext = @file_get_contents($newsite);
if (!$ourtext) return;
$newsite = addslashes($newsite);
$ourtext = addslashes($ourtext);
mysql_query("INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");
// this site has been successfully indexed; increment the counter
++$sitesvisited;
// this extracts all hyperlinks in the document
preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);
if (count($matches)) {
$matches = $matches[0];
$nummatches = count($matches);
echo "Got $nummatches from $newsite\n";
foreach($matches as $match) {
// we want to ignore all these strings
if (stripos($match, ".exe") !== false) continue;
// yes, these next two are very vague, but they do cut out
// the vast majority of advertising links. Like I said,
// this indexer is far from perfect!
if (stripos($match, "ads.") !== false) continue;
if (stripos($match, "ad.") !== false) continue;
if (stripos($match, "doubleclick") !== false) continue;
// this URL looks safe
if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
array_push($urls, $match);
echo "Adding $match...\n";
}
}
}
} else {
echo "Got no matches from $newsite\n";
}
// add this site to the list we've visited already
$parsed[] = $newsite;
}
while ($sitesvisited < 50 && count($urls) != 0) {
parse_site();
// this stops us from overloading web servers
sleep(5);
}
?>
祝你好运!
只是另一个选项 html5lib 一年多前,它似乎是解析 HTML 的不错选择。请参阅:http://code.google.com/p/html5lib/wiki/UserDocumentation
在这里,处理此搜索结果页面的示例: http://index.hu/24ora?tol=2010-08-25&ig=2011-08-25(这是匈牙利语)它将提取搜索结果的数量
from datetime import datetime, timedelta
from html5lib import treebuilders, treewalkers, serializer
import html5lib
import re
import urllib2
import sys
def openURL (url):
"""
utlitity function, returns (page, url)
sets user_agent and resolves possible redirection
returned url may be different than initial url in the case of a redirect
"""
request = urllib2.Request(url)
user_agent = "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5"
request.add_header("User-Agent", user_agent)
pagefile=urllib2.urlopen(request)
realurl = pagefile.geturl()
return (pagefile, realurl)
def daterange(start, stop, step=timedelta(days=1), inclusive=True):
"""
utility function, returns list of dates within the specified range
"""
# inclusive=False to behave like range by default
if step.days > 0:
while start < stop:
yield start
start = start + step
# not +=! don't modify object passed in if it's mutable
# since this function is not restricted to
# only types from datetime module
elif step.days < 0:
while start > stop:
yield start
start = start + step
if inclusive and start == stop:
yield start
def processURLindex(url):
"""
process an url of an index.hu search result page
returns number of search results
e.g. http://index.hu/24ora/?s=LMP&tol=2010-04-02&ig=2010-04-02
"""
(f, new_url) = openURL(url)
parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("dom"))
tree = parser.parse(f)
tree.normalize()
for span in tree.getElementsByTagName("span"):
if span.hasAttribute("class") and (span.getAttribute("class") =="talalat"):
return re.findall(r'\d+', span.firstChild.data)[0]
def daterange2URLindex(term, start_date, end_date):
urlpattern = "http://index.hu/24ora/?s=$TERMamp;tol=2010-04-02&ig=2010-04-02"
cum = 0
for single_date in daterange(start_date, end_date):
datestr = single_date.strftime("%Y-%m-%d")
url = re.sub(r"\d\d\d\d-\d\d-\d\d", datestr, urlpattern)
url = url.replace("$TERM$", term);
num = int(processURLindex(url))
cum = cum + num
print "\t".join([str(num), str(cum), datestr, url])
if __name__ == '__main__':
if len(sys.argv) == 4:
start_date = datetime.strptime(sys.argv[2], '%Y-%m-%d')
end_date = datetime.strptime(sys.argv[3], '%Y-%m-%d')
daterange2URLindex(sys.argv[1], start_date, end_date)
else:
print 'search index.hu within a date range; usage:'
print 'index.hu.py [search term] [from date] [to date] > results.txt'
print 'the date format is yyyy-mm-dd'
print 'the output format is TAB delimited and will be the following:'
print '[count of search results]TAB[count cumulated]TAB[date]TAB[search URL for that date]'
sys.exit(-1)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
我建议python的urllib。
--这是来自这里
然后使用python 的 html 解析器
I suggest python's urllib.
--this is from here
Then use python's html parser