我想抓取 iTunes top X RSS feed 并插入 dB

发布于 2024-07-14 19:01:06 字数 69 浏览 6 评论 0原文

最好我想用一些 bash shell 脚本来做到这一点,也许是一些 PHP 或 PERL 和 MySQL 数据库。 想法?

Preferably I'd like to do so with some bash shell scripting, maybe some PHP or PERL and a MySQL db. Thoughts?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

离不开的别离 2024-07-21 19:01:06

这是一个使用 Perl 的解决方案,在(当然!)一堆模块的帮助下。

它使用 SQLite,因此您可以轻松运行它((简单的)数据库的定义位于脚本的末尾)。 它还使用 Perl 哈希和简单的 SQL 语句,而不是适当的对象和 ORM 层。 我发现直接解析 XML 比使用 RSS 模块更容易(我尝试过 XML::Feed),因为您需要访问特定标签(名称、预览...)。

您可以使用它作为基础来添加更多功能、数据库中的更多字段、流派表......但至少这样您就有了可以扩展的基础(也许您可以将结果发布为开放的) -来源)。

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;                 # to parse the RSS
use DBIx::Simple;              # DB interaction made easy
use Getopt::Std;               # always need options for a script
use PerlIO::gzip;              # itunes sends a gzip-ed file
use LWP::Simple 'getstore';    # to get the RSS

my %opt;
getopts( 'vc:', \%opt);

# could also be an option, but I guess it won't change that much
my @URLs= ( 
            'http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml',
          );

# during debug, it's nice to use a cache of the feed instead of hitting hit every single run
if( $opt{c}) { @URLs= ($opt{c}); }

# I like using SQLite when developping,
# replace with MySQL connect parameters if needed (see DBD::MySQL for the exact syntax)
my @connect= ("dbi:SQLite:dbname=itunes.db","","", { RaiseError => 1, AutoCommit => 0 }) ;

my $NS_PREFIX='im';

# a global, could be passed around, but would make the code a bit more verbose
my $db = DBIx::Simple->connect(@connect) or die "cannot connect to DB: $DBI::errstr";

foreach my $url (@URLs)
  { add_feed( $url); }

$db->disconnect;

warn "done\n" if( $opt{v});

sub add_feed 
  { my( $url)= @_;

    # itunes sends gziped RSS, so we need to unzip it
    my $tempfile= "$0.rss.gz"; # very crude, should use File::Temp instead 
    getstore($url, $tempfile);
    open( my $in_feed, '<:gzip', $tempfile) or die " cannot open tempfile: $!";

    XML::Twig->new( twig_handlers => { 'feed/title' => sub { warn "adding feed ", $_->text if $opt{v}; },
                                          entry       => \&entry,
                                       },
                      map_xmlns => { 'http://phobos.apple.com/rss' => $NS_PREFIX },
                  )
             ->parse( $in_feed);

    close $in_feed;
  }

sub entry
  { my( $t, $entry)= @_;

    # get the data
    my %song= map { $_ => $entry->field( "$NS_PREFIX:$_") } qw( name artist price);
    if( my $preview= $entry->first_child( 'link[@title="Preview"]') )
      { $song{preview}= $preview->att( 'href'); }

    # $db->begin_work;

    # store it
    if( ($db->query( 'SELECT count(*) FROM song WHERE name=?', $song{name})->flat)[0])
      { warn "  skipping $song{name}, already stored\n" if $opt{v};
      }
    else
      {
        warn "  adding $song{name}\n" if $opt{v};
        if( my $artist_id= ($db->query( 'SELECT id from ARTIST where name=?', $song{artist})->flat)[0])
          { warn "  existing artist $song{name} ($artist_id)\n" if $opt{v};
            $song{artist}= $artist_id; 
          }
        else
          { warn "  creating new artist $song{artist}\n" if $opt{v};

            $db->query( 'INSERT INTO artist (name) VALUES (??)', $song{artist});

            # should be $db->last_insert_id but that's not available in DBD::SQLite at the moment
            $song{artist}= $db->func('last_insert_rowid');
          }

        $db->query( 'INSERT INTO song ( name, artist, price, preview) VALUES (??)', 
                              @song{qw( name  artist  price  preview)});
        $db->commit;
      }
    $t->purge; # keeps memory usage lower, probably not needed for small RSS files
  }

__END__
=head1 NAME

  itunes2db - loads itunes RSS feeds to a DB

=head1 OPTIONS

  -c <file>  uses a cache instead of the list of URLs
  -v         verbose

=head1 DB schema

  create table song ( id INT PRIMARY KEY, name TEXT, artist INT, price TEXT, preview TEXT);
  create table artist (id INT PRIMARY KEY, name TEXT);

Here is a solution using Perl, with the help of (of course!) a bunch of modules.

It uses SQLite so you can run it easily (the definition of the (simplistic) DB is at the end of the script). Also it uses Perl hashes and simple SQL statements, instead of proper objects and an ORM layer. I found it easier to parse the XML directly instead of using an RSS module (I tried XML::Feed), because you need access to specific tags (name, preview...).

You can use it as a basis to add more features, more fields in the DB, a table for genre... but at least this way you have a basis that you can expand on (and maybe you can then publish the result as open-source).

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;                 # to parse the RSS
use DBIx::Simple;              # DB interaction made easy
use Getopt::Std;               # always need options for a script
use PerlIO::gzip;              # itunes sends a gzip-ed file
use LWP::Simple 'getstore';    # to get the RSS

my %opt;
getopts( 'vc:', \%opt);

# could also be an option, but I guess it won't change that much
my @URLs= ( 
            'http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml',
          );

# during debug, it's nice to use a cache of the feed instead of hitting hit every single run
if( $opt{c}) { @URLs= ($opt{c}); }

# I like using SQLite when developping,
# replace with MySQL connect parameters if needed (see DBD::MySQL for the exact syntax)
my @connect= ("dbi:SQLite:dbname=itunes.db","","", { RaiseError => 1, AutoCommit => 0 }) ;

my $NS_PREFIX='im';

# a global, could be passed around, but would make the code a bit more verbose
my $db = DBIx::Simple->connect(@connect) or die "cannot connect to DB: $DBI::errstr";

foreach my $url (@URLs)
  { add_feed( $url); }

$db->disconnect;

warn "done\n" if( $opt{v});

sub add_feed 
  { my( $url)= @_;

    # itunes sends gziped RSS, so we need to unzip it
    my $tempfile= "$0.rss.gz"; # very crude, should use File::Temp instead 
    getstore($url, $tempfile);
    open( my $in_feed, '<:gzip', $tempfile) or die " cannot open tempfile: $!";

    XML::Twig->new( twig_handlers => { 'feed/title' => sub { warn "adding feed ", $_->text if $opt{v}; },
                                          entry       => \&entry,
                                       },
                      map_xmlns => { 'http://phobos.apple.com/rss' => $NS_PREFIX },
                  )
             ->parse( $in_feed);

    close $in_feed;
  }

sub entry
  { my( $t, $entry)= @_;

    # get the data
    my %song= map { $_ => $entry->field( "$NS_PREFIX:$_") } qw( name artist price);
    if( my $preview= $entry->first_child( 'link[@title="Preview"]') )
      { $song{preview}= $preview->att( 'href'); }

    # $db->begin_work;

    # store it
    if( ($db->query( 'SELECT count(*) FROM song WHERE name=?', $song{name})->flat)[0])
      { warn "  skipping $song{name}, already stored\n" if $opt{v};
      }
    else
      {
        warn "  adding $song{name}\n" if $opt{v};
        if( my $artist_id= ($db->query( 'SELECT id from ARTIST where name=?', $song{artist})->flat)[0])
          { warn "  existing artist $song{name} ($artist_id)\n" if $opt{v};
            $song{artist}= $artist_id; 
          }
        else
          { warn "  creating new artist $song{artist}\n" if $opt{v};

            $db->query( 'INSERT INTO artist (name) VALUES (??)', $song{artist});

            # should be $db->last_insert_id but that's not available in DBD::SQLite at the moment
            $song{artist}= $db->func('last_insert_rowid');
          }

        $db->query( 'INSERT INTO song ( name, artist, price, preview) VALUES (??)', 
                              @song{qw( name  artist  price  preview)});
        $db->commit;
      }
    $t->purge; # keeps memory usage lower, probably not needed for small RSS files
  }

__END__
=head1 NAME

  itunes2db - loads itunes RSS feeds to a DB

=head1 OPTIONS

  -c <file>  uses a cache instead of the list of URLs
  -v         verbose

=head1 DB schema

  create table song ( id INT PRIMARY KEY, name TEXT, artist INT, price TEXT, preview TEXT);
  create table artist (id INT PRIMARY KEY, name TEXT);
情徒 2024-07-21 19:01:06

据我所知,它没有得到积极维护,但 Scriptella 可能会提供一些帮助。 非常简单的 xml 脚本,在 Java 上运行。

如何将 RSS 吸入数据库的示例

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
    <connection id="in" driver="xpath" url="http://snippets.dzone.com/rss"/>
    <connection id="out" driver="text" url="rss.txt"/>
    <connection id="db" driver="hsqldb" url="jdbc:hsqldb:db/rss" user="sa" classpath="hsqldb.jar"/>
    <script connection-id="db">
       CREATE TABLE Rss (
           ID Integer,
           Title VARCHAR(255),
           Description VARCHAR(255),   
           Link VARCHAR(255)

       )
    </script>
    <query connection-id="in">
        /rss/channel/item
        <script connection-id="out">
            Title: $title
            Description: [
            ${description.substring(0, 20)}...
            ]
            Link: $link
            ----------------------------------
        </script>
        <script connection-id="db">
            INSERT INTO Rss (ID, Title, Description, Link) 
            VALUES (?rownum, ?title, ?description, ?link);
        </script>
    </query>
</etl>

From what I can tell, it's not actively maintained, but Scriptella could be of some assistance. Very simple xml script, running on Java.

Example of how to suck RSS into a database:

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
    <connection id="in" driver="xpath" url="http://snippets.dzone.com/rss"/>
    <connection id="out" driver="text" url="rss.txt"/>
    <connection id="db" driver="hsqldb" url="jdbc:hsqldb:db/rss" user="sa" classpath="hsqldb.jar"/>
    <script connection-id="db">
       CREATE TABLE Rss (
           ID Integer,
           Title VARCHAR(255),
           Description VARCHAR(255),   
           Link VARCHAR(255)

       )
    </script>
    <query connection-id="in">
        /rss/channel/item
        <script connection-id="out">
            Title: $title
            Description: [
            ${description.substring(0, 20)}...
            ]
            Link: $link
            ----------------------------------
        </script>
        <script connection-id="db">
            INSERT INTO Rss (ID, Title, Description, Link) 
            VALUES (?rownum, ?title, ?description, ?link);
        </script>
    </query>
</etl>
像你 2024-07-21 19:01:06

好吧,我不太确定您正在寻找什么样的答案,但我认为您不需要执行任何类型的 shell 脚本。 PHP 和 Perl 都完全能够下载 RSS 源并将数据插入 MySQL。 将 PHP 或 Perl 脚本设置为每 X 小时/天/使用 cronjob 运行一次,然后您就完成了。

鉴于你的问题如此模糊,没什么可告诉你的。

Well, I'm not really sure what sort of answer you're looking for, but I don't think you need to do any sort of shell scripting. Bother PHP and Perl would be perfectly capable of downloading the RSS feed and insert the data into MySQL. Set the PHP or Perl script up to run every X number of hours/days/whatever with a cronjob and you'd be done.

Not really much else to tell you, with how vague your question was.

美羊羊 2024-07-21 19:01:06

我正在抓取 Stack Overflow 的提要,使用 PHP 的 DOMDocument 执行一些额外的过滤,然后使用 DOM 方法来访问什么我想。 我建议调查一下。

I'm scraping Stack Overflow's feed to perform some additional filtering using PHP's DOMDocument and then DOM methods to access what I want. I'd suggest looking into that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文