我想抓取 iTunes top X RSS feed 并插入 dB

发布于 2024-07-14 19:01:06 字数 69 浏览 6 评论 0原文

最好我想用一些 bash shell 脚本来做到这一点，也许是一些 PHP 或 PERL 和 MySQL 数据库。想法？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离不开的别离 2024-07-21 19:01:06

这是一个使用 Perl 的解决方案，在（当然！）一堆模块的帮助下。

它使用 SQLite，因此您可以轻松运行它（（简单的）数据库的定义位于脚本的末尾）。它还使用 Perl 哈希和简单的 SQL 语句，而不是适当的对象和 ORM 层。我发现直接解析 XML 比使用 RSS 模块更容易（我尝试过 XML::Feed），因为您需要访问特定标签（名称、预览...）。

您可以使用它作为基础来添加更多功能、数据库中的更多字段、流派表......但至少这样您就有了可以扩展的基础（也许您可以将结果发布为开放的） -来源）。

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;                 # to parse the RSS
use DBIx::Simple;              # DB interaction made easy
use Getopt::Std;               # always need options for a script
use PerlIO::gzip;              # itunes sends a gzip-ed file
use LWP::Simple 'getstore';    # to get the RSS

my %opt;
getopts( 'vc:', \%opt);

# could also be an option, but I guess it won't change that much
my @URLs= ( 
            'http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml',
          );

# during debug, it's nice to use a cache of the feed instead of hitting hit every single run
if( $opt{c}) { @URLs= ($opt{c}); }

# I like using SQLite when developping,
# replace with MySQL connect parameters if needed (see DBD::MySQL for the exact syntax)
my @connect= ("dbi:SQLite:dbname=itunes.db","","", { RaiseError => 1, AutoCommit => 0 }) ;

my $NS_PREFIX='im';

# a global, could be passed around, but would make the code a bit more verbose
my $db = DBIx::Simple->connect(@connect) or die "cannot connect to DB: $DBI::errstr";

foreach my $url (@URLs)
  { add_feed( $url); }

$db->disconnect;

warn "done\n" if( $opt{v});

sub add_feed 
  { my( $url)= @_;

    # itunes sends gziped RSS, so we need to unzip it
    my $tempfile= "$0.rss.gz"; # very crude, should use File::Temp instead 
    getstore($url, $tempfile);
    open( my $in_feed, '<:gzip', $tempfile) or die " cannot open tempfile: $!";

    XML::Twig->new( twig_handlers => { 'feed/title' => sub { warn "adding feed ", $_->text if $opt{v}; },
                                          entry       => \&entry,
                                       },
                      map_xmlns => { 'http://phobos.apple.com/rss' => $NS_PREFIX },
                  )
             ->parse( $in_feed);

    close $in_feed;
  }

sub entry
  { my( $t, $entry)= @_;

    # get the data
    my %song= map { $_ => $entry->field( "$NS_PREFIX:$_") } qw( name artist price);
    if( my $preview= $entry->first_child( 'link[@title="Preview"]') )
      { $song{preview}= $preview->att( 'href'); }

    # $db->begin_work;

    # store it
    if( ($db->query( 'SELECT count(*) FROM song WHERE name=?', $song{name})->flat)[0])
      { warn "  skipping $song{name}, already stored\n" if $opt{v};
      }
    else
      {
        warn "  adding $song{name}\n" if $opt{v};
        if( my $artist_id= ($db->query( 'SELECT id from ARTIST where name=?', $song{artist})->flat)[0])
          { warn "  existing artist $song{name} ($artist_id)\n" if $opt{v};
            $song{artist}= $artist_id; 
          }
        else
          { warn "  creating new artist $song{artist}\n" if $opt{v};

            $db->query( 'INSERT INTO artist (name) VALUES (??)', $song{artist});

            # should be $db->last_insert_id but that's not available in DBD::SQLite at the moment
            $song{artist}= $db->func('last_insert_rowid');
          }

        $db->query( 'INSERT INTO song ( name, artist, price, preview) VALUES (??)', 
                              @song{qw( name  artist  price  preview)});
        $db->commit;
      }
    $t->purge; # keeps memory usage lower, probably not needed for small RSS files
  }

__END__
=head1 NAME

  itunes2db - loads itunes RSS feeds to a DB

=head1 OPTIONS

  -c <file>  uses a cache instead of the list of URLs
  -v         verbose

=head1 DB schema

  create table song ( id INT PRIMARY KEY, name TEXT, artist INT, price TEXT, preview TEXT);
  create table artist (id INT PRIMARY KEY, name TEXT);

Here is a solution using Perl, with the help of (of course!) a bunch of modules.

It uses SQLite so you can run it easily (the definition of the (simplistic) DB is at the end of the script). Also it uses Perl hashes and simple SQL statements, instead of proper objects and an ORM layer. I found it easier to parse the XML directly instead of using an RSS module (I tried XML::Feed), because you need access to specific tags (name, preview...).

You can use it as a basis to add more features, more fields in the DB, a table for genre... but at least this way you have a basis that you can expand on (and maybe you can then publish the result as open-source).

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;                 # to parse the RSS
use DBIx::Simple;              # DB interaction made easy
use Getopt::Std;               # always need options for a script
use PerlIO::gzip;              # itunes sends a gzip-ed file
use LWP::Simple 'getstore';    # to get the RSS

my %opt;
getopts( 'vc:', \%opt);

# could also be an option, but I guess it won't change that much
my @URLs= ( 
            'http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml',
          );

# during debug, it's nice to use a cache of the feed instead of hitting hit every single run
if( $opt{c}) { @URLs= ($opt{c}); }

# I like using SQLite when developping,
# replace with MySQL connect parameters if needed (see DBD::MySQL for the exact syntax)
my @connect= ("dbi:SQLite:dbname=itunes.db","","", { RaiseError => 1, AutoCommit => 0 }) ;

my $NS_PREFIX='im';

# a global, could be passed around, but would make the code a bit more verbose
my $db = DBIx::Simple->connect(@connect) or die "cannot connect to DB: $DBI::errstr";

foreach my $url (@URLs)
  { add_feed( $url); }

$db->disconnect;

warn "done\n" if( $opt{v});

sub add_feed 
  { my( $url)= @_;

    # itunes sends gziped RSS, so we need to unzip it
    my $tempfile= "$0.rss.gz"; # very crude, should use File::Temp instead 
    getstore($url, $tempfile);
    open( my $in_feed, '<:gzip', $tempfile) or die " cannot open tempfile: $!";

    XML::Twig->new( twig_handlers => { 'feed/title' => sub { warn "adding feed ", $_->text if $opt{v}; },
                                          entry       => \&entry,
                                       },
                      map_xmlns => { 'http://phobos.apple.com/rss' => $NS_PREFIX },
                  )
             ->parse( $in_feed);

    close $in_feed;
  }

sub entry
  { my( $t, $entry)= @_;

    # get the data
    my %song= map { $_ => $entry->field( "$NS_PREFIX:$_") } qw( name artist price);
    if( my $preview= $entry->first_child( 'link[@title="Preview"]') )
      { $song{preview}= $preview->att( 'href'); }

    # $db->begin_work;

    # store it
    if( ($db->query( 'SELECT count(*) FROM song WHERE name=?', $song{name})->flat)[0])
      { warn "  skipping $song{name}, already stored\n" if $opt{v};
      }
    else
      {
        warn "  adding $song{name}\n" if $opt{v};
        if( my $artist_id= ($db->query( 'SELECT id from ARTIST where name=?', $song{artist})->flat)[0])
          { warn "  existing artist $song{name} ($artist_id)\n" if $opt{v};
            $song{artist}= $artist_id; 
          }
        else
          { warn "  creating new artist $song{artist}\n" if $opt{v};

            $db->query( 'INSERT INTO artist (name) VALUES (??)', $song{artist});

            # should be $db->last_insert_id but that's not available in DBD::SQLite at the moment
            $song{artist}= $db->func('last_insert_rowid');
          }

        $db->query( 'INSERT INTO song ( name, artist, price, preview) VALUES (??)', 
                              @song{qw( name  artist  price  preview)});
        $db->commit;
      }
    $t->purge; # keeps memory usage lower, probably not needed for small RSS files
  }

__END__
=head1 NAME

  itunes2db - loads itunes RSS feeds to a DB

=head1 OPTIONS

  -c <file>  uses a cache instead of the list of URLs
  -v         verbose

=head1 DB schema

  create table song ( id INT PRIMARY KEY, name TEXT, artist INT, price TEXT, preview TEXT);
  create table artist (id INT PRIMARY KEY, name TEXT);

回复收藏 0 原文

情徒 2024-07-21 19:01:06

据我所知，它没有得到积极维护，但 Scriptella 可能会提供一些帮助。非常简单的 xml 脚本，在 Java 上运行。

如何将 RSS 吸入数据库的示例：

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
    <connection id="in" driver="xpath" url="http://snippets.dzone.com/rss"/>
    <connection id="out" driver="text" url="rss.txt"/>
    <connection id="db" driver="hsqldb" url="jdbc:hsqldb:db/rss" user="sa" classpath="hsqldb.jar"/>
    <script connection-id="db">
       CREATE TABLE Rss (
           ID Integer,
           Title VARCHAR(255),
           Description VARCHAR(255),   
           Link VARCHAR(255)

       )
    </script>
    <query connection-id="in">
        /rss/channel/item
        <script connection-id="out">
            Title: $title
            Description: [
            ${description.substring(0, 20)}...
            ]
            Link: $link
            ----------------------------------
        </script>
        <script connection-id="db">
            INSERT INTO Rss (ID, Title, Description, Link) 
            VALUES (?rownum, ?title, ?description, ?link);
        </script>
    </query>
</etl>

From what I can tell, it's not actively maintained, but Scriptella could be of some assistance. Very simple xml script, running on Java.

Example of how to suck RSS into a database:

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
    <connection id="in" driver="xpath" url="http://snippets.dzone.com/rss"/>
    <connection id="out" driver="text" url="rss.txt"/>
    <connection id="db" driver="hsqldb" url="jdbc:hsqldb:db/rss" user="sa" classpath="hsqldb.jar"/>
    <script connection-id="db">
       CREATE TABLE Rss (
           ID Integer,
           Title VARCHAR(255),
           Description VARCHAR(255),   
           Link VARCHAR(255)

       )
    </script>
    <query connection-id="in">
        /rss/channel/item
        <script connection-id="out">
            Title: $title
            Description: [
            ${description.substring(0, 20)}...
            ]
            Link: $link
            ----------------------------------
        </script>
        <script connection-id="db">
            INSERT INTO Rss (ID, Title, Description, Link) 
            VALUES (?rownum, ?title, ?description, ?link);
        </script>
    </query>
</etl>

回复收藏 0 原文