重构循环?
我想循环超过 200,000 个用户数据集来过滤 30,000 个产品,我如何优化这个嵌套的大循环以获得最佳性能?
//settings , 5 max per user, can up to 200,000
$settings = array(...);
//all prods, up to 30,000
$prods = array(...);
//all prods category relation map, up to 2 * 30,000
$prods_cate_ref_all = array(...);
//msgs filtered by settings saved yesterday , more then 100 * 200,000
$msg_all = array(...);
//filter counter
$j = 0;
//filter result
$res = array();
foreach($settings as $set){
foreach($prods as $k=>$p){
//filter prods by site_id
if ($set['site_id'] != $p['site_id']) continue;
//filter prods by city_id , city_id == 0 is all over the country
if ($set['city_id'] != $p['city_id'] && $p['city_id'] > 0) continue;
//muti settings of a user may get same prods
if (prod_in($p['id'], $set['uuid'], $res)) continue;
//prods filtered by settings saved to msg table yesterday
if (msg_in($p['id'], $set['uuid'], $msg_all)) continue;
//filter prods by category id
if (!prod_cate_in($p['id'], $set['cate_id'], $prods_cate_ref_all)) continue;
//filter prods by tags of set not in prod title, website ...
$arr = array($p['title'], $p['website'], $p['detail'], $p['shop'], $p['tags']);
if (!tags_in($set['tags'], $arr)) continue;
$res[$j]['name'] = $v['name'];
$res[$j]['prod_id'] = $p['id'];
$res[$j]['uuid'] = $v['uuid'];
$res[$j]['msg'] = '...';
$j++;
}
}
save_to_msg($res);
function prod_in($prod_id, $uuid, $prod_all){
foreach($prod_all as $v){
if ($v['prod_id'] == $prod_id && $v['uuid'] == $uuid)
return true;
}
return false;
}
function prod_cate_in($prod_id, $cate_id, $prod_cate_all){
foreach($prod_cate_all as $v){
if ($v['prod_id'] == $prod_id && $v['cate_id'] == $cate_id)
return true;
}
return false;
}
function tags_in($tags, $arr){
$tag_arr = explode(',', str_replace(',', ',', $tags));
foreach($tag_arr as $v){
foreach($arr as $a){
if(strpos($a, strtolower($v)) !== false){
return true;
}
}
}
return false;
}
function msg_in($prod_id, $uuid, $msg_all){
foreach($msg_all as $v){
if ($v['prod_id'] == $prod_id && $v['uuid'] == $uuid)
return true;
}
return false;
}
更新: 多谢。 是的,数据在mysql中,下面是主要结构:
-- user settings to filter prods, 5 max per user
CREATE TABLE setting(
id INT NOT NULL AUTO_INCREMENT,
uuid VARCHAR(100) NOT NULL DEFAULT '',
tags VARCHAR(100) NOT NULL DEFAULT '',
site_id SMALLINT UNSIGNED NOT NULL DEFAULT 0,
city_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
cate_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
addtime INT UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (`id`),
KEY `idx_setting_uuid` (`uuid`),
KEY `idx_setting_tags` (`tags`),
KEY `idx_setting_city_id` (`city_id`),
KEY `idx_setting_cate_id` (`cate_id`)
) DEFAULT CHARSET=utf8;
CREATE TABLE users(
id INT NOT NULL AUTO_INCREMENT,
uuid VARCHAR(100) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_unique_uuid` (`uuid`)
) DEFAULT CHARSET=utf8;
-- filtered prods
CREATE TABLE msg_list(
id INT NOT NULL AUTO_INCREMENT,
uuid VARCHAR(100) NOT NULL DEFAULT '',
prod_id INT UNSIGNED NOT NULL DEFAULT 0,
msg TEXT NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `idx_ml_uuid` (`uuid`)
) DEFAULT CHARSET=utf8;
-- prods and prod_cate_ref table in another database, so can not join it
CREATE TABLE prod(
id INT NOT NULL AUTO_INCREMENT,
website VARCHAR(100) NOT NULL DEFAULT '' COMMENT ' site name ',
site_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
city_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
title VARCHAR(50) NOT NULL DEFAULT '',
tags VARCHAR(50) NOT NULL DEFAULT '',
detail VARCHAR(500) NOT NULL DEFAULT '',
shop VARCHAR(300) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `idx_prod_tags` (`tags`),
KEY `idx_prod_site_id` (`site_id`),
KEY `idx_prod_city_id` (`city_id`),
KEY `idx_prod_mix` (`site_id`,`city_id`,`tags`)
) DEFAULT CHARSET=utf8;
CREATE TABLE prod_cate_ref(
id MEDIUMINT NOT NULL AUTO_INCREMENT,
prod_id INT NOT NULL NULL DEFAULT 0,
cate_id MEDIUMINT NOT NULL NULL DEFAULT 0,
PRIMARY KEY (`id`),
KEY `idx_pcr_mix` (`prod_id`,`cate_id`)
) DEFAULT CHARSET=utf8;
-- ENGINE all is myisam
我不知道如何仅使用一条sql来获取所有数据。
I want to loop more than 200,000 user datasets to filter 30,000 products,how can i optimize this nested big loop to get best performance?
//settings , 5 max per user, can up to 200,000
$settings = array(...);
//all prods, up to 30,000
$prods = array(...);
//all prods category relation map, up to 2 * 30,000
$prods_cate_ref_all = array(...);
//msgs filtered by settings saved yesterday , more then 100 * 200,000
$msg_all = array(...);
//filter counter
$j = 0;
//filter result
$res = array();
foreach($settings as $set){
foreach($prods as $k=>$p){
//filter prods by site_id
if ($set['site_id'] != $p['site_id']) continue;
//filter prods by city_id , city_id == 0 is all over the country
if ($set['city_id'] != $p['city_id'] && $p['city_id'] > 0) continue;
//muti settings of a user may get same prods
if (prod_in($p['id'], $set['uuid'], $res)) continue;
//prods filtered by settings saved to msg table yesterday
if (msg_in($p['id'], $set['uuid'], $msg_all)) continue;
//filter prods by category id
if (!prod_cate_in($p['id'], $set['cate_id'], $prods_cate_ref_all)) continue;
//filter prods by tags of set not in prod title, website ...
$arr = array($p['title'], $p['website'], $p['detail'], $p['shop'], $p['tags']);
if (!tags_in($set['tags'], $arr)) continue;
$res[$j]['name'] = $v['name'];
$res[$j]['prod_id'] = $p['id'];
$res[$j]['uuid'] = $v['uuid'];
$res[$j]['msg'] = '...';
$j++;
}
}
save_to_msg($res);
function prod_in($prod_id, $uuid, $prod_all){
foreach($prod_all as $v){
if ($v['prod_id'] == $prod_id && $v['uuid'] == $uuid)
return true;
}
return false;
}
function prod_cate_in($prod_id, $cate_id, $prod_cate_all){
foreach($prod_cate_all as $v){
if ($v['prod_id'] == $prod_id && $v['cate_id'] == $cate_id)
return true;
}
return false;
}
function tags_in($tags, $arr){
$tag_arr = explode(',', str_replace(',', ',', $tags));
foreach($tag_arr as $v){
foreach($arr as $a){
if(strpos($a, strtolower($v)) !== false){
return true;
}
}
}
return false;
}
function msg_in($prod_id, $uuid, $msg_all){
foreach($msg_all as $v){
if ($v['prod_id'] == $prod_id && $v['uuid'] == $uuid)
return true;
}
return false;
}
UPDATE:
Thanks a lot.
Yes, data is in mysql, Below is the main struct:
-- user settings to filter prods, 5 max per user
CREATE TABLE setting(
id INT NOT NULL AUTO_INCREMENT,
uuid VARCHAR(100) NOT NULL DEFAULT '',
tags VARCHAR(100) NOT NULL DEFAULT '',
site_id SMALLINT UNSIGNED NOT NULL DEFAULT 0,
city_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
cate_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
addtime INT UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (`id`),
KEY `idx_setting_uuid` (`uuid`),
KEY `idx_setting_tags` (`tags`),
KEY `idx_setting_city_id` (`city_id`),
KEY `idx_setting_cate_id` (`cate_id`)
) DEFAULT CHARSET=utf8;
CREATE TABLE users(
id INT NOT NULL AUTO_INCREMENT,
uuid VARCHAR(100) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_unique_uuid` (`uuid`)
) DEFAULT CHARSET=utf8;
-- filtered prods
CREATE TABLE msg_list(
id INT NOT NULL AUTO_INCREMENT,
uuid VARCHAR(100) NOT NULL DEFAULT '',
prod_id INT UNSIGNED NOT NULL DEFAULT 0,
msg TEXT NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `idx_ml_uuid` (`uuid`)
) DEFAULT CHARSET=utf8;
-- prods and prod_cate_ref table in another database, so can not join it
CREATE TABLE prod(
id INT NOT NULL AUTO_INCREMENT,
website VARCHAR(100) NOT NULL DEFAULT '' COMMENT ' site name ',
site_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
city_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
title VARCHAR(50) NOT NULL DEFAULT '',
tags VARCHAR(50) NOT NULL DEFAULT '',
detail VARCHAR(500) NOT NULL DEFAULT '',
shop VARCHAR(300) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `idx_prod_tags` (`tags`),
KEY `idx_prod_site_id` (`site_id`),
KEY `idx_prod_city_id` (`city_id`),
KEY `idx_prod_mix` (`site_id`,`city_id`,`tags`)
) DEFAULT CHARSET=utf8;
CREATE TABLE prod_cate_ref(
id MEDIUMINT NOT NULL AUTO_INCREMENT,
prod_id INT NOT NULL NULL DEFAULT 0,
cate_id MEDIUMINT NOT NULL NULL DEFAULT 0,
PRIMARY KEY (`id`),
KEY `idx_pcr_mix` (`prod_id`,`cate_id`)
) DEFAULT CHARSET=utf8;
-- ENGINE all is myisam
I don't know how to use just one sql to fetch all.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
谢谢大家对我的启发,我终于明白了,这确实是一个如此简单的方法,但却迈出了一大步!
我重新组合 $prods_cate_ref_all 和 $msg_all 中的数据(最后使用这两个函数),
还有结果数组 $res,
然后使用 strpos 和 in_array 而不是三个迭代函数(prod_in msg_in prod_cate_in),
我得到了惊人的 50 倍加速!
随着数据越大,效果就越有效。
Thank you all for inspiring me, I finally got it, it's a so simple method indeed, but a huge step!
i regroup the data in $prods_cate_ref_all and $msg_all (use the two function at the end),
also the result array $res,
then use strpos and in_array instead of three iterate function (prod_in msg_in prod_cate_in) ,
i got an amazing 50x speed up!!!
As the data goes larger, the effect becomes more effective.
由于您在外循环中已经有了更大的集合,因此很难判断可以在哪里优化它。您可以内联函数代码以节省函数调用或从许多 foreach 中部分展开。
例如,在这个函数中,
您基本上使用数组进行字符串访问。更直接地进行字符串处理(注意:完整标签匹配,您的进行了部分匹配):
但是由于您拥有大量数据,所以事情会花费很长时间。
因此,除了较小的代码移动和更改之外,您可能还需要规模。如果你之前能部分解决问题,你就可以做一个map和reduce策略。也许您已经在使用基于文档的数据库,该数据库为此提供了接口。
As you're already have the larger set in the outer loop it's hard to tell where you can optimize this. You could inline the functions code to spare the function calls or partly unroll from your many foreach's.
For example, inside this function
You're basically using the arrays for string access. Do strings more direct (Note: full tag matching, yours did partial matches):
But as you've got a large amount of data, things will just take long.
So next to smaller code movements and changes you're probably looking for scale. If you can solve problems partly before, you could do a map and reduce strategy. Maybe you're already using a document based database that offers an interface for that.