如何修改传统的笛卡尔积以减少内存开销？

For my problem you're selecting up to 24 items from a pool of maybe 5-10,000 items. In other words we're generating configurations.

The number 24 comes from the item categories, each item is associated with a particular installation location, an item from location 1 cannot be installed in location 10, so I have arranged my associative array to organize the data in groups. Each item looks like:

$items[9][] = array("id" => "0", "2" => 2, "13" => 20);

Where the first parameter ( $item[9] ) tells you the location it is allowed in. If you want it's ok to think of the idea that you cannot install a tire in the spot for an exhaust pipe.

The items are stored in a mySQL database. The user can specify restrictions on the solution, for example, attribute 2 must have a final value of 25 or more. They can have multiple competing restrictions. The queries retrieve items that have any value for the attributes under consideration (unspecified attributes are stored but we don't do any calculations with them). The PHP script then prunes out any redundant choices (for example: if item 1 has an attribute value of 3 and item 2 has an attribute value of 5, in the absence of another restriction you would never choose item 1).

After all the processing has occurred get an associative array that looks like:

$items[10][] = array("id" => "3", "2" => 2, "13" => 100);
$items[10][] = array("id" => "4", "2" => 3, "13" => 50);
$items[9][] = array("id" => "0", "2" => 2, "13" => 20);
$items[9][] = array("id" => "1", "2" => -1, "13" => 50);

I have posted a full example data set at this pastebin link. There is reason to believe I can be more restrictive on what I accept into the data set but even at a restriction of 2 elements per option there's a problem.

In the array() value, the id is the reference to the index of the item in the array, and the other values are attribute id and value pairs. So $items[10][] = array("id" => "3", "2" => 2, "13" => 100); means that in location 10 there is an item with id 3 which as a value of 2 in attribute 2 and a value of 100 in attribute 13. If it helps think of an item being identified by a pair eg (10,0) is item 0 in location 10.

I know I'm not being specific, there are 61 attributes and I don't think it changes the structure of the problem with what they represent. If we want, we can think of attribute 2 as weight and attribute 13 as cost. The problem the user wants solved might be to generate a configuration where the weight is 25 exactly and the cost is minimized.

Back of the envelope math says a rough estimate, if there were only 2 choices per location, is 2^24 choices x size of the record. Assuming a 32 bit integer could be encoded to represent a single record somehow, we're looking at 16,777,216 * 4 = 67,108,864 bytes of memory (utterly ignoring data structure overhead) and there is no reason to believe that either of these assumptions is going to be valid, though an algorithm with an upper memory bound in the realm of 67 megs would be an acceptable memory size.

There's no particular reason to stick to this representation, I used associative arrays since I have a variable number of attributes to use and figured that would allow me to avoid a large, sparse array. Above "2"=>2 actually means that filtered attribute with id #2 has a value of 2 and similarly attribute 13's value is 100. I'm happy to change my data structure to something more compact.

One thought I had was that I do have an evaluation criteria I can use to discard most of the intermediate configurations. As an example, I can compute 75 * "value of "2"" + 10 * "value of "13" to provide a relative weighting of the solutions. In other words, if there were no other restrictions on a problem, each value improvement by 1 of attribute 2 costs 75 and each value improvement of attribute 13 costs 10. Continuing the idea of a car part, think of it like buying a stock part and having a machinist modify it to our specifications.

One problem I see with discarding configurations too early is that the weighting function does not take into account restrictions such as "the final result must have a value of "2" that is at exactly 25". So it's fine if I have a full 24 element configuration, I can run through a loop of the restrictions, discard the solutions that don't match and then finally rank the remaining solutions by the function, but I'm not sure there's a valid line of thought that allows me to throw away solutions earlier.

Does anyone have any suggestions on how to move forward? Although a language agnostic solution is fine, I am implementing in PHP if there's some relevant language feature that might be useful.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

duandou8457 2015-05-15 12:46

关注

I solved my issue with memory by performing a depth first cartesian product. I can weigh the solutions one at a time and retain some if I choose or simply output them as I am doing here in this code snippet.

The main inspiration for this solution came from the very concise answer on this question. Here is my code as it seems like finding a php depth first cartesian product algorithm is less than trivial.

function dfcartesian ( $input, $current, $index ) {
    // sample use: $emptyArray = array();
    //             dfcartesian( $items, $emptyArray, 0 )
    if ( $index == count( $input ) ) {
        // If we have iterated over the entire space and are at the bottom
        // do whatever is relevant to your problem and return.
        //
        // If I were to improve the solution I suppose I'd pass in an
        // optional function name that we could pass data to if desired.
        var_dump( $current );
        echo '<br><br>';
        return;
    }

    // I'm using non-sequential numerical indicies in an associative array
    // so I want to skip any empty numerical index without aborting.
    //
    // If you're using something different I think the only change that
    // needs attention is to change $index + 1 to a different type of
    // key incrementer.  That sort of issue is tackled at
    // https://stackoverflow.com/q/2414141/759749
    if ( isset ( $input[$index] ) ) {
        foreach ( $input[$index] as $element ) {
            $current[] = $element;
            // despite my concern about recursive function overhead,
            // this handled 24 levels quite smoothly.
            dfcartesian( $input, $current, ( $index + 1 ) );
            array_pop( $current );
        }
    } else {
        // move to the next index if there is a gap
        dfcartesian( $input, $current, ( $index + 1 ) );
    }
}

I hope this is of use to someone else tackling the same problem.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

报告相同问题？

关注问题

如何修改传统的笛卡尔积以减少内存开销？ php
2015-05-14 13:25

回答 1 已采纳 I solved my issue with memory by performing a depth first cartesian product. I can weigh the solu
Python的列表推导式与笛卡尔积有什么区别？ list python
2022-09-15 00:56

回答 1 已采纳可以查看手册：python- 列表推导式中的内容
怎样能消除笛卡尔积怎样能消除笛卡尔积 sql
2022-07-09 09:14

回答 2 已采纳你现在的结果应该不是笛卡儿积造成的，因为你在进行多表查询的时候已经使用了连接条件。现在的数据结果应该是数据表本身就存在很多post_date和meta_key相同的数据（但是其它列不同），由于你只显示
PHP 面试题汇总
2022-05-11 23:51

weixin_55347832的博客 CSRF ：跨站请求伪造，用户通过跨站请求，以合法用户身份做非法的事情防范： token 验证 Referer 验证： Referer 指的是页面请求来源。意思是，只接受本站的请求，服务器才做响应；如果不是，就拦截 ...
mysql两表关联出现笛卡尔积 重复数据 big data mysql 有问必答
2021-09-04 15:35

回答 1 已采纳你两表中存在重复的数据，最后的查询结果也就如截图所示了。
阿里云Dataworks如何解决笛卡尔积查询 mysql 云计算
2021-03-20 21:25

回答 1 已采纳参考：https://help.aliyun.com/document_detail/73785.html 笛卡尔积浪费资源，所以阿里云会禁止使用， mapjoin最多支持指定128张小表，否则也
Python3怎么拆包，然后求笛卡尔积？ list python
2022-07-24 18:48

回答 1 已采纳这看个人喜好吧，我喜欢用推导式。具体要看你真实需要是什么，要求怎么使用这些数据 >>> s = ''' 货物损坏，货物完整知情，不知情不完全信息博弈，完全信息博弈 a,b 单向担
人生最好的php，mysql，linux，redis，docker等相关技术经典面试题，新手收藏学习，持续更新中。。。
2021-04-25 14:35

黄昏单车的博客 php面试题 1、写出你能想到的所有HTTP返回状态值，并说明用途（比如：返回404表示找不到页面） # 200：服务器请求成功 # 301：永久重定向，旧网页已被新网页永久替代 # 302：表示临时性重定向 # 400：错误请求 # 401...
这怎么让他们接上？Python3笛卡尔积与查找字母替换文字？ list python
2022-06-23 04:47

回答 1 已采纳有点不明白，后半部分的意思 from itertools import product L1 = list("abcd") L2 = list("+-*/") L3 = list("x") L4
笛卡尔积出来的集合顺序 c语言 html 其他算法
2022-01-24 23:20

回答 1 已采纳可以，集合里的元素本来就没有先后顺序之分。
为什么这种写法会出现笛卡尔积？明明已经加了left join xxx on xxx=xxx条件了啊 mysql
2019-09-02 08:55

回答 3 已采纳已经解决了连接条件两字段的字段类型不一致导致的
面试官：什么是微服务？看了都说好！
2022-10-19 13:26

肥肥技术宅的博客一方面由于节点状态不会变更得那么频繁，放在内存中可以减少网络开销。另一方面，当注册中心宕机后，服务消费者仍能从本机内存中找到服务节点列表从而发起调用。本地快照我们说，注册中心宕机后，服务消费者仍能从...
Python关于笛卡尔积在input创建空值列表，然后求积，怎么写？ list python
2022-08-01 17:43

回答 1 已采纳是不是指如下这样子： >>> list1 = list(input()) ab >>> list2 = list(input()) c >>> l
数据库系统原理（BNU_党德鹏_慕课）超详细听课笔记
2022-01-21 21:23

小强哥哥。的博客数据库系统原理笔记第一章绪论 ... 大小写不敏感（系统默认小写）用单引号做字符串常量的标识（只有引号中的字符才区分大小写）第二节数据定义与修改数据定义包括数据库对象的创建、删除和更改三部分。表的定义...
Hive 知识体系保姆级教程
2021-10-24 22:15

云祁的博客避免去书写MapReduce,减少学习成本, 而且提供了功能的扩展 hive的特点: 可扩展 : Hive可以自由的扩展集群的规模，一般情况下不需要重启服务。延展性 : Hive支持用户自定义函数，用户可以根据自己的需求来实现自己...
Java 攻城狮面试题 02_MySQL 关系型数据库
2021-10-28 21:56

李鲤驴。的博客先说什么是交叉连接: 交叉连接又叫笛卡尔积，它是指不使用任何条件，直接将一个表的所有记录和另一个表中的所有记录一一匹配。内连接则是只有条件的交叉连接，根据某个条件筛选出符合条件的记录，不符合条件的...
列存储、行存储之间的关系和比较
2014-09-03 20:27

伪代码的忧伤的博客传统的数据库引擎不能以一种通用的方式进行数据压缩，主要是由于存在以下三个问题： 1. 按行存储的数据存储方式不利于压缩。这是因为数据（大多为二进制数据）在以这种方式存储时重复并不多。我们发现，按行...
没有解决我的问题, 去提问

悬赏问题

¥100 连续两帧图像高速减法
¥15 组策略中的计算机配置策略无法下发
¥15 如何绘制动力学系统的相图
¥15 对接wps接口实现获取元数据
¥20 给自己本科IT专业毕业的妹m找个实习工作
¥15 用友U8：向一个无法连接的网络尝试了一个套接字操作，如何解决？
¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
¥50 mac mini外接显示器画质字体模糊
¥15 TLS1.2协议通信解密
¥40 图书信息管理系统程序编写

码龄粉丝数原力等级 --

如何修改传统的笛卡尔积以减少内存开销？

1条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

如何修改传统的笛卡尔积以减少内存开销？

1条回答 默认 最新

悬赏问题

1条回答默认最新