dongtui9168 2017-03-07 13:32
浏览 46

用于查找数据集中大多数值的组合的最佳算法

----------------------------------------
   ColumnA  |  ColumnB      | ColumnC  | 
----------------------------------------
      Cat   |     Shirt     |   Pencil | 
      Dog   |     Shirt     |   Eraser | 
      Worm  |     Dress     |   Pen    | 
      Cow   |     Shirt     |   Pen    | 
      Cat   |     Shirt     |   Pen    | 
      Cat   |     Jacket    |   Pen    | 
      Cow   |     Shirt     |   Pen    | 
      Cat   |     Shirt     |   Pen    | 
      Cat   |     Jacket    |   Pen    | 
      Cow   |     Shirt     |   Pen    | 
      Cat   |     Shirt     |   Pen    | 
      Cat   |     Jacket    |   Pen    | 

With the example data above I am trying to find the most re-occuring combinations which are a pair of 2 or greater.

For example

Shirt,Pen 6
Cat,Pen 6    
Cat,Shirt 4
Jacket, Pen 3
Pen,Cow 3
Cat,Shirt,Pen 3
Cat,Jacket,Pen 3
Cow,Shirt,Pen 3

I need this for up to 10 columns of data.

Cat,Shirt is the same as Shirt,Cat.

What is the best algorithm to use? Preferably in SQL but I could also try PHP?

  • 写回答

2条回答 默认 最新

  • dsjfrkvn818747 2017-03-07 13:44
    关注

    You can do this in SQL by identifying each row and adding an "empty" element. Note: this assumes that the values are different in each column -- or at least fungible (it doesn't matter which column one is in).

    Let me also assume that each row has a unique id:

    with t as (
          select id, col
          from data d outer apply
               (values (col1), (col2), (col3), (NULL)) v(col)
         )
    select t1.col, t2.col, t3.col, count(*)
    from t t1 join
         t t2
         on t1.id = t2.id and (t2.col > t1.col or t2.col is null) join
         t t3
         on t1.id = t3.id and (t3.col > t2.col or (t2.col is null and t3.col is null))
    group by t1.col, t2.col, t3.col
    order by count(*) desc;
    
    评论

报告相同问题?

悬赏问题

  • ¥20 测距传感器数据手册i2c
  • ¥15 RPA正常跑,cmd输入cookies跑不出来
  • ¥15 求帮我调试一下freefem代码
  • ¥15 matlab代码解决,怎么运行
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法