duanchun2349 2013-05-28 15:46
浏览 50

从属性列表创建矩阵

I have a CSV with a list of items, and each has a series of attributes attached:

"5","coffee|peaty|sweet|cereal|cream|barley|malt|creosote|sherry|sherry|manuka|honey|peaty|peppercorn|chipotle|chilli|salt|caramel|coffee|demerara|sugar|molasses|spicy|peaty"
"6","oil|lemon|apple|butter|toffee|treacle|sweet|cola|oak|cereal|cinnamon|salt|toffee"

"5" and "6" are both item IDs and unique in the file.

Ultimately, I want to create a matrix demonstrating how many times in the document each attribute was mentioned in the same row with every other attribute. E.g.:

        peaty    sweet    cereal    cream    barley ...
coffee    1       2         2         1        1
oil       0       1         0         0        0 

Note that I'd prefer to reduce duplicates: i.e., "peaty" isn't both a column and a row.

The original database is essentially a key-value store (A table with columns "itemId" and "value") -- I can reformat the data if it helps.

Any idea how I'd do this with Python, PHP or Ruby (Whichever is easiest)? I get the feeling Python can probably do this the easiest of the bunch but I'm missing something fairly basic and/or crucial (I'm just starting to do data analysis with Python).

Thanks!

Edit: In response to the (somewhat unhelpful) "What have you tried" comment, here's what I'm currently working with (Don't laugh, my Python is terrible):

#!/usr/bin/python
import csv

matrix = {}

with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for attrib in attribs:
            if attrib not in matrix:
                matrix[attrib] = {}
            for attrib2 in attribs:
                if attrib2 in matrix[attrib]:
                    matrix[attrib][attrib2] = matrix[attrib][attrib2] + 1 
                else:
                    matrix[attrib][attrib2] = 1
print matrix 

The output is a big, unsorted dictionary of terms, likely with a lot of duplication between the rows and columns. If I use pandas and replace the "print matrix" line with the following...

from pandas import *
df = DataFrame(matrix).T.fillna(0)
print df

I get:

<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, acacia to zesty
Columns: 195 entries, acacia to zesty
dtypes: float64(195)

...Which leads me to think I'm doing something rather wrong.

  • 写回答

2条回答 默认 最新

  • duanfang2708 2013-05-28 16:43
    关注

    I'd do this with an undirected graph, where the frequency is the edge weight. Then you can generate the matrix quite easily by looping through each vertex, where each edge weight represents how many times each element occurred with another.

    Graph docs: http://networkx.github.io/documentation/latest/reference/classes.graph.html

    Starter code:

    import csv
    import itertools
    import networkx as nx
    
    G = nx.Graph()
    
    reader = csv.reader(open('field.csv', "rb"))
    for row in reader:
      row_elements = row[1].split("|")
      combinations = itertools.combinations(row_elements, 2)
      for (a, b) in combinations:
        if G.has_edge(a, b):
          G[a][b]['weight'] += 1
        else:
          G.add_edge(a, b, weight=1)
    
    print(G.edges(data=True))
    

    Edit: woah see if this does everything for ya http://networkx.github.io/documentation/latest/reference/linalg.html#module-networkx.linalg.graphmatrix

    评论

报告相同问题?

悬赏问题

  • ¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line
  • ¥500 火焰左右视图、视差(基于双目相机)
  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)