I have a CSV with a list of items, and each has a series of attributes attached:
"5","coffee|peaty|sweet|cereal|cream|barley|malt|creosote|sherry|sherry|manuka|honey|peaty|peppercorn|chipotle|chilli|salt|caramel|coffee|demerara|sugar|molasses|spicy|peaty"
"6","oil|lemon|apple|butter|toffee|treacle|sweet|cola|oak|cereal|cinnamon|salt|toffee"
"5" and "6" are both item IDs and unique in the file.
Ultimately, I want to create a matrix demonstrating how many times in the document each attribute was mentioned in the same row with every other attribute. E.g.:
peaty sweet cereal cream barley ...
coffee 1 2 2 1 1
oil 0 1 0 0 0
Note that I'd prefer to reduce duplicates: i.e., "peaty" isn't both a column and a row.
The original database is essentially a key-value store (A table with columns "itemId" and "value") -- I can reformat the data if it helps.
Any idea how I'd do this with Python, PHP or Ruby (Whichever is easiest)? I get the feeling Python can probably do this the easiest of the bunch but I'm missing something fairly basic and/or crucial (I'm just starting to do data analysis with Python).
Thanks!
Edit: In response to the (somewhat unhelpful) "What have you tried" comment, here's what I'm currently working with (Don't laugh, my Python is terrible):
#!/usr/bin/python
import csv
matrix = {}
with open("field.csv", "rb") as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
attribs = row[1].split("|")
for attrib in attribs:
if attrib not in matrix:
matrix[attrib] = {}
for attrib2 in attribs:
if attrib2 in matrix[attrib]:
matrix[attrib][attrib2] = matrix[attrib][attrib2] + 1
else:
matrix[attrib][attrib2] = 1
print matrix
The output is a big, unsorted dictionary of terms, likely with a lot of duplication between the rows and columns. If I use pandas and replace the "print matrix" line with the following...
from pandas import *
df = DataFrame(matrix).T.fillna(0)
print df
I get:
<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, acacia to zesty
Columns: 195 entries, acacia to zesty
dtypes: float64(195)
...Which leads me to think I'm doing something rather wrong.