项目场景:
对象:1百万条轨迹
任务:基于存储1百万条轨迹距离的邻接表,利用sklearn.cluster的聚类函数对1百万条轨迹进行聚类
输入:存储1百万条轨迹距离的邻接表(只有相互距离小于给定阈值e的距离值被记录)
输出:每条轨迹的类簇标签
问题描述
目前使用了AgglomerativeClustering层次聚类函数,但是AgglomerativeClustering函数需输入一个1百万乘以1百万大小的稠密矩阵,无法在内存上开辟出这么一大块存储空间。
请求解决方案
- 是否有聚类函数,可以接收稀疏矩阵的输入?目前来看,采用K-Means聚类不可行。因为K-Means聚类需要迭代计算其余点与中心的距离,而距离的信息是已经计算好了。且迭代反复计算距离会带来相当大的时间开销,因为轨迹间计算距离非常耗时。
- 是否有其它方法,可以基于所输入的存储距离的邻接表,完成这1百万条轨迹的聚类。(使用单台电脑)
原始代码如下:
# -*- coding: utf-8 -*-
import time
import pandas as pd
import numpy as np
import csv
from sklearn.cluster import AgglomerativeClustering
from scipy.sparse import csr_matrix
# 读取相似性并生成矩阵
def readSimilarity(strFilePathID, iNum, distanceThreshold, iMagnification):
# 读取 CSV 文件
csv_reader = csv.reader(open(strFilePathID))
# 初始化邻接矩阵和距离矩阵
distanceArray = np.full((iNum, iNum), 500 * iMagnification, dtype=int)
connectArray = np.zeros((iNum, iNum), dtype=int)
# 初始化邻接矩阵的对角线和相邻节点
for rowIndex in range(iNum):
if rowIndex + 1 < iNum:
connectArray[rowIndex, rowIndex + 1] = 1
connectArray[rowIndex + 1, rowIndex] = 1
connectArray[rowIndex, rowIndex] = 1
# 读取
for row in csv_reader:
#第一个元素为目标ID
iTargetID = int(row[0].split(';', 1)[0])
if(iTargetID % 10000 == 0):
print("ReadID_TargetID:" + str(iTargetID))
#遍历每一个查询ID
for index in range(1,row.__len__()):
if row[index].__len__() > 0:
#分割
parts = row[index].split("_")
#第一个为ID
first_part = parts[0]
iQueryID = int(first_part)
#第二个为值
second_part = parts[1]
fDistance = int(float(second_part) * iMagnification)
# 矩阵赋值
distanceArray[iTargetID, iQueryID] = fDistance
distanceArray[iQueryID, iTargetID] = fDistance
if fDistance <= distanceThreshold:
connectArray[iTargetID, iQueryID] = 1
connectArray[iQueryID, iTargetID] = 1
# 将 connectArray 转换为稀疏矩阵
connectMatrix = csr_matrix(connectArray)
print("DistanceArray and ConnectArray created.")
return distanceArray, connectMatrix
def AGClustering(distanceThreshold, selectLinkage, X, Y, strFilePathOut):
# 设置聚类参数
clustering = AgglomerativeClustering(
n_clusters=None,
metric='precomputed',
connectivity=Y,
linkage=selectLinkage,
distance_threshold=distanceThreshold
)
# 聚类
print("Start clustering")
start_time = time.time()
clustering.fit(X)
elapsed_time = time.time() - start_time
print(f"{selectLinkage}Linkage:\t{elapsed_time:.2f}s")
# 输出类簇
labels = pd.DataFrame(clustering.labels_)
labels.to_csv(strFilePathOut)
print(f"Clustering results saved to {strFilePathOut}")
# 读取相似性
iNum = 1000000
catagory = 'CPD'
strFilePathID = f"D:\\Test\\ClusteringPerformance\\STC\\{catagory}-IDHasDistance-B200-Num{iNum}-STC.csv"
iMagnification = 1000
distanceThreshold = 200 * iMagnification
X, Y = readSimilarity(strFilePathID, iNum, distanceThreshold, iMagnification)
# 聚类
selectLinkage = 'single'
strFileClusterLabel = f"D:\\Test\\AgglomerativeClusteringResult\\STC\\{catagory}-ClusterLabel-singleLinkage-B200-Num{iNum}-STC-Test.csv"
AGClustering(distanceThreshold, selectLinkage, X, Y, strFileClusterLabel)