LSA - 潜在语义分析 - 如何用PHP编写代码？

I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.

Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?

I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).

Extract all words from the given text.
Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
Do the Singular Value Decomposition (SVD).
Use the values in the matrix S (SVD) to do the dimension reduction (how?).

I hope you can help me. Thank you very much in advance!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

4条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doukanhua0752 2009-06-24 15:17
关注
LSA links:

Landauer (co-creator) article on LSA

the R-project lsa user guide

Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.

Assumptions:

your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.

M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).

U,Sigma,V = singular_value_decomposition(M) U: w x w Sigma: min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values V: d x d matrix Thus U * Sigma * V = M # you might have to do some transposes depending on how your SVD code # returns U and V. verify this so that you don't go crazy :)

Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.

More succintly... (pseudocode)

Let s1 = sum(Sigma). total = 0 for ii in range(len(Sigma)): val = Sigma[ii] total += val if total > .5 * s1: return ii

This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.

(here, ' -> prime, not transpose)

We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.

That's the essence of the LSA algorithm.

This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.

To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).

Your mileage will definitely vary.

Tagging using LSA (one method!)

Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic

By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.

Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(3条)

报告相同问题？

关注问题

LSA - 潜在语义分析 - 如何用PHP编写代码？ php
2009-06-18 20:10

回答 4 已采纳 LSA links: Landauer (co-creator) article on LSA the R-project lsa user guide Here is the co
000-999用中断暂停 51单片机
2021-07-08 15:01

回答 2 已采纳有几个问题：1、u16 n;声明变量不要放在while(1)里面，应该放在int0intt();的上面。2、delay()延时时长要确定一下延时700是多长时间，这个时长具体跟晶振频率有关系，如果这个
route-policy的影响网络
2023-02-19 11:10

回答 2 已采纳基于Monster 组和GPT的调写：Route-policy不会直接影响LSA/LSP链路状态信息的传递。在OSPF和IS-IS等链路状态路由协议中，路由的信息是通过链路状态数据库（LSDB）进行交
php 语义解析,LSA – 潜在语义分析 – 如何用PHP编写代码？
2021-04-21 18:25

易烫YCC的博客我想在PHP中实现潜在语义分析(LSA),以便找出文本的主题/标签.以下是我认为我必须做的事情.这个对吗？如何在PHP中编写代码？如何确定要选择的单词？>从给定文本中提取所有单词.>对单词/短语进行加权,例如如果...
单片机，每条代码什么意思啊 c语言单片机
2022-05-25 14:10

回答 1 已采纳这代码注释的挺详细的剩下的都是最基础的了，要是还看不懂就需要重新去把课本的基础的东西学下。
Ampache安装页面无法正确显示 php
2016-04-24 12:01

回答 2 已采纳 It's very simple. This is because your PHP port is being used by another service on your system. O
无法通过Python验证RSASSA-PSS签名-> Go python
2015-05-28 00:27

回答 1 已采纳 So it appears I misinterpreted _SALT_SIZE in the Python code. With some help from the TUF develope
红队系列-shellcode AV bypass Evasion免杀合集
2023-11-13 10:24

amingMM的博客 Shellcode的代码实现 sRDI反射型DLL注入 DLL 注入概念反射型 DLL 注入 sRDI基本知识 sRDI组成 sRDI技术优势 sRDI的使用参考文章 Supernova Cobaltstrike免杀从源码级到落地思维转变 Shellcode的分析调试技巧 ...
登录Symfony2后运行控制台命令后台 php symfony
2013-08-07 14:10

回答 1 已采纳 Any variables that you need to access from inside your anonymous function you have to use a use st
关于51单片机定时器的问题？ c语言
2020-02-11 10:24

回答 1 已采纳 ## num没有清零导致需要出去第一次是计算1ms,其余都是计算了65.535ms, 在 ```c void init0() interrupt 1 { .... num++;
一个单片机数码管的小程序，这里为什么说重定义了呢？单片机
2021-07-10 10:20

回答 2 已采纳你三个都写成LSA了。应该是LSA、LSB、LSC
2018最新精选的Go框架，库和软件的精选列表二 https://awesome-go.com/
2019-01-25 08:56

sanshengshi134的博客 go-chat-bot - 用Go编写的IRC，Slack＆Telegram机器人。 go-commons-pool - Golang的通用对象池。 go-multierror - Go（golang）包，用于将错误列表表示为单个错误。 go-openapi - 用于解析和利用open-api模式...
如何操作第一个按键按下，第一盏LED灯亮同时数码管显示1？现在只能做数码管显示按键号。 51单片机
2021-09-18 10:25

回答 2 已采纳图太小，看不清，数码管是接在P2上的？那就在P1输出LED状态的时候，P2同时输出数码管状态就是了。建议先弄个表，把每个数字对应的P2的值写在表里，然后根据数值查表即可。这个表要根据你的数码管的连接方
Go学习路线
2022-05-02 14:37

kgduu的博客今天在开发的时候，找不到...无需代码，将 GraphQL 编译为 SQL。 MTProto MTProto - 在纯 Go 上编写的 Telegram API 的完整本实现。天文学 go-fits - FITS（灵活图像传输系统）格式图像和数据读取器 astrogo/...
【吐血整理】超全golang面试题合集+golang学习指南+golang知识图谱+成长路线一份涵盖大部分golang程序员所需要掌握的核心知识。
2021-01-11 12:37

小白debug的博客目录(善用Ctrl+F) 基础入门新手 Golang开发新手常犯的50个错误数据类型连nil切片和空切片一不一样都不清楚？...map不初始化使用会怎么样 map不初始化长度和初始化长度的区别 map承载多大，..
精选的 Go 框架，库和软件的精选清单
2020-05-09 11:24

思月行云的博客开箱即用，用 Golang 编写，可与集中交易和自定义交易策略兼容。 margelet- 构建 Telegram 机器人的框架。 micha- 用于电报 Bot API 的库。 slacker- 易于使用的框架来创建 Slack 机器人。 slackscot- 用于构建 ...
Go 相关的框架，库和软件的精选清单
2020-07-03 09:37

baobaodqh的博客开箱即用，用Golang编写，可与集中交易和自定义交易策略兼容。 margelet-构建Telegram机器人的框架。 micha-用于电报Bot API的库。 slacker-易于使用的框架来创建Slack机器人。 slackscot-用于构建Slack机器人的另一...
没有解决我的问题, 去提问

悬赏问题

¥15 STM32驱动继电器
¥15 Windows server update services
¥15 关于#c语言#的问题：我现在在做一个墨水屏设计，2.9英寸的小屏怎么换4.2英寸大屏
¥15 模糊pid与pid仿真结果几乎一样
¥15 java的GUI的运用
¥15 Web.config连不上数据库
¥15 我想付费需要AKM公司DSP开发资料及相关开发。
¥15 怎么配置广告联盟瀑布流
¥15 Rstudio 保存代码闪退
¥20 win系统的PYQT程序生成的数据如何放入云服务器阿里云window版？

LSA - 潜在语义分析 - 如何用PHP编写代码？

4条回答 默认 最新

悬赏问题

4条回答默认最新