duanfei1930 2014-10-09 08:54
浏览 27
已采纳

在PHP和MySQL中组织和管理数千个PDF文件

I am helping a former teacher of mine to set up a website where he can exchange class documents (exams, exercise-sheets for students etc.) with his colleagues. He has personally created thousands of PDF-Files, which will now be available to other teachers for reference / usage.

One main feature would be a search function, which will allow users to search for specific files. As there are so many documents, we need to come up with an efficient way to search through all documents.

I have thought of several approaches:

a) Assign every PDF-File 5-10 keywords manually, and save those in the MySQL database along with the file's metadata. The user would be searching for those keywords, and not the PDF's content directly.

b) Use some sort of logic to extract the 10-20 most frequent keywords programmatically, and save those in the MySQL database along with the file's metadata. This is in my opinion a better approach than a).

c) Extract a large portion / all of the PDF-Files text content using file_get_contents and save those in the MySQL database along with the file's metadata. The user is now able to perform searches on the actual text content itself. In my opinion, this would be the best approach.

d) any other approach not mentioned by me?

I am not sure about the viability of those approaches (i.e. will c) consume many resources server-side? In fact we would be sifting through thousands of database rows with each hundreds of words in extracted text-content).

I hope you can give me some pointers on whether I am on the right track, and what in your opinion the best approach would be. Thanks a lot in advance!

  • 写回答

1条回答 默认 最新

  • dounangqie4819 2014-10-09 11:01
    关注

    Approach (a) is your answer (in my opinion). Searching through all the file content is not viable in practice. Extracting the 10-20 most frequent words will only mislead your searching as there is zero guarantee those words will make sense in describing the document they're from. Extracting a large portion of the text could be useful but searching will be a lot slower and there's no say whether it will make the search better or worse than the one with keywords.

    Everything aside, this is largely opinion based. There's no right or wrong way to go about it and approach (a) makes the most sense to me.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 永磁直线电机的电流环pi调不出来
  • ¥15 用stata实现聚类的代码
  • ¥15 请问paddlehub能支持移动端开发吗?在Android studio上该如何部署?
  • ¥170 如图所示配置eNSP
  • ¥20 docker里部署springboot项目,访问不到扬声器
  • ¥15 netty整合springboot之后自动重连失效
  • ¥15 悬赏!微信开发者工具报错,求帮改
  • ¥20 wireshark抓不到vlan
  • ¥20 关于#stm32#的问题:需要指导自动酸碱滴定仪的原理图程序代码及仿真
  • ¥20 设计一款异域新娘的视频相亲软件需要哪些技术支持