duanfei1930 2014-10-09 08:54
浏览 27
已采纳

在PHP和MySQL中组织和管理数千个PDF文件

I am helping a former teacher of mine to set up a website where he can exchange class documents (exams, exercise-sheets for students etc.) with his colleagues. He has personally created thousands of PDF-Files, which will now be available to other teachers for reference / usage.

One main feature would be a search function, which will allow users to search for specific files. As there are so many documents, we need to come up with an efficient way to search through all documents.

I have thought of several approaches:

a) Assign every PDF-File 5-10 keywords manually, and save those in the MySQL database along with the file's metadata. The user would be searching for those keywords, and not the PDF's content directly.

b) Use some sort of logic to extract the 10-20 most frequent keywords programmatically, and save those in the MySQL database along with the file's metadata. This is in my opinion a better approach than a).

c) Extract a large portion / all of the PDF-Files text content using file_get_contents and save those in the MySQL database along with the file's metadata. The user is now able to perform searches on the actual text content itself. In my opinion, this would be the best approach.

d) any other approach not mentioned by me?

I am not sure about the viability of those approaches (i.e. will c) consume many resources server-side? In fact we would be sifting through thousands of database rows with each hundreds of words in extracted text-content).

I hope you can give me some pointers on whether I am on the right track, and what in your opinion the best approach would be. Thanks a lot in advance!

  • 写回答

1条回答 默认 最新

  • dounangqie4819 2014-10-09 11:01
    关注

    Approach (a) is your answer (in my opinion). Searching through all the file content is not viable in practice. Extracting the 10-20 most frequent words will only mislead your searching as there is zero guarantee those words will make sense in describing the document they're from. Extracting a large portion of the text could be useful but searching will be a lot slower and there's no say whether it will make the search better or worse than the one with keywords.

    Everything aside, this is largely opinion based. There's no right or wrong way to go about it and approach (a) makes the most sense to me.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?
  • ¥15 c++头文件不能识别CDialog