在PHP和MySQL中组织和管理数千个PDF文件

I am helping a former teacher of mine to set up a website where he can exchange class documents (exams, exercise-sheets for students etc.) with his colleagues. He has personally created thousands of PDF-Files, which will now be available to other teachers for reference / usage.

One main feature would be a search function, which will allow users to search for specific files. As there are so many documents, we need to come up with an efficient way to search through all documents.

I have thought of several approaches:

a) Assign every PDF-File 5-10 keywords manually, and save those in the MySQL database along with the file's metadata. The user would be searching for those keywords, and not the PDF's content directly.

b) Use some sort of logic to extract the 10-20 most frequent keywords programmatically, and save those in the MySQL database along with the file's metadata. This is in my opinion a better approach than a).

c) Extract a large portion / all of the PDF-Files text content using file_get_contents and save those in the MySQL database along with the file's metadata. The user is now able to perform searches on the actual text content itself. In my opinion, this would be the best approach.

d) any other approach not mentioned by me?

I am not sure about the viability of those approaches (i.e. will c) consume many resources server-side? In fact we would be sifting through thousands of database rows with each hundreds of words in extracted text-content).

I hope you can give me some pointers on whether I am on the right track, and what in your opinion the best approach would be. Thanks a lot in advance!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dounangqie4819 2014-10-09 11:01
关注
Approach (a) is your answer (in my opinion). Searching through all the file content is not viable in practice. Extracting the 10-20 most frequent words will only mislead your searching as there is zero guarantee those words will make sense in describing the document they're from. Extracting a large portion of the text could be useful but searching will be a lot slower and there's no say whether it will make the search better or worse than the one with keywords.

Everything aside, this is largely opinion based. There's no right or wrong way to go about it and approach (a) makes the most sense to me.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

在PHP和MySQL中组织和管理数千个PDF文件 mysql php
2014-10-09 08:54

回答 1 已采纳 Approach (a) is your answer (in my opinion). Searching through all the file content is not viable
MySQL插入和更新查询在插入大数据时出错 mysql php
2014-06-26 10:28

回答 1 已采纳 Your content probably contains quote characters, which you need to escape. $bcontent = mysql_real
从MySQL中的BLOB读取pdf内容 mysql php
2019-03-29 13:08

回答 2 已采纳 many thanks @Shadow ! i modified the field to LONGBLOB and now it seems to work fine! :-)
用三种解决方案优化MySQL两千万数据大表
2020-09-08 21:12

it阿布的博客使用阿里云rds for MySQL数据库（就是MySQL5.6版本），有个用户上网记录表6个月的数据量近2000万，保留最近一年的数据量达到4000万，查询速度极慢，日常卡死。严重影响业务。问题前提：老系统，当时设计系统的人...
linux 中的mysql5.7 中有个mysql.err文件这个能删除吗 linux mysql 数据库
2022-04-20 11:27

回答 2 已采纳可以删，这个是sql运行的错误日志、慢日志等、不过这个文件占的空间也不算大啊，如果磁盘满了，你可以把binlog备份删除
在sql中多大的数据才算是大数据？ java mysql 数据库
2022-03-31 17:24

回答 5 已采纳其实没有实际的标准明确定义多少数据量算大数据，不过阿里开发手册中建议，表数据超过500万条时，建议考虑分表，以防影响查询效率，不过我们公司也有单表超过几千万条的数据，效率确实不高，所以理论上百万级别以
Eclipse和MySQL数据库连接中出现Module format not recognized eclipse java mysql 有问必答
2021-12-06 16:08

回答 3 已采纳 jar包需要build path添加到项目中。项目下新建一个lib包，将jar包复制到lib包下。选中jar包，build path添加到项目中。
Mysql 和 Postgresql(PGSQL) 对比
2018-12-14 22:56

weixin_34240657的博客 Mysql 和 Postgresql(PGSQL) 对比转载自：http://www.oschina.net/question/96003_13994 PostgreSQL与MySQL比较 MySQL使用太广泛了，以至于我不得不将一些应用从mysql 迁移到postgresql, 很多开源软件都是以...
使用php将mysql数据库表内容导出到PDF文件 database mysql php
2013-05-17 08:45

回答 4 已采纳 Take a look at http://www.tcpdf.org/. I never heard about out of box way to print mysql table data
mysql语句，连接查询和where条件在一起执行顺序 mysql
2021-08-09 15:34

回答 1 已采纳没有总是在前或者在后的说法，根据你的需求灵活选择。
如何使用Docker连接php-apache和MySQL？ bash docker mysql php
2018-03-13 10:47

回答 2 已采纳 Your problem is not the connection between your containers. The problem is your PHP / Apache conta
大数据技术的概论（2）
2020-12-18 16:21

xyf0912的博客 1.5大数据带来多大变革 1技术变革特征 2管理模式变革（人力，流程，制造，市场） 1）数据资产化 2）决策智能化 3信息技术IT向数据技术DI的转变
关于mysql中group by和having使用未达到预期的问题 mysql
2020-03-11 10:23

回答 5 已采纳我之前遇到过跟你一样的问题，不用子查询解决不了，having的执行顺序在group by之后，在执行到having时每个id就只剩一条数据了，试一下下面这个sql。 ``` select * fr
Java后端真实面试题大全(有详细答案)--高频/真题
2021-11-24 19:00

IT利刃出鞘的博客写这个面试题的原因：我之前找工作时背了其他很多面试题（在线版和PDF版都有），结果面试官的问题几乎都不在里边，导致面试不通过！于是我整理了这套真题，让你稳过面试！此套面试题的威力：看过这套题的朋友、同事...
GitHub中文排行榜，帮助你发现高分优秀中文项目（二）-Java
2021-01-31 00:08

当年的春天的博客榜单设立目的 ???????? GitHub中文排行榜，帮助你发现...一个小小的要求：项目的 Description 和 README.md 都要包含中文说明；更新越持续越好：最近半年内有更新过的项目才有机会入选（拥抱活跃，远离僵尸）； Star
没有解决我的问题, 去提问

悬赏问题

¥15 如何在scanpy上做差异基因和通路富集？
¥20 关于#硬件工程#的问题，请各位专家解答！
¥15 关于#matlab#的问题：期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707，使系统具有较小的超调量
¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
¥30 截图中的mathematics程序转换成matlab
¥15 动力学代码报错，维度不匹配
¥15 Power query添加列问题
¥50 Kubernetes&Fission&Eleasticsearch
¥15 報錯：Person is not mapped，如何解決？
¥15 c++头文件不能识别CDialog

在PHP和MySQL中组织和管理数千个PDF文件

1条回答 默认 最新

悬赏问题

1条回答默认最新