dongzanxun2790 2018-07-25 19:46
浏览 60

有没有办法可以使用PHP从.doc文件中可靠地获取纯文本?

The things I've tried:

read_doc():

private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));   
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
  {
    $pos = strpos($thisline, chr(0x00));
    if (($pos !== FALSE)||(strlen($thisline)==0))
      {
      } else {
        $outtext .= $thisline." ";
      }
  }
 $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-
\t@\/\_\(\)]/","",$outtext);
return $outtext;}

getRawWordText():

function getRawWordText($filename) {
if(file_exists($filename)) {
    if(($fh = fopen($filename, 'r')) !== false ) {
        $headers = fread($fh, 0xA00);
        $n1 = ( ord($headers[0x21C]) - 1 );// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
        $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
        $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
        $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
        $textLength = ($n1 + $n2 + $n3 + $n4);// Total length of text in the document
        $extracted_plaintext = fread($fh, $textLength);
        $extracted_plaintext = mb_convert_encoding($extracted_plaintext,'UTF-8');
         // if you want to see your paragraphs in a new line, do this
         // return nl2br($extracted_plaintext);
         return ($extracted_plaintext);
    } else {
        return false;
    }
} else {
    return false;
}  }

DocCounter: https://github.com/joeblurton/doccounter

DocumentParser: https://github.com/LukeMadhanga/PHPDocumentParser

Filetotext: https://www.phpclasses.org/package/8908-PHP-Convert-DOCX-DOC-PDF-to-plain-text.html#information https://gist.github.com/HadoDokis/bb7b4a7763a56eba2c5c

PhpWord: https://github.com/PHPOffice/PHPWord

And a couple other functions I can't recall at the moment.

The goal of the project is to extract the text and count the number of characters without whitespace.

The criteria is that the given solution should have a maximum error margin of 10% when compared to the MS Word character count.

Thanks in advance!

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 2024-五一综合模拟赛
    • ¥15 下图接收小电路,谁知道原理
    • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
    • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
    • ¥15 手机接入宽带网线,如何释放宽带全部速度
    • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
    • ¥15 ETLCloud 处理json多层级问题
    • ¥15 matlab中使用gurobi时报错
    • ¥15 这个主板怎么能扩出一两个sata口
    • ¥15 不是,这到底错哪儿了😭