doulei3488 2019-04-20 00:46
浏览 110

使用PHP显示docx文件的内容

I am having tough time displaying the raw content of this docx file. It shows lots of unnecessary words and symbols.

Here is the docx file that I want to extract raw content from.

https://www.darlingheadbands.com/wp-content/uploads/2019/04/file.docx

Right now I am getting some normal raw text and also some weird text like the one below.

PEVuZE5vdGU+PENpdGU+PEF1dGhvcj5Db2hlbjwvQXV0aG9yPjxZZWFyPjIwMDU8L1llYXI+PFJl Y051bT4wPC9SZWNOdW0+PElEVGV4dD5PZmYtbGluZSBsZWFybmluZyBvZiBtb3RvciBza2lsbCBt ZW1vcnk6IGEgZG91YmxlIGRpc3NvY2lhdGlvbiBvZiBnb2FsIGFuZCBtb3ZlbWVudDwvSURUZXh0 PjxEaXNwbGF5VGV4dD4oV2lsbGluZ2hhbSAxOTk5LCBDb2hlbiwgUGFzY3VhbC1MZW9uZSBldCBh

here is my code

<?php
function docx_to_text($input_file){
    $xml_filename = "word/document.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}

echo docx_to_text("file.docx");
?>

It should just show the raw text without any images, tables or format. Just plain text.

  • 写回答

1条回答 默认 最新

  • dongzhiyi2006 2019-04-20 01:57
    关注

    This worked for me (using your document):

    <?php
    
    function read_docx($document)
    {
        $content = '';
        $zip = zip_open($document);
        if (!$zip || is_numeric($zip)) return false;
        while ($zip_entry = zip_read($zip))
        {
            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
            if (zip_entry_name($zip_entry) != 'word/document.xml') continue;
            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
            zip_entry_close($zip_entry);
        }
        zip_close($zip);
    
        $content = str_replace('</w:r></w:p></w:tc><w:tc>', ' ', $content);
        $content = str_replace('</w:r></w:p>', "
    ", $content);
        $content = preg_replace('/<w:fldData xml:space="preserve">.*<\/w:fldData>/Ums', '', $content);
    
        return strip_tags($content);
    }
    
    echo read_docx('./file.docx');
    

    The weird text you were seeing was related to fldData entries, that I had to strip out.

    I kept the document properties, just remove them with preg_replace in case you don't need them.

    评论

报告相同问题?

悬赏问题

  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?