doulei3488 2019-04-20 00:46
浏览 110

使用PHP显示docx文件的内容

I am having tough time displaying the raw content of this docx file. It shows lots of unnecessary words and symbols.

Here is the docx file that I want to extract raw content from.

https://www.darlingheadbands.com/wp-content/uploads/2019/04/file.docx

Right now I am getting some normal raw text and also some weird text like the one below.

PEVuZE5vdGU+PENpdGU+PEF1dGhvcj5Db2hlbjwvQXV0aG9yPjxZZWFyPjIwMDU8L1llYXI+PFJl Y051bT4wPC9SZWNOdW0+PElEVGV4dD5PZmYtbGluZSBsZWFybmluZyBvZiBtb3RvciBza2lsbCBt ZW1vcnk6IGEgZG91YmxlIGRpc3NvY2lhdGlvbiBvZiBnb2FsIGFuZCBtb3ZlbWVudDwvSURUZXh0 PjxEaXNwbGF5VGV4dD4oV2lsbGluZ2hhbSAxOTk5LCBDb2hlbiwgUGFzY3VhbC1MZW9uZSBldCBh

here is my code

<?php
function docx_to_text($input_file){
    $xml_filename = "word/document.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}

echo docx_to_text("file.docx");
?>

It should just show the raw text without any images, tables or format. Just plain text.

  • 写回答

1条回答 默认 最新

  • dongzhiyi2006 2019-04-20 01:57
    关注

    This worked for me (using your document):

    <?php
    
    function read_docx($document)
    {
        $content = '';
        $zip = zip_open($document);
        if (!$zip || is_numeric($zip)) return false;
        while ($zip_entry = zip_read($zip))
        {
            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
            if (zip_entry_name($zip_entry) != 'word/document.xml') continue;
            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
            zip_entry_close($zip_entry);
        }
        zip_close($zip);
    
        $content = str_replace('</w:r></w:p></w:tc><w:tc>', ' ', $content);
        $content = str_replace('</w:r></w:p>', "
    ", $content);
        $content = preg_replace('/<w:fldData xml:space="preserve">.*<\/w:fldData>/Ums', '', $content);
    
        return strip_tags($content);
    }
    
    echo read_docx('./file.docx');
    

    The weird text you were seeing was related to fldData entries, that I had to strip out.

    I kept the document properties, just remove them with preg_replace in case you don't need them.

    评论

报告相同问题?

悬赏问题

  • ¥15 公司的电脑,win10系统自带远程协助,访问家里个人电脑,提示出现内部错误,各种常规的设置都已经尝试,感觉公司对此功能进行了限制(我们是集团公司)
  • ¥15 救!ENVI5.6深度学习初始化模型报错怎么办?
  • ¥30 eclipse开启服务后,网页无法打开
  • ¥30 雷达辐射源信号参考模型
  • ¥15 html+css+js如何实现这样子的效果?
  • ¥15 STM32单片机自主设计
  • ¥15 如何在node.js中或者java中给wav格式的音频编码成sil格式呢
  • ¥15 不小心不正规的开发公司导致不给我们y码,
  • ¥15 我的代码无法在vc++中运行呀,错误很多
  • ¥50 求一个win系统下运行的可自动抓取arm64架构deb安装包和其依赖包的软件。