使用PHP显示docx文件的内容

I am having tough time displaying the raw content of this docx file. It shows lots of unnecessary words and symbols.

Here is the docx file that I want to extract raw content from.

https://www.darlingheadbands.com/wp-content/uploads/2019/04/file.docx

Right now I am getting some normal raw text and also some weird text like the one below.

PEVuZE5vdGU+PENpdGU+PEF1dGhvcj5Db2hlbjwvQXV0aG9yPjxZZWFyPjIwMDU8L1llYXI+PFJl Y051bT4wPC9SZWNOdW0+PElEVGV4dD5PZmYtbGluZSBsZWFybmluZyBvZiBtb3RvciBza2lsbCBt ZW1vcnk6IGEgZG91YmxlIGRpc3NvY2lhdGlvbiBvZiBnb2FsIGFuZCBtb3ZlbWVudDwvSURUZXh0 PjxEaXNwbGF5VGV4dD4oV2lsbGluZ2hhbSAxOTk5LCBDb2hlbiwgUGFzY3VhbC1MZW9uZSBldCBh

here is my code

<?php
function docx_to_text($input_file){
    $xml_filename = "word/document.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}

echo docx_to_text("file.docx");
?>

It should just show the raw text without any images, tables or format. Just plain text.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

dongzhiyi2006 2019-04-20 01:57

关注

This worked for me (using your document):

<?php

function read_docx($document)
{
    $content = '';
    $zip = zip_open($document);
    if (!$zip || is_numeric($zip)) return false;
    while ($zip_entry = zip_read($zip))
    {
        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
        if (zip_entry_name($zip_entry) != 'word/document.xml') continue;
        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
        zip_entry_close($zip_entry);
    }
    zip_close($zip);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', ' ', $content);
    $content = str_replace('</w:r></w:p>', "
", $content);
    $content = preg_replace('/<w:fldData xml:space="preserve">.*<\/w:fldData>/Ums', '', $content);

    return strip_tags($content);
}

echo read_docx('./file.docx');

The weird text you were seeing was related to fldData entries, that I had to strip out.

I kept the document properties, just remove them with preg_replace in case you don't need them.