使用PHP显示docx文件的内容

I am having tough time displaying the raw content of this docx file. It shows lots of unnecessary words and symbols.

Here is the docx file that I want to extract raw content from.

https://www.darlingheadbands.com/wp-content/uploads/2019/04/file.docx

Right now I am getting some normal raw text and also some weird text like the one below.

PEVuZE5vdGU+PENpdGU+PEF1dGhvcj5Db2hlbjwvQXV0aG9yPjxZZWFyPjIwMDU8L1llYXI+PFJl Y051bT4wPC9SZWNOdW0+PElEVGV4dD5PZmYtbGluZSBsZWFybmluZyBvZiBtb3RvciBza2lsbCBt ZW1vcnk6IGEgZG91YmxlIGRpc3NvY2lhdGlvbiBvZiBnb2FsIGFuZCBtb3ZlbWVudDwvSURUZXh0 PjxEaXNwbGF5VGV4dD4oV2lsbGluZ2hhbSAxOTk5LCBDb2hlbiwgUGFzY3VhbC1MZW9uZSBldCBh

here is my code

<?php
function docx_to_text($input_file){
    $xml_filename = "word/document.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}

echo docx_to_text("file.docx");
?>

It should just show the raw text without any images, tables or format. Just plain text.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

dongzhiyi2006 2019-04-20 01:57

关注

This worked for me (using your document):

<?php

function read_docx($document)
{
    $content = '';
    $zip = zip_open($document);
    if (!$zip || is_numeric($zip)) return false;
    while ($zip_entry = zip_read($zip))
    {
        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
        if (zip_entry_name($zip_entry) != 'word/document.xml') continue;
        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
        zip_entry_close($zip_entry);
    }
    zip_close($zip);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', ' ', $content);
    $content = str_replace('</w:r></w:p>', "
", $content);
    $content = preg_replace('/<w:fldData xml:space="preserve">.*<\/w:fldData>/Ums', '', $content);

    return strip_tags($content);
}

echo read_docx('./file.docx');

The weird text you were seeing was related to fldData entries, that I had to strip out.

I kept the document properties, just remove them with preg_replace in case you don't need them.

报告相同问题？

关注问题

使用php编辑上传的.docx文件 php
2014-08-12 11:44

回答 2 已采纳 Ok, there is one very simple solution as to how to get access to the contents of the zip folder. I
上传docx文件并使用php将内容和图像存储到DB中 mysql php
2014-12-08 06:50

回答 1 已采纳 You can already get the $content using your code. $content .= zip_entry_read($zip_entry, zip_entr
如何使用php code-igniter读取docx文件数学方程式 php
2015-04-22 08:27

回答 2 已采纳 I fully go through this https://msdn.microsoft.com/en-us/library/aa982683(v=office.12).aspx#Office
基于php校内文件共享系统设计与实现.docx
2023-11-09 09:14

基于php校内文件共享系统设计与实现
使用PHPWord下载生成的文件 php
2017-11-21 13:56

回答 2 已采纳 just use the builtin functionality as follows $phpWord->save('teste.docx', 'Word2007', true);
下载docx文件时出错 php
2016-08-03 16:55

回答 1 已采纳 Before readfile($fileDir), try to run ob_clean() and flush() to clean (erase) and flush the output
PHP显示特定的目录文件 php
2016-10-18 20:15

回答 1 已采纳 Just replace this: if($ff != '.' && $ff != '..') and use that: if($ff != '.' && $ff != '..' &&
docx-tidy:PHP库可整理DOCX XML文件
2021-05-16 07:52

DOCX文件（包括对包含的XML文件的解压缩和重新存档） XML字符串请注意通过合并分段标签，DocxTidy删除版本控制/编辑历史记录信息使用默认设置运行时，DocxTidy会删除拼写检查标记（“ noProof”，“ proofErr”...
Google Api for PHP（Drive API）导出为.pdf上传的.docx文件 php
2017-07-28 13:42

回答 1 已采纳 OK. So I spend two days looking for solution and I came up with several conclusions: Using Serve
在php上传rtf文件 php
2018-08-18 10:56

回答 2 已采纳 See the documentation: The MAX_FILE_SIZE hidden field (measured in bytes) must precede the fil
laravel 4 docx文件验证无效 php
2015-03-12 03:49

回答 2 已采纳 The issue is that a docx is actually a zipped XML file format. It thinks you're uploading a zip fi
php 读取docx,PHP读取docx文档内容
2021-03-23 21:13

weixin_39688856的博客客户需求, 需要从docx文档读取内容并且做简单格式化, 难点就在于如何读取docx格式并且转换为php可以识别的字符串形式, 惯例先贴代码.class Docx2Text{const SEPARATOR_TAB = "\t";private $docx;private $dom...
找不到存档文件。在PHPWord-0.16.0 / vendor / phpoffice / common / src / Common / XMLReader.php中 html php
2019-01-11 06:36

回答 1 已采纳 A possible solution may be, You can check file permission whether a file folder has permission to
php 打开docx文件怎么打开,docx怎么打开怎么在网页上打开docx文件
2021-05-04 07:23

weixin_39558391的博客 Smarty的date_format总是提示参数不对，按照手册的index.php
PHP语言教程及案例.docx
2024-03-08 09:35

在PHP中，Hello World程序可以通过在文件中嵌入简单的PHP代码实现： ```php <?php echo "Hello, World!"; ?> ``` 在上述例子中： - `<?php` 和 `?>` 用于标识PHP代码块的开始和结束。 - `echo` 用于输出文本。 ##...
没有解决我的问题, 去提问

悬赏问题

¥60 版本过低apk如何修改可以兼容新的安卓系统
¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！
¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像，如何解决？

码龄粉丝数原力等级 --

使用PHP显示docx文件的内容

1条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

使用PHP显示docx文件的内容

1条回答 默认 最新

悬赏问题

1条回答默认最新