I am having tough time displaying the raw content of this docx file. It shows lots of unnecessary words and symbols.
Here is the docx file that I want to extract raw content from.
https://www.darlingheadbands.com/wp-content/uploads/2019/04/file.docx
Right now I am getting some normal raw text and also some weird text like the one below.
PEVuZE5vdGU+PENpdGU+PEF1dGhvcj5Db2hlbjwvQXV0aG9yPjxZZWFyPjIwMDU8L1llYXI+PFJl Y051bT4wPC9SZWNOdW0+PElEVGV4dD5PZmYtbGluZSBsZWFybmluZyBvZiBtb3RvciBza2lsbCBt ZW1vcnk6IGEgZG91YmxlIGRpc3NvY2lhdGlvbiBvZiBnb2FsIGFuZCBtb3ZlbWVudDwvSURUZXh0 PjxEaXNwbGF5VGV4dD4oV2lsbGluZ2hhbSAxOTk5LCBDb2hlbiwgUGFzY3VhbC1MZW9uZSBldCBh
here is my code
<?php
function docx_to_text($input_file){
$xml_filename = "word/document.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
echo docx_to_text("file.docx");
?>
It should just show the raw text without any images, tables or format. Just plain text.