The things I've tried:
read_doc():
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-
\t@\/\_\(\)]/","",$outtext);
return $outtext;}
getRawWordText():
function getRawWordText($filename) {
if(file_exists($filename)) {
if(($fh = fopen($filename, 'r')) !== false ) {
$headers = fread($fh, 0xA00);
$n1 = ( ord($headers[0x21C]) - 1 );// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
$n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
$n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
$n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
$textLength = ($n1 + $n2 + $n3 + $n4);// Total length of text in the document
$extracted_plaintext = fread($fh, $textLength);
$extracted_plaintext = mb_convert_encoding($extracted_plaintext,'UTF-8');
// if you want to see your paragraphs in a new line, do this
// return nl2br($extracted_plaintext);
return ($extracted_plaintext);
} else {
return false;
}
} else {
return false;
} }
DocCounter: https://github.com/joeblurton/doccounter
DocumentParser: https://github.com/LukeMadhanga/PHPDocumentParser
Filetotext: https://www.phpclasses.org/package/8908-PHP-Convert-DOCX-DOC-PDF-to-plain-text.html#information https://gist.github.com/HadoDokis/bb7b4a7763a56eba2c5c
PhpWord: https://github.com/PHPOffice/PHPWord
And a couple other functions I can't recall at the moment.
The goal of the project is to extract the text and count the number of characters without whitespace.
The criteria is that the given solution should have a maximum error margin of 10% when compared to the MS Word character count.
Thanks in advance!