To match them byte for byte the most efficient way is
if(hash_file('sha1', $pathToFile1) == hash_file('sha1', $pathToFile2))
if that's too exact, you could strip whitespace. From text files, not binary files like docx
or xlsx
files.
if(hash('sha1', str_replace(' ', '', file_get_contents( $pathToFile1))) == hash('sha1', str_replace(' ', '', file_get_contents( $pathToFile2))))
Or something like that to normalize the text. For binary file types you will have to use some library for that type of file to convert them first to text.
In other words you will have to come up with some way to normalize the text contents of the file, such as upper casing everything and removing spaces or other acceptable differences.
Normalizing is a fancy way of saying, removing the differences. A simple example is this.
Some text
Now is that the same as Some text.
? Or Some Text
or some Text
that depends. But "normalizing them" may look like this sometext
with no punctuation, spaces or casing. It's up to you to decide how you normalize them.
Because of the mention of the binary formats I can't help you there as you will need to find a way to open them in PHP, which will require some 3rd party libraries.
Your question is very Broad, so I can only give you a Broad overview of how to do it.
Hashing is nice because it takes a file of {x} size and makes it 40 characters long (in the case of sha1
) which is a lot easier to store in a DB, or visualize. I mention the DB because you can cut the operation in half by pre-normalizing and hashing the Known file (the source file). This will reduce the overall cost of comparing them.
UPDATE
Here is an example
echo hash('sha1', 'The same text') == hash('sha1', 'the same text') ? 'true' : 'false';
The output will be false
However if you do this:
echo hash('sha1', strtolower('The same text')) == hash('sha1', strtolower('the same text')) ? 'true' : 'false';
The output will be true
Sandbox
A small amount of text is no different then a large amount. The difference between the two pieces of code above, is I normalized one and not the other.
UPDATE1
ok. do u know the softwares like Typing Tutor.. which takes typing test. There is one fixed paragraph and user will write that paragraph in text box with same formatting.
$old = 'The same text';
$arr_old = explode(' ', $old);
$new = 'the same text';
$pattern = '/\b('.implode(')\b|\b(', array_map('preg_quote', $arr_old)).')\b/';
preg_match_all($pattern, $new, $matches );
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => same
[1] => text
)
[1] => Array
(
[0] =>
[1] =>
)
[2] => Array
(
[0] => same
[1] =>
)
[3] => Array
(
[0] =>
[1] => text
)
)
It's important to mention that the index of the match(-1), will match the index of the word. For example in the above there is no match in $matches[1]
there is no match. This corresponds to The
which is the first item in $arr_old = explode(' ', $old);
or [0=>'The', 1=>'same', 2=>'text']
But because the match is 1
based and the array is 0
based you have to subtract 1.
PS to check these I would do something like
$len = count($matches);
for($i=1;$i<$len;$i++){
if(!empty(array_filter($matches[$i]))) echo "match ".$arr_old[$i-1]."
";
}
Output:
match same
match text
Sandbox
I hope that helps.