dongqiancui9194 2018-10-13 03:52
浏览 129
已采纳

使用PHP或Javascript以编程方式比较两个word或excel或powerpoint文档

Following are some requirements for my new project.

Admin will upload a file which will be in format of Ms Word 2007 or Ms Excel 2007 or Ms Power Point 2007.

Lets say that admin has uploaded a file named demo1.docx file.

Now demo1.docx is a master file.

Now other users will upload their own files like demo2.docx, demo3.docx etc.

I want to compare demo2.docx and demo3.docx files with master file demo1.docx.

Files uploaded by other users must be copy of the master file. I mean number of characters, text, formatting have to be same as the master file.

If it is excel file, then number of sheets, no. of cells filled have to be same and same thing apply to power point files.

I want to do this using PHP or Javascript.

So can u please tell me if it is possible or not? and if it is possible then suggest me some ways to accomplish this task.

Thanks in advance.

  • 写回答

1条回答 默认 最新

  • donglin7383 2018-10-13 05:43
    关注

    To match them byte for byte the most efficient way is

    if(hash_file('sha1', $pathToFile1) == hash_file('sha1', $pathToFile2))
    

    if that's too exact, you could strip whitespace. From text files, not binary files like docx or xlsx files.

    if(hash('sha1', str_replace(' ', '', file_get_contents( $pathToFile1))) == hash('sha1', str_replace(' ', '', file_get_contents( $pathToFile2))))
    

    Or something like that to normalize the text. For binary file types you will have to use some library for that type of file to convert them first to text.

    In other words you will have to come up with some way to normalize the text contents of the file, such as upper casing everything and removing spaces or other acceptable differences.

    Normalizing is a fancy way of saying, removing the differences. A simple example is this.

    Some text
    

    Now is that the same as Some text.? Or Some Text or some Text that depends. But "normalizing them" may look like this sometext with no punctuation, spaces or casing. It's up to you to decide how you normalize them.

    Because of the mention of the binary formats I can't help you there as you will need to find a way to open them in PHP, which will require some 3rd party libraries.

    Your question is very Broad, so I can only give you a Broad overview of how to do it.

    Hashing is nice because it takes a file of {x} size and makes it 40 characters long (in the case of sha1) which is a lot easier to store in a DB, or visualize. I mention the DB because you can cut the operation in half by pre-normalizing and hashing the Known file (the source file). This will reduce the overall cost of comparing them.

    UPDATE

    Here is an example

    echo hash('sha1', 'The same text') == hash('sha1', 'the same text') ? 'true' : 'false';
    

    The output will be false However if you do this:

    echo hash('sha1', strtolower('The same text')) == hash('sha1', strtolower('the same text')) ? 'true' : 'false';
    

    The output will be true

    Sandbox

    A small amount of text is no different then a large amount. The difference between the two pieces of code above, is I normalized one and not the other.

    UPDATE1

    ok. do u know the softwares like Typing Tutor.. which takes typing test. There is one fixed paragraph and user will write that paragraph in text box with same formatting.

    $old = 'The same text';
    $arr_old = explode(' ', $old);
    $new = 'the same text';
    
    $pattern = '/\b('.implode(')\b|\b(', array_map('preg_quote', $arr_old)).')\b/';
    
    preg_match_all($pattern, $new, $matches );
    
    print_r($matches);
    

    Output

      Array
    (
        [0] => Array
            (
                [0] => same
                [1] => text
            )
    
        [1] => Array
            (
                [0] => 
                [1] => 
            )
    
        [2] => Array
            (
                [0] => same
                [1] => 
            )
    
        [3] => Array
            (
                [0] => 
                [1] => text
            )
    
    ) 
    

    It's important to mention that the index of the match(-1), will match the index of the word. For example in the above there is no match in $matches[1] there is no match. This corresponds to The which is the first item in $arr_old = explode(' ', $old); or [0=>'The', 1=>'same', 2=>'text'] But because the match is 1 based and the array is 0 based you have to subtract 1.

    PS to check these I would do something like

    $len = count($matches);
    for($i=1;$i<$len;$i++){
        if(!empty(array_filter($matches[$i]))) echo "match ".$arr_old[$i-1]."
    ";
    }
    

    Output:

    match same
    match text
    

    Sandbox

    I hope that helps.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 python:excel数据写入多个对应word文档
  • ¥60 全一数分解素因子和素数循环节位数
  • ¥15 ffmpeg如何安装到虚拟环境
  • ¥188 寻找能做王者评分提取的
  • ¥15 matlab用simulink求解一个二阶微分方程,要求截图
  • ¥30 乘子法解约束最优化问题的matlab代码文件,最好有matlab代码文件
  • ¥15 写论文,需要数据支撑
  • ¥15 identifier of an instance of 类 was altered from xx to xx错误
  • ¥100 反编译微信小游戏求指导
  • ¥15 docker模式webrtc-streamer 无法播放公网rtsp