duanjiancong4860
2012-06-19 14:38
浏览 48
已采纳

将HTML表格转换为文本

I'm working on a project that requires to convert html email into text. Below is a simplified version of the HTML code:

<table>
    <tr>
        <td width="10%"></td>
        <td width="60%"> test product </td>
        <td width="20%">5</td>
        <td width="10%"> £50.00 </td>
    </tr>
    <tr>
        <td></td>
        <td colspan="3" width="100%"> Project Name: Test Project </td>
    </tr>
    <tr>
        <td width="10%"> </td>
        <td colspan="2" width="80%"> Page 1 : 01 New York 1.jpg </td>
        <td width="10%"> £0.00 </td>
    </tr>
</table>

The expected outcome should look like this in a text file (with columns aligned nicely):

test product                                      5            £50.00
Project Name: Test Project                                                            
Page 1 :  01 New York 1.jpg                                    £0.00

My idea is parsing the HTML content by DOMDocument. Then I will set a default width for the table (i.e.: 100 spaces) then convert the width of each column from % to number of spaces (based on colspan & width attribute of <td> tag). Then I will subtract these column width to strlen of the data in each column to archive the number of spaces I need to pad_right to the string to make everything align vertically.

I have been working that way, hasn't been archived what I want but just wondering if it is stupid or anyone knows a better way please help me out.

Also when it comes to Multibyte languages (Japanese, Korean etc...) I don't think my approach would work because their characters will be bigger than one space and it end up a mess.

Can someone help me out please?

  • 写回答
  • 好问题 提建议
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • douxian4888 2012-06-19 15:02
    已采纳

    Don't reinvent the wheel. Table rendering is difficult, rendering tables using only text is even more difficult. To clarify the complexity of a text-based table renderer that offers all the features of HTML, take a look at w3m, which is open source: these 3000 lines of code are there only to display html tables.

    Transform HTML to Text

    There are textbased browsers that can be used by command line, like lynx. You could fwrite your html table into a file, pass that file into the textbased browser and take its output.

    Note: textbased browsers are generally used in a shell, which generally displays in monospace. This remains a prerequisite.

    lynx and w3m are both available on Windows and you don't need to "install" them, you just need to have the executables and the permission to run them from PHP.

    code example:

    <?php
    $table = '<table><tr><td>foo</td><td>bar</td></tr></table>'; //this contains your table
    $html = "<html><body>$table</body></html>";
    
    //write html file
    $tmpfname = tempnam(sys_get_temp_dir(), "tblemail");
    
    $handle = fopen($tmpfname, "w");
    fwrite($handle, $html);
    fclose($handle);
    
    $myTextTable = shell_exec("w3m.exe -dump \"$tmpfname\"");
    unlink($tmpfname);
    

    w3m.exe needs to be in your working directory.

    (didn't try it)

    Render a Text table

    If you want a native PHP solution, there's also at least one framework (https://github.com/c9s/CLIFramework) aimed at console applications for PHP which has a table renderer.

    It doesn't transform HTML to text, but it helps you build a text formatted table with support for multiline cells (which seems to be the most complicated part).

    Using CLIFramework you would need a code like this to render your table:

    <?php
    require 'vendor/autoload.php';
    use CLIFramework\Component\Table\Table;
    
    $table = new Table;
    $table->addRow(array( 
        "test product", "5", "£50.00"
    ));
    $table->addRow(array( 
        "Project Name: Test Project", "", ""
    ));
    $table->addRow(array( 
        "Page 1 : 01 New York 1.jpg", "", "£0.00"
    ));
    
    $myTextTable = $table->render();
    

    The CLIFramework table renderer doesn't seem to support anything similar to "colspan" however.

    Here's the documentation for the table component: https://github.com/c9s/CLIFramework/wiki/Using-Table-Component

    已采纳该答案
    评论
    解决 无用
    打赏 举报

相关推荐 更多相似问题