dongou0524 2014-11-24 12:40
浏览 56
已采纳

如何使用PHP从html字符串中提取未标记的文本

Consider the following HTML code :

<strong>title</strong>
Hello World
<strong>Sub-Title</strong>
<div>This is just stuff</div>

How can I clean up the string to just return the string with no tags, i.e. 'Hello World'. I presume this is with DOM, and would prefer a non-regex answer if anyone has a way, and without using javascript or jquery.

[EDIT] Code it fails on.

<span style="color: #677b8d"><strong>Short Description</strong><br/>Microsoft Office Home and Business 2013-Word, Excel, PowerPoint, OneNote and Outlook(Does not include Publisher or Access), DSP , No Warranty on Software <br/><br/><strong>Long description<br/></strong><div>Microsoft Office Home and Business 2013 32-bit/x64 DSP No Warranty on Software </div> <font face="Arial"> <div><br/><strong>Product Overview </strong> <div><font face="Arial">The New Microsoft Office Home &amp; Business 2013 is designed to help you create and communicate faster with new, time-saving features and a clean, modern look. Plus, you can save your documents in the cloud on SkyDrive and access them virtually anywhere. The latest versions of Word, Excel, PowerPoint, OneNote plus Outlook on 1 PC.</font></div> </div> <div><strong><br/> Features<br/></strong><font face="Arial">•One time purchase for the life of your PC; non-transferrable.<br/> •Office on one PC for business and household use.<br/> •The latest versions of Word, Excel, PowerPoint, OneNote, and Outlook.<br/> •7 GB of online storage in SkyDrive.<br/> •Free Office Web Apps* for accessing, editing, and sharing documents.<br/> •An improved user interface optimized for touch, pen, and keyboard.</font> <div> </div> <div><font face="Arial"><strong>Specifications<br/></strong>Operating System Windows <br/> Office/Productivity Software Office Suites &amp; Tools <br/> Purchase Method Boxed <br/> Users/Devices per License 1-User <br/></font></div> </div> <div><font face="Arial"><strong>System Requirements:<br/></strong>Computer and Processor 1 GHz or faster x86 or 64-bit processor with SSE2 instruction set</font></div> <div> <p><font face="Arial"><strong>Memory<br/></strong>1 GB RAM (32-bit); 2 GB RAM (64-bit) recommended for graphics features, Outlook Instant Search, and certain advanced functionality**</font></p> <p><font face="Arial"><strong>Hard Disk<br/></strong>3.0 GB available disk space</font></p> <p><font face="Arial"><strong>Display<br/></strong>1366 x 768 resolution</font></p> <p><font face="Arial"><strong>Operating System<br/></strong>Windows® 7, Windows 8, Windows Server 2008 R2 with .NET 3.5 or later</font></p> <p><font face="Arial"><strong>Graphics<br/></strong>Graphics hardware acceleration requires a DirectX10 graphics card</font></p> <p><font face="Arial"><strong>Additional Requirements<br/></strong>Internet connection. Fees may apply.</font></p> <p><font face="Arial">Microsoft Internet Explorer 8, 9, or 10; Mozilla Firefox 10.x or a later version; Apple Safari 5; or Google Chrome 17.x.</font></p> <p><font face="Arial">A touch-enabled device is required to use any multi-touch functionality. However, all features and functionality are always available by using a keyboard, mouse, or other standard or accessible input device. New touch features are optimized for use with Windows 8.</font></p> <p><font face="Arial">Information Rights Management features require access to a Windows 2003 Server with SP1 or later running Windows Rights Management Services.</font></p> <p><font face="Arial">Microsoft and Skype accounts.</font></p> <p><font face="Arial"><strong>Other<br/></strong>Product functionality and graphics may vary based on your system configuration. Some features may require additional or advanced hardware or server connectivity.</font></p> <p><font face="Arial">*An appropriate device, Internet connection and Internet Explorer, Firefox or Safari browser are required.<br/> **512 MB RAM recommended for accessing Outlook data files larger than 1GB<br/></font></p> </div> </font></span>
  • 写回答

1条回答 默认 最新

  • dongxinche1264 2014-11-24 12:59
    关注

    I would suggest you to surround the code with kinda exotic tag, which is definitely not occured in the code itself, like:

     $a="<body><strong>title</strong>
    Hello World
    <strong>Sub-Title</strong>
    <div>This is just stuff</div></body>";
    

    Then use DOM:

    $doc = new DOMDocument();
    $doc->loadHTML($a);
    $xpath = new DOMXPath($doc);
    $textnodes = $xpath->evaluate('//body/text()[not(normalize-space() = '')]');
    

    Now you may get whatever you want:

    foreach( $textnodes as $el ) {
      print_r($el);
    }
    
    /*
    DOMText Object
    (
        [wholeText] => 
    Hello World
    
        [data] => 
    Hello World
    
        [length] => 13
        [nodeName] => #text
        [nodeValue] => 
    Hello World
    
        [nodeType] => 3
        [parentNode] => (object value omitted)
        [childNodes] => 
        [firstChild] => 
        [lastChild] => 
        [previousSibling] => (object value omitted)
        [nextSibling] => (object value omitted)
        [attributes] => 
        [ownerDocument] => (object value omitted)
        [namespaceURI] => 
        [prefix] => 
        [localName] => 
        [baseURI] => 
        [textContent] => 
    Hello World
    */
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 安装svn网络有问题怎么办
  • ¥15 Python爬取指定微博话题下的内容,保存为txt
  • ¥15 vue2登录调用后端接口如何实现
  • ¥65 永磁型步进电机PID算法
  • ¥15 sqlite 附加(attach database)加密数据库时,返回26是什么原因呢?
  • ¥88 找成都本地经验丰富懂小程序开发的技术大咖
  • ¥15 如何处理复杂数据表格的除法运算
  • ¥15 如何用stc8h1k08的片子做485数据透传的功能?(关键词-串口)
  • ¥15 有兄弟姐妹会用word插图功能制作类似citespace的图片吗?
  • ¥15 latex怎么处理论文引理引用参考文献