dououde4065 2015-11-03 11:59
浏览 49
已采纳

PHP html_entity_decode和修剪混乱

I'm trying to use strip_tags and trim to detect if a string contains empty html?

$description = '<p>&nbsp;</p>';

$output = trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));

var_dump($output);

string 'Â ' (length=2)

My debug to try figure this out:

$description = '<p>&nbsp;</p>';

$test = mb_detect_encoding($description);
$test .= "
";
$test .= trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
$test .= "
";
$test .= html_entity_decode($description, ENT_QUOTES, 'UTF-8');

file_put_contents('debug.txt', $test);

Output: debug.txt

ASCII
 
<p> </p>
  • 写回答

1条回答 默认 最新

  • duangu1033 2015-11-03 12:07
    关注

    If you use var_dump(urlencode($output)) you'll see that it outputs string(6) "%C2%A0" hence the charcodes are 0xC2 and 0xA0. These two charcodes are unicode for "non-breaking-space". Make sure your file is saved in UTF-8 format and your HTTP headers are UTF-8 format.

    That said, to trim this character you can use regex with the unicode modifier (instead of trim):

    DEMO:

    <?php
    
    $description = '<p>&nbsp;</p>';
    
    $output = trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
    
    var_dump(urlencode($output)); // string(6) "%C2%A0"
    
    // -------
    
    $output = preg_replace('~^\s+|\s+$~', '', strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
    
    var_dump(urlencode($output)); // string(6) "%C2%A0"
    
    // -------
    
    $output = preg_replace('~^\s+|\s+$~u', '', strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
    // Unicode! -----------------------^
    
    var_dump(urlencode($output)); // string(0) ""
    

    Regex autopsy:

    • ~ - the regex modifier delimiter - must be before the regex, and then before the modifiers
    • ^\s+ - the start of the string immediately followed by one or more whitespaces (one or more whitespace characters in the start of the string) - (^ means start of the string, \s means a whitespace character, + means "matched 1 to infinity times")
    • | - OR
    • \s+$ - one or more whitespace characters immediately followed by the end of the string (one or more whitespace characters in the end of the string)
    • ~ - the ending regex modifier delimiter
    • u - the regex modifier - here using the unicode modifier (PCRE_UTF8) to make sure we replace unicode whitespace characters.
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入
  • ¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
  • ¥70 PlayWright在Java上连接CDP关联本地Chrome启动失败,貌似是Windows端口转发问题
  • ¥15 帮我写一个c++工程
  • ¥30 Eclipse官网打不开,官网首页进不去,显示无法访问此页面,求解决方法
  • ¥15 关于smbclient 库的使用
  • ¥15 微信小程序协议怎么写
  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启