dpr26232 2015-05-11 15:59
浏览 36
已采纳

删除HTML和恶意代码,在PHP中留下标点符号和外语

function stripAlpha( $item )
{
    $search     = array( 
         '@<script[^>]*?>.*?</script>@si'   // Strip out javascript 
        ,'@<style[^>]*?>.*?</style>@siU'    // Strip style tags properly 
        ,'@<[\/\!]*?[^<>]*?>@si'            // Strip out HTML tags
        ,'@<![\s\S]*?–[ \t
]*>@'         // Strip multi-line comments including CDATA
        ,'/\s{2,}/'
        ,'/(\s){2,}/'
    );
    $pattern    = array(
         '#[^a-zA-Z ]#'                     // Non alpha characters
        ,'/\s+/'                            // More than one whitespace
    );
    $replace    = array(
         ''
        ,' '
    );
    $item = preg_replace( $search, '', html_entity_decode( $item ) );
    $item = trim( preg_replace( $pattern, $replace, strip_tags( $item ) ) );

    return $item;
}

One person suggested replacing this entire script with one liner:

$clear = preg_replace('/[^A-Za-z0-9\-]/', '', urldecode($_GET['id']));

but that gives an error with the $_GET command - unknown variable ID

what I'm looking for is the simplest script to remove all HTML code and weird characters, replacing carriage returns with spaces and leaving punctuation like dots commas and exclamation points.

There are a lot of similar questions but none seem to really answer this question right and those scripts strip away all characters including sentence punctuation and foreign Arabic fonts or spanish.

for example if the string contains www.mygreatwebsite.com

the cleaner script will return wwwmygreatwebsitecom which looks weird.

If someone is excited about something like 'Hey this is a great website! ' it also removes the exclamation points.

All the similar questions out there that I've looked up remove all the characters....

I'd like to leave IN the punctuation and any foreign language characters with one simple regex command that clears out all the stuff people paste into forms, but leaves the punctuation.

Naturally carriage returns would be replaced by spaces.

Any suggestions?

  • 写回答

2条回答 默认 最新

  • drpph80800 2015-05-11 16:03
    关注

    To remove all html code, it's easy, use strip_tags

    $text = strip_tags($html);
    

    But it works only if the string doesn't contain css or javascript code.

    So a better way that deals with this problem is to use DOMDocument and XPath to find all text nodes that haven't a style or a script tag as ancestor:

    $dom = new DOMDocument;
    $dom->loadHTML($html);
    
    $xp = new DOMXPath($dom);
    
    $textNodeList = $xp->query('//text()[not(ancestor::script) and not(ancestor::style)]');
    
    $text = '';
    
    foreach($textNodeList as $textNode) {
        $text .= ' '. $textNode->nodeValue;
    }
    

    to replace weird characters and white-space characters except punctuation with a space:

    $text = preg_replace('~[^\pP\pL\pN]+~u', ' ', $text);
    

    Where \pP is a character class for punctuation characters, \pL for letters, \pN for digits. (to be more precise about the characters you want to preserve, take a look at the available character classes here (search for "Unicode character properties"))

    obviously, you can trim the text to finish:

    $text = trim($text);
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 远程桌面文档内容复制粘贴,格式会变化
  • ¥15 关于#java#的问题:找一份能快速看完mooc视频的代码
  • ¥15 这种微信登录授权 谁可以做啊
  • ¥15 请问我该如何添加自己的数据去运行蚁群算法代码
  • ¥20 用HslCommunication 连接欧姆龙 plc有时会连接失败。报异常为“未知错误”
  • ¥15 网络设备配置与管理这个该怎么弄
  • ¥20 机器学习能否像多层线性模型一样处理嵌套数据
  • ¥20 西门子S7-Graph,S7-300,梯形图
  • ¥50 用易语言http 访问不了网页
  • ¥50 safari浏览器fetch提交数据后数据丢失问题