删除HTML和恶意代码，在PHP中留下标点符号和外语

function stripAlpha( $item )
{
    $search     = array( 
         '@<script[^>]*?>.*?</script>@si'   // Strip out javascript 
        ,'@<style[^>]*?>.*?</style>@siU'    // Strip style tags properly 
        ,'@<[\/\!]*?[^<>]*?>@si'            // Strip out HTML tags
        ,'@<![\s\S]*?–[ \t
]*>@'         // Strip multi-line comments including CDATA
        ,'/\s{2,}/'
        ,'/(\s){2,}/'
    );
    $pattern    = array(
         '#[^a-zA-Z ]#'                     // Non alpha characters
        ,'/\s+/'                            // More than one whitespace
    );
    $replace    = array(
         ''
        ,' '
    );
    $item = preg_replace( $search, '', html_entity_decode( $item ) );
    $item = trim( preg_replace( $pattern, $replace, strip_tags( $item ) ) );

    return $item;
}

One person suggested replacing this entire script with one liner:

$clear = preg_replace('/[^A-Za-z0-9\-]/', '', urldecode($_GET['id']));

but that gives an error with the $_GET command - unknown variable ID

what I'm looking for is the simplest script to remove all HTML code and weird characters, replacing carriage returns with spaces and leaving punctuation like dots commas and exclamation points.

There are a lot of similar questions but none seem to really answer this question right and those scripts strip away all characters including sentence punctuation and foreign Arabic fonts or spanish.

for example if the string contains www.mygreatwebsite.com

the cleaner script will return wwwmygreatwebsitecom which looks weird.

If someone is excited about something like 'Hey this is a great website! ' it also removes the exclamation points.

All the similar questions out there that I've looked up remove all the characters....

I'd like to leave IN the punctuation and any foreign language characters with one simple regex command that clears out all the stuff people paste into forms, but leaves the punctuation.

Naturally carriage returns would be replaced by spaces.

Any suggestions?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
drpph80800 2015-05-11 16:03
关注
To remove all html code, it's easy, use strip_tags

$text = strip_tags($html);

But it works only if the string doesn't contain css or javascript code.

So a better way that deals with this problem is to use DOMDocument and XPath to find all text nodes that haven't a style or a script tag as ancestor:

$dom = new DOMDocument; $dom->loadHTML($html); $xp = new DOMXPath($dom); $textNodeList = $xp->query('//text()[not(ancestor::script) and not(ancestor::style)]'); $text = ''; foreach($textNodeList as $textNode) { $text .= ' '. $textNode->nodeValue; }

to replace weird characters and white-space characters except punctuation with a space:

$text = preg_replace('~[^\pP\pL\pN]+~u', ' ', $text);

Where \pP is a character class for punctuation characters, \pL for letters, \pN for digits. (to be more precise about the characters you want to preserve, take a look at the available character classes here (search for "Unicode character properties"))

obviously, you can trim the text to finish:

$text = trim($text);
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

删除HTML和恶意代码，在PHP中留下标点符号和外语 php
2015-05-11 15:59

回答 2 已采纳 To remove all html code, it's easy, use strip_tags $text = strip_tags($html); But it works only
PHP preg_split或preg_match句子但在Array中保留标点符号 php
2016-05-30 07:59

回答 1 已采纳 You could do what you want using preg_match: $meta = 'I am looking to break this paragraph into c
如何在标点符号中使用dot而不是在PHP中附加 mysql php
2017-03-30 01:38

回答 1 已采纳 This is nothing to do with PHP syntax. Your example contains a . inside a quoted string, which PH
基于SpringBoot的电影推荐网站--30760（免费领源码+数据库）可做计算机毕业设计JAVA、PHP、爬虫、APP、小程序、C#、C++、python、数据可视化、大数据、全套文案
2024-03-05 21:00

vx_BS81330的博客本电影推荐网站是针对目前电影推荐网站的实际需求，从实际工作出发，对过去的电影推荐网站存在的问题进行分析，结合计算机系统的结构、概念、模型、原理、方法，在计算机各种优势的情况下，采用目前最流行的B/S结构...
如何在PHP中的任何标点符号和标记之间添加空格？ [关闭] php
2014-11-17 10:38

回答 1 已采纳 I suggest not to use "hungry" regular expression engine in your case, simple string replace will d
用标点符号错误和缺少空格清理字符串 php
2019-03-29 17:21

回答 1 已采纳 Thanks to @Thefourthbird <?php $str = "TheLion is walking(proudly) through theJungle,but he m
如何用PHP中的单个标点符号替换所有重复的标点符号？ php
2015-01-14 20:12

回答 1 已采纳 You can use: $str = preg_replace('~((?<!:)[^\p{L}\p{N}])\1+~u', '$1', $str); //=> Hello. ho
springboot 电影推荐网站计算机毕设源码30760
2023-08-01 16:00

weixin1_ZYKJ985的博客本电影推荐网站是针对目前电影推荐网站的实际需求，从实际工作出发，对过去的电影推荐网站存在的问题进行分析，结合计算机系统的结构、概念、模型、原理、方法，在计算机各种优势的情况下，采用目前最流行的B/S结构...
如何解析标点符号并用PHP中的html标记替换 php
2012-10-04 00:53

回答 2 已采纳 Try $string = preg_replace('~\*(.*?)\*~','<bold>$1</bold>',$string); Edit: Appended t
java如何判断字符串中有几个标点符号和空格 java
2023-04-02 07:59

回答 2 已采纳 public static void countPunctuationAndSpaces(String str) { // 匹配标点符号和空格的正则表达式 String regex =
C/C++去除中文（全角）标点符号
2015-12-29 05:39

回答 1 已采纳求助求助![图片说明](http://forum.csdn.net/PointForum/ui/scripts/csdn/Plugin/001/face/5.gif)![图片说明](http://fo
计算机毕业设计项目选题推荐（免费领源码）java+springboot +mysql电影推荐网站30760
2023-12-18 11:30

QQ_3376098506的博客本电影推荐网站是针对目前电影推荐网站的实际需求，从实际工作出发，对过去的电影推荐网站存在的问题进行分析，结合计算机系统的结构、概念、模型、原理、方法，在计算机各种优势的情况下，采用目前最流行的B/S结构...
python统计文章中单词个数（不包含标点符号） python
2022-11-02 10:48

回答 1 已采纳 import copy a = 'ni de ju zi,nide juzi' a = a.replace(',', ' ') print('文章中有{}个单词'.format(len(a.spl
（免费领源码）java#springboot#MYSQL 电影推荐网站30760-计算机毕业设计项目选题推荐
2023-11-03 08:51

2301_3224142804的博客本电影推荐网站是针对目前电影推荐网站的实际需求，从实际工作出发，对过去的电影推荐网站存在的问题进行分析，结合计算机系统的结构、概念、模型、原理、方法，在计算机各种优势的情况下，采用目前最流行的B/S结构...
java&springboot&MYSQL 电影推荐网站30760-计算机毕业设计项目选题推荐（附源码）
2023-11-13 17:30

VX_bysjlw985的博客本电影推荐网站是针对目前电影推荐网站的实际需求，从实际工作出发，对过去的电影推荐网站存在的问题进行分析，结合计算机系统的结构、概念、模型、原理、方法，在计算机各种优势的情况下，采用目前最流行的B/S结构...
java+springboot+MYSQL 电影推荐网站30760-计算机毕业设计（赠源码)
2023-08-17 08:45

bysjlw985的博客本电影推荐网站是针对目前电影推荐网站的实际需求，从实际工作出发，对过去的电影推荐网站存在的问题进行分析，结合计算机系统的结构、概念、模型、原理、方法，在计算机各种优势的情况下，采用目前最流行的B/S结构...
java+springboot+MYSQL电影推荐网站30760-计算机毕业设计（可赠源码）
2023-08-23 15:15

bs_wa66的博客本电影推荐网站是针对目前电影推荐网站的实际需求，从实际工作出发，对过去的电影推荐网站存在的问题进行分析，结合计算机系统的结构、概念、模型、原理、方法，在计算机各种优势的情况下，采用目前最流行的B/S结构...
没有解决我的问题, 去提问

悬赏问题

¥15 远程桌面文档内容复制粘贴，格式会变化
¥15 关于#java#的问题：找一份能快速看完mooc视频的代码
¥15 这种微信登录授权谁可以做啊
¥15 请问我该如何添加自己的数据去运行蚁群算法代码
¥20 用HslCommunication 连接欧姆龙 plc有时会连接失败。报异常为“未知错误”
¥15 网络设备配置与管理这个该怎么弄
¥20 机器学习能否像多层线性模型一样处理嵌套数据
¥20 西门子S7-Graph,S7-300，梯形图
¥50 用易语言http 访问不了网页
¥50 safari浏览器fetch提交数据后数据丢失问题

删除HTML和恶意代码，在PHP中留下标点符号和外语

2条回答 默认 最新

悬赏问题

2条回答默认最新