如何使用PHP替换String中的非SGML字符？

I programmed a guestbook using PHP4 and HTML 4.01 (with the charset ISO-8859-15, i.e. latin-9). The data is saved in a MySQL-database with the charset (ISO-8859-1, i.e. latin-1).

When somebody enters characters from a different charset, it seems that the browsers send the data encoded (actually I have not checked where it gets encoded, ...).

Anyway, in some cases, it seems that characters are not saved encoded in the database. Thus, the validator returns an error message when I add show the data within an HTML4.01 document:

non SGML character number 146

You have used an illegal character in your text. HTML uses the standard UNICODE Consortium character repertoire, and it leaves undefined (among others) 65 character codes (0 to 31 inclusive and 127 to 159 inclusive) that are sometimes used for typographical quote marks and similar in proprietary character sets. The validator has found one of these undefined characters in your document. The character may appear on your browser as a curly quote, or a trademark symbol, or some other fancy glyph; on a different computer, however, it will likely appear as a completely different character, or nothing at all.

Your best bet is to replace the character with the nearest equivalent ASCII character, or to use an appropriate character entity. For more information on Character Encoding on the web, see Alan Flavell's excellent HTML Character Set Issues reference.

This error can also be triggered by formatting characters embedded in documents by some word processors. If you use a word processor to edit your HTML documents, be sure to use the "Save as ASCII" or similar command to save the document without formatting information.

I'm now using PHP5.2.17, and played a bit with htmlspecialchars, but nothing worked. How can I encode thoses characters, so that there are no more validation errors?

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

dtsps00544 2012-03-16 04:35

关注

In both ISO-8859-1 and ISO-8859-15 the character number 146 is a control character MW (Message Waiting) from the C1 range.

SGML refers to ISO 8859-1 (mind the space between ISO and 8859-1, which is not a hyphen as in the character sets you use). It does not allow control characters but three (here: SGML in HTML):

In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10).

You therefore did pass an illegal character. There does not exist a SGML/HTML entity for it you could replace it with.

I suggest you validate the input that comes into your application that it does not allow control characters. If you believe those characters were originally representing a useful thing, like a letter that can be actually read (e.g. not a control character), it's likely that when you process the data the encoding is broken at some point.

From the information given in your question it's hard to say where, because you only specify the input encoding and the encoding of the database filed - but those two already don't match (which should not produce the issue you're asking about, but it can produce other issues). Next to those two places, there is also the database client connection charset (unspecified in your question), the output encoding (unspecified in your question) and the response content encoding (unspecified in your question).

It might make sense that you change your overall encoding to UTF-8 to support a wider range of characters, but that's really a might.

Edit: The part above is somewhat a strict view. It came to my mind that the input you receive is not ISO-8859-1(5) actually but something else, like a windows code page. I'd probably say, it's Windows-1252 (cp1252)^Wikipedia. Compared to the C1 range of ISO-8859-1 (128-159) it has several non-control characters.

The Wikipedia page also notes that most browsers treat ISO-8859-1 as Windows-1252/CP1252/CP-1252. The PHP htmlentities() function is not able to deal with these characters, the translation table for HTML Entities does not cover the codepoints (PHP 5.3, not tested against 5.4). You need to create your own translation table and use it with strtr to replace the characters not available in ISO 8859-15 for windows-1252:

/*
 * mappings of Windows-1252 (cp1252)  128 (0x80) - 159 (0x9F) characters:
 * @link http://en.wikipedia.org/wiki/Windows-1252
 * @link http://www.w3.org/TR/html4/sgml/entities.html
 */
$cp1252HTML401Entities = array(
    "\x80" => '&euro;',    # 128 -> euro sign, U+20AC NEW
    "\x82" => '&sbquo;',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => '&fnof;',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '&bdquo;',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '&hellip;',  # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '&dagger;',  # 134 -> dagger, U+2020 ISOpub
    "\x87" => '&Dagger;',  # 135 -> double dagger, U+2021 ISOpub
    "\x88" => '&circ;',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '&permil;',  # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => '&Scaron;',  # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '&lsaquo;',  # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => '&OElig;',   # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => '&#381;',    # 142 -> U+017D
    "\x91" => '&lsquo;',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '&rsquo;',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '&ldquo;',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '&rdquo;',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '&bull;',    # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '&ndash;',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '&mdash;',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '&tilde;',   # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '&trade;',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => '&scaron;',  # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '&rsaquo;',  # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => '&oelig;',   # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => '&#382;',    # 158 -> U+017E
    "\x9F" => '&Yuml;',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

If you want to be even more safe, you can spare the named entities and just only pick the numeric ones which should work in very old browsers as well:

$cp1252HTMLNumericEntities = array(
    "\x80" => '&#8364;',   # 128 -> euro sign, U+20AC NEW
    "\x82" => '&#8218;',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => '&#402;',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '&#8222;',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '&#8230;',   # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '&#8224;',   # 134 -> dagger, U+2020 ISOpub
    "\x87" => '&#8225;',   # 135 -> double dagger, U+2021 ISOpub
    "\x88" => '&#710;',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '&#8240;',   # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => '&#352;',    # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '&#8249;',   # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => '&#338;',    # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => '&#381;',    # 142 -> U+017D
    "\x91" => '&#8216;',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '&#8217;',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '&#8220;',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '&#8221;',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '&#8226;',   # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '&#8211;',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '&#8212;',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '&#732;',    # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '&#8482;',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => '&#353;',    # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '&#8250;',   # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => '&#339;',    # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => '&#382;',    # 158 -> U+017E
    "\x9F" => '&#376;',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

Hope this is more helpful now. See as well the Wikipedia page linked above for some characters that are in windows-1242 and ISO 8859-15 but at different points. You should probably consider to use UTF-8 on your website.

展开全部

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

编辑

预览

报告相同问题？

关注问题

您如何在PHP中解析和处理HTML / XML？
2019-12-04 02:40

asdfgh0077的博客我已经在许多工具中使用了此工具，并在许多不同类型的网页上对其进行了测试，并且我认为它的效果很好。 #11楼您可以尝试使用类似 HTML Tidy的方法来清理所有“损坏的” HTML，并将HTML转换为XHTML，然后可以使用...
大前端高频面试题详解确定不看看？（持续更新）
2023-05-28 06:57

星辰大海1412的博客在前端面试中，除了要求掌握 HTML、CSS、JavaScript 等基础知识外，还需要对框架、工具、技术栈等有深入的理解和应用能力。针对这样的需求，我准备了一篇前端面试文章介绍的简介，希望能够帮助读者更好地了解前端...
前端300道常见面试题，前端找工作必备
2021-12-15 13:06

编程ID的博客前端面试题汇总一、HTML 和 CSS 1、你做的页面在哪些流览器测试过？这些浏览器的内核分别是什么? IE: trident 内核Firefox：gecko 内核Safari:webkit 内核 Opera:以前是 presto 内核，Opera 现已改用 Google Chrome...
中高级前端面试知识点汇总
2019-08-29 01:52

吹过麦田的风的博客随着近年来前端技术的飞跃发展以及移动互联网时代的洗礼，iframe的使用渐渐的不被建议，虽然也是一种跨域请求的解决方案，但这里就不再讲述，请读者自行查阅网上资料。 2.jsonp jsonp是比较常用的方法，我们假设a....
前端三剑客（html、css、js）面试题
2023-04-19 13:02

weixin_45754783的博客判断变量的类型 typeof：判断基本数据类型 instanceof：判断引用数据类型，判断一个实例是否属于某种类型使用constructor判断变量的类型使用Object.prototype.toString.call判断变量的类型使用jquery中$.type...
前端面试题
2019-08-08 03:49

缒幕的博客前端面试题汇总一、HTML和CSS 21 你做的页面在哪些流览器测试过？这些浏览器的内核分别是什么? 21 每个HTML文件里开头都有个很重要的东西，Doctype，知道这是干什么的吗？ 21 Quirks模式是什么？它和Standards模式...
前端面试题总结
2021-10-31 15:39

煜成'Studio的博客使用 XML 和 XSLT 进⾏行数据交换及相关操作。 C.总共有 8 种 callback（onSuccess onFailure onUninitialized onloading onloaded onInteractive onComplete onException） D.abort()⽅法，停⽌当前请求 XML，...
Web前端面试知识总结---（不断更新中）
2019-02-28 02:55

青青子衿M的博客 CSS方面，也有自己独有的处理方式，例如设置透明，低版本IE中使用滤镜的方式 1.15、什么叫优雅降级和渐进增强？渐进增强 progressive enhancement：针对低版本浏览器进行构建页面，保证最基本的...
Web前端面试题整合，持续更新【可以收藏】
2021-12-12 12:53

jason的java世界的博客 css相关、JS相关、浏览器网络相关、vue相关、react相关、移动端相关、插件及工具相关、前端性能优化、原生通信、算法相关、node相关、计算机基础
web前端开发面试题
2020-07-06 13:32

书亦何欢*的博客 3.如何看待前端开发？ 4.平时是如何学习前端开发的？ 5.未来三到五年的规划是怎样的？ position的值， relative和absolute分别是相对于谁进行定位的？ § absolute :生成绝对定位的元素，相对于最近一级的定位不是...
最全面、最详细web前端面试题及答案总结
2021-02-01 01:34

赫兹/Herzz的博客总结不易，希望可以帮助到即将面试或还在学习中的web前端小伙伴，祝面试顺利，拿高薪！本章是HTML考点的⾮重难点，因此我们采⽤简略回答的⽅式进⾏撰写，所以不会有太多详细的解释。我们约定，每个问题后我们标记...
前端面试题（中）
2019-09-26 00:46

banlin1780的博客前端面试题 HTML Doctype作用？标准模式与兼容模式各有什么区别? （1）、<!DOCTYPE>声明位于位于HTML文档中的第一行，处于 <html> 标签之前。告知浏览器的解析器用什么文档标准解析这个文档。...
AJAX介绍和使用
2022-07-08 12:53

褚师子书的博客 AJAX介绍和使用
web前端常见面试题45道总结
2019-11-14 06:52

了不起的大志的博客 jsonp是前后端结合跨域方式，因为前段请求到数据需要在回调函数中使用，所以后端得将数据放回到回调函数中,ajax是指通过使用xmlhttpquest对象进行异步数据交互的技术，jsonp是依靠scriptsrc属性来获取的，不属于ajax...
前端开发面试题及答案整理
2017-09-09 15:39

-jjjiong的博客前端开发面试题及答案整理文章目录一些开放性题目position的值， relative和absolute分别是相对于谁进行定位的？如何解决跨域问题XML和JSON的区别？谈谈你对webpack的看法说说TCP传输的三次握手四次挥手策略TCP和UDP...
前端面试题（经典）
2017-10-28 08:25

强哥blog的博客 1、position的值， relative和absolute分别是相对于谁进行定位的？ absolute :生成绝对定位的元素，相对于最近一级的定位不是 static ...relative 生成相对定位的元素，相对于其在普通流中的位置进行定位。
前端面试题目搜集
2018-10-05 07:32

loneleaf1的博客最近读到一本与前端面试有关的书《前端程序员面试笔试宝典》，里面的内容很多都是高频的面试题，在此推荐给各位网友。一、理论知识 1.1、前端 MV*框架的意义早期前端都是比较简单，基本以页面为工作单元，...
前端开发面试题
2019-02-26 11:06

奋斗成就男人的博客前端还是一个年轻的行业，新的行业标准，框架，库都不断在更新和新增，正如赫门在2015深JS大会上的《前端服务化之路》主题演讲中说的一句话："每18至24个月，前端都会难一倍"，这些变化...
twig 模板引擎_Twig-最受欢迎的独立PHP模板引擎
2020-08-30 18:58

culi3182的博客让我们看一下它的主要优势以及如何在项目中使用它。安装 (Installation) There are two ways of installing Twig. We can use the tar ball available on their website, or we can use Composer, just like we ...
没有解决我的问题, 去提问

码龄粉丝数原力等级 --

如何使用PHP替换String中的非SGML字符？

2条回答默认最新

码龄粉丝数原力等级 --

如何使用PHP替换String中的非SGML字符？

2条回答 默认 最新

2条回答默认最新