If you ever accept raw HTML from an outside source to embed into your site, you should always, always, reformat and whitelist it. You have no idea what that 3rd party HTML may contain, and you have no guarantee that it's valid; yet on your site you presumably want guaranteed valid HTML with certain limits on its content (or do you really want to enable the embedding of arbitrary <script>
tags...?!).
That means you want to:
- parse the HTML and extract whatever structural information is in it
- filter that structure to allow only approved elements and then
- produce your own HTML from that which you can guarantee is syntactically valid.
Supposedly the best PHP library which does that is HTML Purifier. Without using a library, you would use a lenient HTML parser, something like DOMDocument
to inspect and filter the content, and then the built-in DOMDocument::saveXML
to produce the new sanitised HTML.