I'm using Amazon's API to obtain the description of books. The API returns XML responses and the description is marked up (with HTML) very poorly. To deal with this poorly marked up description, which oftentimes breaks the layout of my site, I'm trying to use HTML Tidy to "clean it up."
In order to prevent "weird" characters from being displayed on my web page, I think I need to tell Tidy what the input encoding is and what the desired output encoding is. I know I want the output to be UTF8. However, I'm not sure how to determine the encoding of the input (Amazon's book description).
I've tried something like this:
mb_detect_encoding($amazon_description);
It's helped, but I'm still occasionally getting weird characters (a black diamond with a question mark in it: �). My guess is that I'm not detecting the encoding properly.
Any suggestions what I need to do?
EDIT:
This is my current solution:
$sanitized_amazon_markup = preg_replace('/[^\w`~!@#$%^&*()-=_+[\]{}|;\':",.\/<>? ]/', '', $sanitized_amazon_markup);
I'm not sure about this as this may delete stuff that I should be keeping.