PHP Markdown & Text Encoding

Replacing characters outside ASCII range by their HTML entity equivalent is pretty common on the web. I’m talking about the process of changing characters (like é) to a named entity (é) or a numbered one (decimal é or hexadecimal é). From time to time, I get in my email a request asking for converting characters to their HTML entity in PHP Markdown. Here is an explanation of why this won’t happen.

The reason Markdown doesn’t convert any character (except for <>'"& where appropriate) is that you shouldn’t have to convert them. PHP Markdown will work with any character encoding which is a superset of ASCII, including ISO-Latin-1 and UTF-8, and will leave characters as they are. If your input text is UTF-8, the result will be UTF-8 and will be displayed correctly on a web page, provided you specify the UTF-8 charset in the mime-type (either in the server header or in a meta tag).

If you want to convert such characters to entities, feel free to do so: it should be performed after Markdown has finished processing the text. But replacing characters by entities does not belongs in Markdown. Correctly converting characters to entities require the knowledge of what the input charset is, and PHP Markdown doesn’t know about this — it simply assume any superset of ASCII.