Embedded comments may disappear when certain Unicode character posted
The following was previously reported to support@intensedebate.com on 2009-03-07 but unacknowledged. I'm more confident it will be answered here, but I should point out that this bug, if I am correct, could be used maliciously, so I originally didn't want to announce it on a public forum. But here goes...
----
I've observed a bug affecting the embedded IntenseDebate comments system
for blogs. It is triggered by specific Unicode byte sequences, when
they are accidentally (or perhaps maliciously) included in the text of a
comment. It results in Firefox users not being able to see or post
comments on that page any more. It may affect other browsers, but IE7
seems unaffected.
The problem characters are Unicode U+2028 (LINE SEPARATOR; hex sequence
0xE2,0x80,0xA8) and U+2029 (PARAGRAPH SEPARATOR; hex sequence
0xE2,0x80,0xA9). These get printed in IntenseDebate's JavaScript
literally, and Firefox interprets them as line endings (as if there had
been a literal new-line character in the script). This causes various
scripting errors to show in the 'Error Console' in the 'Tools' menu.
I believe this is correct behaviour on Firefox's part; it seems like a
correct interpretation of the Unicode standard. I doubt many other
browsers implement this yet, but I'd expect them to eventually follow suit.
My suggested fix for this would be to detect these byte sequences and
escape them in the dynamically generated JavaScript (as you would for the \n character), using this syntax:
'This is my comment.\u2028It contains a Unicode line separator.'
In the above example, the byte sequence 0xE2,0x80,0xA8 would be replaced
by \u2028 .
Here is an easy way to do this in PHP:
// Escape special Unicode byte sequences that may cause JavaScript syntax errors
function escape_unicode($string)
{
// U+2028: LINE SEPARATOR
$string = str_replace("\xE2\x80\xA8", "\\u2028", $string);
// U+2029: PARAGRAPH SEPARATOR
$string = str_replace("\xE2\x80\xA9", "\\u2029", $string);
return $string;
}
You may alternatively wish to convert either of the problem characters to
plain ASCII newline characters ( \n ), or maybe some to sort of HTML line or paragraph
break.
I've set up an example of this bug in action on this page; note that
the entire comments section has disappeared in Firefox after I posted a
comment containing one of these characters:
http://pyro.eu.org/stuff/unicode-bug-...
It's also worth noting that this probably only affects pages that are
declared to use UTF-8 encoding (either in HTTP headers or the equivalent
HTML <meta /> tag). Websites that do not do this are probably unaffected
by this bug, but then they probably don't render *any* Unicode properly
(eg. foreign/non-Latin characters) within the IntenseDebate comments, because the IntenseDebate JavaScript files are served without declaring their content encoding in the HTTP headers (eg. Content-Type: text/javascript; charset=UTF-8).
----
I've observed a bug affecting the embedded IntenseDebate comments system
for blogs. It is triggered by specific Unicode byte sequences, when
they are accidentally (or perhaps maliciously) included in the text of a
comment. It results in Firefox users not being able to see or post
comments on that page any more. It may affect other browsers, but IE7
seems unaffected.
The problem characters are Unicode U+2028 (LINE SEPARATOR; hex sequence
0xE2,0x80,0xA8) and U+2029 (PARAGRAPH SEPARATOR; hex sequence
0xE2,0x80,0xA9). These get printed in IntenseDebate's JavaScript
literally, and Firefox interprets them as line endings (as if there had
been a literal new-line character in the script). This causes various
scripting errors to show in the 'Error Console' in the 'Tools' menu.
I believe this is correct behaviour on Firefox's part; it seems like a
correct interpretation of the Unicode standard. I doubt many other
browsers implement this yet, but I'd expect them to eventually follow suit.
My suggested fix for this would be to detect these byte sequences and
escape them in the dynamically generated JavaScript (as you would for the \n character), using this syntax:
'This is my comment.\u2028It contains a Unicode line separator.'
In the above example, the byte sequence 0xE2,0x80,0xA8 would be replaced
by \u2028 .
Here is an easy way to do this in PHP:
// Escape special Unicode byte sequences that may cause JavaScript syntax errors
function escape_unicode($string)
{
// U+2028: LINE SEPARATOR
$string = str_replace("\xE2\x80\xA8", "\\u2028", $string);
// U+2029: PARAGRAPH SEPARATOR
$string = str_replace("\xE2\x80\xA9", "\\u2029", $string);
return $string;
}
You may alternatively wish to convert either of the problem characters to
plain ASCII newline characters ( \n ), or maybe some to sort of HTML line or paragraph
break.
I've set up an example of this bug in action on this page; note that
the entire comments section has disappeared in Firefox after I posted a
comment containing one of these characters:
http://pyro.eu.org/stuff/unicode-bug-...
It's also worth noting that this probably only affects pages that are
declared to use UTF-8 encoding (either in HTTP headers or the equivalent
HTML <meta /> tag). Websites that do not do this are probably unaffected
by this bug, but then they probably don't render *any* Unicode properly
(eg. foreign/non-Latin characters) within the IntenseDebate comments, because the IntenseDebate JavaScript files are served without declaring their content encoding in the HTTP headers (eg. Content-Type: text/javascript; charset=UTF-8).
1
person has this problem
I have this problem, too!
Tell me when someone solves it.
The more people who report this problem, the more it gets noticed.
The more people who report this problem, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
Loading Profile...

