Get your own customer support community
 

Non-ASCII characters in Posting List

Should we be including non-ascii characters in our posting list? The reason I ask is my computer is able to read Asian characters. However, in my part-* files and outputMerged.txt, I still get things like
oï¿1⁄2uï¿1⁄2ï¿1⁄23ï¿1⁄2wï¿1⁄2ï¿1⁄2vï¿1⁄2ï¿1⁄2rcï¿1⁄2ï¿1⁄2ï¿1⁄2ï¿1⁄2
as terms.
(This was run on the 200 url input.)

To see for youself (assuming your computer can read Asian characters):
http://www.ics.uci.edu/~mlavaves/outputMerged.txt

To my knowledge, outputMerged.txt is saved in UTF-8 format, so if there were Asian characters in there to begin with, they should display properly.

I know I would still get that junk text for other languages I don't have installed on my computer (like Arabic), but I don't see any Asian characters at all. There should at least be a few from some of the pages that offer translations in those languages...
 
indifferent I’m : |
Inappropriate?
1 person has this question

User_default_medium