Non-ASCII characters in Posting List
Should we be including non-ascii characters in our posting list? The reason I ask is my computer is able to read Asian characters. However, in my part-* files and outputMerged.txt, I still get things like
oï¿1⁄2uï¿1⁄2ï¿1⁄23ï¿1⁄2wï¿1⁄2ï¿1⁄2vï¿1⁄2ï¿1⁄2rcï¿1⁄2ï¿1⁄2ï¿1⁄2ï¿1⁄2
as terms.
(This was run on the 200 url input.)
To see for youself (assuming your computer can read Asian characters):
http://www.ics.uci.edu/~mlavaves/outputMerged.txt
To my knowledge, outputMerged.txt is saved in UTF-8 format, so if there were Asian characters in there to begin with, they should display properly.
I know I would still get that junk text for other languages I don't have installed on my computer (like Arabic), but I don't see any Asian characters at all. There should at least be a few from some of the pages that offer translations in those languages...
oï¿1⁄2uï¿1⁄2ï¿1⁄23ï¿1⁄2wï¿1⁄2ï¿1⁄2vï¿1⁄2ï¿1⁄2rcï¿1⁄2ï¿1⁄2ï¿1⁄2ï¿1⁄2
as terms.
(This was run on the 200 url input.)
To see for youself (assuming your computer can read Asian characters):
http://www.ics.uci.edu/~mlavaves/outputMerged.txt
To my knowledge, outputMerged.txt is saved in UTF-8 format, so if there were Asian characters in there to begin with, they should display properly.
I know I would still get that junk text for other languages I don't have installed on my computer (like Arabic), but I don't see any Asian characters at all. There should at least be a few from some of the pages that offer translations in those languages...
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?Michael,
You can ignore tokens containing non-ASCII characters. Although managing Unicode is important for web search engines, it is not among the goals of this assignment.
Loading Profile...



EMPLOYEE