If you've found this website then congratulations. This is the class forum and place for feedback on the class. Ask questions here, discuss related topics, whatever is appropriate in conjunction with the University of California, Irvine's Fall 2008 course on Information Retrieval.
There's is a neat little article on a risk-like game in the NY Times today. To summarize, it's a risk-like game that takes place in the real world portraying the sense of a ubicomp atmosphere at universities and workplaces around the country. Link is below:
How do you link the firefox extension to the java webserver so that the extension will display the output of the java webserver? Is there certain javascript that needs to be added/modified? Thanks!
I'm sort of confused about what values to use when calculating the weight of the query for the cosine score. I understand I'm supposed to treat the query like it's own document, but how does that change the values I need to use in the TF-IDF function?
If, for example, I have a one-word query, does that make both my term frequency and my document frequency for the query equal to 1? Would my corpus size also just be the size of my query? The issue with those numbers is that log(1) = 0, so my TF-IDF weight will be 0. That doesn't seem right...
For assignment 7, are we going to be asked multi-word queries, or are the queries going to just be a single term? If the query is going to be multi word, it is going to take forever to load up each serialized file.
I'm having memory issues when building the posting list for the terms. Everything completely breaks down after I've handled roughly 35,000 pages. I've tried shortening the document URL, and that certainly helped. But I have no idea what I can do to take care of the next 35,000 in my set. What are you doing to avoid running out of memory?
Here are the components you will probably need to build. I say "probably" and "should" because you can do it however you want. The evaluation doesn't specify an architecture and oftentimes other people have good ways of doing things that I don't consider....
1)
Something that can read in a bunch of files (either from a list or by looking in a directory).
I would start by seeing if java.io.File can help there.
2)
Then you need to read in each of the files line by line and parse out the content. Depending on the content you will need to do different things. If you are using the sample output files, then for every document URL you read in you will have to update the data structure with what you find. The posting list is just a data structure put together the correct way.
You will need to use the java.util.Map interface which maps from a document URL to a term frequency in it. (The number of times a term appears in that document) Alternatively you could use a list. (I don't recall which Jam recommended in discussion) Call this the "DocTFMap".
Then you will need a java.util.Map data structure which maps from a term (stored as a string) to the DocTFMap's. Call this the TermMap
When you choose an implementation for the Map data structures, you should probably choose a map that keeps things alphabetized for efficiency.
You will process a document at a time.
For each document, count the number of times each term occurs in it. (term frequency)
When you figure that out, find the right DocTFMap in the TermMap by using the term you are currently processing. Then insert a new Document TermFrequency pair into it.
3)
You should write out your data structure to disk using Serialize utilities.
4)
You should write a second program which reads in the data structure asks for the user for a term and then returns the list of documents that have that term. (Again a lookup into the TermMap)
5)
All those components again for the Image lookup dictionary, but just one Map is necessary which maps from Document URLS to Images. Much of the code will be very similar.
I'm using the full data set provided by Prof. Patterson to test my parser, but there seems to be a lot of malformed lines. For example, in the first text file for terms (term_doc_pairs_0001.txt), there is this line:
Where you can see that the crawler was trying to put gloucester:*count* and another page URL, but got cut off. Should I just fix the errors as they pop up and continue, or what would be the best course of action?
I'm kind of confused as to what counts as a "full data set". While I wasn't able to get my own crawler to go through all 2.2 million wiki pages, it did go through at least a few hundred thousand. Since the "full set" posted for the class to use if they wanted was only 70,000 documents, could I simply use 70,000 documents from my own crawl and have that count as a full set of my own data?
In the evaluation portion, what do we do if the document example you give us is not one of the documents that is part of our data set? Will you only be giving us examples to search for based on what is in our data sets?
Finally, for the recall portion of the evaluation, how do you want the results for this? Do we input a word and a document and we return a "yes" or a "no" depending on if the word is in that document?
Does anyone have any good methods for finding the image files in the document for assignment 6?
I was thinking that I'd have to find all the tags, then the src values within them, but it seems like a weak algorithm that doesn't explore all the possibilities. Are there any easy WebSphinx methods that get all the images out of it, or something?
I'm having trouble making the XPI file for the Firefox extension. I'm doing it by using winzip, and calling the file "emptysidebar.xpi." I'm zipping up the folder that is in the emptysidebar.zip file that is provided on the website. The XPI file that they give works just fine, but when I try to make one, it states that it can't find the install.rdf file, even though it is included in the xpi file, and I placed it in the same folder as install.rdf.
Can someone give a walkthrough about how to create an XPI file, and where to place it? (if there's a certain place where it has to go)
I'm not exactly sure what we're supposed to output in our sidebar. are we creating a history trace where it adds the link of whatever page gets refreshed or opened? or just shows one link of the main page?
I was having issues with Assignment 4 -- the crawler would only crawl roughly half of the "easter egg" pages before stopping. It turns out the crawler defaults to only crawling pages that are 5 steps away from your seed. So if you're having the same problem, use the setMaxDepth() method and set it to something higher, like 10 or 50 or something :)
I'm really getting scared. I've been crawling for 2 hours, my text file index is 0.5gb. The memory use started off at around 275mb and is now at 575mb. The 1gb max memory I allowed for my Java VM might not be enough. Also, I'm not even doing the pattern matching stuff for lipograms, rhopalics and stuff.
I read that the english wiki has about 2.2 million pages. Even if my crawler could crawl 400 pages a minute (VERY GENEROUS amount) it will take 3.8 days.
I think we really need some kind of smaller subset or start or end conditions.
A strange problem is occurring when i try to set the starting point of a crawl as the Irvine, California page or the Bubonic Plague page. The crawler simply terminates without even visiting the page. I've already fixed the page size limit and had the crawler obey and disobey robots.txt. but this problem still occurs. The strange part is that any other page I try, the crawler seems to go fine. And when I use the workbench to crawl those two pages, it also seems to work fine. Any ideas?
If you use this company's products or services, we'll add the company to your dashboard. If you work for this company, we'll add you as an employee. Got it, thanks!