Recent activity
Subscribe to this feed
Martin replied on August 11, 2009 07:31 to the question "Post-processing: likely error in tag breakdown near [...] in beacon line" in Immeria:
Martin replied on July 08, 2009 11:56 to the question "Post-processing: likely error in tag breakdown near [...] in beacon line" in Immeria:
Hey Stephane,
Now another crawl does not work because of the same issue. The problem is that there are so many Pages with semicolons in the title we cannot exclude them all manually. This character should be escaped by default.

Please give me a feedback if you need additional information.
Regards,
Martin
Martin asked a question in Immeria on June 30, 2009 12:19:
Post-processing: likely error in tag breakdown near [...] in beacon lineHi Stephane,
After a crawl the report won't open because of the following error dialogues:
Post-processing: likely error in tag breakdown near [%20Title] in beacon line
hxxp://sdc.domain.tld/dcs1234567890123456789123_xyzz/dcs.gif?&dcsdat=1246353933215&dcssip=www.domain.tld&dcsuri=/de/technical/knowhow/documents/page_1.html&WT.tz=2&WT.bh=11&WT.ul=en-US&WT.cd=32&WT.sr=1400x1050&WT.jo=Yes&WT.ti=COMPANY%20FBB%20Service%20AG%20-%20Product%20of%20Service;%20Title;%20Green%20greener%20greenest&WT.js=Yes&WT.jv=1.5&WT.bs=1400x894&WT.fi=Yes&WT.fv=10.0&WT.sp=@@SPLITVALUE@@
Post-processing: likely error in tag breakdown near [%20Title%20for%20T%E4lpel] in beacon line
hxxp://sdc.domain.tld/dcs1234567890123456789123_xyzz/dcs.gif?&dcsdat=1246353933215&dcssip=www.domain.tld&dcsuri=/de/technical/knowhow/documents/page_2.html&WT.tz=2&WT.bh=11&WT.ul=en-US&WT.cd=32&WT.sr=1400x1050&WT.jo=Yes&WT.ti=COMPANY%20FBB%20Service%20AG%20-%20Product%20of%20Service;%20Strahlung;%20Green%20greener%20greenest&WT.js=Yes&WT.jv=1.5&WT.bs=1400x894&WT.fi=Yes&WT.fv=10.0&WT.sp=@@SPLITVALUE@@
There is probably an issue with those semicolons (;) in the WT.ti parameter. I anonymized the requests but the elementary info should still be available.
Cheers,
Martin
Martin replied on May 28, 2009 21:53 to the question "Need help with regex exclude rule" in Immeria:
Martin marked one of Stephane Hamel's replies in Immeria as useful. Stephane Hamel replied to the question "Need help with regex exclude rule".
Martin replied on May 28, 2009 15:42 to the question "Need help with regex exclude rule" in Immeria:
Thank you for excluding tiff and eps by default!
Since the site is on a testing environment I could not send you a link but I sent you a screencast providing a walkthrough for our problem.
We made some new findings after we started the crawling from scratch. The link for the printing page get's redirected:
asdf.example.com/test.php?browser=print -> asdf.example.com/print/test.php
So we added '=print' to our exclude rule which seems to work...at least for the first few hundred pages.
This hardens our assumption that the filter is applied on the link not on the url. Therefore a rule does not operate on pages which are already in the queue for crawling (for example by resuming a crawl). Is this correct?
Martin asked a question in Immeria on May 28, 2009 08:52:
Need help with regex exclude ruleWe are testing a site with about 16'000 pages. After successfully crawling 3'000 pages WASP got stuck because it followed some links to EPS and TIF graphics. A few pages later it got stuck again on pages which automatically start a print dialog.
Therefore we stopped the crawling and started again (resume) with the following exclude regex rule:
(\.eps|\.tif|/print/)
This rule should prevent from loading sites like these:
asdf.example.com/print/somepage.htm
asdf.example.com/media/anypicture.tif
asdf.example.com/media/anotherpic.eps
Unfortunately it does not work as expected. The excluded pages still get loaded. Is this because we did not specify the rule at the beginning or is there an error in our regex?
Thanks!
Loading Profile...

