Need help with regex exclude rule
We are testing a site with about 16'000 pages. After successfully crawling 3'000 pages WASP got stuck because it followed some links to EPS and TIF graphics. A few pages later it got stuck again on pages which automatically start a print dialog.
Therefore we stopped the crawling and started again (resume) with the following exclude regex rule:
(\.eps|\.tif|/print/)
This rule should prevent from loading sites like these:
asdf.example.com/print/somepage.htm
asdf.example.com/media/anypicture.tif
asdf.example.com/media/anotherpic.eps
Unfortunately it does not work as expected. The excluded pages still get loaded. Is this because we did not specify the rule at the beginning or is there an error in our regex?
Thanks!
Therefore we stopped the crawling and started again (resume) with the following exclude regex rule:
(\.eps|\.tif|/print/)
This rule should prevent from loading sites like these:
asdf.example.com/print/somepage.htm
asdf.example.com/media/anypicture.tif
asdf.example.com/media/anotherpic.eps
Unfortunately it does not work as expected. The excluded pages still get loaded. Is this because we did not specify the rule at the beginning or is there an error in our regex?
Thanks!
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?Your regex looks ok, and I did a few tests without any issue. Can you send me an email to shamel (at) immeria (dot) net with the site you are testing?
Also, I have added tiff & eps to the list of files automatically excluded.
I’m unsure
-
Inappropriate?Thank you for excluding tiff and eps by default!
Since the site is on a testing environment I could not send you a link but I sent you a screencast providing a walkthrough for our problem.
We made some new findings after we started the crawling from scratch. The link for the printing page get's redirected:
asdf.example.com/test.php?browser=print -> asdf.example.com/print/test.php
So we added '=print' to our exclude rule which seems to work...at least for the first few hundred pages.
This hardens our assumption that the filter is applied on the link not on the url. Therefore a rule does not operate on pages which are already in the queue for crawling (for example by resuming a crawl). Is this correct?
I’m confident
-
Inappropriate?Right on! The exclusion rule (or include) is currently applied when it is initially discovered on a page. Since each discovered page is logged for further analysis, stopping the crawl, specifying the exclusion rule, and then resuming doesn't work since there are already unscanned entries in the temporary database.
I will make two fixes:
1) log the skipped entry in the database so you can see which links were skipped.
2) change the time where the include/exclude rule is being checked so if you change the regular expression and resume the crawl they will be processed accordingly.
Thanks for catching that!
The company and 1 other person say
this answers the question
Loading Profile...




EMPLOYEE