Get your own customer support community
 

Need help with regex exclude rule

We are testing a site with about 16'000 pages. After successfully crawling 3'000 pages WASP got stuck because it followed some links to EPS and TIF graphics. A few pages later it got stuck again on pages which automatically start a print dialog.

Therefore we stopped the crawling and started again (resume) with the following exclude regex rule:
(\.eps|\.tif|/print/)

This rule should prevent from loading sites like these:
asdf.example.com/print/somepage.htm
asdf.example.com/media/anypicture.tif
asdf.example.com/media/anotherpic.eps

Unfortunately it does not work as expected. The excluded pages still get loaded. Is this because we did not specify the rule at the beginning or is there an error in our regex?



Thanks!
 
indifferent I’m stuck
Inappropriate?
1 person has this question

User_default_medium