Re: Spidering

From: Tonnerre Lombard (tonnerre.lombard@sygroup.ch)
Date: Mon Jan 21 2008 - 02:38:05 EST


Salut,

On Thu, 17 Jan 2008 17:58:41 +0000 "me me" <securityoneoone@googlemail.com> wrote:
> Whilst I don't expect it to get everything (JavaScript etc is going
> to take manual intervention, so is a number of other possible
> technologies), I have never really found a tool that I consider to be
> the defacto spidering tool from this perspective. One of the biggest
> problems is a lot of the spiders seem to choke on really big sites,
> or go into infinite loops etc etc.

Yes, Microsoft Passport is very evil there, as an example. My trick to
solve the Microsoft Passport Problem is to search every link if it
contains an URLencoded version of the current URL and if it does,
ignore it. That appears to avoid deadloops.

I haven't yet seen other deadloops as far as I remember, but then again
I didn't index very much yet.

                                Tonnerre

-- 
SyGroup GmbH
Tonnerre Lombard
Solutions Systematiques
Tel:+41 61 333 80 33		Güterstrasse 86
Fax:+41 61 383 14 67		4053 Basel
Web:www.sygroup.ch		tonnerre.lombard@sygroup.ch




This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:58:20 EDT