Re: Optimal wildcard search algorithm

From: Dathan Bennett (dathan@shsu.edu)
Date: Thu Nov 30 2006 - 11:19:05 EST


Round 1: Must search a%, b%, c%, etc. This takes Θ(36), and yields N(1) matches.

Round 2a: Use search string _a%, _b%, _c%, etc. This takes Θ(36), and yields
E(2) matches. Since we’re using this step to _exclude_ search paths, wasted time
is actually Θ (E_2 ). The advantage is that the time we spent finding
non-matching strings (Θ (36-E(2) )) eliminates the need to spend Θ(N(1) *(36-E(2)
)) in the brute-force approach weeding out the same non-matching strings. So
long as N(1) * (36 – E(2) ) > 36 (the time spent on this approach at this level),
we have gained an advantage in terms of time. Doing some algebra, we get the
equivalent condition: 36 * (1 – 1/N(1) ) > E(2) .

Round 2b: Brute force this level, excluding the (36 – E(2) ) terms that we’ve
found don’t exist in the second position. I.e, search for the strings aa%
(assuming a exists in the second position), ab%, etc., ba%, bb%, etc. This will
yield N2 matches.

Round 3a: Same as round 2a. Use search strings __a%, __b%, etc. Once again, it
takes Θ(36) and yields E(3) matches, for a net gain of speed whenever N(2)
*(36-E(3) ) > 36.

 

Some concrete examples:

After round 1, we discover that there are usernames that start with a, b, and
c. N(1) = 3.

Proceed to round 2a: we discover (in Θ(36) time) that there are usernames with
d, e, f, and g in the second position (i.e., E(2) = 4).

Round 2b: since we know there are only 4 possible characters in the second
position, we only need to try ad%, ae%, af%, ag%, bd%, be%, bf%, bg%, cd%, ce%,
cf%, cg%, which will take Θ(N(1) * E(2) ) = Θ(12) time. Compare to traditional
brute-force without the pruning from round 2a, which takes Θ(3*36) = Θ(108) time.

Net savings from inserting round 2a is the total time required for brute-forcing
the second position (108), minus the time required for round 2a (always 36) ,
minus the time required to check the ones that round 2a doesn’t prune away (in
this case, 12). Time saved = 108 – 36 – 12 = 108 – 48 = 60. We saved over
50%. These numbers get significantly larger for larger N(i), especially if E(i+1)
stays small.

 

As I noted, at any level the break-even curve is 36*(1 – 1/N(i) )>E(i+1) . Note
that if you don’t know how many usernames there are on the system total (meaning
you can’t establish a concrete upper bound for N(i) or E(i+1) based on how many
usernames you’ve already found), the only way to calculate whether it’s
worthwhile is to do it (to establish a value for E(i+1) ), which is pointless.
The exception is that if N(i) = 1, it is never faster to use this pruning step.
But as noted, if we allow single-character wildcards, it takes only Θ(L) time to
naively discover the maximum length of usernames, where L is the maximum length.
(using a binary search-like approach, we could shorten the time required, but
since this operation would only be done once, it's probably not important).
This could be used to place a more useful upper bound on N(i) and E(i+1) . This
complicates things, though, and I haven't done the analysis to figure out how
much it could be used to further optimize searches.

 
I went ahead and plotted out the time for each possible value of N(i) and E(i+1)
using each approach. Turns out the average time saved per character position by
using the pruN(i)ng (if my math is right) versus traditional brute-force is
Θ(287.75). The maximum time saved per character position is Θ(1224) (this
happens when all 36 characters exist in a particular position, but only one
character exists in the next. An example would be “user1a”, “user2a”, “user3a”,
“user4a”, …, “useraa”, “userba”, “userca”, etc.), and the maximum time lost per
character position is Θ(36).

I know this doesn't answer the question of what the _most_ optimal approach
would be, but it certainly proves that this approach is, except in pathological
cases, more optimal than traditional brute-force.

LDAP doesn't have a positional wildcard, though, does it? I think, though I
haven't done the analysis, that using such queries as "*a*" will be helpful on
systems with relatively few usernames, but on larger systems with many
usernames, I believe it would be less useful. However, if you combine the use
of "*a*" type wildcard searches in a similar manner to the above, you can globally
cull out characters that do not exist after a certain position when combined
with a particular prefix. For instance, make an iN(i)tial pass using "*a*"-type
searches to eliminate characters that don't exist anywhere in any username (or
hash, etc.). Then, take the remaiN(i)ng alphabet and brute-force the first
character position (a*, b*, etc.). Now use "a*a*", "a*b*", etc. to attempt to
eliminate members of the alphabet that don't exist in any username that has "a"
as a prefix. Now do the same for the "aa" prefix (use "aa*a*", etc.). This will
help you cull out characters, but on a system with a large collection of
usernames that are more or less uniformly distributed, you're going to spend a
lot of time that you may or may not get back. I'll do some analysis and see.

~Dathan

P.S. I apologize for any formatting errors. I used another application to
generate the symbols and typesetting, and then pasted into Thunderbird. I'll be
happy to send you the original if this one is difficult to read.

Tim wrote:
>> Do I understand correctly, that the wildcard test you describe only
>> gives an exists/doesn't exist answer? How many tests can you afford
>> per second?
>>
>
> Yes, the attack would be somewhat blind, resulting in yes/no or
> yes/no/error responses. For instance, if this were a login form using
> LDAP, it might return success for a proper username/password
> combination, failure for a nonexistent username or bad password, and an
> error for a wildcarded username that does exist but can't be bound to,
> since there's no injection in the LDAP bind request.
>
>
>> What you can try is to analyse some sample username lists for
>> ways to equalise their distribution. For example, you wouldn't want to
>> start with a*, better jump directly to aa*, ab*, ..., az*
>>
>
> You think starting with 2-character prefixes is actually faster? It
> doesn't seem that way to me. In the worst case for a single character
> test, you need to make 36+36*m, where m is the number of successful
> results from the first round.
>
> If you start with 2-character tuples, you'll always need to perform
> 36*36 tests to get to the same point in the search, which is certainly
> slower on average.
>
>
>> You might try taking samples from
>> http://www.galbithink.org/names/agnames.htm
>> http://blogs.parc.com/playon/archives/2006/06/naming_patterns.html
>> or similar, some username/password combo lists, etc.
>>
>> You can also test for suffixes (*a, *b, ..) hoping for a better
>> distribution.
>>
>
> Indeed, I think using common names, or even just starting with the most
> probable characters would speed up the iN(i)tial findings. This is very
> useful if your goal is simply to find one or a few usernames. However,
> if you want to find them all, it probably doesn't help much.
>
> Besides, the general algorithm I seek would be useable on other fields.
> Take for example, the following ldap injections:
>
> *)(userPassword=A*
> *)(userPassword=B*
> ...
>
> If an LDAP server isn't properly secured, this could allow one to
> brute-force all password hashes using just the blind responses.
> Password hashes wouldn't have predictable distribution. Similar
> situations may exist in SQL injections as well, though there's probably
> easier attacks in that situation.
>
>
>
>> If you're allowed single-char wildcards, you could do more interesting
>> searches - tests for certain username lengths being the most
>> important. You can also walk the search space based on the *second*
>> letter of the username _a%, _b% etc, which will (I guess) be more
>> equally distributed than the first letter.
>>
>
> Indeed, with single-character wildcards, a lot of additional information
> can be divulged. In fact, one could determine the length range of all
> strings right off, which would make it easy to determine when we're
> "done". I didn't want to complicate the question by introducing these
> though.
>
>
>> I don't know if searching for substrings in the middle hoping they
>> would prune the search tree will be helpful - but you can analyze
>> those sample lists and see if particular patterns come up. If there is
>> a good set of 2 or 3 character strings that have near-zero frequencies
>> in the sample lists, that's the way to go.
>>
>
> Yes, one of my burN(i)ng questions is to figure out whether mid-string
> checks can be beneficial. Multi-character mid-string searches are
> likely a bad idea, for the same reasons I described above.
>
> While I appreciate the ideas you've suggested, I'm really looking for a
> more formal answer to the optimality question.
>
> thanks,
> tim
>
> ------------------------------------------------------------------------
> This List Sponsored by: Cenzic
>
> Need to secure your web apps?
> Cenzic Hailstorm finds vulnerabilities fast.
> Click the link to buy it, try it or download Hailstorm for FREE.
> http://www.cenzic.com/products_services/download_hailstorm.php?camp=701600000008bOW
> ------------------------------------------------------------------------
>

-- 
Dathan Bennett
Network AdmiN(i)strator
Center of Excellence in Digital Forensics
Sam Houston State UN(i)versity
Phone: (936) 294-4847
Fax: (936) 294-4222
E-mail: dathan@shsu.edu
------------------------------------------------------------------------
This List Sponsored by: Cenzic
Need to secure your web apps?
Cenzic Hailstorm finds vulnerabilities fast.
Click the link to buy it, try it or download Hailstorm for FREE.
http://www.cenzic.com/products_services/download_hailstorm.php?camp=701600000008bOW
------------------------------------------------------------------------


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:57:24 EDT