[logback-dev] Improving Joran's RuleStore performance

Niall Gallagher niall at npgall.com
Fri Jun 21 01:07:59 CEST 2013


On 20 Jun 2013, at 23:15, ceki <ceki at qos.ch> wrote:

> On 20.06.2013 23:27, Niall Gallagher wrote:
>> Hi Ceki,
>> 
>> Just wondering about the SUB-STRING patterns, such as "*/if/then/*" and
>> "*/if/else/*".
>> 
>> I'm currently working on adding a permuterm tree to the
>> concurrent-trees[1] project. It will support fast retrieval from the
>> tree for wildcard queries like: X, X*, *X, *X*, X*Y. These patterns
>> would be fully accelerated.
> 
> The *X* variant fits the sub-string patterns used in Joran. While
> traversing the elements of an XML document, I would like to compute if
> *X* is the path defined by current location in the XML document. This
> seems to be different (or even unrelated) to the problem of checking
> whether *X* matches one of the keys inserted previously into an ADT
> such as a trie.

So you want to store patterns in the tree, and then run a document over the tree to see which patterns matched. Yes it's different. The same idea as InvertedRadixTree. In this case I don't think such an "inverted" permuterm tree is possible, as rotating the query (the document) would be expensive.

> 
>> It looks like the *X* variant might fit your sub-string patterns.
>> 
>> It's also possible to have partial acceleration for additional patterns
>> such as X*Y*Z, *X*Y*Z, *X*Y*Z* and so on with any number of terms, which
>> involves using the first and last terms to find a candidate set in the
>> tree and then doing on-the-fly filtering for the terms in the middle.
> 
> I am curious. Is "autocomplete" in presence of typos the use case for such patterns?

Spell correction and auto-complete in the presence of typos are some use cases (though actually this would require higher-level logic on top of the tree also). There's a discussion about spell correction on the concurrent-trees mailing list.

The major use case for me though, is the ability to search a product catalog very quickly using queries containing wildcards, or using queries containing typos.

The trees in concurrent-trees[1] were originally part of another project, CQEngine[2] before I spun it out as a separate project. I'm writing the concurrent permuterm tree in concurrent-trees, so that I can then use it as a basis for a new type of index in CQEngine, allowing free-text search of product catalogs etc.

[1] http://code.google.com/p/concurrent-trees/
[2] http://code.google.com/p/cqengine/

> 
>> Would that help? I'm about 80% of the way to completion.
>> 
>> [1] https://code.google.com/p/concurrent-trees/
>> 
>> Niall
>> 
> 
> 
> -- 
> Ceki
> 65% of statistics are made up on the spot
> _______________________________________________
> logback-dev mailing list
> logback-dev at qos.ch
> http://mailman.qos.ch/mailman/listinfo/logback-dev



More information about the logback-dev mailing list