Transform queries for sphinx input.
sphinxesc
A small module to prevent user-submitted search expressions from being mis-parsed into invalid Sphinx Extended Query Expressions.
The module provides a function
module SphinxEscape where
escapeSphinxQueryString :: String -> String
that sanitizes the Sphinx query expression in a way that can be safely submitted to the Sphinx API.
Synopsis
Example from ghci:
ghci> :m SphinxEscape
ghci> putStrLn $ escapeSphinxQueryString "@tag_list hello OR quick brown fox 7/11"
@tag_list hello | quick brown fox 7 11
ghci>
ghci> putStrLn $ escapeSphinxQueryString "hello AND quick brown fox 7/11"
hello & quick brown fox 7 11
ghci>
Explanation
escapeSphinxQueryString
performs very simple escaping with the help of a simplified abtract syntax tree. The abstract syntax tree it builds is:
data Expression =
TagFieldSearch String
| Literal String
| Phrase String
| AndOrExpr Conj Expression Expression
deriving Show
The escaping does not parse more advanced Sphinx query expressions such as NEAR/n
, quorum, etc., nor does it recognize arbitrary @field
expressions. The only special expressions recognized are & (AND)
, | (OR)
and @tag_list WORDS
. Except for quoted phrases, non-alphanumeric characters that do not form part of these specific expressions are simply turned into whitespace.
See the Testing section below for examples of conversions.
Obviously these rules are quite domain specific. The rules can be made more configurable later.
Testing
The command line executable sphinxesc
can be used to test the expression parser and escaping of the input to the final sphinx search expression.
$ sphinxesc "test OR hello"
test | hello
# -p option shows the parsing result
$ sphinxesc -p "test OR hello"
AndOrExpr Or (Literal "test") (Literal "hello")
There is a suite of Bash-based regression tests in tests.txt
, where the input is on the left, followed by ::
surrounded by any whitespace, followed by the expected escaped output result. To run the tests, execute the script ./test.sh
NOTE This test output may be outdated. Please look at the tests.txt
for the current tests.
./test.sh
INPUT EXPECTED RESULT PASS
7/11 7 11 7 11 PASS
hello 7/11 hello 7 11 hello 7 11 PASS
hello OR 7/11 hello | 7 11 hello | 7 11 PASS
hello or 7/11 hello | 7 11 hello | 7 11 PASS
hello | 7/11 hello | 7 11 hello | 7 11 PASS
hello AND 7/11 hello & 7 11 hello & 7 11 PASS
@tag_list fox tango 7/11 @tag_list fox tango 7 11 @tag_list fox tango 7 11 PASS
@(tag_list) fox tango 7/11 @tag_list fox tango 7 11 @tag_list fox tango 7 11 PASS
@(tag_list) AND @tag_list AND @tag_list AND PASS
@other_field AND other field AND other field AND PASS
hello & @other_field AND hello & other field AND hello & other field AND PASS
hello & hello hello PASS
& hello & hello hello PASS
& & hello & hello hello PASS
| | hello | hello hello PASS
"hello" hello hello hello hello hello PASS
hello" hello hello hello hello hello PASS
hello' hello hello hello hello hello PASS
hello' @tag_list fox hello @tag_list fox hello @tag_list fox PASS
hello' @tag_list fox & hello @tag_list fox hello @tag_list fox PASS
PASS
(The last case is hard to see, but the input is a blank string "" and the output is a blank string "".)
Future directions
The escaping function can be made more configurable. The parser and AST data structure can also be made more sophisticated, so that the AST can cover more of the Sphinx Extended Query syntax.
Reference
- http://sphinxsearch.com/docs/latest/extended-syntax.html Sphinx Extended Syntax docs.