Transforming data with Analyzers
Analyzers allow you to transform data, for sophisticated text processing and searching, either standalone or in combination with Views
While AQL string functions allow for basic text manipulation, true text processing including tokenization, language-specific word stemming, case conversion and removal of diacritical marks (accents) from characters only become possible with Analyzers.
Analyzers parse input values and transform them into sets of sub-values, for example by breaking up text into words. If they are used in Views then the documents’ attribute values of the linked collections are used as input and additional metadata is produced internally. The data can then be used for searching and sorting to provide the most appropriate match for the specified conditions, similar to queries to web search engines.
Analyzers can be used on their own to tokenize and normalize strings in AQL
queries with the TOKENS()
function.
How Analyzers process values depends on their type and configuration. The configuration is comprised of type-specific properties and list of features. The features control the additional metadata to be generated to augment View indexes, to be able to rank results for instance.
Analyzers can be managed via an HTTP API and through a JavaScript module.
Value Handling
While most of the Analyzer functionality is geared towards text processing, there is no restriction to strings as input data type when using them through Views – your documents could have attributes of any data type after all.
Strings are processed according to the Analyzer, whereas other primitive data
types (null
, true
, false
, numbers) are added to the index unchanged.
The elements of arrays are unpacked, processed and indexed individually, regardless of the level of nesting. That is, strings are processed by the configured Analyzer(s) and other primitive values are indexed as-is.
Objects, including any nested objects, are indexed as sub-attributes. This applies to sub-objects as well as objects in arrays. Only primitive values are added to the index, arrays and objects can not be searched for.
Also see:
- SEARCH operation on how to query indexed values such as numbers and nested values
- ArangoSearch Views for details about how compound data types (arrays, objects) get indexed
Analyzer Names
Each Analyzer has a name for identification with the following naming conventions:
- The name must only consist of the letters
a
toz
(both in lower and upper case), the numbers0
to9
, underscore (_
) and dash (-
) symbols. This also means that any non-ASCII names are not allowed. - It must always start with a letter.
- The maximum allowed length of a name is 254 bytes.
- Analyzer names are case-sensitive.
Custom Analyzers are stored per database, in a system collection _analyzers
.
The names get prefixed with the database name and two colons, e.g.
myDB::customAnalyzer
.This does not apply to the globally available
built-in Analyzers, which are not stored in an
_analyzers
collection.
Custom Analyzers stored in the _system
database can be referenced in queries
against other databases by specifying the prefixed name, e.g.
_system::customGlobalAnalyzer
. Analyzers stored in databases other than
_system
can not be accessed from within another database however.
Analyzer Types
The currently implemented Analyzer types are:
identity
: treat value as atom (no transformation)delimiter
: split into tokens at user-defined characterstem
: apply stemming to the value as a wholenorm
: apply normalization to the value as a wholengram
: create n-grams from value with user-defined lengthstext
: tokenize into words, optionally with stemming, normalization, stop-word filtering and edge n-gram generationaql
: for running AQL query to prepare tokens for indexpipeline
: for chaining multiple Analyzersstopwords
: removes the specified tokens from the inputgeojson
: breaks up a GeoJSON object into a set of indexable tokensgeopoint
: breaks up a JSON object describing a coordinate into a set of indexable tokens
Available normalizations are case conversion and accent removal (conversion of characters with diacritical marks to the base characters).
Analyzer / Feature | Tokenization | Stemming | Normalization | N-grams |
---|---|---|---|---|
identity |
No | No | No | No |
delimiter |
(Yes) | No | No | No |
stem |
No | Yes | No | No |
norm |
No | No | Yes | No |
ngram |
No | No | No | Yes |
text |
Yes | Yes | Yes | (Yes) |
aql |
(Yes) | (Yes) | (Yes) | (Yes) |
pipeline |
(Yes) | (Yes) | (Yes) | (Yes) |
stopwords |
No | No | No | No |
geojson |
– | – | – | – |
geopoint |
– | – | – | – |
Analyzer Properties
The valid attributes/values for the properties are dependant on what type
is used. For example, the delimiter
type needs to know the desired delimiting
character(s), whereas the text
type takes a locale, stop-words and more.
identity
An Analyzer applying the identity
transformation, i.e. returning the input
unmodified.
It does not support any properties and will ignore them.
Examples
Applying the identity Analyzers does not perform any transformations, hence the input is returned unaltered:
arangosh> db._query(`RETURN TOKENS("UPPER lower dïäcríticš", "identity")`).toArray();
[
[
"UPPER lower dïäcríticš"
]
]
delimiter
An Analyzer capable of breaking up delimited text into tokens as per RFC 4180 (without starting new records on newlines).
The properties allowed for this Analyzer are an object with the following attributes:
delimiter
(string): the delimiting character(s)
Examples
Split input strings into tokens at hyphen-minus characters:
stem
An Analyzer capable of stemming the text, treated as a single token, for supported languages.
The properties allowed for this Analyzer are an object with the following attributes:
locale
(string): a locale in the formatlanguage[_COUNTRY][.encoding][@variant]
(square brackets denote optional parts), e.g."de.utf-8"
or"en_US.utf-8"
. Only UTF-8 encoding is meaningful in ArangoDB. Also see Supported Languages.
Examples
Apply stemming to the input string as a whole:
arangosh> var analyzers = require("@arangodb/analyzers");
arangosh> var a = analyzers.save("stem_en", "stem", {
........> locale: "en.utf-8"
........> }, ["frequency", "norm", "position"]);
arangosh> db._query(`RETURN TOKENS("databases", "stem_en")`).toArray();
[
[
"databas"
]
]
norm
An Analyzer capable of normalizing the text, treated as a single token, i.e. case conversion and accent removal.
The properties allowed for this Analyzer are an object with the following attributes:
locale
(string): a locale in the formatlanguage[_COUNTRY][.encoding][@variant]
(square brackets denote optional parts), e.g."de.utf-8"
or"en_US.utf-8"
. Only UTF-8 encoding is meaningful in ArangoDB. Also see Supported Languages.accent
(boolean, optional):true
to preserve accented characters (default)false
to convert accented characters to their base characters
case
(string, optional):"lower"
to convert to all lower-case characters"upper"
to convert to all upper-case characters"none"
to not change character case (default)
Examples
Convert input string to all upper-case characters:
arangosh> var analyzers = require("@arangodb/analyzers");
arangosh> var a = analyzers.save("norm_upper", "norm", {
........> locale: "en.utf-8",
........> case: "upper"
........> }, ["frequency", "norm", "position"]);
arangosh> db._query(`RETURN TOKENS("UPPER lower dïäcríticš", "norm_upper")`).toArray();
[
[
"UPPER LOWER DÏÄCRÍTICŠ"
]
]
Convert accented characters to their base characters:
arangosh> var analyzers = require("@arangodb/analyzers");
arangosh> var a = analyzers.save("norm_accent", "norm", {
........> locale: "en.utf-8",
........> accent: false
........> }, ["frequency", "norm", "position"]);
arangosh> db._query(`RETURN TOKENS("UPPER lower dïäcríticš", "norm_accent")`).toArray();
[
[
"UPPER lower diacritics"
]
]
Convert input string to all lower-case characters and remove diacritics:
arangosh> var analyzers = require("@arangodb/analyzers");
arangosh> var a = analyzers.save("norm_accent_lower", "norm", {
........> locale: "en.utf-8",
........> accent: false,
........> case: "lower"
........> }, ["frequency", "norm", "position"]);
arangosh> db._query(`RETURN TOKENS("UPPER lower dïäcríticš", "norm_accent_lower")`).toArray();
[
[
"upper lower diacritics"
]
]
ngram
An Analyzer capable of producing n-grams from a specified input in a range of min..max (inclusive). Can optionally preserve the original input.
This Analyzer type can be used to implement substring matching. Note that it slices the input based on bytes and not characters by default (streamType). The “binary” mode supports single-byte characters only; multi-byte UTF-8 characters raise an Invalid UTF-8 sequence query error.
The properties allowed for this Analyzer are an object with the following attributes:
min
(number): unsigned integer for the minimum n-gram lengthmax
(number): unsigned integer for the maximum n-gram lengthpreserveOriginal
(boolean):true
to include the original value as wellfalse
to produce the n-grams based on min and max only
startMarker
(string, optional): this value will be prepended to n-grams which include the beginning of the input. Can be used for matching prefixes. Choose a character or sequence as marker which does not occur in the input.endMarker
(string, optional): this value will be appended to n-grams which include the end of the input. Can be used for matching suffixes. Choose a character or sequence as marker which does not occur in the input.streamType
(string, optional): type of the input stream"binary"
: one byte is considered as one character (default)"utf8"
: one Unicode codepoint is treated as one character
Examples
With min = 4
and max = 5
, the Analyzer will produce the following
n-grams for the input string "foobar"
:
"foob"
"fooba"
"foobar"
(if preserveOriginal is enabled)"ooba"
"oobar"
"obar"
An input string "foo"
will not produce any n-gram unless preserveOriginal
is enabled, because it is shorter than the min length of 4.
Above example but with startMarker = "^"
and endMarker = "$"
would
produce the following:
"^foob"
"^fooba"
"^foobar"
(if preserveOriginal is enabled)"foobar$"
(if preserveOriginal is enabled)"ooba"
"oobar$"
"obar$"
Create and use a trigram Analyzer with preserveOriginal
disabled:
Create and use a bigram Analyzer with preserveOriginal
enabled and with start
and stop markers:
text
An Analyzer capable of breaking up strings into individual words while also optionally filtering out stop-words, extracting word stems, applying case conversion and accent removal.
The properties allowed for this Analyzer are an object with the following attributes:
locale
(string): a locale in the formatlanguage[_COUNTRY][.encoding][@variant]
(square brackets denote optional parts), e.g."de.utf-8"
or"en_US.utf-8"
. Only UTF-8 encoding is meaningful in ArangoDB. Also see Supported Languages.accent
(boolean, optional):true
to preserve accented charactersfalse
to convert accented characters to their base characters (default)
case
(string, optional):"lower"
to convert to all lower-case characters (default)"upper"
to convert to all upper-case characters"none"
to not change character case
stemming
(boolean, optional):true
to apply stemming on returned words (default)false
to leave the tokenized words as-is
edgeNgram
(object, optional): if present, then edge n-grams are generated for each token (word). That is, the start of the n-gram is anchored to the beginning of the token, whereas thengram
Analyzer would produce all possible substrings from a single input token (within the defined length restrictions). Edge n-grams can be used to cover word-based auto-completion queries with an index, for which you should set the following other options:accent: false
,case: "lower"
and most importantlystemming: false
.min
(number, optional): minimal n-gram lengthmax
(number, optional): maximal n-gram lengthpreserveOriginal
(boolean, optional): whether to include the original token even if its length is less than min or greater than max
stopwords
(array, optional): an array of strings with words to omit from result. Default: load words fromstopwordsPath
. To disable stop-word filtering provide an empty array[]
. If both stopwords and stopwordsPath are provided then both word sources are combined.-
stopwordsPath
(string, optional): path with a language sub-directory (e.g.en
for a localeen_US.utf-8
) containing files with words to omit. Each word has to be on a separate line. Everything after the first whitespace character on a line will be ignored and can be used for comments. The files can be named arbitrarily and have any file extension (or none).Default: if no path is provided then the value of the environment variable
IRESEARCH_TEXT_STOPWORD_PATH
is used to determine the path, or if it is undefined then the current working directory is assumed. If thestopwords
attribute is provided then no stop-words are loaded from files, unless an explicit stopwordsPath is also provided.Note that if the stopwordsPath can not be accessed, is missing language sub-directories or has no files for a language required by an Analyzer, then the creation of a new Analyzer is refused. If such an issue is discovered for an existing Analyzer during startup then the server will abort with a fatal error.
Examples
The built-in text_en
Analyzer has stemming enabled (note the word endings):
You may create a custom Analyzer with the same configuration but with stemming disabled like this:
Custom text Analyzer with the edge n-grams feature and normalization enabled,
stemming disabled and "the"
defined as stop-word to exclude it:
aql
Introduced in: v3.8.0
An Analyzer capable of running a restricted AQL query to perform data manipulation / filtering.
The query must not access the storage engine. This means no FOR
loops over
collections or Views, no use of the DOCUMENT()
function, no graph traversals.
AQL functions are allowed as long as they do not involve Analyzers (TOKENS()
,
PHRASE()
, NGRAM_MATCH()
, ANALYZER()
etc.) or data access, and if they can
be run on DB-Servers in case of a cluster deployment. User-defined functions
are not permitted.
The input data is provided to the query via a bind parameter @param
.
It is always a string. The AQL query is invoked for each token in case of
multiple input tokens, such as an array of strings.
The output can be one or multiple tokens (top-level result elements). They get
converted to the configured returnType
, either booleans, numbers or strings
(default).
If returnType
is "number"
or "bool"
then it is unnecessary to set this
AQL Analyzer as context Analyzer with ANALYZER()
in View queries. You can
compare indexed fields to numeric values, true
or false
directly, because
they bypass Analyzer processing.
The properties allowed for this Analyzer are an object with the following attributes:
queryString
(string): AQL query to be executedcollapsePositions
(boolean):true
: set the position to 0 for all members of the query result arrayfalse
(default): set the position corresponding to the index of the result array member
keepNull
(boolean):true
(default): treatnull
like an empty stringfalse
: discardnull
s from View index. Can be used for index filtering (i.e. make your query return null for unwanted data). Note that empty results are always discarded.
batchSize
(integer): number between 1 and 1000 (default = 1) that determines the batch size for reading data from the query. In general, a single token is expected to be returned. However, if the query is expected to return many results, then increasingbatchSize
trades memory for performance.memoryLimit
(integer): memory limit for query execution in bytes. (default is 1048576 = 1Mb) Maximum is 33554432U (32Mb)returnType
(string): data type of the returned tokens. If the indicated type does not match the actual type then an implicit type conversion is applied (see TO_STRING(), TO_NUMBER(), TO_BOOL())"string"
(default): convert emitted tokens to strings"number"
: convert emitted tokens to numbers"bool"
: convert emitted tokens to booleans
Examples
Soundex Analyzer for a phonetically similar term search:
arangosh> var analyzers = require("@arangodb/analyzers");
arangosh> var a = analyzers.save("soundex", "aql", { queryString: "RETURN SOUNDEX(@param)" },
........> ["frequency", "norm", "position"]);
arangosh> db._query("RETURN TOKENS('ArangoDB', 'soundex')").toArray();
[
[
"A652"
]
]
Concatenating Analyzer for conditionally adding a custom prefix or suffix:
Filtering Analyzer that ignores unwanted data based on the prefix "ir"
,
with keepNull: false
and explicitly returning null
:
Filtering Analyzer that discards unwanted data based on the prefix "ir"
,
using a filter for an empty result, which is discarded from the View index even
without keepNull: false
:
Custom tokenization with collapsePositions
on and off:
The input string "A-B-C-D"
is split into an array of strings
["A", "B", "C", "D"]
. The position metadata (as used by the PHRASE()
function) is set to 0 for all four strings if collapsePosition
is enabled.
Otherwise the position is set to the respective array index, 0 for "A"
,
1 for "B"
and so on.
collapsePosition |
A | B | C | D |
---|---|---|---|---|
true |
0 | 0 | 0 | 0 |
false |
0 | 1 | 2 | 3 |
The position data is not directly exposed, but we can see its effects through
the PHRASE()
function. There is one token between "B"
and "D"
to skip in
case of uncollapsed positions. With positions collapsed, both are in the same
position, thus there is negative one to skip to match the tokens.
pipeline
Introduced in: v3.8.0
An Analyzer capable of chaining effects of multiple Analyzers into one. The pipeline is a list of Analyzers, where the output of an Analyzer is passed to the next for further processing. The final token value is determined by last Analyzer in the pipeline.
The Analyzer is designed for cases like the following:
- Normalize text for a case insensitive search and apply n-gram tokenization
- Split input with
delimiter
Analyzer, followed by stemming with thestem
Analyzer
The properties allowed for this Analyzer are an object with the following attributes:
pipeline
(array): an array of Analyzer definition-like objects withtype
andproperties
attributes
Analyzers of types geopoint
and geojson
cannot be used in pipelines and
will make the creation fail. These Analyzers require additional postprocessing
and can only be applied to document fields directly.
Examples
Normalize to all uppercase and compute bigrams:
Split at delimiting characters ,
and ;
, then stem the tokens:
stopwords
Introduced in: v3.8.1
An Analyzer capable of removing specified tokens from the input.
It uses binary comparison to determine if an input token should be discarded. It checks for exact matches. If the input contains only a substring that matches one of the defined stopwords, then it is not discarded. Longer inputs such as prefixes of stopwords are also not discarded.
The properties allowed for this Analyzer are an object with the following attributes:
stopwords
(array): array of strings that describe the tokens to be discarded. The interpretation of each string depends on the value of thehex
parameter.hex
(boolean): If false (default), then each string instopwords
is used verbatim. If true, then the strings need to be hex-encoded. This allows for removing tokens that contain non-printable characters. To encode UTF-8 strings to hex strings you can use e.g.- AQL:
FOR token IN ["and","the"] RETURN TO_HEX(token)
- arangosh / Node.js:
["and","the"].map(token => Buffer(token).toString("hex"))
- Modern browser:
["and","the"].map(token => Array.from(new TextEncoder().encode(token), byte => byte.toString(16).padStart(2, "0")).join(""))
- AQL:
Examples
Create and use a stopword Analyzer that removes the tokens and
and the
.
The stopword array with hex-encoded strings for this looks like
["616e64","746865"]
(a
= 0x61, n
= 0x6e, d
= 0x64 and so on).
Note that a
and theater
are not removed, because there is no exact match
with either of the stopwords and
and the
:
Create and use an Analyzer pipeline that normalizes the input (convert to
lower-case and base characters) and then discards the stopwords and
and the
:
geojson
Introduced in: v3.8.0
An Analyzer capable of breaking up a GeoJSON object into a set of indexable tokens for further usage with ArangoSearch Geo functions.
GeoJSON object example:
{
"type": "Point",
"coordinates": [ -73.97, 40.78 ] // [ longitude, latitude ]
}
The properties allowed for this Analyzer are an object with the following attributes:
type
(string, optional):"shape"
(default): index all GeoJSON geometry types (Point, Polygon etc.)"centroid"
: compute and only index the centroid of the input geometry"point"
: only index GeoJSON objects of type Point, ignore all other geometry types
options
(object, optional): options for fine-tuning geo queries. These options should generally remain unchangedmaxCells
(number, optional): maximum number of S2 cells (default: 20)minLevel
(number, optional): the least precise S2 level (default: 4)maxLevel
(number, optional): the most precise S2 level (default: 23)
Examples
Create a collection with GeoJSON Points stored in an attribute location
, a
geojson
Analyzer with default properties, and a View using the Analyzer.
Then query for locations that are within a 3 kilometer radius of a given point
and return the matched documents, including the calculated distance in meters.
The stored coordinates and the GEO_POINT()
arguments are expected in
longitude, latitude order:
geopoint
Introduced in: v3.8.0
An Analyzer capable of breaking up JSON object describing a coordinate into a set of indexable tokens for further usage with ArangoSearch Geo functions.
The Analyzer can be used for two different coordinate representations:
- an array with two numbers as elements in the format
[<latitude>, <longitude>]
, e.g.[40.78, -73.97]
. - two separate number attributes, one for latitude and one for
longitude, e.g.
{ location: { lat: 40.78, lon: -73.97 } }
. The attributes cannot be at the top level of the document, but must be nested like in the example, so that the Analyzer can be defined for the fieldlocation
with the Analyzer properties{ "latitude": ["lat"], "longitude": ["lng"] }
.
The properties allowed for this Analyzer are an object with the following attributes:
latitude
(array, optional): array of strings that describes the attribute path of the latitude value relative to the field for which the Analyzer is defined in the Viewlongitude
(array, optional): array of strings that describes the attribute path of the longitude value relative to the field for which the Analyzer is defined in the Viewoptions
(object, optional): options for fine-tuning geo queries. These options should generally remain unchangedmaxCells
(number, optional): maximum number of S2 cells (default: 20)minLevel
(number, optional): the least precise S2 level (default: 4)maxLevel
(number, optional): the most precise S2 level (default: 23)
Examples
Create a collection with coordinates pairs stored in an attribute location
,
a geopoint
Analyzer with default properties, and a View using the Analyzer.
Then query for locations that are within a 3 kilometer radius of a given point.
The stored coordinates are in latitude, longitude order, but GEO_POINT()
and
GEO_DISTANCE()
expect longitude, latitude order:
Create a collection with coordinates stored in an attribute location
as
separate nested attributes lat
and lng
, a geopoint
Analyzer that
specifies the attribute paths to the latitude and longitude attributes
(relative to location
attribute), and a View using the Analyzer.
Then query for locations that are within a 3 kilometer radius of a given point:
Analyzer Features
The features of an Analyzer determine what term matching capabilities will be available and as such are only applicable in the context of ArangoSearch Views.
The valid values for the features are dependant on both the capabilities of
the underlying type and the query filtering and sorting functions that the
result can be used with. For example the text type will produce
frequency
+ norm
+ position
and the PHRASE()
AQL function requires
frequency
+ position
to be available.
Currently the following features are supported:
- frequency: how often a term is seen, required for
PHRASE()
- norm: the field normalization factor
- position: sequentially increasing term position, required for
PHRASE()
. If present then the frequency feature is also required
Built-in Analyzers
There is a set of built-in Analyzers which are available by default for convenience and backward compatibility. They can not be removed.
The identity
Analyzer has no properties and the features frequency
and norm
. The Analyzers of type text
all tokenize strings with stemming
enabled, no stopwords configured, accent removal and case conversion to
lowercase turned on and the features frequency
, norm
and position
:
Name | Type | Locale (Language) | Case | Accent | Stemming | Stopwords | Features |
---|---|---|---|---|---|---|---|
identity |
identity |
["frequency", "norm"] |
|||||
text_de |
text |
de.utf-8 (German) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_en |
text |
en.utf-8 (English) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_es |
text |
es.utf-8 (Spanish) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_fi |
text |
fi.utf-8 (Finnish) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_fr |
text |
fr.utf-8 (French) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_it |
text |
it.utf-8 (Italian) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_nl |
text |
nl.utf-8 (Dutch) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_no |
text |
no.utf-8 (Norwegian) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_pt |
text |
pt.utf-8 (Portuguese) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_ru |
text |
ru.utf-8 (Russian) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_sv |
text |
sv.utf-8 (Swedish) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
text_zh |
text |
zh.utf-8 (Chinese) |
lower |
false |
true |
[ ] |
["frequency", "norm", "position"] |
Note that locale, case, accent, stemming and stopwords are Analyzer
properties. text_zh
does not have actual stemming support for Chinese despite
what the property value suggests.
Supported Languages
Analyzers rely on ICU for
language-dependent tokenization and normalization. The ICU data file
icudtl.dat
that ArangoDB ships with contains information for a lot of
languages, which are technically all supported.
The alphabetical order of characters is not taken into account by ArangoSearch,
i.e. range queries in SEARCH operations against Views will not follow the
language rules as per the defined Analyzer locale nor the server language
(startup option --default-language
)!
Also see Known Issues.
Stemming support is provided by Snowball, which supports the following languages:
Language | Code |
---|---|
Arabic * | ar |
Basque * | eu |
Catalan * | ca |
Danish * | da |
Dutch | nl |
English | en |
Finnish | fi |
French | fr |
German | de |
Greek * | el |
Hindi * | hi |
Hungarian * | hu |
Indonesian * | id |
Irish * | ga |
Italian | it |
Lithuanian * | lt |
Nepali * | ne |
Norwegian | no |
Portuguese | pt |
Romanian * | ro |
Russian | ru |
Serbian * | sr |
Spanish | es |
Swedish | sv |
Tamil * | ta |
Turkish * | tr |
* Introduced in: v3.7.0