Features and Improvements in ArangoDB 3.9
The following list shows in detail which features have been added or improved in ArangoDB 3.9. ArangoDB 3.9 also contains several bug fixes that are not listed here.
ArangoSearch
Segmentation and Collation Analyzers
The new segmentation
Analyzer type allows you to tokenize text in a
language-agnostic manner as per
Unicode Standard Annex #29,
making it suitable for mixed language strings. It can optionally preserve all
non-whitespace or all characters instead of keeping alphanumeric characters only,
as well as apply case conversion.
The collation
Analyzer converts the input into a set of language-specific
tokens. This makes comparisons follow the rules of the respective language,
most notable in range queries against Views.
See:
UI
Analyzers in Web Interface
A new menu item ANALYZERS has been added to the side navigation bar of the Web UI. Through this page, you can view existing Analyzers as well create new Analyzers. The UI is full-featured and lets you feed in all parameters and options that you could otherwise input through the HTTP or JavaScript API.
It also lets you copy configuration from an existing Analyzer, allowing for a much quicker workflow when your new Analyzer is very similar to an existing one.
It offers two edit/view modes - a form mode where a standard web form is used to capture user input, and a JSON mode where experienced users can directly write the raw Analyzer configuration in JSON format.
Configurable root redirect
Added two options to arangod
to allow HTTP redirection customization for
root (/
) call of the HTTP API:
-
--http.permanently-redirect-root
: iftrue
(default), use a permanent redirection (use HTTP 301 code), iffalse
fall back to temporary redirection (use HTTP 302 code). -
--http.redirect-root-to
: redirect of root URL to a specified path. Redirects to/_admin/aardvark/index.html
if not set (default).
These options are useful to override the built-in web interface with some user-defined action.
Web interface session handling
The previously inactive startup parameter --server.session-timeout
was
revived and now controls the timeout for web interface sessions (and other
sessions that are based on JWTs created by the /_open/auth
API).
For security reasons, the default timeout value for web interface sessions has been reduced to one hour, after which a session is ended automatically. Web interface sessions that are active (i.e. that have any user activity) are automatically extended until the user ends the session explicitly or if there is a period of one hour without any user activity.
The timeout value for web interface sessions can be adjusted via the
--server.session-timeout
startup parameter (in seconds).
AQL
Prune Variable
Added an option to store PRUNE
expression as a variable. Now, a PRUNE
condition can be stored in a variable and be used later in the query without
having to repeat the PRUNE
condition:
FOR v, e, p IN 10 OUTBOUND @start GRAPH "myGraph"
PRUNE pruneCondition = v.isRelevant == true
FILTER pruneCondition
RETURN p
The condition v.isRelevant == true
is stored in the variable pruneCondition
,
and later used as a condition for FILTER
.
See Pruning
Decay Functions
Added three decay functions to AQL:
Decay functions calculate a score with a function that decays depending on the distance of a numeric value from a user given origin.
DECAY_GAUSS(41, 40, 5, 5, 0.5) // 1
DECAY_LINEAR(5, 0, 10, 0, 0.2) // 0.6
DECAY_EXP(2, 0, 10, 0, 0.2) // 0.7247796636776955
Vector Functions
Added three vector functions to AQL for calculating the cosine similarity, Manhattan distance, and Euclidean distance:
COSINE_SIMILARITY([0,1], [1,0]) // 0
L1_DISTANCE([-1,-1], [2,2]) // 6
L2_DISTANCE([1,1], [5,2]) // 4.1231056256176606
Traversal filtering optimizations
A post-filter on the vertex and/or edge result of a traversal will now be applied during the traversal to avoid generating the full output for AQL. This will have a positive impact on performance when filtering on the vertex/edge but still returning the path.
Previously all paths were produced even for non-matching vertices/edges. The new optimization now will check on the vertex/edge filter condition first and only produce the remaining paths.
For example, the query
FOR v, e, p IN 10 OUTBOUND @start GRAPH "myGraph"
FILTER v.isRelevant == true
RETURN p
can now be optimized, and the traversal statement will only produce
paths for which the last vertex satisfied isRelevant == true
.
Traversal partial path buildup
There is now a performance optimization for traversals in which the path
is returned, but only a specific sub-attribute of the path is used later
(e.g. vertices
, edges
, or weight
sub-attribute).
For example, the query
FOR v, e, p IN 1..3 OUTBOUND @start GRAPH "myGraph"
RETURN p.vertices
only requires the buildup of the vertices
sub-attribute of the path
result, but not the buildup of the edges
sub-attribute.
This optimization should have a positive impact on performance for larger traversal result sets.
Warnings on invalid OPTIONS
Invalid use of OPTIONS
in AQL queries will now raise a warning when the query
is parsed. This is useful to detect misspelled attribute names in OPTIONS
, e.g.
INSERT ... INTO collection
OPTIONS { overwrightMode: 'ignore' } /* should have been 'overwriteMode' */
It is also useful to detect the usage of valid OPTIONS
attribute names that
are used at a wrong position in the query, e.g.
FOR doc IN collection
FILTER doc.value == 1234
INSERT doc INTO other
OPTIONS { indexHint: 'myIndex' } /* should have been used above for FOR */
In case options are used incorrectly, a warning with code 1575 will be raised
during query parsing or optimization. By default, warnings are reported but do
not lead to the query being aborted. This can be toggled by the startup option
--query.fail-on-warnings
or the per-query runtime option failOnWarnings
.
Memory usage tracking
The AQL operations K_SHORTEST_PATHS
and SHORTEST_PATH
are now included
in the memory usage tracking performed by AQL, so that memory acquired by these
operations will be accounted for and checked against the configured memory
limit (options --query.memory-limit
and --query.memory-limit-global
).
Execution of complex queries
Very large queries (in terms of query execution plan complexity) are now split
into multiple segments that are executed using separate stacks. This avoids
potential stack overflow. The number of execution nodes after that such
stack splitting is performed can be configured via the startup option
--query.max-nodes-per-callstack
. The default value is 200 for macOS, and 250
for the other supported platforms. The value can be adjusted per query via the
maxNodesPerCallstack
query option.
Query complexity limits
AQL now has some hard-coded query complexity limits, to prevent large programmatically generated queries from causing trouble (too deep recursion, enormous memory usage, long query optimization and distribution passes etc.).
The following limits have been introduced:
- a recursion limit for AQL query expressions. An expression can now be
up to 500 levels deep. An example expression is
1 + 2 + 3 + 4
, which is 3 levels deep1 + (2 + (3 + 4))
. The expression recursion is limited to 500 levels. - a limit for the number of execution nodes in the initial query execution plan. The number of execution nodes is limited to 4,000.
RocksDB block cache control
The new query option fillBlockCache
can be used to control the population
of the RocksDB block cache with data read by the query. The default value for
this per-query option is true
, which means that any data read by the query
will be inserted into the RocksDB block cache if not already present in there.
This mimics the previous behavior and is a sensible default.
Setting the option to false
allows to not store any data read by the query
in the RocksDB block cache. This is useful for queries that read a lot of (cold)
data which would lead to the eviction of the hot data from the block cache.
Server options
Extended naming convention for databases
There is a new startup option --database.extended-names-databases
to allow
database names to contain most UTF-8 characters. This feature is
experimental in ArangoDB 3.9, but will become the norm in a future version.
Running the server with the option enabled provides support for database names that are not comprised within the ASCII table, such as Japanese or Arabic letters, emojis, letters with accentuation. Also, many ASCII characters that were formerly banned in the traditional naming convention are now accepted.
Example database names that can be used with the new naming convention:
"España", "😀", "犬", "كلب", "@abc123", "København", "München", "Россия", "abc? <> 123!"
The ArangoDB client tools arangobench, arangodump, arangoexport, arangoimport, arangorestore, and arangosh ship with full support for the extended database naming convention.
Note that the default value for --database.extended-names-databases
is false
for compatibility with existing client drivers and applications that only support
ASCII names according to the traditional database naming convention used in previous
ArangoDB versions. Enabling the feature may lead to incompatibilities up to the
ArangoDB instance becoming inaccessible for such drivers and client applications.
Please be aware that dumps containing extended database names cannot be restored into older versions that only support the traditional naming convention. In a cluster setup, it is required to use the same database naming convention for all Coordinators and DB-Servers of the cluster. Otherwise the startup will be refused. In DC2DC setups it is also required to use the same database naming convention for both datacenters to avoid incompatibilities.
Also see Database Naming Conventions.
Version information
The arangod server now provides a command --version-json
to print version
information in JSON format. This output can be used by tools that need to
programmatically inspect an arangod executable.
A pseudo log topic "all"
was added. Setting the log level for the “all” log
topic will adjust the log level for all existing log topics. For example,
--log.level all=debug
will set all log topics to log level “debug”.
Support info API
A new HTTP REST API endpoint GET /_admin/support-info
was added for retrieving
deployment information for support purposes. The endpoint returns data about the
ArangoDB version used, the host (operating system, server ID, CPU and storage capacity,
current utilization, a few metrics) and the other servers in the deployment
(in case of active failover or cluster deployments).
As this API may reveal sensitive data about the deployment, it can only be
accessed from inside the _system
database. In addition, there is a policy control
startup option --server.support-info-api
that controls if and to whom the API
is made available. This option can have the following values:
disabled
: support info API is disabled.jwt
: support info API can only be accessed via superuser JWT.hardened
(default): if--server.harden
is set, the support info API can only be accessed via superuser JWT. Otherwise it can be accessed by admin users only.public
: everyone with access to the_system
database can access the support info API.
Miscellaneous changes
Collection statuses
The previously existing collection statuses “new born”, “loading”, “unloading” and “unloaded” were removed, as they weren’t actively used in arangod.
These statuses were last relevant with the MMFiles storage engine, when it was important to differentiate which collections were present in main memory and which weren’t. With the RocksDB storage engine, all that is automatically handled anyway, and the mentioned statuses are not important anymore.
The “Load” and “Unload” buttons for collections have also been removed from the
web interface. This change also obsoletes the load()
and unload()
calls for
collections as well as their HTTP API equivalents. The APIs will remain in place
for now for downwards-compatibility but have been changed to no-ops.
They will eventually be removed in a future version of ArangoDB.
Cluster-internal timeouts
The internal timeouts for inactive cluster transactions on DB servers was increased from 3 to 5 minutes.
Previously transactions on DB servers could expire quickly, which led to spurious “query ID not found” or “transaction ID not found” errors on DB servers for multi-server queries/transactions with unbalanced access patterns for the different participating DB servers.
Transaction timeouts on Coordinators remain unchanged, so any queries/transactions that are abandoned will be aborted there, which will also be propagated to DB-Servers.
Client tools
Increased default number of threads
The default value for the --threads
startup parameter was changed from
2 to the maximum of 2 and the number of available CPU cores for the following
client tools:
- arangodump
- arangoimport
- arangorestore
This change can help to improve performance of imports, dumps or restore
processes on machines with multiple cores in case the --threads
parameter
was not previously used. As a trade-off, the change may lead to an increased
load on servers, so any scripted imports, dumps or restore processes that
want to keep the server load under control should set the number of client
threads explicitly when invoking any of the above client tools.
arangoimport
arangoimport now provides a --datatype
startup option, in order to fix
the datatypes for certain attributes in CSV/TSV imports. For example, in the
the following CSV input file, it is unclear if the numeric values should be
imported as numbers or as stringified numbers for the individual attributes:
key,price,weight,fk
123456,200,5,585852
864924,120,10,9998242
9949,70,11.5,499494
6939926,2130,5,96962612
To determine the datatypes for the individual columns, arangoimport can be
invoked with the --datatype
startup option, once for each attribute:
--datatype key=string
--datatype price=number
--datatype weight=number
--datatype fk=string
This will turn the numeric-looking values in the key
attribute into strings
but treat the attributes price
and weight
as numbers. Finally, the values in
attribute fk
will be treated as strings again.
See Overriding data types per attribute
arangobench
arangobench now prints a short description of the test case started, so it is easier to figure out what operations are carried out by a test case. Several test cases in arangobench have been deprecated because they do not target real world use cases but were rather writing for some internal testing. The deprecated test cases will be removed in a future version to clear up the list of test cases.
arangovpack
The arangovpack utility supports more input and output formats (JSON and
VelocyPack, plain or hex-encoded). The former options --json
and --pretty
have been removed and have been replaced with separate options for specifying
the input and output types:
--input-type
(json
,json-hex
,vpack
,vpack-hex
)--output-type
(json
,json-pretty
,vpack
,vpack-hex
)
The former option --print-non-json
has been replaced with the new option
--fail-on-non-json
which makes arangovpack
fail when trying to emit non-JSON types to JSON output.
Internal changes
The compiler version used to build the ArangoDB Linux executables has been upgraded from g++ 9.3.0 to g++ 10.2.1. g++ 10 is also the expected version of g++ when compiling ArangoDB from source.
The bundled version of the Snappy compression library was upgraded from version 1.1.8 to version 1.1.9.
The minimum architecture requirements have been raised from the Westmere architecture to the Sandy Bridge architecture. 256-bit AVX instructions are now expected to be present on all targets that run ArangoDB 3.9 executables. If a target does not support AVX instructions, it may fail with SIGILL at runtime.