arangoimport Examples: CSV / TSV
Importing CSV Data
arangoimport offers the possibility to import data from CSV files. This comes handy when the data at hand is in CSV format already and you don’t want to spend time converting them to JSON for the import.
To import data from a CSV file, make sure your file contains the attribute names in the first row. All the following lines in the file will be interpreted as data records and will be imported.
The CSV import requires the data to have a homogeneous structure. All records
must have exactly the same amount of columns as there are headers. By default,
lines with a different number of values will not be imported and there will be
warnings for them. To still import lines with less values than in the header,
there is the --ignore-missing
option. If set to true, lines that have a
different amount of fields will be imported. In this case only those attributes
will be populated for which there are values. Attributes for which there are
no values present will silently be discarded.
Example:
"first","last","age","active","dob"
"John","Connor",25,true
"Jim","O'Brady"
With --ignore-missing this will produce the following documents:
{ "first" : "John", "last" : "Connor", "active" : true, "age" : 25 }
{ "first" : "Jim", "last" : "O'Brady" }
The cell values can have different data types though. If a cell does not have any value, it can be left empty in the file. These values will not be imported so the attributes will not “be there” in document created. Values enclosed in quotes will be imported as strings, so to import numeric values, boolean values or the null value, don’t enclose the value in quotes in your file.
We will be using the following import for the CSV import:
"first","last","age","active","dob"
"John","Connor",25,true,
"Jim","O'Brady",19,,
"Lisa","Jones",,,"1981-04-09"
Hans,dos Santos,0123,,
Wayne,Brewer,null,false,
The command line to execute the import is:
arangoimport --file "data.csv" --type csv --collection "users"
The above data will be imported into 5 documents which will look as follows:
{ "first" : "John", "last" : "Connor", "active" : true, "age" : 25 }
{ "first" : "Jim", "last" : "O'Brady", "age" : 19 }
{ "first" : "Lisa", "last" : "Jones", "dob" : "1981-04-09" }
{ "first" : "Hans", "last" : "dos Santos", "age" : 123 }
{ "first" : "Wayne", "last" : "Brewer", "active" : false }
As can be seen, values left completely empty in the input file will be treated
as absent. This is also true for unquoted null
literals.
The literals true
and false
will be treated as booleans if they are not
enclosed in quotes.
Numeric values not enclosed in quotes will be treated as numbers.
Note that leading zeros in numeric values will be removed. To import numbers
with leading zeros, please use strings (e.g. "012"
instead of 012
).
You can set --convert
to false
if you want to treat all unquoted literals
and numbers as strings instead. --convert
is enabled by default.
Other values not enclosed in quotes will be treated as strings. Any values enclosed in quotes will be treated as strings, too.
String values containing the quote character or the separator must be enclosed
with quote characters. Within a string, the quote character itself must be
escaped with another quote character (or with a backslash if the
--backslash-escape
option is used).
Note that the quote and separator characters can be adjusted via the
--quote
and --separator
arguments when invoking arangoimport. The quote
character defaults to the double quote mark ("
). To use a literal quote in a
string, you can use two quote characters (""
).
To use backslash for escaping quote characters (\"
), please set the option
--backslash-escape
to true
.
The importer supports Windows (CRLF) and Unix (LF) line breaks. Line breaks might also occur inside values that are enclosed with the quote character.
Here is an example for using literal quotes and newlines inside values:
"name","password"
"Foo","r4ndom""123!"
"Bar","wow!
this is a
multine password!"
"Bartholomew ""Bart"" Simpson","Milhouse"
Extra whitespace at the end of each line will be ignored. Whitespace at the start of lines or between field values will not be ignored, so please make sure that there is no extra whitespace in front of values or between them.
Importing TSV Data
You may also import tab-separated values (TSV) from a file. This format is very simple: every line in the file represents a data record. There is no quoting or escaping. That also means that the separator character (which defaults to the tabstop symbol) must not be used anywhere in the actual data.
As with CSV, the first line in the TSV file must contain the attribute names, and all lines must have an identical number of values.
If a different separator character or string should be used, it can be specified
with the --separator
argument.
An example command line to execute the TSV import is:
arangoimport --file "data.tsv" --type tsv --collection "users"
Reading compressed input files
arangoimport can transparently process gzip-compressed input files if they have a “.gz” file extension, e.g.
arangoimport --file data.csv.gz --type csv --collection "users"
For other input formats it is possible to decompress the input file using another program and piping its output into arangoimport, e.g.
bzcat users.csv.bz2 | arangoimport --file "-" --type csv --collection "users"
This example requires that a bzcat
utility for decompressing bzip2-compressed
files is available, and that the shell supports pipes.
Attribute Name Translation
For the CSV and TSV input formats, attribute names can be translated automatically. This is useful in case the import file has different attribute names than those that should be used in ArangoDB.
A common use case is to rename an id
column from the input file into _key
as
it is expected by ArangoDB. To do this, specify the following translation when
invoking arangoimport:
arangoimport --file "data.csv" --type csv --translate "id=_key"
Other common cases are to rename columns in the input file to _from
and _to
:
arangoimport --file "data.csv" --type csv --translate "from=_from" --translate "to=_to"
The --translate
option can be specified multiple times. The source attribute name
and the target attribute must be separated with a =
.
Ignoring Attributes
For the CSV and TSV input formats, certain attribute names can be ignored on
imports. In an ArangoDB cluster there are cases where this can come in handy,
when your documents already contain a _key
attribute and your collection has
a sharding attribute other than _key
: In the cluster this configuration is
not supported, because ArangoDB needs to guarantee the uniqueness of the _key
attribute in all shards of the collection.
arangoimport --file "data.csv" --type csv --remove-attribute "_key"
The same thing would apply if your data contains an _id
attribute:
arangoimport --file "data.csv" --type csv --remove-attribute "_id"
Reading headers from a separate file
For the CSV and TSV input formats it is sometimes required to read raw data files that do not contain a first line with all attributes names.
For these cases, arangoimport supports a --headers-file
option to specify
a separate input file just for the header line with all the attribute names.
The contents of this file will be interpreted as CSV/TSV line with attribute
names, and the contents of the regular input file (--file
) will be
interpreted as the data to import, without any attribute names.
The --headers-option
can be used as follows:
arangoimport --file "data.csv" --type csv --headers-file "headers.csv"
If the option is used, it is necessary that the file specified via
--headers-file
contains one line with the attribute names in CSV/TSV format
(taking into account --backslash-escape
, --quote
and --separator
).