Marklogic Server: MLCP User Guide
Marklogic Server: MLCP User Guide
Marklogic Server: MLCP User Guide
MarkLogic 10
May, 2019
Table of Contents
MarkLogic Content Pump (mlcp) is a command line tool for getting data into and out of a
MarkLogic Server database. This chapter covers the following topics:
• Feature Overview
• Import content into a MarkLogic Server database from flat files, compressed ZIP and
GZIP files, or mlcp database archives.
• Create documents from flat files, delimited text files, aggregate XML files, and
line-delimited JSON files. For details, see “Importing Content Into MarkLogic Server” on
page 26.
• Import mixed content types from a directory, using the file suffix and MIME type
mappings to determine document type. Unrecognized/missing suffixes are imported as
binary documents. For details, see “How mlcp Determines Document Type” on page 31.
• Export the contents of a MarkLogic Server database to flat files, a compressed ZIP file, or
an mlcp database archive. For details, see “Exporting Content from MarkLogic Server” on
page 92.
• Copy content and metadata from one MarkLogic Server database to another. For details,
see “Copying Content Between Databases” on page 117.
• Import or copy content into a MarkLogic Server database, applying a custom server-side
transformation before inserting each document. For details, see “Transforming Content
During Ingestion” on page 57.
• Extract documents from an archived forest to flat files or a compressed file using Direct
Access. For details, see “Using Direct Access to Extract or Copy Documents” on
page 126.
• Import documents from an archived forest into a live database using Direct Access. For
details, see “Importing Documents from a Forest into a Database” on page 129.
The mlcp tool operates in local mode meaning that mlcp drives all its work on the host where it is
invoked. Resources such as import and input data and export destination must be reachable from
that host. All communication with MarkLogic Server is through that host.
In local mode, throughput is limited by resources such as memory and network bandwidth
available to the host running mlcp.
You can use mlcp even when a load balancer sits between the client host and the MarkLogic host.
The mlcp tool is compatible with AWS Elastic Load Balancer (ELB) and other load balancers.
Term Definition
aggregate XML content that includes recurring element names and which can be
split into multiple documents with the recurring element as the docu-
ment root. For details, see “Splitting Large XML Files Into Multiple
Documents” on page 37.
line-delimited JSON A type of aggregate input where each line in the file is a piece of stand-
alone JSON content. For details, see “Creating Documents from
Line-Delimited JSON Files” on page 44.
• Replace mlcp.sh with mlcp.bat. You should always use mlcp.bat on Windows; using
mlcp.sh with Cygwin is not supported.
• For aesthetic reasons, long example command lines are broken into multiple lines using
the Unix line continuation character “\”. On Windows, remove the line continuation
characters and place the entire command on one line, or replace the line continuation
characters with the Windows equivalent, “^”.
• Replace option arguments enclosed in single quotes (') with double quotes ("). If the
single-quoted string contains embedded double quotes, escape the inner quotes.
• Escape any unescaped characters that have special meaning to the Windows command
interpreter.
For example, the following Unix command line:
Where command is one of the commands in the table below. Each command has a set of
command-specific options, which are covered in the chapter that discusses the command.
Command Description
import Import data from the file system or standard input to a MarkLogic Server
database. For a list of options usable with this command, see “Import
Command Line Options” on page 81.
export Export data from a MarkLogic Server database to the file system. For a
list of options usable with this command, see “Export Command Line
Options” on page 112.
copy Copy data from one MarkLogic Server database to another. For a list of
options usable with this command, see “Copy Command Line Options”
on page 119.
extract Use Direct Access to extract files from a forest file to documents on the
native file system. For a list of options usable with this command, see
“Extract Command Line Options” on page 130.
Options can also be specified in an options file using -options_file. Options files and command
line options can be used together. For details, see “Options File Syntax” on page 9.
• The value of a boolean typed option can be omitted. If the value is omitted, true is implied.
For example, -copy_collections is equivalent to -copy_collections true.
For example, the following command passes the setting “-Xmx100M” to the JVM to increase the
JVM heap size for a single mclp run:
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
If you use an options file, it must be the first option on the command line. The mlcp command
(import, export, copy) can also go inside the options file. For example:
• Each line contains either a command name, an option, or an option value, ordered as they
would appear on the command line.
• Comments begin with “#” and must be on a line by themselves.
• Blank lines, leading white space, and trailing white space are ignored.
For example, if you frequently use the same MarkLogic Server connection information (host,
port, username, and password), you can put the this information into an options file:
$ cat my-conn.txt
# my connection info
-host
localhost
-port
8000
-username
me
-password
my_password
You can also include a command name (import, export, or copy) as the first non-comment line in
an options file:
0 Successful completion.
For example, MarkLogic 9 and mlcp 9.0 include support for redacting documents as you export
them. However, older versions of MarkLogic do not support this feature, so it is not possible to
use the -redaction option of mlcp with older versions.
Similarly, you can use mlcp to export a database archive from MarkLogic 9 or later that includes
documents with the node-update security capability. However, this capability did not exist in
earlier versions of MarkLogic, so it cannot be preserved if you import the MarkLogic 9 archive
into an older MarkLogic, and may even cause errors.
For best results, use the version of mlcp that corresponds to your version of MarkLogic, or limit
your jobs to features you know are supported in both.
http://github.com/marklogic/marklogic-contentpump
• Supported Platforms
• Required Software
• Installing mlcp
• Security Considerations
• MarkLogic Server 7.0-1 or later, with an XDBC App Server configured. MarkLogic 8 and
later versions come with an XDBC App Server pre-configured on port 8000.
• Oracle/Sun Java JRE 1.8 or later.
2. Unpack the mlcp distribution to a location of your choice. This creates a directory named
mlcp-version, where version is the mlcp version. For example, assuming
/space/marklogic contains zip file for mlcp version 1.3, then the following commands
install mlcp under /space/marklogic/mlcp-1.3/:
$ cd /space/marklogic
$ unzip mlcp-1.3-bin.zip
3. Optionally, put the mlcp bin directory on your path. For example:
$ export PATH=${PATH}:/space/marklogic/mlcp-1.3/bin
$ export PATH=${PATH}:$JAVA_HOME/bin
You might need to configure your MarkLogic cluster before using mlcp for the first time. For
details, see “Configuring Your MarkLogic Cluster” on page 13.
On Windows, use the mlcp.bat command to run mlcp. On UNIX and Linux, use the mlcp.sh
command. You should not use mlcp.sh in the Cygwin shell environment on Windows.
When you use mlcp with MarkLogic 8 or later on the default port (8000), no special cluster
configuration is necessary. Port 8000 includes a pre-configured XDBC App Server. The default
database associated with port 8000 is the Documents database. To use mlcp with a different
database and port 8000, use the -database, -input_database, or -output_database options. For
example:
When using MarkLogic 8 or later with a port other than 8000, the port should connect to either an
XDBC App Server or an App Server with a rewriter that is set up to handle XDBC traffic.
Hosts within a group share the same App Server configuration, but hosts in different groups do
not. Therefore, if all your forest hosts are in a single group, you only need to configure one App
Server to handle XDBC traffic. If your forests are on hosts in multiple groups, then you must
configure an App Server for XDBC that listens on the same port in each group.
For example, the cluster shown below is properly configured to use Database A as an mlcp input
or output source. Database A has 3 forests, located on 3 hosts in 2 different groups. Therefore,
both Group 1 and Group 2 must make Database A accessible via XDBC on port 9001.
Group 1 Group 2
XDBC App Server on port 9001 XDBC App Server on port 9001
Database A
If the forests of Database A are only located on Host1 and Host2, which are in the same group,
then you would only need to configure one XDBC App Server on port 9001.
If you use MarkLogic 8 or later and port 8000 instead of port 9001, then you do not need to
explicitly create any XDBC App Servers to support the above database configuration because
both group automatically have an XDBC App Server on port 8000. You might need to explicitly
specify the database name (Database A) in your mlcp command, though, if it is not the default
database associated with port 8000.
By default, mlcp requires a username and password to be included in the command line options
for each job. You can avoid passing a cleartext password between your mlcp client host and
MarkLogic Server by using Kerberos for authentication. For details, see “Using mlcp With
Kerberos” on page 16.
If you want to use SSL with both the source (input) and destination (output) App Servers during
an mlcp copy job, both App Servers must be SSL enabled.
mlcp
Command Line Option For more information
Command
copy -input_ssl and/or -output_ssl “Copy Command Line Options” on page 119
All these options accept a boolean argument value. As described in “Command Line Summary”
on page 7, “true” is assumed if you leave the argument off.
If you have disabled the default SSL protocol on your App Server, you must also use one of the
following options to explicitly specify the SSL protocol that mlcp should use when connecting to
MarkLogic:
mlcp
Command Line Option For more information
Command
Note: The above SSL protocol options are ignored in some cases when you use the SSL
configuration technique describe in “Using mlcp With Kerberos” on page 16.
Before you can use Kerberos with mlcp, you must configure your MarkLogic installation to
enable external security, as described in External Security in the Security Guide.
If external security is not already configured, you will need to perform at least the following
procedures:
• Create a Kerberos external security configuration object. For details, see Creating an
External Authentication Configuration Object in the Security Guide.
• Create a Kerberos keytab file and install it in your MarkLogic installation. For details, see
Creating a Kerberos keytab File in the Security Guide.
• Create one or more users associated with an external name. For details, see Assigning an
in the Security Guide.
External Name to a User
• Configure your XDBC App Server to use “kerberos-ticket” authentication. For details, see
Configuring an App Server for External Authentication in the Security Guide.
• Creating Users
• Invoking mlcp
This user must also be assigned roles and privileges required to enable your mlcp operations.
For example, if you’re using mlcp to import documents into a database, then the user must have
update privileges on the target database, as well as the minimum privileges required by mlcp. For
details on the minimum privileges required by mlcp, see “Security Considerations” on page 14.
For example, if you create a configuration named “kerb-conf”, then configure your XDBC App
Server with the following values for the “authentication”, “internal security”, and “external
security” configuration settings in the Admin Interface:
You can use an existing XDBC App Server or create a new one. To create a new XDBC App
Server, use the Admin Interface, the Admin API, or the REST Management API. For details, see
Procedures for Creating and Managing XDBC Servers in the Administrator’s Guide.
Configure the App Server to use “kerberos-ticket” authentication and the Kerberos external
security configuration object you created following the instructions in Creating an External
Authentication Configuration Object in the Security Guide.
Note: When you install MarkLogic, an XDBC App Server and other services are
available port 8000. Changing the security configuration for the App Server on
port 8000 affects all the MarkLogic services available through this port, including
the HTTP App Server and REST Client API instance.
• Use kinit or a similar program on your mlcp client host to create and cache a Kerberos
Ticket to Get Tickets (TGT) for a principal you assigned to a MarkLogic user.
• Invoke mlcp with no -username and no -password option from the environment in which
you cached the TGT.
For example, suppose you configured an XDBC App Server on port 9010 of host “ml-host” to use
“kerberos-ticket” authentication. Further, suppose you associated the Kerberos principal name
“kuser” with the user “mluser”. Then the following commands result in mlcp authenticating with
Kerberos as user “kuser”, and importing documents into the database as “mluser”.
kinit kuser
...
mlcp.sh import -host ml-host -port 9010 -input_file_path src_dir
You do not necessarily need to run kinit every time you invoke mlcp. The cached TGT typically
has a lifetime over which it is valid.
This chapter walks you through a short introduction to mlcp in which you import documents into
a database and then export them back out as files in the following steps:
• Load Documents
• Export Documents
gs/
import/
one.xml
two.json
export/
2. Ensure the mlcp bin directory and the java commands are on your path. For example, the
following example command places the mlcp bin directory on your path if mlcp is
installed in MLCP_INSTALL_DIR:
3. Create a directory to serve as your work area and change directories to this work area. For
example:
mkdir gs
cd gs
4. Create a sub-directory to hold the sample input and output data. For example:
mkdir import
The examples use an options file to save MarkLogic connection related options so that you can
easily re-use them across multiple commands. This section describes how to create this file.
If you prefer to pass the connection options directly on the command line instead, add -username,
-password, -host, and possibly -port options to the example mlcp commands in place of
-options_file.
1. If you are not already at the top level of your work area, change directory to this location.
That is, the gs folder created in “Prepare to Run the Examples” on page 19.
cd gs
2. Create a file named conn.txt with the following contents. Each line is either an option
name or a value for the preceding option.
-username
your_username
-password
your_password
-host
localhost
-port
8000
3. Edit conn.txt and modify the values of the -username and -password options to match
your environment.
4. Optionally, modify the -host and/or -port option values. The host and port must identify a
MarkLogic Server App Server that supports the XDBC protocol. MarkLogic Server
comes with an App Server pre-configured on port 8000 that supports XDBC, attached to
the Documents database. You can choose a different App Server.
gs/
conn.txt
import/
one.xml
two.json
Other input options include compressed files, delimited text files, aggregate XML data, and
line-delimited JSON data. See “Importing Content Into MarkLogic Server” on page 26 for details.
You can also load document into a different database using the -database option.
To load a single file, specify the path to the file as the value of -input_file_path. For example:
-input_file_path import
When you load documents, a default URI is generated based on the type of input data. For details,
see “Controlling Database URIs During Ingestion” on page 28.
We will import documents from flat files, so the default URI is the absolute pathname of the input
file. For example, if your work area is /space/gs on Linux or C:\gs on Windows, then the default
URI when you import documents from gs/import is as follows:
Linux: /space/gs/import/filename
Windows: /c:/gs/import/filename
You can use the -output_uri_replace option to strip off the portion of the URI that comes from
the path steps before “gs”. The option argument is of the form “pattern,replacement_text”. For
example, given the default URIs shown above, we’ll add the following option to create URIs that
begin with “/gs”:
Run the following command from the root of your work area (gs) to load all the files in the import
directory. Modify the argument to -output_uri_replace to match your environment.
Linux:
mlcp.sh import -options_file conn.txt \
-output_uri_replace "/space,''" -input_file_path import
Windows:
mlcp.bat import -options_file conn.txt ^
-output_uri_replace "/c:,''" -input_file_path import
The output from mlcp should look similar to the following (but with a timestamp prefix on each
line). “OUTPUT_RECORDS_COMITTED: 2” indicates mlcp loaded two files. For more details, see
“Understanding mlcp Output” on page 23.
Optionally, use Query Console’s Explore feature to examine the contents of the Documents
database and see that the documents were created. You should see documents with the following
URIs:
/gs/import/one.xml
/gs/import/two.json
You can also create documents from files in a compressed file and from other types of input
archives. For details, see “Importing Content Into MarkLogic Server” on page 26.
You can identify the documents to export in several ways, including by URI, by directory, by
collection, and by XPath expression. This example uses a directory filter. Recall that the input
documents were loaded with URIs of the form /gs/import/filename. Therefore we can easily
extract the files by database directory using -directory_filter /gs/import/.
This example exports documents from the default database associated with the App Server on
port 8000. Use the -database option to export documents from a different database.
Use the following procedure to export the documents inserted in “Load Documents” on page 21.
1. If you are not already at the top level of your work area, change directory to this location.
That is, the gs folder created in “Prepare to Run the Examples” on page 19. For example:
cd gs
2. Extract the previously inserted documents into a directory named export. The export
directory must not already exist.
Linux:
mlcp.sh export -options_file conn.txt -output_file_path export \
-directory_filter /gs/import/
Windows:
mlcp.bat export -options_file conn.txt -output_file_path export ^
-directory_filter /gs/import/
You should see output similar to the following, but with a timestamp prefix on each line. The
“OUTPUT_RECORDS: 2” line indicates mlcp exported 2 files.
The exported documents are in gs/export. A filesystem directory is created for each directory
step in the original document URI. Therefore, you should now have the following directory
structure:
gs/
export/
gs/
import/
one.xml
two.json
The following table summarizes the purpose of key pieces of information reported by mlcp:
Message Description
Content type is set to format X. Import only. This indicates the type of documents
mlcp will create. The default is MIXED, which
means mlcp will base the type on the input file suffix.
For details, see “How mlcp Determines Document
Type” on page 31.
Total input paths to process : N Import only. Found N candidate input sources. If this
number is 0, then the pathname you supplied to
-input_file_path does not contain any data that
meets your import criteria. If you’re unable to
diagnose the cause, refer to “Troubleshooting” on
page 133.
INPUT_RECORDS: N The number of inputs mlcp actually tried to process.
For an import operation, this is the number of
documents mlcp attempted to create. For an export
operation, this is number of documents mlcp
attempted to export. If there are errors, this number
may not correspond to the actual number of
documents imported, exported, copied, or extracted.
Message Description
You can use mlcp to insert content into a MarkLogic Server database from flat files, compressed
ZIP and GZIP files, aggregate XML files, and MarkLogic Server database archives. The input
data can be accessed from the native filesystem.
For a list of import related options, see “Import Command Line Options” on page 81.
• Loading Triples
• Failover Handling
The default input type is documents, which means each input file or ZIP file entry creates one
database document. All other input file types represent composite input formats which can yield
multiple database documents per input file.
The following table provides a quick reference of the supported input file types, along with the
allowed document types for each, and whether or not they can be passed to mlcp as compressed
files.
When the input file type is documents or sequencefile you must consider both the input format
(-input_file_type) and the output document format (-document_type). In addition, for some
input formats, input can come from either compressed or uncompressed files
(-input_compressed).
The -document_type option controls the database document format when -input_file_type is
documents or sequencefile. MarkLogic Server supports text, JSON, XML, and binary documents.
If the document type is not explicitly set with these input file types, mlcp uses the input file suffix
to determine the type. For details, see “How mlcp Determines Document Type” on page 31.
Note: You cannot use mlcp to perform document conversions. Your input data should
match the stated document type. For example, you cannot convert XML input into
a JSON document just by setting -document_type json.
The following table summarizes the default behavior with several input sources:
delimited text file The value in the column used as For a record of the form
the id. (The first column, by “first,second,third” where Column
default). 1 is the id: first
For example, the following command loads all files from the file systemdirectory
/space/bill/data into the database attached to the App Server on port 8000. The documents
inserted into the database have URIs of form /space/bill/data/filename.
If the /space/bill/data directory is zipped up into bill.zip, such that bill/ is the root directory
in zip file, then the following command inserts documents with URIs of the form
bill/data/filename:
When you use the -generate_uri option to have mlcp generate URIs for you, the generated URIs
follow the same pattern as for aggregate XML and line delimited JSON:
/path/filename-split_start-seqnum
The generated URIs are unique across a single import operation, but they are not globally unique.
For example, if you repeatedly import data from some file /tmp/data.csv, the generated URIs will
be the same each time (modulo differences in the number of documents inserted by the job).
-output_uri_replace pattern,’string’,pattern,’string'
For details on the regular expression language supported by -output_uri_replace, see “Regular
Expression Syntax” on page 9.
Note: These options are applied after the default URI is constructed and encoded, so if
the option values contain characters not allowed in a URI, you must encode them
yourself. See “Character Encoding of URIs” on page 31.
The following example loads documents from the filesystem directory /space/bill/data. The
default output URIs would be of the form /space/bill/data/filename. The example uses
-output_uri_replace to replace “bill/data” with “will” and strip off “/space/”, and then adds a
“/plays” prefix using -output_uri_prefix. The end result is output URIs of the form
/plays/will/filename.
If you supply a URI or URI component, you are responsible for ensuring the result is a legitimate
URI. No automatic encoding takes place. This applies to -output_uri_replace,
-output_uri_prefix, and -output_uri_suffix. The changes implied by these options are applied
after mlcp encodes the default URI.
When mlcp exports documents from the database to the file system such that the output directory
and/or file names are derived from the document URI, the special symbols are decoded. That is,
“foo%bar.xml” becomes “foo bar.xml” when exported. For details, see “How URI Decoding
Affects Output File Names” on page 95.
• Document type can be inherent in the input file type. For example, aggregates and rdf
input files always insert XML documents. For details, see “Supported Input Format
Summary” on page 26.
• You can specify a document type explicitly with -document_type. For example, to load
documents as XML, use -input_file_type documents -document_type xml. You cannot
set an explicit type for all input file types.
• mlcp can determine document type dynamically from the output document URI and the
MarkLogic Server MIME type mappings when you use -input_file_type documents
-document_type mixed.
If you set -document_type to an explicit type such as -document_type json, then mlcp inserts all
documents as that type.
If you use -document_type mixed, then mlcp determines the document type from the output URI
suffix and the MIME type mapping configured into MarkLogic Server. Mixed is the default
behavior for -input_file_type documents.
Note: You can only use -document_type mixed when the input file type is documents.
The following table contains examples of applying the default MIME type mappings to output
URIs with various file extensions, an unknown extension, and no extension. The default mapping
includes many additional suffixes. You can examine and create MIME type mappings under the
Mimetypes section of the Admin Interface. For more information, see Implicitly Setting the Format
Based on the MIME Type in the Loading Content Into MarkLogic Server Guide.
/path/doc.xml XML
/path/doc.json JSON
/path/doc.jpg binary
/path/doc.txt text
/path/doc.unknown binary
/path/doc-nosuffix binary
The MIME type mapping is applied to the final output URI. That is, the URI that results from
applying the URI transformation options described in “Controlling Database URIs During
Ingestion” on page 28. The following table contains examples of how URI transformations can
affect the output document type in mixed mode, assuming the default MIME type mappings.
Input Doc
URI Options Output URI
Filename Type
-output_uri_suffix ".xml"
-output_uri_replace "\.\d+,'.txt'"
Document type determination is completed prior to invoking server side transformations. If you
change the document type in a transformation function, you are responsible for changing the
output document to match. For details, see “Transforming Content During Ingestion” on page 57.
2. Set -input_file_type if your input files are not documents. For example, if loading from
delimited text files, sequence files, aggregate XML files, RDF triples files, or database
archives.
3. Set -document_type if -input_file_type is not documents and the content type cannot be
accurately deduced from the file suffixes as described in “How mlcp Determines
Document Type” on page 31.
4. Set -mode:
2. Set -input_file_type if your input files are not documents. For example, if loading from
delimited text files, sequence files, aggregate XML files, or database archives.
3. Set -document_type if -input_file_type is not documents and the content type cannot be
accurately deduced from the file suffixes as described in “How mlcp Determines
Document Type” on page 31.
4. Set -mode:
Note: Input document filtering is handled differently for -input_file_type forest. For
details, see “Filtering Forest Contents” on page 103.
For example, the following command loads only files with a “.xml” suffix from the directory
/space/bill/data:
The mlcp tool uses Java regular expression syntax. For details, see “Regular Expression Syntax”
on page 9.
Follow this procedure to load content from one or more ZIP or GZIP compressed files.
1. Set -input_file_path:
• To load from a single file, set -input_file_path to the path to the compressed file.
• To load from multiple files, set -input_file_path to a directory containing the
compressed files.
2. If the content type cannot be accurately deduced from suffixes of the files inside the
compressed file as described in “How mlcp Determines Document Type” on page 31, set
-document_type appropriately.
4. If the compressed file suffix is not “.zip” or “.gzip”, specify the compressed file format by
setting -input_compression_codec to zip or gzip.
If you set -document_type to anything but mixed, then the contents of the compressed file must be
homogeneous. For example, all XML, all JSON, or all binary.
The following example command loads binary documents from the compressed file
/space/images.zip on the local filesystem.
The following example loads all the files in the compressed file /space/example.jar, using
-input_compression_codec to tell mlcp the compression format because of the “.jar” suffix:
If -input_file_path is a directory, mlcp loads contents from all compressed files in the input
directory, recursing through subdirectories. The input directory must not contain other kinds of
files.
By default, the URI prefix on documents loaded from a compressed file includes the full path to
the input compressed file and mirrors the directory hierarchy inside the compressed file. For
example, if a ZIP file /space/shakespeare.zip contains bill/data/dream.xml then the ingested
document URI is /space/shakespeare.zip/bill/data/dream.xml. To override this behavior, see
“Controlling Database URIs During Ingestion” on page 28.
1. Set -input_file_path:
5. If the input archive was created without any metadata, set -archive_metadata_optional to
true. If this is not set, an exception is thrown if the archive contains no metadata.
6. If you want to exclude some or all of the document metadata in the archive:
Note: When you import properties from an archive, you should disable the “maintain last
modified” configuration option on the destination database during the import.
Otherwise, you can get an XDMP-SPECIALPROP error if the import operation tries to
update the last modified property. To disable this setting, use the Admin Interface
or the library function admin:set-maintain-last-modified.
The following mlcp options support creating multiple documents from aggregate data:
• -aggregate_record_element
• -uri_id
• -aggregate_record_namespace
You can disaggregate XML when loading from either flat or compressed files. For more
information about working with compressed files, see “Loading Documents From Compressed
Files” on page 35.
1. Set -input_file_path:
• To load from a single file, set -input_file_path to the path to the aggregate XML
file.
• To load from multiple files, set -input_file_path to a directory containing the
aggregate files. The directory must not contain other kinds of files.
2. If you are loading from a compressed file, set -input_compressed.
4. Set -aggregate_record_element to the element QName of the node to use as the root for
all inserted documents. See the example below. The default is the first child element under
the root element.
Note: The element QName should appear at only one level. You cannot specify the
element name using a path, so disaggregation occurs everywhere that name is
found.
5. Optionally, override the default document URI by setting -uri_id to the name of the
element from which to derive the document URI.
The default URI is hashcode-seqnum in local mode. If there are multiple matching elements, the
first match is used.
If your aggregate URI id’s are not unique, you can overwrite one document in your input set with
another. Importing documents with non-unique URI id’s from multiple threads can also cause
deadlocks.
The following command breaks the input data into a document for each <person> element. The
-uri_id and other URI options give the inserted documents meaningful names. The command
creates URIs of the form “/people/lastname.xml” by using the <last/> element as the aggregate
URI id, along with an output prefix and suffix:
Then mlcp ingests no documents unless you set -aggregate_record_namespace. Setting the
namespace creates two documents in the namespace “http://marklogic.com/examples”. For
example, after running the following command:
The following options are commonly used in the generation of documents from delimited text
files:
• -input_file_type delimited_text
• -delimiter
• -uri_id
The default document type is XML. To create JSON documents, use -document_type json.
When creating XML documents, each document has a root node of <root> and child elements
with names corresponding to each column title. You can override the default root element name
using the -delimited_root_name option; for details, see “Customizing XML Output” on page 41.
When creating JSON documents, each document is rooted at an unnamed object containing JSON
properties with names corresponding to each column title. By default, the values for JSON are
always strings. Use -data_type to override this behavior; for details, see “Controlling Data Type
in JSON Output” on page 42.
For example, if you have the following data and mlcp command:
Then mlcp creates the XML output shown in the table below. To generate the JSON output, add
-document_type json to the mlcp command line.
<root> {
<first>george</first> "first": "george",
<last>washington</last> "last": "washington"
</root> }
<root> {
<first>betsy</first> "first": "betsy",
<last>ross</last> "last": "ross"
</root> }
• The first line in the input file contains “column” names that are used to create the XML
element or JSON property names of each document created from the file.
• The same delimiter is used to separate each value, as well as the column names. The
default separator is a comma; use -delimiter to override it; for details, see “Specifying the
Field Delimiter” on page 43.
• Every line has the same number of fields (values). Empty fields are represented as two
delimiters in a row, such as “a,b,,d”.
For example, the following data meets the input format requirements:
first,last
george,washington
betsy,ross
This data produces documents with XML elements or JSON properties named “first” and “last”.
The following example produces documents with root element <person> in the namespace
http://my.namespace.
For example, if you have an input file called “catalog.csv” that looks like the following:
Then the default output documents look similar to the following. Notice that all the property
values are strings.
{ "id": "12345",
"price": "8.99",
"in-stock: "true"
}
The following example command uses the -data_type option to make the “price” property a
number value and the “in-stock” property a boolean value. Since the “id” field is not specified in
the -data_type option, it remains a string.
first,last
george,washington
betsy,ross
Then importing this data with no URI related options creates two documents with name
corresponding to the “first” value. The URI will be “george” and “betsy”.
Note that URIs generated with -generate_uri are only guaranteed to be unique across your
import operation. For details, see “Default Document URI Construction” on page 28.
You can further tailor the URIs using -output_uri_prefix and -output_uri_suffix. These
options apply even when you use -generate_uri. For details, see “Controlling Database URIs
During Ingestion” on page 28.
If your URI id’s are not unique, you can overwrite one document in your input set with another.
Importing documents with non-unique URI id’s from multiple threads can also cause deadlocks.
For example, the Linux bash shell parser makes it difficult to specify a tab delimiter on the
command line, so you can put the options in a file instead. In the example options file below, the
string literal after -delimiter should contain a tab character.
$ cat delim.opt
-input_file_type
delimited_text
-delimiter
"tab"
To create JSON documents from delimited text files such as CSV files, see “Creating Documents
from Delimited Text Files” on page 39. For aggregate XML input, see “Splitting Large XML
Files Into Multiple Documents” on page 37.
Usually, each line of input has similar structure, such as the following:
However, the JSON data on each line is independent of the other lines, so the lines do not have to
contain JSON data of the same “shape”. For example, the following is a valid input file:
Given the input shown below, the following command creates 2 JSON documents. Each
document contains the data from a single line of input.
$ cat example.json
{"id": "12345","price":8.99, "in-stock": true}
{"id": "67890","price":2.00, "in-stock": false}
The example command creates documents whose contents precisely mirror each of input:
/space/data/example.json-0-1
/space/data/example.json-0-2
...
You can base the URI on values in the content instead by using the -uri_id option to specify the
name of a property found in the data. You can further tailor the URIs using -output_uri_prefix
and -output_uri_suffix. For details, see “Controlling Database URIs During Ingestion” on
page 28.
For example, the following command uses the value in the “id” field as the base of the URI and
uses -output_uri_suffix to add a “.json” suffix to the URIs:
Given these options, an input line of the form shown below produces a document with the URI
“12345.json” instead of “/space/data/example.json-0-1”.
If the property name specified with -uri_id is not unique in your data, mlcp will use the first
occurrence found in a breadth first search. The value of the specified property should be a valid
number or string.
If you use -uri_id, any records (lines) that do not contain the named property are skipped. If the
property is found but the value is null or not a number or string, the record is skipped.
You can use mlcp to load triples files in several formats, including RDF/XML, Turtle, and
N-Quads. For a full list of supported formats, see Supported RDF Triple Formats in Semantic Graph
Developer’s Guide.
Note: Each time you load triples from a file, mlcp inserts new documents into the
database. That is, multiple loads of the same input inserts new triples each time,
rather than overwriting. Only the XQuery and REST API allow you replace triples.
Load triples data embedded within other content according to the instructions for the enclosing
input file type, rather than with -input_file_type rdf. For example, if you have an XML input
document that happens to have some triples embedded in it, load the document using
-input_file_type documents.
You cannot combine loading triples files with other input file types.
If you do not include any graph selection options in your mlcp command, Quads are loaded into
the graph specified in the data. Quads with no explicit graph specification and other kinds of triple
data are loaded into the default graph. You can change this behavior with options. For details, see
“Graph Selection When Loading Quads” on page 46 or “Graph Selection for Other Triple Types”
on page 48.
For details, see Loading Triples with mlcp in Semantic Graph Developer’s Guide.
• -output_graph
• -output_override_graph
• -output_collections
You can use -output_collections by itself or with the other two options. You cannot use
-output_graph and -output_override_graph together.
If your semantic data is not in a quad format like N-Quads, see “Graph Selection for Other Triple
Types” on page 48.
Quads interact with these options differently than other triple formats because quads can include a
graph IRI in each quad. The following table summarizes the affect of various option combinations
when importing quads with mlcp:
none For quads that contain an explicit graph IRI, load the triple into
that graph. For quads with no explicit graph IRI, load the triple
into the default graph. The default graph URI is
http://marklogic.com/semantics#default-graph.
-output_graph For quads that contain an explicit graph IRI, load the triple into
that graph. For quads with no explicit graph IRI, load the triple
into the graph specified by -output_graph.
-output_override_graph Load all triples into the graph specified by
-output_override_graph. This graph overrides any graph IRIs
contained in the quads.
-output_collections Similar to -output_override_graph, but you can specifiy multiple
collections. Load triples into the graph specified as the first (or
only) collection; also add triples to any additional collections on
the list. This overrides any graph IRIs contained in the quads.
-output_graph with For quads that contain an explicit graph IRI, load the triple into
-output_collections that graph. For quads with no explicit graph IRI, load the triple
into the graph specified by -output_graph. Also add triples to the
specified collections.
For more details, see Loading Triples with mlcp in the Semantic Graph Developer’s Guide.
For example, suppose you load the following N-Quad data with mlcp. There are 3 quads in the
data set. The first and last quad include a graph IRI, the second quad does not.
<http://one.example/subject1> <http://one.example/predicate1>
<http://one.example/object1> <http://example.org/graph3> .
_:subject1 <http://an.example/predicate1> "object1" .
Then the table below illustrates how the various graph related options affect how the triples are
loaded into the database:
none Graphs:
http://example.org/graph3
http://marklogic.com/semantics#default-graph
http://example.org/graph5
• -output_graph
• -output_collections
The following table summarizes the affect of various option combinations when importing triples
with mlcp. For quads, see “Graph Selection When Loading Quads” on page 46.
-output_graph with Load triples into the graph specified by -output_graph and also
-output_collections add them to the specified collections.
For more details, see Loading Triples with mlcp in the Semantic Graph Developer’s Guide.
For example, if you use a command similar to the following load triples data:
Then the table below illustrates how the various graph related options affect how the triples are
loaded into the database:
none Graph:
http://marklogic.com/semantics#default-graph
For details, see “Importing Documents from a Forest into a Database” on page 129.
Selecting a batch size is a speed vs. memory tradeoff. Each request to the server introduces
overhead because extra work must be done. However, unless you use -streaming or
-document_type mixed, all the updates in a batch stay in memory until a request is sent, so larger
batches consume more more memory.
It is also possible to overwhelm MarkLogic Server if you have too many concurrent sessions
active.
The optimizations described by this section are only enabled if you explicitly specify the
-fastload or -output_directory options. (The -output_directory option implies -fastload).
Note: The -fastload option work slightly different when used with -restrict_hosts.
For details, see “How -restrict_hosts Affects -fastload” on page 74. The limitations
of -fastload described in this section still apply.
By default, mlcp inserts documents into the database by distributing work across the e-nodes in
your MarkLogic cluster. Each e-node inserts documents into the database according to the
configured document assignment policy.
This means the default insertion process for a document is similar to the following:
1. mlcp selects Host A from the available e-nodes in the cluster and sends it the document to
be inserted.
2. Using the document assignment policy configured for the database, Host A determines the
document should be inserted into Forest F on Host B.
When you use -fastload (or -output_directory), mlcp attempts to cut out the middle step by
applying the document assignment policy on the client. The interaction becomes similar to the
following:
1. Using the document assignment policy, mlcp determines the document should be inserted
into Forest F on Host B.
2. mlcp sends the document to Host B for insertion, with instructions to insert it into a
specific forest.
Pre-determining the destination host and forest can always be done safely and consistently if the
all of the following conditions are met:
To make forest assignment decisions locally, mlcp gathers information about the database
assignment policy and forest topology at the beginning of a job. If you change the assignment
policy or forest topology while an mlcp import or copy operation is running, mlcp might make
forest placement decisions inconsistent with those MarkLogic Server would make. This can cause
problems such as duplicate document URIs and unbalanced forests.
Similar problems can occur if mlcp attempts to update a document already in the database, and the
forest topology or assignment policy changes between the time the document was originally
inserted and the time mlcp updates the document. Using user-specified forest placement when
initially inserting a document creates the same conflict.
• A document mlcp inserts already exists in the database and any of the following
conditions are true:
• The forest topology has changed since the document was originally inserted.
• The assignment policy has changed since the document was originally inserted.
• The assignment policy is not Legacy (default) or Bucket. For details, see “How
Assignment Policy Affects Optimization” on page 53.
• The document was originally inserted using user-specified forest placement.
• A document mlcp inserts does not already exist in the database and any of the following
conditions are true:
• The forest topology changes while mlcp is running.
• The assignment policy changes while mlcp is running.
Assignment policy is a database configuration setting that affects how MarkLogic Server selects
what forest to insert a document into or move a document into during rebalancing. For details, see
Rebalancer Document Assignment Policies in Administrator’s Guide.
Note: Assignment policy was introduced with MarkLogic 7 and mlcp v1.2. If you use an
earlier version of mlcp with MarkLogic 7 or later, the database you import data
into with -fastload or -output_directory must be using the legacy assignment
policy.
Any operation that changes the forests available for updates changes your forest topology,
including the following:
In most cases, it is your responsibility to determine whether or not you can safely use -fastload
(or -output_directory, which implies -fastload). In cases where mlcp can detect -fastload is
unsafe, it will disable it or give you an error.
Note: Assignment policy was introduced with MarkLogic 7 and mlcp v1.2. If you use an
earlier version of mlcp with MarkLogic 7 or later, the database you import data
into with -fastload or -output_directory must be using the legacy assignment
policy.
The following table summarizes the limitations imposed by each assignment policy. If you do not
explicitly set assignment policy, the default is Legacy or Bucket.
Bucket • there are no pre-existing documents in the database with the same
URIs; or
• you use -output_directory; or
• the URIs may be in use, but the forest topology has not changed
since the documents were created, and the documents were not
initially inserted using user-specified forest placement.
Statistical You can only use -fastload to create new documents; updates are not
supported. You should use -output_directory to ensure there are no
updates.
All documents in a batch are inserted into the same forest. The rebalancer
may subsequently move the documents if the batch size is large enough
to cause the forest to become unbalanced.
Range You can only use -fastload to create new documents; updates are not
supported. You should use -output_directory to ensure there are no
updates.
You can only use -fastload optimizations with range policy if you are
licensed for Tiered Storage.
Query You can only use -fastload to create new documents; updates are not
supported. You should use -output_directory to ensure there are no
updates.
You can only use -fastload optimizations with range policy if you are
licensed for Tiered Storage.
4.13.4 Tuning Split Size and Thread Count for Local Mode
You can tune split size only when importing documents in local mode from one of the following
input file types:
You can tune thread count for both whole documents and all composite files types. Thread count
and split size can interact to affect job performance.
In local mode, a split defines the unit of work per thread devoted to a session with MarkLogic
Server. The ideal split size is one that keeps all mlcp session threads busy. The default split size is
32M for local mode. Use the -max_split_size, -thread_count, and -thread_count_per_split
options to tune your load.
By default, threads are assigned to splits in a round-robin fashion. For example, consider a
loading 120 small documents of length 1M. Since the default split size is 32M, the load is broken
into 4 splits. If -thread_count is 10, each split is assigned to at least 2 threads (10 / 4 = 2). The
remaining 2 threads are each assigned to a split, so the number of threads per split are distributed
as follows:
Split 1: 3 threads
Split 2: 3 threads
Split 3: 2 threads
Split 4: 2 threads
This distribution could result in two of the splits completing faster, leaving some threads idle. If
you set -max_split_size to 12M, the load has 10 splits, which can be evenly distributed across the
threads and may result in better thread utilization.
If -thread_count is less than the number of splits, the default behavior is one thread per split, up
to the total number of threads. The remaining splits must wait until a thread becomes available.
Note: If you specify -thread_count_per_splitt, each input split will run with the
specified number. The total number of thread count, however, is controlled by the
newly calculated thread count or -thread_count if it is specified.
If MarkLogic Server is not I/O bound, then raising the thread count, and possibly threads per split,
can improve throughput when the number of splits is small but each split is very large. This is
often applicable to loading from zip files, aggregate files, and delimited text files. Note that if
MarkLogic Server is already I/O bound in your environment, increasing the concurrency of writes
will not necessarily improve performance.
Streaming content into the database usually requires less memory on the host running mlcp, but
ingestion can be slower because it introduces additional network overhead. Streaming also does
not take advantage of mlcp’s builtin retry mechanism. If an error occurs that is normally retryable,
the job will fail.
Note: Streaming is only usable when -input_file_type is documents. You cannot use
streaming with delimited text files, sequence files, or archives.
Note: This option can only be applied to composite input file types that logically produce
multiple documents and for which mlcp can efficiently identify document
boundaries, such as delimited_text. Not all composite file types are supported; for
details, see “Import Command Line Options” on page 81.
The -split_input option affects local mode as follows: Suppose you are importing a very large
delimited text file in local mode with -split_input set to false and the data processed as a single
split. The work might be performed by multiple threads (depending on the job configuration), but
these threads read records from the input file synchronously. This can cause some read contention.
If you set -split_input to true, then each thread is assigned its own chunk of input, resulting in
less contention and greater concurrency.
16M true 1M 16
Tuning the split size in this case potentially enables greater concurrency because the multiple
splits can be assigned to different threads or tasks.
Split size is tunable using -max_split_size, -min_split_size, and block size. For details, see
“Tuning Split Size and Thread Count for Local Mode” on page 54.
• Implementation Guidelines
• Function Signature
• Input Parameters
• Expected Output
• Example Implementation
Parameter Description
$content Data about the original input document. The map contains the following keys:
• uri - The URI of the document being inserted into the database.
• value - The contents of the input document, as a document node, binary
node, or text node.
$context Additional context information about the insertion, such as
tranformation-specific parameter values. The map can contain the following
keys when your transform function is invoked:
The type of node your function receives in the “value” property of $content depends on the input
document type, as determined by mlcp from the -document_type option or URI extension. For
details, see “How mlcp Determines Document Type” on page 31. The type of node your function
returns in the “value” property should follow the same guidelines.
The table below outlines the relationship between document type and the node type your
transform function should expect.
XML document-node
JSON document-node
BINARY binary-node
TEXT text-node
The collections, permissions, quality, and temporal collection metadata from the mlcp command
line is made available to your function so that you can modify or replace the values. If a given
metadata category is not specified on the command line, the key will not be present in the input
map.
Note: Modifying the document URI in a transformation can cause duplicate URIs when
combined with the -fastload option, so you should not use -fastload or
-output_directory with a transformation module that changes URIs. For details,
see “Time vs. Correctness: Understanding -fastload Tradeoffs” on page 51.
The documents returned by your transformation should be exactly as you want to insert them into
the database. No further transformations are applied by the mlcp infrastructure. For example, a
transform function cannot affect document type just by changing the URI. Instead, it must convert
the document node. For details, see “Example: Changing the URI and Document Type” on
page 71.
You can use the context parameter to specify collections, permissions, quality, and values
metadata for the documents returned by your transform. Use the following keys and data formats
for specifying various categories of metadata:
For a description of the meaning of the keys, see “Input Parameters” on page 58.
If your function returns multiple documents, they will all share the metadata settings from the
context parameter.
$root/following-sibling::node()
}
), $content
)
};
For an end-to-end example of using this transform, see “Example: Server-Side Content
Transformation” on page 66.
• Function Signature
• Input Parameters
• Expected Output
• Example Implementation
{ uri: string,
value: node
}
The type of node your function receives in content.value depends on the input document type, as
determined by mlcp from the -document_type option or URI extension. For details, see “How
mlcp Determines Document Type” on page 31. The type of node your function returns in the
value property should follow the same guidelines.
The table below outlines the relationship between document type and the node type your
transform function should expect (or return).
XML document-node
JSON document-node
BINARY binary-node
TEXT text-node
The context parameter can contain context information about the insertion, such as any
transform-specific parameters passed on the mlcp command line. The context parameter has the
following form:
{ transform_param: string,
collections: [ string, ... ],
permissions: [ object, ... ],
quality: number,
temporalCollection: string}
The following table describes the properties of the input parameters in more detail:
Parameter Description
content • uri - The URI of the document being inserted into the database.
• value - The contents of the input document, as a document node,
binary node, or text node; see below.
context • transform_param - The value passed by the client through the
-transform_param option, if any. Your function is responsible for
parsing and validation of the input string.
• collections : Collection URIs specified by the -output_collections
option. Value format: An array of strings.
• permissions : Permissions specified by the -output_permissions
option. Value format: An array of permissions objects, as produced
by xdmp.permission.
• quality : The document quality specified by the -output_quality
parameter. Value format: A number.
• temporalCollection : The temporal collection URI specified by the
-temporal-collection parameter. Value format: A string.
The collections, permissions, quality, and temporal collection metadata from the mlcp command
line is made available to your function so that you can modify or replace the values. If a given
metadata category is not specified on the command line, the property will not be present in the
context object.
The document content returned by your transformation should be exactly as you want to insert
them into the database. No further transformations are applied by the mlcp infrastructure. For
example, a transform function cannot affect document type just by changing the URI. Instead, it
must convert the document node. For details, see “Example: Changing the URI and Document
Type” on page 71.
You can modify the context input parameter to specify collections, permissions, quality, and
values metadata for the documents returned by your transform. Use the following property names
and data formats for specifying various categories of metadata:
For a description of the meaning of the keys, see “Input Parameters” on page 61.
If your function returns multiple documents, they will all share the metadata settings from the
context parameter.
exports.addProp = addProp;
• If you use a server-side transform with -fastload (or -output_directory, which enables
-fastload), your transformation function only has access to database content in the same
forest as the input document. If your transformation function needs general access to the
database, do not use -fastload or -output_directory.
Best practice is to install your libraries into the modules database of your XDBC App Server. If
you install your module into the modules database, MarkLogic Server automatically makes the
implementation available throughout your MarkLogic Server cluster. If you choose to install
dependent libraries into the Modules directory of your MarkLogic Server installation, you must
manually do so on each node in your cluster.
MarkLogic Server supports several methods for loading modules into the modules database:
• Run an XQuery or JavaScript query in Query Console. For example, you can run a query
similar to the following to install a module using Query Console. Note: First select your
modules database in the Query Console Content Source dropdown.
• If you use the App Server on port 8000 or have a REST API instance, you can use any of
the following Client APIs:
• Java: ResourceExtensionsManager.write. For details, see Managing Dependent
Libraries and Other Assets in the Java Application Developer’s Guide.
If you use the filesystem instead of a modules database, you can manually install your module
into the Modules directory. Copy the module into MARKLOGIC_INSTALL_DIR/Modules or into a
subdirectory of this directory. The default location of this directory is:
• Unix: /opt/MarkLogic/Modules
• Windows: C:\Program Files\MarkLogic\Modules
If your transformation function requires other modules, you should also install the dependent
libraries in the modules database or the modules directory.
For a complete example, see “Example: Server-Side Content Transformation” on page 66.
• When -fastload is in effect, your transform function runs in the scope of a single forest
(the forest mlcp determines is the appropriate destination for the file being inserted). This
means if you change the document URI as part of your transform, you can end up creating
documents with duplicate URIs.
• When you use a transform function, all the documents in each batch are transformed and
inserted into the database as a single statement. This means, for example, that if the
(transformed) batch contain more than one document with the same URI, you will get an
XDMP-CONFLICTINGUPDATES error.
The following example command assumes you previously installed a transform module with path
/example/mlcp-transform.xqy, and that the function implements a transform function (the default
function) in the namespace http://marklogic.com/example. The function expects a user-defined
parameter value, supplied using the -transform_param option.
For a complete example, see “Example: Server-Side Content Transformation” on page 66.
This example assumes you have already created an XDBC App Server, configured to use "/" as
the root and a modules database of Modules.
$ mkdir /space/mlcp/txform/data
2. Create a file named txform.xml in the sample data directory with the following contents:
<parent><child/></parent>
3. Create a file named txform.json in the sample data directory with the following contents:
{ "key": "value" }
This example module modifies XML input documents by adding an attribute named NEWATTR.
Other input document types pass through the transform unmodified.
In a location other than the sample input data directory, create a file named transform.xqy with
the following contents. For example, copy the following into /space/mlcp/txform/transform.xqy.
},
$root/following-sibling::node()
}
), $content
)
};
This example module modifies JSON input documents by adding an attribute named NEWPROP.
Other input document types pass through the transform unmodified.
In a location other than the sample input data directory, create a file named transform.sjs with
the following contents. For example, copy the following into /space/mlcp/txform/transform.sjs.
exports.transform = addProp;
These instructions assume you use the XDBC App Server and Documents database
pre-configured on port 8000. This procedure installs the module using Query Console. You can
use another method.
For more detailed instructions on using Query Console, see Query Console User Guide.
http://yourhost:8000/qconsole/
2. Create a new query by clicking the "+" at the top of the query editor.
4. Install the XQuery and/or JavaScript module by copying one of the following scripts into
the new query. Modify the first parameter of xdmp:document-load to match the path to the
transform module you previously created.
5. Select the modules database of your XDBC App Server in the Content Source dropdown
at the top of the query editor. If you use the XDBC App Server on port 8000, this is the
database named Modules.
6. Click the Run button. Your module is installed in the modules database.
7. To confirm installation of your module, click the Explore button at the top of the query
editor and note your module installed with URI /example/mlcp-transform.xqy or
/example/mlcp-transform.sjs.
Use a command similar to the following if you installed the XQuery transform module:
Use a command similar to the following if you installed the JavaScript transform module:
mlcp should report creating two documents. Near the end of the mlcp output, you should see lines
similar to the following:
Use Query Console to explore the content database associated with your XDBC App Server.
Confirm that mlcp created 2 documents. If your input was in the directory
/space/mlcp/txform/data, then the document URIs will be:
• /space/mlcp/txform/data/txform.xml
• /space/mlcp/txform/data/txform.json
If you use the XQuery transform, then exploring the contents of txform.xml in the database should
show a NEWATTR attribute was inserted by the transform, with the value from
-transform_param. The document contents should be as follows:
<parent NEWATTR="my-value">
<child/>
</parent>
If you use the JavaScript transform, then exploring the contents of txform.json in the database
should show a NEWPROP property was inserted by the transform, with the value from
-transform_param. The document contents should be as follows:
Note: Transforms that change the document URI should not be combined with the
-fastload or -output_directory options as they can cause duplicate document
URIs. For details, see “Time vs. Correctness: Understanding -fastload Tradeoffs”
on page 51.
As described in “How mlcp Determines Document Type” on page 31, the URI extension and
MIME type mapping are used to determine document type when you use -document_type mixed.
However, transform functions do not run until after document type selection is completed.
Therefore, if you want to affect document type in a transform, you must convert the document
node, as well as optionally changing the output URI.
Suppose your input document set generates an output document URI with the unmapped
extension “.1”, such as /path/doc.1. Since “1” is not a recognized URI extension, mlcp creates a
binary document node from this input file by default. The example transform function in this
section intercepts such a document and transforms it into an XML document.
• XQuery Implementation
• JavaScript Implementation
Note that if you define a MIME type mapping that maps the extension “1” to XML (or JSON) in
your MarkLogic Server configuration, then mlcp creates a document of the appropriate type to
begin with, and this conversion becomes unnecessary.
exports.transform = modDocType;
If any hostname listed in the value of the -host option is not resolvable by mlcp at the beginning
of a job, then mlcp will abort the job with an IllegalArgumentException.
Assuming all hostnames are resolvable, mlcp uses the first of these hosts to gather information
about the target database. If mlcp is unable to connect to the first host in the -host list, then mlcp
will move on to the next host in the list. If mlcp cannot connect to any of the listed hosts, then the
job will fail with an IOException.
If mlcp successfully retrieves a list of forest hosts, then mlcp subsequently connects directly to
these hosts when distributing work across the cluster, whether or not these hosts are specified in
the -host option. In this way, your job does not need to be aware cluster topology.
This behavior applies to the import, export, and copy commands. (For a copy job, you specify
hosts through -input_host and -output_host, rather than -host.)
You can also restrict mlcp to just the hosts listed by the -host option. For details, see “Restricting
the Hosts mlcp Uses to Connect to MarkLogic” on page 73.
• Limit the host working set to just the e-nodes in your cluster.
• The public and private DNS names of a host differ, such as can occur for an AWS
instance.
Note: MarkLogic automatically sets -restrict_hosts to true when it detects the presence
of a load balancer.
When -restrict_hosts is set to true, mlcp will only connect to the hosts listed in the -host
option, rather than using the approach described in “How mlcp Uses the Host List” on page 73.
Note: Using -restrict_hosts will usually degrade the performance of an mlcp job
because mlcp cannot distribute work as efficiently.
For example, if you’re using mlcp with a load balancer between your client and your MarkLogic
cluster, you can specify the load balancer with -host and set -restrict_hosts to true to prevent
mlcp from attempting to bypass the load balancer and connect directly to the forest hosts.
You can restrict mlcp’s host list when using the import, export, and copy commands. For import
and export, use the -host and -restrict_hosts options. For copy, use -input_host and
-restrict_input_hosts and/or -output_host and -restrict_output_hosts.
Without -restrict_hosts, mlcp figures out which hosts contains the destination forest for a
document, and then connects directly to that host. When -restrict_hosts is true, a connection to
the forest host might not possible. In this case, mlcp connects to an allowed e-node, and includes
the detailed destination information along with the document. The destination details makes an
insertion faster than it would otherwise be.
Note: Failover support in mlcp is only available when running mlcp against MarkLogic 9
or later. With older MarkLogic versions, the job will fail if mlcp is connected to a
host that becomes unavailable.
mlcp always attempts to connect to a new host during a failover event. mlcp can potentially
recover from failover event in the following cases:
• If mlcp receives a connection error that indicates an e-node serving the database is down,
mlcp attempts to select another host. For a job that is not running in fastload mode, mlcp
selects the next host in its host list. For a fastload job, mlcp attempts to determine the
replica forest and host and connect to that host.
• If mlcp receives a retryable error from MarkLogic, it will retry the operation with the same
host. For example, a forest restart or a forest replica host going down can cause a retryable
error.
If mlcp is able to re-establish a connection in these cases, then the job can continue. It is possible
for some documents not to be imported, depending on the configuration of the job. mlcp can only
retry the current batch.
• If -transaction_size is 1, then mlcp only needs to retry the current batch. In most cases, a
successful failover will not cause any insertions to fail.
• If -transaction_size is greater than 1, then mlcp can only retry the current batch. Other
batches in the same transaction cannot be retried. Some documents might not be inserted.
• Even if -transaction_size is 1, mlcp might fail to import all documents in the face of a
failover event in some cases. For example:
• Failover does not succeed within 5 minutes. If it takes more than 5 minutes for
MarkLogic to recover from the failure, then mlcp aborts the job and reports an
error.
mlcp reports any documents that could not be inserted due to the failover.
The following messages are an example of mlcp output during a failover event. Timestamps have
been elided.
1. A failure of some kind occurs, such as host going down. The exact error messages will
depend on the type of failure. Notice that example errors below include a retryable
exception.
2. mlcp begins retrying the failed insertion. Errors may continue to occur because
MarkLogic is still failing over.
Before 10.0-5, when an mlcp commit failed during ingestion, due to the exceptions listed above,
mlcp did not retry the batch. All the documents in the current batch would fail permanently. The
mlcp retry mechanism has been added in 10.0-5 to make mlcp more robust and able to recover
from these exceptions.
• If -batch_size is larger than 1 and -transaction_size is larger than 1: mlcp does not retry
in this situation as the client only caches the current batch. All the documents in the
current transaction will fail permanently.
mlcp only retries when the exceptions caught are retryable. Every time when mlcp retries, it
attempts to select another host. When the exceptions are not retryable, or the retry doesn't succeed
within ~16 minutes for the DHS cluster to recover, all the documents in the current batch will fail
permanently and mlcp will log the failure.
When the current batch fails during inserting or committing, the failures will be logged on WARN
level. Then if the exception is retryable, mlcp will retry inserting the whole batch, and the retry
messages will be logged on DEBUG level. If the retry succeeds, the succeeding message will be
logged on INFO level. If the exception is not retryable, or the maximum retry limit has been
exceeded, the document/batch will fail permanently and will be logged on ERROR level.
Each log message has a batch number in the format of xxxx.xxxx (two integers separated by a dot)
attached to it. The first integer represents the current thread number and the second represents the
batch count local to the current thread. Globally, xxxx.xxxx is unique. This batch number makes it
easier to track down and debug batch failures.
The following messages are an example of common exceptions caught when running mlcp with
DHS cluster on AWS/Azure. These exceptions mostly happens when e-nodes are down or the
static e-node gets overloaded. Timestamps have been removed from these examples.
Note: mlcp gets XDMP-NOTXN when the transaction has already been committed or rolled
back.
The following messages are an example of MLCP output during a retry event. Timestamps have
been removed.
inserting
...DEBUG mapreduce.ContentWriter: Batch 1473219859.1010: Sleeping before
retrying...sleepTime=500ms
...DEBUG contentpump.TransformWriter: Batch 1473219859.1010: Retrying
inserting batch, attempts: 1/15
...INFO contentpump.TransformWriter: Batch 1473219859.1010: Retrying inserting
batch is successful
...WARN contentpump.TransformWriter: Batch 278973739.75: Failed committing
transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 918057596.3: Failed committing
transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 278973739.75: Failed during
committing
...WARN contentpump.TransformWriter: Batch 918057596.3: Failed during
committing
...WARN contentpump.TransformWriter: Batch 1763434846.80: Failed committing
transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 1763434846.80: Failed during
committing
...WARN contentpump.TransformWriter: Batch 981349710.122: Failed committing
transaction: Error parsing HTTP headers: Connection timed out
...WARN contentpump.TransformWriter: Batch 981349710.122: Failed during
committing
...WARN mapreduce.ContentWriter: Batch 278973739.75: Failed rolling back
transaction: No transaction
...DEBUG mapreduce.ContentWriter:
com.marklogic.xcc.exceptions.XQueryException: XDMP-NOTXN: No transaction with
identifier 11132444146034518336
[Session: user=admin, cb={default} [ContentSource: user=admin, cb={none}
[provider: SSLconn address=5bJZEjQ1L.z.marklogicsvc.com/52.224.204.231:8005,
pool=0/64]]]
[Client: XCC/11.0-20200911, Server: XDBC/10.0-4]
...DEBUG mapreduce.ContentWriter: Batch 278973739.75: Sleeping before
retrying...sleepTime=500ms
...WARN contentpump.TransformWriter: Batch 1978594827.298: QueryException:
JS-FATAL: xdmp:function(fn:QName(, transformInsertBatch),
/MarkLogic/hadoop.sjs)($transform-module, $transform-function, $uris, $values,
$insert-options, $transform-option)
...WARN contentpump.TransformWriter: Batch 1978594827.298: Failed during
inserting
...ERROR contentpump.TransformWriter: Batch 1978594827.298: Document failed
permanently: /space/data/iplocations/IP2LOCATION-LITE-DB5.CSV.gz-0-2798613 in
file:/space/data/iplocations/IP2LOCATION-LITE-DB5.CSV.gz at line 2798614
4.17.0.1 Limitations
There are two known limitations with the mlcp retry feature:
• When the input type is archive, mlcp is not able to retry loading metadata/naked
properties when commit fails, since by design the client does not cache these inputs.
• Loading temporal documents may have issues. When mlcp commit fails and catches
exceptions, it tries rolling back before retry loading the whole batch. However, the
previous transaction may have made it to the server and mlcp will get NOTXN exception.
This may create issues for temporal documents, since they may be inserted multiple times.
The following command line options can be used to tune this process:
• -thread_count and -thread_count_per_split: When these two options are specified, mlcp
will use a fixed number of threads and auto-scaling will not happen.
• -max_threads: When -max_threads is specified, mlcp will cap the maximum thread count,
and auto-scaling cannot go beyond this number. This is to prevent the client-side from
running out of memory as the DHS cluster may have a huge number of nodes. By default,
-max_threads is not set.
The following messages are an example of common log messages a user may get in an
auto-scaling process. Timestamps have been removed.
Option Description
The following table lists command line options that define the characteristics of the import
operation:
Option Description
Option Description
Option Description
Option Description
Option Description
Option Description
Option Description
Option Description
Option Description
Option Description
Option Description
You can export content in a MarkLogic Server database to files or an archive. Use archives to
copy content from one MarkLogic Server database to another. Output can be written to the native
filesystem.
For a list of export related command line options, see “Export Command Line Options” on
page 112.
You can also use mlcp to extract documents directly from offline forests. For details, see “Using
Direct Access to Extract or Copy Documents” on page 126.
• Exporting to an Archive
1. Select the files to export. For details, see “Filtering Document Exports” on page 96.
3. To prettyprint exported XML when using local mode, set -indented to true.
When using -document_selector to filter by XPath expression, you can define namespace
prefixes using the -path_namespace option. For example:
-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2'
-document_selector '/ex1:elem[ex2:attr > 10]'
Note: Document URIs are URI-decoded before filesystem directories or filenames are
constructed for them. For details, see “How URI Decoding Affects Output File
Names” on page 95.
For a full list of export options, see “Export Command Line Options” on page 112.
The following example exports selected documents in the database to the native filesystem
directory /space/mlcp/export/files. The directory filter selects only the documents in /plays.
1. Select the files to export. For details, see “Filtering Document Exports” on page 96.
2. Set -output_file_path to the destination directory on the native filesystem. This directory
must not already exist.
4. To prettyprint exported XML when using local mode, set -indented to true.
For a full list of export options, see “Export Command Line Options” on page 112.
The zip files created by export have filenames of the form timestamp-seqnum.zip.
The following example exports all the documents in the database to the directory
/space/examples/export on the native filesystem.
$ ls /space/examples/export
20120823135307-0700-000000-XML.zip
1. Select the documents to export. For details, see “Filtering Archive and Copy Contents” on
page 97.
2. Set -output_file_path to the destination directory on the native filesystem. This directory
must not already exist.
4. If you want to exclude some or all document metadata from the archive:
The following example exports all documents and metadata to the directory
/space/examples/exported. After export, the directory contains one or more compressed archive
files.
The following example exports only documents in the database directory /plays/, including their
collections, properties, and quality, but excluding permissions:
You can use the mlcp import command to import an archive into a database. For details, see
“Loading Content and Metadata From an Archive” on page 36.
When you export a document to a file (or to a file in a compressed file), the output file name is
based on the document URI. The document URI is decoded to form the file name. For example, if
the document URI is “foo%20bar.xml”, then the output file name is “foo bar.xml”.
If the document URI does not conform to the standard URI syntax of RFC 3986, decoding may
fail, resulting in unexpected file names. For example, if the document URI contains unescaped
special characters then the raw URI may be used.
If the document URI contains a scheme, the scheme is removed. If the URI contains both a
scheme and an authority, both are removed. For example, if the document URI is
“file:foo/bar.xml”, then the output file path is output_file_path/foo/bar.xml. If the document
URI is “http://marklogic.com/examples/bar.xml” (contains a scheme and an authority), then the
output file path is output_file_path/examples/bar.xml.
If the document URI includes directory steps, then corresponding output subdirectories are
created. For example, if the document URI is “/foo/bar.xml”, then the output file path is
output_file_path/foo/bar.xml.
By default, mlcp exports all documents in the database. That is, mlcp exports the equivalent of
fn:collection(). The following options allow you to filter what is exported. These options are
mutually exclusive.
• -directory_filter - export only the documents in the listed database directories. You
cannot use this option with -collection_filter or -document-selector.
• -collection_filter - export only the documents in the listed collections. You cannot use
this option with -directory_filter or -document_selector.
• -document_selector - export only documents selected by the specified XPath expression.
You cannot use this option with -directory_filter or -collection_filter. Use
-path_namespace to define namespace prefixes.
• -query_filter - export only documents matched by the specified cts query. You can use
this option alone or in combination with a directory, collection or document selector filter.
You can only use this filter with the export and copy commands. Results may not be
accurate; for details, see “Understanding When Filters Are Accurate” on page 98.
Note: When filtering with a document selector, the XPath filtering expression should
select fragment roots only. An XPath expression that selects nodes below the root
is very inefficient.
When using -document_selector to filter by XPath expression, you can define namespace prefixes
using the -path_namespace option. For example:
-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2'
-document_selector '/ex1:elem[ex2:attr > 10]'
By default, all documents and metadata are exported/copied. The following options allow you to
modify this behavior:
Note: When filtering with a document selector, the XPath filtering expression should
select fragment roots only. An XPath expression that selects nodes below the root
is very inefficient.
When using -document_selector to filter by XPath expression, you can define namespace
prefixes using the -path_namespace option. For example:
-path_namespace 'ex1,http://marklogic.com/example,ex2,http://my/ex2'
-document_selector '/ex1:elem[ex2:attr > 10]'
The query you supply with -query_filter is used in an unfiltered search, which means there can
be false positives among the selected documents. When you combine -query_filter with
-directory_filter, -collection_filter, or -document_selector, mlcp might select documents
that do not meet your directory, collection, or path filter criteria.
The interaction between -query_filter and the other filtering options is similar to the following.
In this example, the search can match documents that are not in the “parts” collection.
-collection_filter parts
-query_filter yourSerializedQuery
cts:search(
fn:collection("parts"),
yourQuery,
("unfiltered"))
For a complete example using -query_filter, see “Example: Exporting Documents Matching a
Query” on page 99.
To learn more about the implications of unfiltered searches, see Fast Pagination and Unfiltered
Searches in the Query Performance and Tuning Guide.
The -query_filter option accepts a serialized XML cts:query or JSON cts.query as its value. For
example, the following table shows the serialization of a cts word query, prettyprinted for
readability:
Format Example
JSON {"wordQuery":{
"text":["huck"],
"options":["lang=en"]
}}
For details on how to obtain the serialized representation of a cts query, see Serializations of
cts:query Constructors in the Search Developer’s Guide.
Using an options file is recommended when using -query_filter because both XML and JSON
serialized queries contain quotes and other characters that have special meaning to the Unix and
Windows command shells, making it challenging to properly escape the query. If you use
-query_filter on the command line, you must quote the serialized query and may need to do
additional special character escaping.
For example, you can create an options file similar to the following. It should contain at least 2
lines: One for the option name and one for the serialized query. You can include other options in
the file. For details, see “Options File Syntax” on page 9.
XML -query_filter
<cts:word-query xmlns:cts="http://marklogic.com/cts"><cts:text
xml:lang="en">mark</cts:text></cts:word-query>
JSON -query_filter
{"wordQuery":{"text":["huck"], "options":["lang=en"]}}
If you save the above option in a file named “query_filter.txt”, then the following mlcp command
exports files from the database that contain the word “huck”:
You can combine -query_filter with another filtering option. For example, the following
command combines the query with a collection filter. The command exports only documents
containing the word “huck” in the collection named “classics”:
Note: The documents selected by -query_filter can include false positives, including
documents that do not match other filter criteria. For details, see “Understanding
When Filters Are Accurate” on page 98.
Language Example
Notice that in the XML example, the xdmp:quote “indent” option is used to disable XML
prettyprinting, making the output better suited for inclusion on the mlcp command line:
xdmp:quote(
<query>{$query}</query>/*,
<options xmlns="xdmp:quote"><indent>no</indent></options>
)
Notice that in the JavaScript example, it is necessary to call toObject on the wrapped query to get
the proper JSON serialization. Using toObject converts the value to a JavaScript object which
xdmp.quote will serialize as JSON.
xdmp.quote(wrapper.query.toObject())
If you want to test your serialized query before using it with mlcp, you can round-trip your XML
query with cts:search in XQuery or your JSON query with cts.search or the JSearch API in
Server-Side JavaScript, as shown in the following examples.
Language Example
Note that xdmp:unquote returns a document node in XQuery, so you need to use XPath to address
the underlying query element root node when reconstructing the query:
cts:query(xdmp:unquote($q)/*[1])
cts.query(fn.head(xdmp.unquote(serializedQ)).root)
By default, mlcp extracts all documents in the input forests. That is, mlcp extracts the equivalent
of fn:collection(). The following options allow you to filter what is extracted from a forest with
Direct Access. These options can be combined.
• -type_filter: Extract only documents with the listed content type (text, XML, or binary).
• -directory_filter: Extract only the documents in the listed database directories.
• -collection_filter: Extract only the documents in the listed collections.
For example, following combination of options extracts only XML documents in the collections
named “2004” or “2005”.
Similarly, the following options import only binary documents in the source database directory
/images/:
When you use Direct Access, filtering is performed in the process that reads the forest files rather
than being performed by MarkLogic Server. For example, in local mode, filters are applied by
mlcp on the host where you run it.
In addition, filtering cannot be applied until after a document is read from the forest. When you
import or extract files from a forest file, mlcp must “touch” every document in the forest.
For details, see “Using Direct Access to Extract or Copy Documents” on page 126.
If you require a consistent snapshot of the database contents during an export or copy, use the
-snapshot option to force all documents to be read from the database at a consistent point in time.
The submission time of the job is used as the timestamp. Any changes to the database occurring
after this time are not reflected in the output.
If a merge occurs while exporting or copying a consistent snapshot, and the merge eliminates a
fragment that is subsequently accessed by the mlcp job, you may get an XDMP-OLDSTAMP error. If
this occurs, the documents included in the same batch or task may not be included in the
export/copy result. If the source database is on MarkLogic Server 7 or later, you may be able to
work around this problem by setting the merge timestamp to retain fragments for a time period
longer than the expected running time of the job; for details, see Understanding and Controlling
Database Merges in the Administrator’s Guide.
-redaction "pii-rules,sec-rules"
Before you can use redaction, you must install one or more redaction rule sets in the Schemas
database. For details on defining and installing redaction rules, see Redacting Document Content in
the Application Developer’s Guide.
Preparing to redact documents with mlcp requires the following steps. For a complete example,
see “Example: Using mlcp for Redaction” on page 105.
1. Install one or more redaction rules in the Schemas database. Each rule must be part of at
least one collection. For details, see Defining Redaction Rules and Installing Redaction Rules
in the Application Developer’s Guide.
2. If you create a rule that uses a user-defined redaction function, install the implementation
of your redaction function in the modules database associated with the App Server you
will connect to using mlcp. For details, see User-Defined Redaction Functions in the
Application Developer’s Guide.
3. Add the -redaction option to your mlcp command line. For example, the following
command applies the rules in the collections “pii-rules” and “sec-rules” to all exported
documents.
The -redaction option works similarly for copy operations. For details, see “Redacting Content
During a Copy” on page 118.
The user who extracts redacted documents must have read permissions on the source documents
and the rules, but need not be able to modify the rule collection or rule definitions. For details, see
Security Considerations in Application Developer’s Guide.
The following behaviors apply when exceptional conditions occur. You should be aware of these
behaviors so you understand when content might not be redacted as expected:
• If a rule collection is empty, mlcp issues a warning and continues with the job.
• If any of the rules contain errors, an error is reported and mlcp aborts the export or copy
operation.
• If a rule is valid, but an error occurs when applying the rule, the rule is skipped for the
current document and a warning is logged. The job continues.
This example uses rules based on built-in redaction functions. For an example of using
user-defined redaction functions, see User-Defined Redaction Functions in the Application
Developer’s Guide.
redact-gs/
data/
rules/
The data/ directory will hold the source documents. The rules/ directory will hold redaction
rules. The example walks you through populating these directories and uploading the contents to
MarkLogic using mlcp in preparation for exporting a set of redacted documents with mlcp.
Create the required directories on Linux by running the following command in a location of your
choosing:
Create the required directories on Windows by running the following command in a location of
your choice:
• /redact-gs/sample1.xml
• /redact-gs/sample2.json
Follow the steps in this procedure to install two sample documents in the Documents database.
1. Change directory to the data directory you created in “Creating a Work Area” on
page 106. You should be in your redact-gs/data directory.
<personal>
<name>Little Bopeep</name>
<summary>Seeking lost sheep. Please call 123-456-7890.</summary>
<id>12-3456789</id>
</personal>
{"personal": {
"name": "Jack Sprat",
"summary": "Free nutrition advice! Call (234)567-8901 now!",
"id": "45-6789123"
}}
4. Run the following mlcp command to insert the sample documents into the Documents
database. Modify the connection details as needed to match your environment.
You can use Query Console to explore the Documents database and confirm the upload.
The use of -output_uri_replace on the import command line replaces the portion of the default
URI that is based on the filesystem location with the fixed directory prefix “/rules/gs”. For more
details, see “Controlling Database URIs During Ingestion” on page 28.
When you complete this exercise, the Schemas database should contain the following documents.
The documents are inserted into a rule collection named “gs-rules”. Rules must be in a rule
collection before you can apply them.
• /rules/gs/redact-phone.xml
• /rules/gs/conceal-id.json
The rules installed in this step use the redact-us-phone and conceal built-in redaction functions. For
details on these and other built-in redaction functions, see Built-in Redaction Function Reference in
the Application Developer’s Guide.
Follow the steps in this procedure to install two sample rules in the Schemas database. For an
explanation of what the rules do, see “Understanding the Example Rules” on page 108.
1. Change directory to the rules directory you created in “Creating a Work Area” on
page 106. You should be in your redact-gs/rules directory.
{ "rule": {
"description": "Remove customer ids.",
"path": "//id",
"method": { "function": "conceal" }
}}
4. Run the following mlcp command to insert the rules into the Schemas database. Modify
the connection details as needed to match your environment.
You can use Query Console to explore the Schemas database and confirm the upload.
The use of -output_uri_replace on the import command line replaces the portion of the default
URI that is based on the filesystem location with the fixed directory prefix “/rules/gs”. For more
details, see “Controlling Database URIs During Ingestion” on page 28.
<function>redact-us-phone</function>
</method>
<options>
<level>partial</level>
</options>
</rule>
The JSON rule installed in “Installing the Redaction Rules” on page 107 has the following form:
{ "rule": {
"description": "Remove customer ids.",
"path": "//id",
"method": { "function": "conceal" }
}}
The expected result of applying this rule is to remove nodes named id. For example, if //id
selects and XML element or JSON property, the element or property does not appear in the
redacted output. Note that, if //id selects array items in JSON, the items are eliminated, but the id
property might remain, depending on the structure of the document. For more details, see conceal
in the Application Developer’s Guide.
Running the export command saves the redacted documents to an output/ sub-directory. You
should have the following filesystem hierarch. The “extra” redact-gs sub-directory is created by
mlcp because the document URIs are of the form /redact-s/filename.
redact-gs/
output/
redact-gs/
sample1.xml
sample2.json
The following table shows the result of redacting the XML sample document. Notice that the
telephone number in the summary element has been partially redacted by the redact-us-phone
function. Also, the id element has been completely hidden by the conceal function. The affected
parts of the content are highlighted in the table.
Original <personal>
Document <name>Little Bopeep</name>
<summary>Seeking lost sheep. Please call 123-456-7890.</summary>
<id>12-3456789</id>
</personal>
Redacted <personal>
Result <name>Little Bopeep</name>
<summary>Seeking lost sheep. Please call ###-###-7890.</summary>
</personal>
The following table shows the result of redacting the JSON sample document. Notice that the
telephone number in the summary property has been partially redacted by the redact-us-phone
function. Also, the id property has been completely hidden by the conceal function. The affected
parts of the content are highlighted in the table.
Original {"personal": {
Document "name": "Jack Sprat",
"summary": "Free nutrition advice! Call (234)567-8901 now!",
"id": "45-6789123"
}}
Redacted {"personal": {
Result "name": "Jack Sprat",
"summary": "Free nutrition advice! Call (###)###-8901 now!"
}}
To redact documents when copying them between databases rather than exporting them, add the
-redaction option to the mlcp copy command line.
Option Description
The following table lists command line options that define the characteristics of the export
operation:
Option Description
Option Description
Option Description
Option Description
Option Description
Use the mlcp copy command to copy content and associated metadata from one MarkLogic Server
database to another when both are reachable on the network. You can also copy data from offline
forests to a MarkLogic Server database; for details, see “Using Direct Access to Extract or Copy
Documents” on page 126.
• Basic Steps
• Examples
3. Select what documents to copy. For details, see “Filtering Archive and Copy Contents” on
page 97.
• To select document matching a query, use -query_filter. You can use this option
alone or in combination with a directory, collection or document selector filter.
False positives are possible; for details, see “Understanding When Filters Are
Accurate” on page 98.
• To select all documents in the database, leave -collection_filter,
-directory_filter, -document_selector, and -query_filter unset.
For a complete list of mlcp copy command options, see “Copy Command Line Options” on
page 119.
6.2 Examples
The following example copies all documents and their metadata from the source database to the
destination database:
The following example copies selected documents, excluding the source permissions and adding
the documents to 2 new collections in the destination database:
For an example of using -query_filter, see “Example: Exporting Documents Matching a Query”
on page 99.
Redaction is performed as documents are read from the source database. For example, if you copy
documents between databases in two different MarkLogic installations, the unredacted content
never leaves the source installation.
Use the -redaction option to apply redaction rules during a copy. For example, the following
command copies documents in the “my_docs” collection from one database to another, and
applies the redaction rules in the rule collections “hipaa-rules and “biz-rules” to the source
documents before copying them to the destination database.
For more details, see “Redacting Content During Export or Copy Operations” on page 104.
Option Description
Option Description
Option Description
Option Description
Option Description
Option Description
-output_ssl_protocol string Specify the protocol mlcp should use when creating
an SSL connection to the output App Server. You
must include this option if you use the -output_ssl
option to connect to an App Server configured to dis-
able MarkLogic’s default protocol (TLSv1.2).
Allowed values: tls, tlsv1, tlsv1.1, tlsv1.2.
Default: TLSv1.2.
-output_uri_prefix string Specify a prefix to prepend to the default URI. Used
to construct output document URIs. For details, see
“Controlling Database URIs During Ingestion” on
page 28.
-output_uri_replace comma-list A comma separated list of (regex,string) pairs that
define string replacements to apply to the URIs of
documents added to the database. The replacement
strings must be enclosed in single quotes. For
example, -output_uri_replace
"regex1,'string1',regext2,'string2'"
Option Description
Option Description
Direct Access enables you to bypass MarkLogic Server and extract documents from a database by
reading them directly from the on-disk representation of a forest. This feature is best suited for
accessing documents in archived, offline forests.
Direct Access is primarily intended for accessing archived data that is part of a tiered storage
deployment; for details, see Tiered Storage in the Administrator’s Guide. You should only use
Direct Access on a forest that is offline or read-only; for details, see “Limitations of Direct
Access” on page 127.
For example, if you have data that ages out over time such that you need to retain it, but you do
not need to have it available for real time queries through MarkLogic Server, you can archive the
data by taking the containing forests offline, but still access the contents using Direct Access.
Use Direct Access with mlcp to access documents in offline and read-only forests in the following
ways:
• The mlcp extract command to extracts archived documents from a database as flat files.
This operation is similar to exporting documents from a database to files, but does not
require a source MarkLogic Server instance. For details, see “Choosing Between Export
and Extract” on page 128.
• The mlcp import command with -input_file_type forest imports archived documents as
to another database as live documents. A destination MarkLogic Server instance is
required, but no source instance.
Since Direct Access bypasses the active data management performed by MarkLogic Server, you
should not use it on forests receiving document updates. Additional restrictions apply. For details,
see “Limitations of Direct Access” on page 127.
• The forest is offline and not in an error state. A forest is offline if the availability is set to
offline, or the forest or the database to which it is attached is disabled. For details, see
Taking Forests and Partitions Online and Offline in the Administrator’s Guide.
• The forest is online, but the updates-allowed state of the forest is read-only. For details,
see Setting the Updates-allowed State on Partitions in the Administrator’s Guide.
The following additional limitations apply to using Direct Access:
• Accessing documents with Direct Access bypasses security roles and privileges. The
content is protected only by the filesystem permissions on the forest data.
• Direct Access cannot take advantage of indexing or caching when accessing documents.
Every document in each participating forest is read, even when you use filtering criteria
such as -directory_filter or -type_filter. Filtering can only be applied after reading a
document off disk.
• Direct Access skips property fragments.
• Direct Access skips documents partitioned into multiple fragments. For details, see
Fragments in the Administrator’s Guide.
• Older versions of mlcp might not be able to read forest data from MarkLogic 9 or later.
For best results, use the version of mlcp that corresponds to your MarkLogic version.
When you use Direct Access, mlcp skips any forest (or a stand within a forest) that is receiving
updates or that is in an error state. Processing continues even when some documents are skipped.
When you use mlcp with Direct Access, your forest data must be reachable from the host(s)
processing the input. In local mode, the forests must be reachable from the host on which you
execute mlcp.
If mlcp accesses large or external binaries with Direct Access, then the reachability requirement
also applies to the large data directory and any external binary directories. Furthermore, these
directories must be reachable along the same path as when the forest was online.
The extract command places no load on MarkLogic Server. The export command offloads most
of the work to your MarkLogic cluster. Thus, export honors document permissions, takes
advantage of database indexes, and can apply transformations and filtering at the server. By
contrast, extract bypasses security (other than file permissions on the forest files), must access all
document sequentially, and applies a limited set of filters on the client.
The export command offers a richer set of filtering options than extract. In addition, export only
accesses the documents selected by your options, while extract must scan the entirety of each
input forest, even when extracting selected documents.
1. Set -input_file_path to the path to the input forest directory(s). Specify multiple forests
using a comma-separated list of paths.
2. Select the documents to extract. For details, see “Filtering Forest Contents” on page 103.
3. Set -output_file_path to the destination file or directory on the native filesystem. This
directory must not already exist.
• Your input forests must be reachable from the host where you execute mlcp.
5. If you want to extract the documents as files in compressed files, set -compress to true.
Filtering options can be combined. Directory names specified with -directory_filter should end
with “/”. All filters are applied on the client, so every document is accessed, even if it is filtered
out of the output document set.
Note: Document URIs are URI-decoded before filesystem directories or filenames are
constructed for them. For details, see “How URI Decoding Affects Output File
Names” on page 95.
For a full list of extract options, see “Extract Command Line Options” on page 130.
The following example extracts selected documents from the forest files in
/var/opt/MarkLogic/Forests/example to the native filesystem directory
/space/mlcp/extracted/files. The directory filter selects only the input documents in the
database directory /plays.
1. Set -input_file_path to the path to the input forest directory(s). Specify multiple forests
using a comma-separated list of paths.
3. Specify the connection information for the destination database using -host, -port,
-username, and -password.
4. Select the files to extract from the input forest. For details, see “Filtering Forest Contents”
on page 103. Filtering options can be used together.
5. If you want to exclude some or all of the document metadata in the forests:
• Your input forests and the destination MarkLogic Server instance must be
reachable from the host where you run mlcp.
By default, an imported document has a database URI based on the input file path. You can
customize the URI using options. For details, see “Controlling Database URIs During Ingestion”
on page 28.
The following table lists command line options that define the characteristics of the extraction:
Option Description
8.0 Troubleshooting
135
This chapter includes tips for debugging some common problems. The following topics are
covered:
For example, the command below reports the version of mlcp, and the Java JRE that mlcp will use
at runtime, plus the versions of MarkLogic supported by this version of mlcp.
$ mlcp.sh version
ContentPump version: 8.0
Java version: 1.7.0_45
Supported MarkLogic versions: 6.0 - 8.0
Note that not all features of mlcp are supported by all versions of MarkLogic, even within the
reported range of supported versions. For example, if MarkLogic version X introduces a new
feature that is supported by mlcp, that doesn’t mean you can use mlcp to work with the feature in
MarkLogic version X-1.
In addition, mlcp connects directly to hosts in your MarkLogic Server cluster that contain forests
of the target database. Therefore, all the hosts that serve a target database must be reachable from
the host where mlcp runs (local mode).
mlcp gets the lists of participating hosts by querying your MarkLogic Server cluster
configuration. If a hostname returned by this query is not resolvable, mlcp will not be able to
connect, which can prevent document loading.
If you think you might have connection issues, enable debug level logging to see details on name
resolution and connection failures. For details, see “Enabling Debug Level Messages” on
page 134.
log4j.logger.com.marklogic.mapreduce=DEBUG
log4j.logger.com.marklogic.contentpump=DEBUG
You may find these property settings are already at the end of log4j.properties, but commented
out. Remove the leading “#” to enable them.
• The input type is documents, and the document type is set to (or determined to be) XML,
but the input file fails to parse properly as XML. Correct the error in the input data and try
again.
• You set -input_file_path to a location containing compressed files, but you do not set
-input_compressed and -input_compression_codec. In this case, mlcp will load the
compressed files as binary documents, rather than creating documents from the contents
of the compressed files.
• You set -document_type to a value inconsistent with the input data referenced by
-input_file_path.
• A syntax error was encountered while splitting an aggregate XML file into multiple pieces
of document content.
• A delimited text file contains records (lines) with an incorrect number of column values or
with no value for the URI id column.
If mlcp reports an ATTEMPTED_INPUT_RECORD_COUNT of 0, then the tool found no input documents
meeting your requirements. If there are errors or warnings, correct them and try again. If there are
no errors, then the combination of options on your command line probably does not select any
suitable documents. For example:
In local mode, an interrupted job will shutdown gracefully as long as it can finish withint 30
seconds.
If mlcp cannot gracefully shut down the job, you might see the following warning:
MarkLogic provides technical support according to the terms detailed in your Software License
Agreement or End User License Agreement.
Complete product documentation, the latest product release downloads, and other useful
information is available for all developers at http://developer.marklogic.com. For technical
questions, we encourage you to ask your question on Stack Overflow.
MarkLogic 10
MarkLogic Server Technical Support
10.0 Copyright
999
The MarkLogic software is protected by United States and international copyright laws, and
incorporates certain third party libraries and components which are subject to the attributions,
terms, conditions and disclaimers set forth below.
For all copyright notices, including third-party copyright notices, see the Combined Product
Notices for your version of MarkLogic.
MarkLogic 10
MarkLogic Server Copyright