1 9-Section
1 9-Section
1 9-Section
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
Getting Started
origin/dev The overview topics help you get started and learn the DC/OS fundamentals.
DC/OS is a distributed operating system based on the Apache Mesos distributed systems
kernel. ...
High Availability
Networking
Load Balancing and VIPs
Marathon-LB
High-Availability
DNS Quick Reference
Upgrading
This document provides instructions for upgrading a DC/OS cluster. If this upgrade is
performed on a supported OS with all prerequisites fulfilled, this upgrade should preserve
the...
Installing DC/OS
Enterprise DC/OS is designed to be configured, deployed, managed, scaled, and upgraded
on any cluster of physical or virtual machines. You can install DC/OS in the environment of
y...
Upgrading
DC/OS Custom Installation Options
DC/OS Cloud Installation Options
Local
High-Availibility
Performance Monitoring
Performance Monitoring
Performance Monitoring
Logging
Debugging from the DC/OS Web Interface
Quick Start
Tutorials
This is a collection of tutorials about using DC/OS. Learn how to run services and operate
services in production.
Release Notes
GUI
The DC/OS web interface provides a rich graphical view of your DC/OS cluster. With the
web interface you can view the current state of your entire cluster and DC/OS services. The
w...
CLI
You can use the DC/OS command-line interface (CLI) to manage your cluster nodes, install
DC/OS packages, inspect the cluster state, and administer the DC/OS service
subcommands. Yo...
Security
Enterprise DC/OS makes managing users easier with LDAP, SAML, and Open ID Connect
integrations. You can also use permissions to define which resources users can access. In
strict a...
Quick Start
Metrics API
Metrics Reference
Deploying Jobs
You can create scheduled jobs in DC/OS without installing a separate service. Create and
administer jobs in the DC/OS web interface, the DC/OS CLI, or via an API. Note: The Jobs
fu...
Deploying Services and Pods
DC/OS uses Marathon to manage processes and services. Marathon is the init system for
DC/OS. Marathon starts and monitors your applications and services, automaticall...
Installing Services
Pods
Monitoring Services
Updating a User-Created Service
Service Ports
High Availability
Updated: April 17, 2017
This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS.
Leader/Follower Architecture
A common pattern in HA systems is the leader/follower concept. This is also sometimes
referred to as: master/slave, primary/replica, or some combination thereof. This architecture
is used when you have one authoritative process, with N standby processes. In some
systems, the standby processes might also be capable of serving requests or performing
other operations. For example, when running a database like MySQL with a master and
replica, the replica is able to serve read-only requests, but it cannot accept writes (only the
master will accept writes).
In DC/OS, a number of components follow the leader/follower pattern. Well discuss some of
them here and how they work.
Mesos
Mesos can be run in high availability mode, which requires running 3 or 5 masters. When run
in HA mode, one master is elected as the leader, while the other masters are followers. Each
master has a replicated log which contains some state about the cluster. The leading master
is elected by using ZooKeeper to perform leader election. For more detail on this, see the
Mesos HA documentation.
Marathon
Marathon can be run in HA mode, which allows running multiple Marathon instances (at
least 2 for HA), with one elected leader. Marathon uses ZooKeeper for leader election. The
followers do not accept writes or API requests, instead the followers proxy all API requests
to the leading Marathon instance.
ZooKeeper
Physical domains: this includes machine, rack, datacenter, region, and availability zone.
Network domains: machines within the same network may be subject to network
partitions. For example, a shared network switch may fail or have invalid configuration.
With DC/OS, you can distribute masters across racks for HA. Agents can be distributed
across regions, and its recommended that you tag agents with attributes to describe their
location. Synchronous services like ZooKeeper should also remain within the same region to
reduce network latency. For more information, see the Configuring High-Availability
documentation.
For applications which require HA, they should also be distributed across fault domains. With
Marathon, this can be accomplished by using the UNIQUE and GROUP_BY constraints operator.
Separation of Concerns
HA services should be decoupled, with responsibilities divided amongst services. For
example, web services should be decoupled from databases and shared caches.
Using an HA load balancer like Marathon-LB, or the internal Layer 4 load balancer.
Building apps in accordance with the 12-factor app manifesto.
Following REST best-practices when building services: in particular, avoiding storing
client state on the server between requests.
A number of DC/OS services follow the fail-fast pattern in the event of errors. Specifically,
both Mesos and Marathon will shut down in the case of unrecoverable conditions such as
losing leadership.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Networking
ENTERPRISE DC/OS Updated: April 17, 2017
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
NETWORKING
DC/OS comes with an east-west load balancer thats meant to be used to enable multi-tier
microservices architectures. It acts as a TCP Layer 4 load balancer, and its tightly integrated
with the kernel.
Usage
You can use the layer 4 load balancer by assigning a VIP from the DC/OS web interface.
Alternatively, if youre using something other than Marathon, you can create a label on the
port protocol buffer while launching a task on Mesos. This labels key must be in the format
VIP_$IDX, where $IDX is replaced by a number, starting from 0. Once you create a task, or a
set of tasks with a VIP, they will automatically become available to all nodes in the cluster,
including the masters.
Details
When you launch a set of tasks with these labels, DC/OS distributes them to all of the nodes
in the cluster. All of the nodes in the cluster act as decision makers in the load balancing
process. A process runs on all the agents that the kernel consults when packets are
recognized with this destination address. This process keeps track of availability and
reachability of these tasks to attempt to send requests to the right backends.
Recommendations
Caveats
It is recommended when you use our VIPs you keep long-running, persistent connections.
The reason behind this is that you can very quickly fill up the TCP socket table if you do not.
The default local port range on Linux allows source connections from 32768 to 61000. This
allows 28232 connections to be established between a given source IP and a destination
address, port pair. TCP connections must go through the time wait state prior to being
reclaimed. The Linux kernels default TCP time wait period is 120 seconds. Given this, you
would exhaust the connection table by only making 235 new connections / sec.
Health checks
We also recommend taking advantage of Mesos health checks. Mesos health checks are
surfaced to the load balancing layer. Marathon only converts command health checks to
Mesos health checks. You can simulate HTTP health checks via a command similar to test
"$(curl -4 -w '%{http_code}' -s http://localhost:${PORT0}/|cut -f1 -d" ")" == 200.
This ensures the HTTP status code returned is 200. It also assumes your application binds
to localhost. The ${PORT0} is set as a variable by Marathon. We do not recommend using
TCP health checks as they can be misleading as to the liveness of a service.
Important: Docker container command health checks are run inside the Docker container.
For example, if cURL is used to check NGINX, the NGINX container must have cURL
installed, or the container must mount /opt/mesosphere in RW mode.
Demo
If you would like to run a demo, you can configure a Marathon app as mentioned above, and
use the URI https://s3.amazonaws.com/sargun-mesosphere/linux-amd64, as well as the
command chmod 755 linux-amd64 && ./linux-amd64 -listener=:${PORT0} -say-
string=version1 to execute it. You can then test it by hitting the application with the
command: curl http://1.2.3.4:5000. This app exposes an HTTP API. This HTTP API
answers with the PID, hostname, and the say-string thats specified in the app definition. In
addition, it exposes a long-running endpoint at http://1.2.3.4:5000/stream, which will
continue to stream until the connection is terminated. The code for the application is
available here: https://github.com/mesosphere/helloworld.
Prior to this, you had to run a complex proxy that would reconfigure based on the tasks
running on the cluster. Fortunately, you no longer need to do this. Instead, you can have an
incredible simple HAProxy configuration like so:
defaults log global mode tcp contimeout 50000000 clitimeout 50000000 srvtimeout 50000000 listen
appname 0.0.0.0:80 mode tcp balance roundrobin server mybackend 1.2.3.4:5000
This will run an HAProxy on the public agent, on port 80. If youd like, you can make the
number of instances equal to the number of public agents. Then, you can point your external
load balancer at the pool of public agents on port 80. Adapting this would simply involve
changing the backend entry, as well as the external port.
Potential Roadblocks
IP Overlay
Problems can arise if the VIP address that you specified is used elsewhere in the network.
Although the VIP is a 3-tuple, it is best to ensure that the IP dedicated to the VIP is only in
use by the load balancing software and isnt in use at all in your network. Therefore, you
should choose IPs from the RFC1918 range.
IPSet
You must have the command ipset installed. If you do not, you may see an error like:
Ports
The ports 61420, and 61421 must be open for the load balancer to work correctly. Because
the load balancer maintains a partial mesh, it needs to ensure that connectivity between
nodes is unhindered.
Implementation
The local process polls the master node roughly every 5 seconds. The master node caches
this for 5 seconds as well, bounding the propagation time for an update to roughly 11
seconds. Although this is the case for new VIPs, it is not the case for failed nodes.
Data plane
The load balancer dataplane primarily utilizes Netfilter. The load balancer installs 4 IPTables
rules to enable this, therefore the load balancer must start after firewalld, or any other
destructive IPTables management system. These 4 rules tell IPTables to put the packets
that match them on an NFQueue. NFQueue is a kernel subsystem that allows userspace to
process network traffic.
The rules are two types the first type is to intercept the traffic, and the second is to drop it.
The purpose of the latter rule is to provide an immediate connection reset to the client. The
prior set of rules matches based on the combination of a TCP packet, the SYN flag (and
none else), and an IPSet match which is populated with the list of VIPs.
Once the packet is on the nfqueue, we calculate the backend that the connection should be
mapped to. We use this information to program in an in-kernel conntrack entry which maps
(port DNATs) the 5-tuple from the original destination (the VIP) to the new destination (the
backend). In some cases where hairpin load balancing occurs, SNAT may be required as
well.
Once the NAT programming is done, the packet is released back into the kernel. Since our
rules are in the raw chain the packet doesnt yet have a conntrack entry associated with it.
The conntrack subsystem recognizes the connection based on the prior program and
continues to handle the rest of the flow independently from the load balancer.
The simple algorithm maintains an EWMA of latencies for a given backend at connection
time. It also maintains a set of consecutive failures, and when they happened. If a backend
observes enough consecutive failures in a short period of time (<5m) it is considered to be
unavailable. A failure is classified as three way handshake failing to occur.
The primary way the algorithm works is that it iterates over the backends and finds those
that we assume are available after taking the the historical failures as well as the group
failure detector. It then takes two random nodes from the most available bucket.
The probabilistic failure detector randomly chooses backends and checks whether or not the
group failure detector considers the agent to be alive. It will continue to do this until it either
finds 2 backends that are in the ideal bucket, or until 20 lookups happen. If the prior case
happens, itll choose one at random. If the latter case happens itll choose one of the 20 at
random.
Failure detection
The load balancer includes a state of the art failure detection scheme. This failure detection
scheme takes some of the work done in the HyParView work. The failure detector maintains
a fully connected sparse graph of connections amongst the nodes in the cluster.
Every node maintains an adjacency table. These adjacency tables are gossiped to every
other node in the cluster. These adjacency tables are then used to build an application-level
multicast overlay.
These connections are monitored via an adaptive ping algorithm. The adaptive ping
algorithm maintains a window of pings between neighbors, and if the ping times out, they
sever the connections. Once this connection is severed the new adjacencies are gossiped to
all other nodes, therefore potentially triggering cascading health checks. This allows the
system to detect failures in less than a second. Although, the system has backpressure
when there are lots of failures, and fault detection can rise to 30 seconds.
Next steps
Assign a VIP to your application
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
NETWORKING
Marathon-LB
Updated: April 17, 2017
Marathon-LB is based on HAProxy, a rapid proxy and load balancer. HAProxy provides
proxying and load balancing for TCP and HTTP based applications, with features such as
SSL support, HTTP compression, health checking, Lua scripting and more. Marathon-LB
subscribes to Marathons event bus and updates the HAProxy configuration in real time.
You can can configure Marathon-LB with various topologies. Here are some examples of
how you might use Marathon-LB:
Use Marathon-LB as your edge load balancer and service discovery mechanism. You
could run Marathon-LB on public-facing nodes to route ingress traffic. You would use the
IP addresses of your public-facing nodes in the A-records for your internal or external
DNS records (depending on your use-case).
Use Marathon-LB as an internal LB and service discovery mechanism, with a separate
HA load balancer for routing public traffic in. For example, you may use an external F5
load balancer on-premise, or an Elastic Load Balancer on Amazon Web Services.
Use Marathon-LB strictly as an internal load balancer and service discovery mechanism.
You might also want to use a combination of internal and external load balancers, with
different services exposed on different load balancers.
Here we discuss Marathon-LB as an edge load balancer and as an internal and external
load balancer.
Next Steps
Install
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
NETWORKING
High-Availability
PREVIEW Updated: April 17, 2017
This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS.
Terminology
Zone
A zone is a failure domain that has isolated power, networking, and connectivity. Typically, a
zone is a single data center or independent fault domain on-premise, or managed by a cloud
provider. For example, AWS Availability Zones or GCP Zones. Servers within a zone are
connected via high bandwidth (e.g. 1-10+ Gbps), low latency (up to 1 ms), and low cost
links.
Region
A region is a geographical region, such as a metro area, that consists of one or more zones.
Zones within a region are connected via high bandwidth (e.g. 1-4 Gbps), low latency (up to
10 ms), low cost links. Regions are typically connected through public internet via variable
bandwidth (e.g. 10-100 Mbps) and latency (100-500 ms) links.
Rack
A rack is typically composed of a set of servers (nodes). A rack has its own power supply
and switch (or switches), all attached to the same frame. On public cloud platforms such as
AWS, there is no equivalent concept of a rack.
General Recommendations
Latency
DC/OS master nodes should be connected to each other via highly available and low latency
network links. This is required because some of the coordinating components running on
these nodes use quorum writes for high availability. For example, Mesos masters, Marathon
schedulers, and ZooKeeper use quorum writes.
Similarly, most DC/OS services use ZooKeeper (or etcd, consul, etc) for scheduler leader
election and state storage. For this to be effective, service schedulers must be connected to
the ZooKeeper ensemble via a highly available, low latency network link.
Routing
DC/OS networking requires a unique address space. Cluster entities cannot share the same
IP address. For example, apps and DC/OS agents must have unique IP addresses.
Leader/Follower Architecture
A common pattern in HA systems is the leader/follower concept. This is also sometimes
referred to as: master/slave, primary/replica, or some combination thereof. This architecture
is used when you have one authoritative process, with N standby processes. In some
systems, the standby processes might also be capable of serving requests or performing
other operations. For example, when running a database like MySQL with a master and
replica, the replica is able to serve read-only requests, but it cannot accept writes (only the
master will accept writes).
In DC/OS, a number of components follow the leader/follower pattern. Well discuss some of
them here and how they work.
Mesos
Mesos can be run in high availability mode, which requires running 3 or 5 masters. When run
in HA mode, one master is elected as the leader, while the other masters are followers. Each
master has a replicated log which contains some state about the cluster. The leading master
is elected by using ZooKeeper to perform leader election. For more detail on this, see the
Mesos HA documentation.
Marathon
Marathon can be run in HA mode, which allows running multiple Marathon instances (at
least 2 for HA), with one elected leader. Marathon uses ZooKeeper for leader election. The
followers do not accept writes or API requests, instead the followers proxy all API requests
to the leading Marathon instance.
ZooKeeper
Physical domains: this includes machine, rack, datacenter, region, and availability zone.
Network domains: machines within the same network may be subject to network
partitions. For example, a shared network switch may fail or have invalid configuration.
For applications which require HA, they should also be distributed across fault domains. With
Marathon, this can be accomplished by using the UNIQUE and GROUP_BY constraints operator.
Separation of Concerns
HA services should be decoupled, with responsibilities divided amongst services. For
example, web services should be decoupled from databases and shared caches.
Fast Failover
When failures do occur, failover should be as fast as possible. Fast failover can be achieved
by:
Using an HA load balancer like Marathon-LB, or the internal Layer 4 load balancer.
Building apps in accordance with the 12-factor app manifesto.
Following REST best-practices when building services: in particular, avoiding storing
client state on the server between requests.
A number of DC/OS services follow the fail-fast pattern in the event of errors. Specifically,
both Mesos and Marathon will shut down in the case of unrecoverable conditions such as
losing leadership.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
NETWORKING
Port: 555
Port Name: myport
Load Balanced
outergroupsubgroupmyapp.marathon.l4lb.thisdcos.directory:555
This is only available when the service is load balanced. :555 is not a part of the DNS
address, but is there to show that this address and port is load balanced as a pair
rather than individually.
myapp-subgroup-outergroup.marathon.containerip.dcos.thisdcos.directory
This is only available when the service is running on a virtual network.
myapp-subgroup-outergroup.marathon.agentip.dcos.thisdcos.directory
This is always available and should be used when the service is not running on a
virtual network.
myapp-subgroup-outergroup.marathon.autoip.dcos.thisdcos.directory
This is always available and should be used to address an application that is
transitioning on or off a virtual network.
myapp-subgroup-outergroup.marathon.mesos
This is always available, and is equivalent for the most part to the agentip. However it
is less specific and less performant than the agentip and thus use is discouraged.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Upgrading
ENTERPRISE DC/OS Updated: April 17, 2017
If this upgrade is performed on a supported OS with all prerequisites fulfilled, this upgrade
should preserve the state of running tasks on the cluster. This document reuses portions of
the Advanced DC/OS Installation Guide.
Important:
The VIP features, added in DC/OS 1.8, require that ports 32768 65535 are open
between all agent and master nodes for both TCP and UDP.
Virtual networks require Docker 1.11 or later. For more information, see the
documentation.
An upgraded DC/OS Marathon leader cannot connect to an non-secure (i.e. not
upgraded) leading Mesos master. The DC/OS UI cannot be trusted until all masters are
upgraded. There are multiple Marathon scheduler instances and multiple Mesos masters,
each being upgraded, and the Marathon leader may not be the Mesos leader.
Task history in the Mesos UI will not persist through the upgrade.
To modify your DC/OS configuration, you must run the installer with the modified
config.yaml and update your cluster using the new installation files. Changes to the DC/OS
configuration have the same risk as upgrading a host. Incorrect configurations could
potentially crash your hosts, or an entire cluster.
Only a subset of DC/OS configuration parameters can be modified. The adverse effects on
any software that is running on top of DC/OS is outside of the scope of this document.
Contact Mesosphere Support for more information.
check_time
dns_search
docker_remove_delay
gc_delay
resolvers
telemetry_enabled
use_proxy
http_proxy
https_proxy
no_proxy
The security mode (security) can be changed but has special caveats.
You can only update to a stricter security mode. Security downgrades are not supported.
For example, if your cluster is in permissive mode and you want to downgrade to
disabled mode, you must reinstall the cluster and terminate all running workloads.
During each update, you can only increase your security by a single level. For example,
you cannot update directly from disabled to strict mode. To increase from disabled to
strict mode you must first update to permissive mode, and then update from permissive
to strict mode.
See the security mode for a description of the different security modes and what each
means.
Instructions
These steps must be performed for version upgrades and cluster configuration changes.
Prerequisites
Mesos, Mesos Frameworks, Marathon, Docker and all running tasks in the cluster should
be stable and in a known healthy state.
For Mesos compatibility reasons, we recommend upgrading any running Marathon-on-
Marathon instances to Marathon version 1.3.5 before proceeding with this DC/OS
upgrade.
You must have access to copies of the config files used with the previous DC/OS version:
config.yaml and ip-detect.
You must be familiar with using systemctl and journalctl command line tools to review
and monitor service status. Troubleshooting notes can be found at the end of this
document.
You must be familiar with the Advanced DC/OS Installation Guide.
You should take a snapshot of ZooKeeper prior to upgrading. Marathon supports
rollbacks, but does not support downgrades.
Bootstrap Node
Choose your desired security mode and then follow the applicable upgrade instructions.
Copy your existing config.yaml and ip-detect files to an empty folder on your bootstrap
node.
Merge the old config.yaml into the new config.yaml format. In most cases the differences
will be minimal.
Important:
The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.
Prerequisite:
Your cluster must be upgraded to DC/OS 1.9 and running in disabled security mode
before it can be upgraded to permissive mode. If your cluster was running in permissive
mode before it was upgraded to DC/OS 1.9, you can skip this procedure.
To update a 1.9 cluster from disabled security to permissive security, complete the following
procedure:
The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.
If you are updating a running DC/OS cluster to run in security: strict mode, beware that
security vulnerabilities may persist even after migration to strict mode. When moving to strict
mode, your services will now require authentication and authorization to register with Mesos
or access its HTTP API. You should test these configurations in permissive mode before
upgrading to strict, to maintain scheduler and script uptimes across the upgrade.
Prerequisite:
Your cluster must be upgraded to DC/OS 1.9 and running in permissive security mode
before it can be updated to strict mode. If your cluster was running in strict mode before it
was upgraded to DC/OS 1.9, you can skip this procedure.
To update a cluster from permissive security to strict security, complete the following
procedure:
The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.
DC/OS Masters
Proceed with upgrading every master node one-at-a-time in any order using the following
procedure. When you complete each upgrade, monitor the Mesos master metrics to ensure
the node has rejoined the cluster and completed reconciliation.
DC/OS Agents
Important: When upgrading agent nodes, there is a 5 minute timeout for the agent to
respond to health check pings from the mesos-masters before it is considered lost and its
tasks are given up for dead.
On all DC/OS agents:
Navigate to the /opt/mesosphere/lib directory and delete this library file. Deleting this file
will prevent conflicts.
libltdl.so.7
Troubleshooting Recommendations
The following commands should provide insight into upgrade issues:
On DC/OS Masters
On DC/OS Agents
Notes:
Packages available in the DC/OS 1.9 Universe are newer than those in the DC/OS 1.8
Universe. Services are not automatically upgraded when DC/OS 1.9 is installed because
not all DC/OS services have upgrade paths that will preserve existing state.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Installing DC/OS
ENTERPRISE DC/OS Updated: April 17, 2017
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
Upgrading
This document provides instructions for upgrading a DC/OS cluster. If this upgrade is
performed on a supported OS with all prerequisites fulfilled, this upgrade should preserve
the...
DC/OS Custom Installation Options
The DC/OS Enterprise Edition is installed in your environment by using a dynamically
generated setup file. This file is generated by using specific parameters that are set during
c...
DC/OS Cloud Installation Options
You can install DC/OS on by using cloud templates.
Local
This installation method uses Vagrant to create a cluster of virtual machines on your local
machine that can be used for demos, development, and testing with DC/OS. System
requirem...
High-Availibility
This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS. Terminology Zone A zone is a failure domain that has ...
DC/OS Ports
This topic lists the ports that are required to launch DC/OS. Additional ports may be required
to launch the individual DC/OS services. All nodes TCP Port DC/OS component systemd u...
Opt-Out
Telemetry You can opt-out of providing anonymous data by disabling telemetry for your
cluster. To disable telemetry, add this parameter to your config.yaml file during installation...
Frequently Asked Questions
Q. Can I install DC/OS on an already running Mesos cluster? We recommend starting with a
fresh cluster to ensure all defaults are set to expected values. This prevents unexpected c...
Troubleshooting a Custom Installation
General troubleshooting approach Verify that you have a valid IP detect?????script,
functioning DNS resolvers to bind the DC/OS services to, and that all nodes are synchr...
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
Upgrading
ENTERPRISE DC/OS Updated: April 17, 2017
If this upgrade is performed on a supported OS with all prerequisites fulfilled, this upgrade
should preserve the state of running tasks on the cluster. This document reuses portions of
the Advanced DC/OS Installation Guide.
Important:
The VIP features, added in DC/OS 1.8, require that ports 32768 65535 are open
between all agent and master nodes for both TCP and UDP.
Virtual networks require Docker 1.11 or later. For more information, see the
documentation.
An upgraded DC/OS Marathon leader cannot connect to an non-secure (i.e. not
upgraded) leading Mesos master. The DC/OS UI cannot be trusted until all masters are
upgraded. There are multiple Marathon scheduler instances and multiple Mesos masters,
each being upgraded, and the Marathon leader may not be the Mesos leader.
Task history in the Mesos UI will not persist through the upgrade.
Enterprise DC/OS downloads can be found here.
To modify your DC/OS configuration, you must run the installer with the modified
config.yaml and update your cluster using the new installation files. Changes to the DC/OS
configuration have the same risk as upgrading a host. Incorrect configurations could
potentially crash your hosts, or an entire cluster.
Only a subset of DC/OS configuration parameters can be modified. The adverse effects on
any software that is running on top of DC/OS is outside of the scope of this document.
Contact Mesosphere Support for more information.
Here is a list of the parameters that you can modify:
check_time
dns_search
docker_remove_delay
gc_delay
resolvers
telemetry_enabled
use_proxy
http_proxy
https_proxy
no_proxy
The security mode (security) can be changed but has special caveats.
You can only update to a stricter security mode. Security downgrades are not supported.
For example, if your cluster is in permissive mode and you want to downgrade to
disabled mode, you must reinstall the cluster and terminate all running workloads.
During each update, you can only increase your security by a single level. For example,
you cannot update directly from disabled to strict mode. To increase from disabled to
strict mode you must first update to permissive mode, and then update from permissive
to strict mode.
See the security mode for a description of the different security modes and what each
means.
Instructions
These steps must be performed for version upgrades and cluster configuration changes.
Prerequisites
Mesos, Mesos Frameworks, Marathon, Docker and all running tasks in the cluster should
be stable and in a known healthy state.
For Mesos compatibility reasons, we recommend upgrading any running Marathon-on-
Marathon instances to Marathon version 1.3.5 before proceeding with this DC/OS
upgrade.
You must have access to copies of the config files used with the previous DC/OS version:
config.yaml and ip-detect.
You must be familiar with using systemctl and journalctl command line tools to review
and monitor service status. Troubleshooting notes can be found at the end of this
document.
You must be familiar with the Advanced DC/OS Installation Guide.
You should take a snapshot of ZooKeeper prior to upgrading. Marathon supports
rollbacks, but does not support downgrades.
Bootstrap Node
Choose your desired security mode and then follow the applicable upgrade instructions.
Copy your existing config.yaml and ip-detect files to an empty folder on your bootstrap
node.
Merge the old config.yaml into the new config.yaml format. In most cases the differences
will be minimal.
Important:
The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.
Prerequisite:
Your cluster must be upgraded to DC/OS 1.9 and running in disabled security mode
before it can be upgraded to permissive mode. If your cluster was running in permissive
mode before it was upgraded to DC/OS 1.9, you can skip this procedure.
To update a 1.9 cluster from disabled security to permissive security, complete the following
procedure:
The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.
If you are updating a running DC/OS cluster to run in security: strict mode, beware that
security vulnerabilities may persist even after migration to strict mode. When moving to strict
mode, your services will now require authentication and authorization to register with Mesos
or access its HTTP API. You should test these configurations in permissive mode before
upgrading to strict, to maintain scheduler and script uptimes across the upgrade.
Prerequisite:
Your cluster must be upgraded to DC/OS 1.9 and running in permissive security mode
before it can be updated to strict mode. If your cluster was running in strict mode before it
was upgraded to DC/OS 1.9, you can skip this procedure.
To update a cluster from permissive security to strict security, complete the following
procedure:
The command in the previous step will produce a URL in the last line of its output,
prefixed with Node upgrade script URL:. Record this URL for use in later steps. It will
be referred to in this document as the Node upgrade script URL.
Run the nginx container to serve the installation files.
DC/OS Masters
Proceed with upgrading every master node one-at-a-time in any order using the following
procedure. When you complete each upgrade, monitor the Mesos master metrics to ensure
the node has rejoined the cluster and completed reconciliation.
Verify that the upgrade script succeeded and exited with the status code :
echo $? 0
Navigate to the /opt/mesosphere/lib directory and delete this library file. Deleting this file
will prevent conflicts.
libltdl.so.7
Verify that the upgrade script succeeded and exited with the status code :
echo $? 0
Troubleshooting Recommendations
The following commands should provide insight into upgrade issues:
sudo journalctl -u dcos-download sudo journalctl -u dcos-spartan sudo systemctl | grep dcos
On DC/OS Masters
Notes:
Packages available in the DC/OS 1.9 Universe are newer than those in the DC/OS 1.8
Universe. Services are not automatically upgraded when DC/OS 1.9 is installed because
not all DC/OS services have upgrade paths that will preserve existing state.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
The DC/OS installation process requires a cluster of nodes to install DC/OS onto and a
single node to run the DC/OS installation from.
Contact your sales representative or sales@mesosphere.io for access tothe DC/OS setup
file.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
Local
ENTERPRISE DC/OS Updated: April 17, 2017
This installation method uses Vagrant to create a cluster of virtual machines on your local
machine that can be used for demos, development, and testing with DC/OS.
System requirements
Hardware
Minimum 5 GB of memory to run DC/OS.
Software
Enterprise DC/OS setup file. Contact your sales representative or
sales@mesosphere.com to obtain this file.
DC/OS Vagrant. The installation and usage instructions are maintained in the dcos-
vagrant GitHub repository. Follow the the deploy instructions to set up your host machine
correctly and to install DC/OS.
For the latest bug fixes, use the master branch.
For increased stability, use the latest official release.
For older releases on DC/OS, you may need to download an older release of DC/OS
Vagrant.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
High-Availibility
PREVIEW Updated: April 17, 2017
This document discusses the high availability (HA) features in DC/OS and best practices for
building HA applications on DC/OS.
Terminology
Zone
A zone is a failure domain that has isolated power, networking, and connectivity. Typically, a
zone is a single data center or independent fault domain on-premise, or managed by a cloud
provider. For example, AWS Availability Zones or GCP Zones. Servers within a zone are
connected via high bandwidth (e.g. 1-10+ Gbps), low latency (up to 1 ms), and low cost
links.
Region
A region is a geographical region, such as a metro area, that consists of one or more zones.
Zones within a region are connected via high bandwidth (e.g. 1-4 Gbps), low latency (up to
10 ms), low cost links. Regions are typically connected through public internet via variable
bandwidth (e.g. 10-100 Mbps) and latency (100-500 ms) links.
Rack
A rack is typically composed of a set of servers (nodes). A rack has its own power supply
and switch (or switches), all attached to the same frame. On public cloud platforms such as
AWS, there is no equivalent concept of a rack.
General Recommendations
Latency
DC/OS master nodes should be connected to each other via highly available and low latency
network links. This is required because some of the coordinating components running on
these nodes use quorum writes for high availability. For example, Mesos masters, Marathon
schedulers, and ZooKeeper use quorum writes.
Similarly, most DC/OS services use ZooKeeper (or etcd, consul, etc) for scheduler leader
election and state storage. For this to be effective, service schedulers must be connected to
the ZooKeeper ensemble via a highly available, low latency network link.
Routing
DC/OS networking requires a unique address space. Cluster entities cannot share the same
IP address. For example, apps and DC/OS agents must have unique IP addresses.
Leader/Follower Architecture
A common pattern in HA systems is the leader/follower concept. This is also sometimes
referred to as: master/slave, primary/replica, or some combination thereof. This architecture
is used when you have one authoritative process, with N standby processes. In some
systems, the standby processes might also be capable of serving requests or performing
other operations. For example, when running a database like MySQL with a master and
replica, the replica is able to serve read-only requests, but it cannot accept writes (only the
master will accept writes).
In DC/OS, a number of components follow the leader/follower pattern. Well discuss some of
them here and how they work.
Mesos
Mesos can be run in high availability mode, which requires running 3 or 5 masters. When run
in HA mode, one master is elected as the leader, while the other masters are followers. Each
master has a replicated log which contains some state about the cluster. The leading master
is elected by using ZooKeeper to perform leader election. For more detail on this, see the
Mesos HA documentation.
Marathon
Marathon can be run in HA mode, which allows running multiple Marathon instances (at
least 2 for HA), with one elected leader. Marathon uses ZooKeeper for leader election. The
followers do not accept writes or API requests, instead the followers proxy all API requests
to the leading Marathon instance.
ZooKeeper
Physical domains: this includes machine, rack, datacenter, region, and availability zone.
Network domains: machines within the same network may be subject to network
partitions. For example, a shared network switch may fail or have invalid configuration.
For applications which require HA, they should also be distributed across fault domains. With
Marathon, this can be accomplished by using the UNIQUE and GROUP_BY constraints operator.
Separation of Concerns
HA services should be decoupled, with responsibilities divided amongst services. For
example, web services should be decoupled from databases and shared caches.
Fast Failover
When failures do occur, failover should be as fast as possible. Fast failover can be achieved
by:
Using an HA load balancer like Marathon-LB, or Minuteman for internal layer 4 load
balancing.
Building apps in accordance with the 12-factor app manifesto.
Following REST best-practices when building services: in particular, avoiding storing
client state on the server between requests.
A number of DC/OS services follow the fail-fast pattern in the event of errors. Specifically,
both Mesos and Marathon will shut down in the case of unrecoverable conditions such as
losing leadership.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
DC/OS Ports
ENTERPRISE DC/OS Updated: April 17, 2017
This topic lists the ports that are required to launch DC/OS. Additional ports may be required
to launch the individual DC/OS services.
All nodes
TCP
Port DC/OS component systemd unit
61003REX-Ray dcos-rexray.service
61053Mesos DNS dcos-mesos-dns.service
61420Erlang Port Mapping Daemon (EPMD) dcos-epmd.service
62053DNS Forwarder (Spartan) dcos-spartan.service
62080Navstar dcos-navstar.service
62501DNS Forwarder (Spartan) dcos-spartan.service
62502Navstar dcos-navstar.service
UDP
Port DC/OS component systemd unit
61053 Mesos DNS dcos-mesos-dns.service
62053 DNS Forwarder (Spartan) dcos-spartan.service
64000 Navstar dcos-navstar.service
Master
TCP
Port DC/OS component systemd unit
53 DNS Forwarder (Spartan) dcos-spartan.service
80 Admin Router Master dcos-adminrouter.service
443 Admin Router Master dcos-adminrouter.service
1050 DC/OS Diagnostics (3DT) dcos-3dt.service
1337 DC/OS Secrets dcos-secrets.service |
2181 Exhibitor and Zookeeper dcos-exhibitor.service
5050 Mesos Master dcos-mesos-master.service
7070 DC/OS Package Manager (Cosmos) dcos-cosmos.service
8080 Marathon dcos-marathon.service
8101 DC/OS Identity and Access Manager dcos-bouncer.service |
8123 Mesos DNS dcos-mesos-dns.service
8181 Exhibitor and Zookeeper dcos-exhibitor.service
8200 Vault dcos-vault.service |
8888 DC/OS Certificate Authority dcos-ca.service |
9990 DC/OS Package Manager (Cosmos) dcos-cosmos.service
15055 DC/OS History dcos-history-service.service
15101 Marathon libprocess dcos-marathon.service
15201 DC/OS Jobs (Metronome) libprocess dcos-metronome.service
62500 DC/OS Network Metrics dcos-networking_api.service |
DynamicDC/OS Jobs (Metronome) dcos-metronome.service
DynamicDC/OS Component Package Manager (Pkgpanda) dcos-pkgpanda-api.service
UDP
Port DC/OS component systemd unit
53 DNS Forwarder (Spartan) dcos-spartan.service
Agent
TCP
Port DC/OS component systemd unit
5051 Mesos Agent dcos-mesos-slave.service
61001 Admin Router Agent dcos-adminrouter-agent
61002 DC/OS Diagnostics (3DT) dcos-3dt.service
Default advertised port ranges (for Marathon health
1025-2180
checks)
Default advertised port ranges (for Marathon health
2182-3887
checks)
Default advertised port ranges (for Marathon health
3889-5049
checks)
Default advertised port ranges (for Marathon health
5052-8079
checks)
Port DC/OS component systemd unit
Default advertised port ranges (for Marathon health
8082-8180
checks)
Default advertised port ranges (for Marathon health
8182-32000
checks)
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
Opt-Out
ENTERPRISE DC/OS Updated: April 17, 2017
Telemetry
You can opt-out of providing anonymous data by disabling telemetry for your cluster. To
disable telemetry, add this parameter to your config.yaml file during installation (note this
requires using the CLI or advanced installers):
telemetry_enabled: 'false'
If youve already installed your cluster and want to disable this in-place, you can go through
an upgrade with the same parameter set.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
origin/dev
origin/dev
To gracefully kill an agent nodes Mesos process and allow systemd to restart it, use the
following command. _Note: If Auto Scaling Groups are in use, the node will be replaced
automatically:
sudo systemctl kill -s SIGUSR1 dcos-mesos-slave
</ul>
<p>To gracefully kill an agent node's Mesos process and allow systemd to restart it, use the
following command. <em>Note: If Auto Scaling Groups are in use, the node will be replaced
automatically</em>:
<ul>
<li><em>For a public agent:</em>
<pre><code class="bash">$ sudo systemctl kill -s SIGUSR1 dcos-mesos-slave-public
>>>>>>> origin/dev
</code></pre></li>
<li>To gracefully kill the process and prevent systemd from restarting it, add a
<code>stop</code> command:
`bash
<<<<<<< HEAD
origin/dev
`
For a public agent:
`bash
<<<<<<< HEAD</p></li>
</ul>
<h1> sudo systemctl kill -s SIGUSR1 dcos-mesos-slave-public && sudo systemctl stop
dcos-mesos-slave-public</h1>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
`
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
INSTALLING DC/OS
Troubleshooting a Custom
Installation
Updated: April 17, 2017
origin/dev
<<<<<<< HEAD
It is recommended that you use the ip-detect examples.
## DNS resolvers
You must have working DNS resolvers,
specified in your config.yaml file. It is
recommended that you have forward and
reverse lookups for FQDNs, short
hostnames, and IP addresses. It is
possible for DC/OS to function in
environments without valid DNS support,
but the following must work to support
DC/OS services, including Spark:
It is recommended that you use the `ip-detect`
[examples](/1.9/administration/installing/custom/advanced/). ## DNS resolvers You must have
working DNS resolvers, specified in your
[config.yaml](/1.9/administration/installing/custom/configuration-parameters/#resolvers) file.
It is recommended that you have forward and reverse lookups for FQDNs, short hostnames, and IP
addresses. It is possible for DC/OS to function in environments without valid DNS support, but
the following _must_ work to support DC/OS services, including Spark:
origin/dev
- `hostname -f` returns the FQDN - `hostname -s` returns the short hostname You should sanity
check the output of `hostnamectl` on all of your nodes as well. When troubleshooting problems
with a DC/OS installation, you should explore the components in this sequence: 1. Exhibitor 1.
Mesos master 1. Mesos DNS 1. DNS Forwarder (Spartan) 1. DC/OS Marathon 1. Jobs 1. Admin Router
Be sure to check that all services are up and healthy on the masters before checking the
agents. ### NTP Network Time Protocol (NTP) must be enabled on all nodes for clock
synchronization. By default, during DC/OS startup you will receive an error if this is not
enabled. You can check if NTP is enabled by running one of these commands, depending on your OS
and configuration: ```bash </code></pre> <<<<<<< HEAD ntptime adjtimex -p <h1> timedatectl</h1>
<pre><code>$ ntptime $ adjtimex -p $ timedatectl </code></pre> <blockquote> <blockquote>
<blockquote> <blockquote> <blockquote> <blockquote> <blockquote> origin/dev ```
Ensure that firewalls and any other connection-filtering mechanisms are not interfering with
cluster component communications. TCP, UDP, and ICMP must be permitted. Ensure that services
that bind to port 53, which is required by DNS Forwarder (dcos-spartan.service), are
disabled and stopped. For example:
<<<<<<< HEAD
origin/dev
```bash journalctl -flu dcos-exhibitor ``` - Verify that `/tmp` is mounted *without*
`noexec`. If it is mounted with `noexec`, Exhibitor will fail to bring up ZooKeeper because
Java JNI won't be able to `exec` a file it creates in `/tmp` and you will see multiple
`permission denied` errors in the log. To repair `/tmp` mounted with `noexec`: 1. Enter this
command: ```bash mount -o remount,exec /tmp ``` 1. Check the output of
`/exhibitor/v1/cluster/status` and verify that it shows the correct number of masters and
that all of them are `"serving"` but only one of them is designated as `"isLeader": true`
<<<<<<< HEAD
For example, SSH to your master node and enter this command:
```bash
curl -fsSL
http://localhost:8181/exhibitor/v1/cluster/
status | python -m json.tool
For example, [SSH](/1.9/administration/access-node/sshcluster/) to your master node and
enter this command: ```bash $ curl -fsSL http://localhost:8181/exhibitor/v1/cluster/status |
python -m json.tool
origin/dev
[
{
"code": 3,
"descripti
on":
"serving",
"hostnam
e":
"10.0.6.7
0",
"isLeader
": false
},
{
"code": 3,
"descripti
on":
"serving",
"hostnam
e":
"10.0.6.6
9",
"isLeader
": false
},
{
"code": 3,
"descripti
on":
"serving",
"hostnam
e":
"10.0.6.6
8",
"isLeader
": true
}
]
```
**Note:** Running this command in multi-master configurations can take up to 10-15 minutes
to complete. If it doesn't complete after 10-15 minutes, you should carefully review the
`journalctl -flu dcos-exhibitor` logs.
Verify whether you can ping the DNS Forwarder (ready.spartan). If not, review the DNS
Dispatcher service logs: ?????
If you are able to ping ready.spartan, but not leader.mesos, review the Mesos master
service logs by using this command:
?
The Mesos masters must be up and running with a leader elected before Mesos-DNS
can generate its DNS records from ?????/state.?????
Component logs
During DC/OS installation, each of the components will converge from a failing state to a
running state in the logs.
Admin Router
DC/OS Marathon
gen_resolvconf
Mesos DNS
Admin Router
The Admin Router is started on the master nodes. The Admin Router provides central
authentication and proxy to DC/OS services within the cluster. This allows you to
administer your cluster from outside the network without VPN or a SSH tunnel. For HA,
an optional load balancer can be configured in front of each master node, load balancing
port 80, to provide failover and load balancing.
Troubleshooting:
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
For example, here is a snippet of the Admin Router log as it converges to a successful
state: ```bash systemd[1]: Starting A high performance web server and a reverse proxy
server... systemd[1]: Started A high performance web server and a reverse proxy server.
nginx[1652]: ip-10-0-7-166.us-west-2.compute.internal nginx: 10.0.7.166 - -
[18/Nov/2015:14:01:10 +0000] "GET /mesos/master/state-summary HTTP/1.1" 200 575 "-" "python-
requests/2.6.0 CPython/3.4.2 Linux/4.1.7-coreos" nginx[1652]: ip-10-0-7-166.us-
west-2.compute.internal nginx: 10.0.7.166 - - [18/Nov/2015:14:01:10 +0000] "GET /metadata
HTTP/1.1" 200 175 "-" "python-requests/2.6.0 CPython/3.4.2 Linux/4.1.7-coreos" ```
Publicly accessible applications are run in the public agent node. Public agent nodes can
be configured to allow outside traffic to access your cluster. Public agents are optional
and there is no minimum. This is where you'd run a load balancer, providing a service
from inside the cluster to the external public.
Troubleshooting:
You might not be able to SSH to agent nodes, depending on your cluster network
configuration. We have made this a little bit easier with the DC/OS CLI. For more
information, see SSHing to a DC/OS cluster.
You can get the IP address of registered agent nodes from the Nodes tab in the
DC/OS web interface. Nodes that have not registered are not shown.
SSH to your agent node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
For example, here is a snippet of the Mesos agent log as it converges to a successful state:
```bash mesos-slave[1080]: I1118 14:00:43.687366 1080 main.cpp:272] Starting Mesos slave
mesos-slave[1080]: I1118 14:00:43.688474 1080 slave.cpp:190] Slave started on
1)@10.0.1.108:5051 mesos-slave[1080]: I1118 14:00:43.688503 1080 slave.cpp:191] Flags at
startup: --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" --
cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --
cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --
container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --
disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --
docker_remove_delay="1hrs" --docker_socket="/var/run/docker.sock" --
docker_stop_timeout="0ns" --enforce_container_disk_quota="false" --
executor_environment_variables="{"LD_LIBRARY_PATH":"\/opt\/mesosphere\/lib","PATH":"\/usr\/b
in","SASL_PATH":"\/opt\/mesosphere\/lib\/sasl2","SHELL":"\/usr\/bin\/bash"}" --
executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --
fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --
gc_delay="2days" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --
hostname_lookup="false" --image_provisioner_backend="copy" --
initialize_driver_logging="true" --ip_discovery_command="/opt/mesosphere/bin/detect_ip" --
isolation="cgroups/cpu,cgroups/mem" --launcher_dir="/opt/mesosphere/packages/mesos-
-30d3fbeb6747bb086d71385e3e2e0eb74ccdcb8b/libexec/mesos" --log_dir="/var/log/mesos" --
logbufsecs="0" --logging_level="INFO" --master="zk://leader.mesos:2181/mesos" --
oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins"
--port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --
recovery_timeout="15mins" --registration_backoff_factor="1secs" --
resource_monitoring_interval="1secs" --
resources="ports:[1025-2180,2182-3887,3889-5049,5052-8079,8082-8180,8182-32000]" --
revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --
slave_subsystems="cpu,memory" --strict="true" --switch_user="true" --
systemd_runtime_directory="/run/systemd/system" --version="false" --
work_dir="/var/lib/mesos/slave" mesos-slave[1080]: I1118 14:00:43.688711 1080 slave.cpp:211]
Moving slave process into its own cgroup for subsystem: cpu mesos-slave[1080]: 2015-11-18
14:00:43,689:1080(0x7f9b526c4700):ZOO_INFO@check_events@1703: initiated connection to server
[10.0.7.166:2181] mesos-slave[1080]: I1118 14:00:43.692811 1080 slave.cpp:211] Moving slave
process into its own cgroup for subsystem: memory mesos-slave[1080]: I1118 14:00:43.697872
1080 slave.cpp:354] Slave resources: ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079,
8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; disk(*):32541 mesos-slave[1080]: I1118
14:00:43.697916 1080 slave.cpp:390] Slave hostname: 10.0.1.108 mesos-slave[1080]: I1118
14:00:43.697928 1080 slave.cpp:395] Slave checkpoint: true ```
DC/OS Marathon
DC/OS Marathon is started on the master nodes. The native Marathon instance that is
the init system for DC/OS. It starts and monitors applications and services.
Troubleshooting:
<<<<<<< HEAD
origin/
dev
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
For example, here is a snippet of the DC/OS Marathon log as it converges to a successful
state: ```bash java[1288]: I1118 13:59:39.125041 1363 group.cpp:331] Group process
(group(1)@10.0.7.166:48531) connected to ZooKeeper java[1288]: I1118 13:59:39.125100 1363
group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
java[1288]: I1118 13:59:39.125121 1363 group.cpp:403] Trying to create path '/mesos' in
ZooKeeper java[1288]: [2015-11-18 13:59:39,130] INFO Scheduler actor ready
(mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-5)
java[1288]: I1118 13:59:39.147804 1363 detector.cpp:156] Detected a new leader: (id='1')
java[1288]: I1118 13:59:39.147924 1363 group.cpp:674] Trying to get
'/mesos/json.info_0000000001' in ZooKeeper java[1288]: I1118 13:59:39.148727 1363
detector.cpp:481] A new leading master (UPID=master@10.0.7.166:5050) is detected java[1288]:
I1118 13:59:39.148787 1363 sched.cpp:262] New master detected at master@10.0.7.166:5050
java[1288]: I1118 13:59:39.148952 1363 sched.cpp:272] No credentials provided. Attempting to
register without authentication java[1288]: I1118 13:59:39.150403 1363 sched.cpp:641]
Framework registered with cdcb6222-65a1-4d60-83af-33dadec41e92-0000 ```
gen_resolvconf
gen_resolvconf is started. This is a service that helps the agent nodes locate the master
nodes. It updates /etc/resolv.conf so that agents can use the Mesos-DNS service for
service discovery. gen_resolvconf uses either an internal load balancer, vrrp, or a static
list of masters to locate the master nodes. For more information, see the
master_discovery configuration parameter.
Troubleshooting:
When gen_resolvconf is up and running, you can view /etc/resolv.conf contents. It should
contain one or more IP addresses for the master nodes, and the optional external DNS server.
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>
<h1> journalctl -u dcos-gen-resolvconf -b</h1>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
Troubleshooting:
Go directly to the Mesos web interface and view status at <master-hostname>/mesos.
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
For example, here is a snippet of the Mesos master log as it converges to a successful
state: ```bash mesos-master[1250]: I1118 13:59:33.890916 1250 master.cpp:376] Master
cdcb6222-65a1-4d60-83af-33dadec41e92 (10.0.7.166) started on 10.0.7.166:5050 mesos-
master[1250]: I1118 13:59:33.890945 1250 master.cpp:378] Flags at startup: --
allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --
authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --
cluster="pool-880dfdbf0f2845bf8191" --framework_sorter="drf" --help="false" --
hostname_lookup="false" --initialize *driver_logging="true" --
ip_discovery_command="/opt/mesosphere/bin/detect_ip" --log_auto_initialize="true" --
log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max*
slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1" --
recovery_slave_removal_limit="100%" --registry="replicated_log" --
registry_fetch_timeout="1mins" --registry_sto re_timeout="5secs" --registry_strict="false" -
-roles="slave_public" --root_submissions="true" --slave_ping_timeout="15secs" --
slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --
webui_dir="/opt/mesosphere/packages/mesos-
-30d3fbeb6747bb086d71385e3e2e0eb74ccdcb8b/share/mesos/webui" --weights="slave_public=1" --
work_dir="/var/lib/mesos/mas ter" --zk="zk://127.0.0.1:2181/mesos" --
zk_session_timeout="10secs" mesos-master[1250]: 2015-11-18
13:59:33,891:1250(0x7f14427fc700):ZOO_INFO@check_events@1750: session establishment complete
on server [127.0.0.1:2181], sessionId=0x1511ae440bc0001, negotiated timeout=10000 ```
Mesos-DNS
Mesos-DNS is started on the DC/OS master nodes. Mesos-DNS provides service
discovery within the cluster. Optionally, Mesos-DNS can forward unhandled requests to
an external DNS server, depending on how the cluster is configured. For example,
anything that does not end in .mesos will be forwarded to the external resolver.
Troubleshooting:
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
For example, here is a snippet of the Mesos-DNS log as it converges to a successful state:
```bash mesos-dns[1197]: I1118 13:59:34.763885 1197 detect.go:135] changing leader node from
"" -> "json.info_0000000001" mesos-dns[1197]: I1118 13:59:34.764537 1197 detect.go:145]
detected master info:
&MasterInfo{Id:*cdcb6222-65a1-4d60-83af-33dadec41e92,Ip:*2785476618,Port:*5050,Pid:*master@1
0.0.7.166:5050,Hostname:*10\.0.7.166,Version:*0\.25.0,Address:&Address{Hostname:*10\.0.7.166
,Ip:*10\.0.7.166,Port:*5050,XXX_unrecognized:[],},XXX_unrecognized:[],} mesos-dns[1197]:
VERY VERBOSE: 2015/11/18 13:59:34 masters.go:47: Updated leader:
&MasterInfo{Id:*cdcb6222-65a1-4d60-83af-33dadec41e92,Ip:*2785476618,Port:*5050,Pid:*master@1
0.0.7.166:5050,Hostname:*10\.0.7.166,Version:*0\.25.0,Address:&Address{Hostname:*10\.0.7.166
,Ip:*10\.0.7.166,Port:*5050,XXX_unrecognized:[],},XXX_unrecognized:[],} mesos-dns[1197]:
VERY VERBOSE: 2015/11/18 13:59:34 main.go:76: new masters detected: [10.0.7.166:5050] mesos-
dns[1197]: VERY VERBOSE: 2015/11/18 13:59:34 generator.go:70: Zookeeper says the leader is:
10.0.7.166:5050 mesos-dns[1197]: VERY VERBOSE: 2015/11/18 13:59:34 generator.go:162:
reloading from master 10.0.7.166 mesos-dns[1197]: I1118 13:59:34.766005 1197 detect.go:219]
notifying of master membership change:
[&MasterInfo{Id:*cdcb6222-65a1-4d60-83af-33dadec41e92,Ip:*2785476618,Port:*5050,Pid:*master@
10.0.7.166:5050,Hostname:*10\.0.7.166,Version:*0\.25.0,Address:&Address{Hostname:*10\.0.7.16
6,Ip:*10\.0.7.166,Port:*5050,XXX_unrecognized:[],},XXX_unrecognized:[],}] mesos-dns[1197]:
VERY VERBOSE: 2015/11/18 13:59:34 masters.go:56: Updated masters:
[&MasterInfo{Id:*cdcb6222-65a1-4d60-83af-33dadec41e92,Ip:*2785476618,Port:*5050,Pid:*master@
10.0.7.166:5050,Hostname:*10\.0.7.166,Version:*0\.25.0,Address:&Address{Hostname:*10\.0.7.16
6,Ip:*10\.0.7.166,Port:*5050,XXX_unrecognized:[],},XXX_unrecognized:[],}] mesos-dns[1197]:
I1118 13:59:34.766124 1197 detect.go:313] resting before next detection cycle ```
SSH to your master node and enter this command to view the logs from boot time:
```bash
<<<<<<< HEAD</p></li>
</ul>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
For example, here is a snippet of the Exhibitor log as it converges to a successful state:
```bash INFO com.netflix.exhibitor.core.activity.ActivityLog Automatic Instance Management
will change the server list: ==> 1:10.0.7.166 [ActivityQueue-0] INFO
com.netflix.exhibitor.core.activity.ActivityLog State: serving [ActivityQueue-0] INFO
com.netflix.exhibitor.core.activity.ActivityLog Server list has changed [ActivityQueue-0]
INFO com.netflix.exhibitor.core.activity.ActivityLog Attempting to stop instance
[ActivityQueue-0] INFO com.netflix.exhibitor.core.activity.ActivityLog Attempting to
start/restart ZooKeeper [ActivityQueue-0] INFO
com.netflix.exhibitor.core.activity.ActivityLog Kill attempted result: 0 [ActivityQueue-0]
INFO com.netflix.exhibitor.core.activity.ActivityLog Process started via:
/opt/mesosphere/active/exhibitor/usr/zookeeper/bin/zkServer.sh [ActivityQueue-0] ERROR
com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server: JMX enabled by default
[pool-3-thread-1] ERROR com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server:
Using config: /opt/mesosphere/active/exhibitor/usr/zookeeper/bin/../conf/zoo.cfg [pool-3-
thread-1] INFO com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server: Starting
zookeeper ... STARTED [pool-3-thread-3] INFO com.netflix.exhibitor.core.activity.ActivityLog
Cleanup task completed [pool-3-thread-6] INFO
com.netflix.exhibitor.core.activity.ActivityLog Cleanup task completed [pool-3-thread-9] ```
<<<<<<< HEAD
origin/dev
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Administering Clusters
Updated: April 17, 2017
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
Performance Monitoring
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur. Your monito...
Performance Monitoring
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur. Your monito...
Performance Monitoring
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur. Your monito...
Logging
DC/OS cluster nodes generate logs that contain diagnostic and status information for
DC/OS core components and DC/OS services. Service, Task, and Node Logs <<<<<<...
Debugging from the DC/OS Web Interface
You can debug your service or pod from the DC/OS web interface. Service and Pod
Health and Status Summaries If you have added a Marathon health check to your service
or pod, the Se...
Debugging
DC/OS offers several tools to debug your services when they are stuck in deployment or
are not behaving as you expect. This topic discusses how to debug your services using
both th...
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING
Performance Monitoring
ENTERPRISE DC/OS Updated: April 17, 2017
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur.
Your monitoring tools should leverage historic data points so that you can track changes
and deviations. You should monitor your cluster when it is known to be in a healthy state
as well as unhealthy. This will give you a baseline for what is normal in the DC/OS
environment. With this historical data, you can fine tune your tools and set appropriate
thresholds and conditions. When these thresholds are exceeded, you can send alerts to
administrators.
Counters have metrics that are additive and include past and present results. These metrics
are not persisted across failover.
Marathon has a timer metric that determines how long an event has taken place. Timer
does not exist for Mesos observability metrics.
Marathon metrics
Marathon provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.
Lifetime metrics
service.mesosphere.marathon.uptime (gauge) This metric provides the uptime, in
milliseconds, of the reporting Marathon process. Use this metric to diagnose stability
problems that can cause Marathon to restart.
Running tasks
service.mesosphere.marathon.task.running.count (gauge) This metric provides the
number of tasks that are running.
Staged tasks
service.mesosphere.marathon.task.staged.count (gauge) This metric provides the number
of tasks that are staged. Tasks are staged immediately after they are launched. A
consistently high number of staged tasks indicates a high number of tasks are being stopped
and restarted. This can be caused by either:
A high number of app updates or manual restarts.
service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl
.publishFuture (timer) This metric calculates how long it takes Marathon to process status
updates.
Mesos metrics
Mesos provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.
Master
These metrics should not increase over time
master/slave_removals (counter) This metric provides the number of agents removed for
various reasons, including maintenance. Use this metric to determine network partitions
after a large number of agents have disconnected. If this number greatly deviates from the
previous number, your system administrator should be notified (PagerDuty etc).
master/tasks_error (counter) This metric provides the number of invalid tasks.
master/tasks_failed (counter) This metric provides the number of failed tasks.
master/tasks_killed (counter) This metric provides the number of killed tasks.
master/tasks_lost (counter) This metric provides the number of lost tasks. A lost task
means a task was killed or disconnected by an external factor. Use this metric when a large
number of task deviate from the previous historic number.
master/slave_removals (counter) This metric provides the number of agents that were not
re-registered during master failover. This is a broad endpoint that combines
../reason_unhealthy, ../reason_unregistered, and ../reason_registered. You can
monitor this explicitly or leverage master/slave_removals/reason_unhealthy,
master/slave_removals/reason_unregistered, and
master/slave_removals/reason_registered for specifics.
master/slave_removals/reason_unhealthy (counter) This metric provides the number of
agents failed because of failed health checks. This endpoint returns the total number of
agents that were unhealthy.
master/elected (gauge) This metric indicates whether this is the elected master. This
metric should be fetched from all masters, and add up to 1. If this number is not 1 for a
period of time, your system administrator should be notified (PagerDuty etc).
master/uptime_secs (gauge) This metric provides the master uptime, in seconds. This
number should be at least 5 minutes (300 seconds) to indicate a stable master. You can use
this metric to detect flapping. For example, if the master has an uptime of less than 1
minute (60 seconds) for more than 10 minutes, it has probably restarted 10 or more times.
Agent
These metrics should not decrease over time
slave/uptime_secs (gauge) This metric provides the agent uptime, in seconds. This number
should be always increasing. The moment this number resets to , this indicates that the
agent process has been rebooted. You can use this metric to detect flapping. For example,
if the agent has an uptime of less than 1 minute (60 seconds) for more than 10 minutes, it
has probably restarted 10 or more times.
slave/registered (gauge) This metric indicates whether this agent is registered with a
master. This value should always be 1. A indicates that the agent is looking to join a new
master.
General
Check the Marathon App Health API endpoint for critical applications API endpoint.
Mesos endpoint that indicates how many agents have been shut down increases
Check for mesos masters having short uptimes, which is exposed in Mesos metrics.
Modify the mesos-master log rotation configuration to store the complete logs for at least
one day.
Make sure the master nodes have plenty of disk space.
Change the logrotation option from rotate 7 to maxage 14 or more. For example:
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING
Performance Monitoring
ENTERPRISE DC/OS Updated: April 17, 2017
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur.
Your monitoring tools should leverage historic data points so that you can track changes
and deviations. You should monitor your cluster when it is known to be in a healthy state
as well as unhealthy. This will give you a baseline for what is normal in the DC/OS
environment. With this historical data, you can fine tune your tools and set appropriate
thresholds and conditions. When these thresholds are exceeded, you can send alerts to
administrators.
Mesos and Marathon expose the following types of metrics:
Gauges are metrics that provide the current state at the moment it was queried.
Counters have metrics that are additive and include past and present results. These metrics
are not persisted across failover.
Marathon has a timer metric that determines how long an event has taken place. Timer
does not exist for Mesos observability metrics.
Marathon metrics
Marathon provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.
Lifetime metrics
service.mesosphere.marathon.uptime (gauge) This metric provides the uptime, in
milliseconds, of the reporting Marathon process. Use this metric to diagnose stability
problems that can cause Marathon to restart.
Running tasks
service.mesosphere.marathon.task.running.count (gauge) This metric provides the
number of tasks that are running.
Staged tasks
service.mesosphere.marathon.task.staged.count (gauge) This metric provides the number
of tasks that are staged. Tasks are staged immediately after they are launched. A
consistently high number of staged tasks indicates a high number of tasks are being stopped
and restarted. This can be caused by either:
A high number of app updates or manual restarts.
service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl
.publishFuture (timer) This metric calculates how long it takes Marathon to process status
updates.
Mesos metrics
Mesos provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.
Master
These metrics should not increase over time
master/slave_removals (counter) This metric provides the number of agents removed for
various reasons, including maintenance. Use this metric to determine network partitions
after a large number of agents have disconnected. If this number greatly deviates from the
previous number, your system administrator should be notified (PagerDuty etc).
master/slave_removals (counter) This metric provides the number of agents that were not
re-registered during master failover. This is a broad endpoint that combines
../reason_unhealthy, ../reason_unregistered, and ../reason_registered. You can
monitor this explicitly or leverage master/slave_removals/reason_unhealthy,
master/slave_removals/reason_unregistered, and
master/slave_removals/reason_registered for specifics.
master/slave_removals/reason_unhealthy (counter) This metric provides the number of
agents failed because of failed health checks. This endpoint returns the total number of
agents that were unhealthy.
master/elected (gauge) This metric indicates whether this is the elected master. This
metric should be fetched from all masters, and add up to 1. If this number is not 1 for a
period of time, your system administrator should be notified (PagerDuty etc).
master/uptime_secs (gauge) This metric provides the master uptime, in seconds. This
number should be at least 5 minutes (300 seconds) to indicate a stable master. You can use
this metric to detect flapping. For example, if the master has an uptime of less than 1
minute (60 seconds) for more than 10 minutes, it has probably restarted 10 or more times.
Agent
These metrics should not decrease over time
slave/uptime_secs (gauge) This metric provides the agent uptime, in seconds. This number
should be always increasing. The moment this number resets to , this indicates that the
agent process has been rebooted. You can use this metric to detect flapping. For example,
if the agent has an uptime of less than 1 minute (60 seconds) for more than 10 minutes, it
has probably restarted 10 or more times.
slave/registered (gauge) This metric indicates whether this agent is registered with a
master. This value should always be 1. A indicates that the agent is looking to join a new
master.
General
Check the Marathon App Health API endpoint for critical applications API endpoint.
Mesos endpoint that indicates how many agents have been shut down increases
Check for mesos masters having short uptimes, which is exposed in Mesos metrics.
Modify the mesos-master log rotation configuration to store the complete logs for at least
one day.
Make sure the master nodes have plenty of disk space.
Change the logrotation option from rotate 7 to maxage 14 or more. For example:
Performance Monitoring
Updated: April 17, 2017
Here are some recommendations for monitoring a DC/OS cluster. You can use any
monitoring tools. The endpoints listed below will help you troubleshoot when issues
occur.
Your monitoring tools should leverage historic data points so that you can track changes
and deviations. You should monitor your cluster when it is known to be in a healthy state
as well as unhealthy. This will give you a baseline for what is normal in the DC/OS
environment. With this historical data, you can fine tune your tools and set appropriate
thresholds and conditions. When these thresholds are exceeded, you can send alerts to
administrators.
Counters have metrics that are additive and include past and present results. These metrics
are not persisted across failover.
Marathon has a timer metric that determines how long an event has taken place. Timer
does not exist for Mesos observability metrics.
Marathon metrics
Marathon provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS. You can query the metrics HTTP endpoint in your DC/OS
cluster at <Master-Public-IP>/marathon/metrics.
Lifetime metrics
service.mesosphere.marathon.uptime (gauge) This metric provides the uptime, in
milliseconds, of the reporting Marathon process. Use this metric to diagnose stability
problems that can cause Marathon to restart.
Running tasks
service.mesosphere.marathon.task.running.count (gauge) This metric provides the
number of tasks that are running.
Staged tasks
service.mesosphere.marathon.task.staged.count (gauge) This metric provides the number
of tasks that are staged. Tasks are staged immediately after they are launched. A
consistently high number of staged tasks indicates a high number of tasks are being stopped
and restarted. This can be caused by either:
A high number of app updates or manual restarts.
service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl
.publishFuture (timer) This metric calculates how long it takes Marathon to process status
updates.
Mesos metrics
Mesos provides a number of metrics for monitoring. Here are the ones that are
particularly useful to DC/OS.
Master
These metrics should not increase over time
master/slave_removals (counter) This metric provides the number of agents removed for
various reasons, including maintenance. Use this metric to determine network partitions
after a large number of agents have disconnected. If this number greatly deviates from the
previous number, your system administrator should be notified (PagerDuty etc).
master/slave_removals (counter) This metric provides the number of agents that were not
re-registered during master failover. This is a broad endpoint that combines
../reason_unhealthy, ../reason_unregistered, and ../reason_registered. You can
monitor this explicitly or leverage master/slave_removals/reason_unhealthy,
master/slave_removals/reason_unregistered, and
master/slave_removals/reason_registered for specifics.
master/slave_removals/reason_unhealthy (counter) This metric provides the number of
agents failed because of failed health checks. This endpoint returns the total number of
agents that were unhealthy.
master/elected (gauge) This metric indicates whether this is the elected master. This
metric should be fetched from all masters, and add up to 1. If this number is not 1 for a
period of time, your system administrator should be notified (PagerDuty etc).
master/uptime_secs (gauge) This metric provides the master uptime, in seconds. This
number should be at least 5 minutes (300 seconds) to indicate a stable master. You can use
this metric to detect flapping. For example, if the master has an uptime of less than 1
minute (60 seconds) for more than 10 minutes, it has probably restarted 10 or more times.
Agent
These metrics should not decrease over time
slave/uptime_secs (gauge) This metric provides the agent uptime, in seconds. This number
should be always increasing. The moment this number resets to , this indicates that the
agent process has been rebooted. You can use this metric to detect flapping. For example,
if the agent has an uptime of less than 1 minute (60 seconds) for more than 10 minutes, it
has probably restarted 10 or more times.
slave/registered (gauge) This metric indicates whether this agent is registered with a
master. This value should always be 1. A indicates that the agent is looking to join a new
master.
General
Check the Marathon App Health API endpoint for critical applications API endpoint.
Mesos endpoint that indicates how many agents have been shut down increases
Check for mesos masters having short uptimes, which is exposed in Mesos metrics.
Modify the mesos-master log rotation configuration to store the complete logs for at least
one day.
Make sure the master nodes have plenty of disk space.
Change the logrotation option from rotate 7 to maxage 14 or more. For example:
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING
Logging
PREVIEW Updated: April 17, 2017
DC/OS cluster nodes generate logs that contain diagnostic and status information for
DC/OS core components and DC/OS services.
You can access information about DC/OS scheduler services, like Marathon or Kafka,
with the following CLI command:
dcos service log --follow <scheduler-service-name> ======= You can access information about
DC/OS scheduler services, like Marathon or Kafka, with the following CLI command: ```bash $
dcos service log --follow <scheduler-service-name> >>>>>>> origin/dev
You can access DC/OS task logs by running this CLI command:
<<<<<<< HEAD dcos task log --follow <service-name> ======= $ dcos task log --follow
<service-name> >>>>>>> origin/dev
You access the logs for the master node with the following CLI command:
<<<<<<< HEAD dcos node log --leader ======= $ dcos node log --leader >>>>>>> origin/dev
To access the logs for an agent node, run dcos node to get the Mesos IDs of your nodes,
then run the following CLI command:
You can download all the log files for your service from the Services > Services tab in the
DC/OS GUI. You can also monitor stdout/stderr.
You can download all the log files for your service from the **Services > Services** tab in
the [DC/OS GUI](/1.9/usage/webinterface/). You can also monitor stdout/stderr. For more
information, see the Service and Task Logs [quick start
guide](/1.9/administration/logging/quickstart/). >>>>>>> origin/dev ## System Logs DC/OS
components use `systemd-journald` to store their logs. To access the DC/OS core component
logs, [SSH into a node][5] and run this command to see all logs: ```bash <<<<<<< HEAD
journalctl -u "dcos-*" -b ======= $ journalctl -u "dcos-*" -b >>>>>>> origin/dev
You can view the logs for specific components by entering the component name. For
example, to access Admin Router logs, run this command:
journalctl -u dcos-nginx -b
You can find which components are unhealthy in the DC/OS GUI from the Nodes tab.
<<<<<<< HEAD
origin/dev
Aggregation
Unfortunately, streaming logs from machines in your cluster isnt always viable.
Sometimes, you need the logs stored somewhere else as a history of whats happened.
This is where log aggregation really is required. Check out how to get it setup with some
of the most common solutions:
<<<<<<< HEAD
ELK
Splunk
=======
ELK
Splunk
origin/dev
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING
You can debug your service or pod from the DC/OS web interface.
The Services > Services page lists each service or pod, the resources it has requested,
and its status. Possible statuses are Deploying, Waiting, or Running. If you have set up a
Marathon health check, you can also see the health of your service or pod: a green dot
for healthy and a red dot for unhealthy. If you have not set up a health check, the dot will
be gray.
Debugging Page
Clicking the name of a service or pod and then the Debug tab reveals a detailed
debugging page. There, you will see sections for Last Changes, Last Task Failure, Task
Statistics, Recent Resource Offers. You will also see a Summary of resource offers and
what percentage of those offers matched your pod or services requirements, as well as a
Details section that lists the host where your service or pod is running and which
resource offers were successful and unsuccessful for each deployment. You can use the
information on this page to learn where and how you need to modify your service or pod
definition.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
MONITORING, LOGGING, AND DEBUGGING
Debugging
PREVIEW Updated: April 17, 2017
DC/OS offers several tools to debug
your services when they are stuck in
deployment or are not behaving as you
expect. This topic discusses how to
debug your services using both the
DC/OS CLI and the DC/OS web
interface.
menu_order: 3
post_excerpt:
feature_maturity: preview
enterprise: no
The dcos task exec command allows you to execute an arbitrary command inside of a
tasks container and stream its output back to your local terminal. It offers an experience
very similar to docker exec.
Users do not need SSH keys to execute the dcos task exec command. Enterprise
DC/OS provides several debugging permissions so that users do not need the
dcos:superuser permission either.
You can execute this command in any of the following four modes.
dcos task exec <task-id> <command> (no flags): streams STDOUT and STDERR from the
remote terminal to your local terminal as raw bytes.
dcos task exec --tty <task-id> <command>: streams STDOUT and STDERR from the
remote terminal to your local terminal, but not as raw bytes. Instead, this option puts
your local terminal into raw mode, allocates a remote pseudo terminal (PYT), and
streams the STDOUT and STDERR through the remote PTY.
dcos task exec --interactive <task-id> <command> streams STDOUT and STDERR
from the remote terminal to your local terminal and streams STDIN from your local
terminal to the remote command.
dcos task exec --interactive --tty <task-id> <command>: streams STDOUT and
STDERR from the remote terminal to your local terminal and streams STDIN from
your local terminal to the remote terminal. Also puts your local terminal into raw mode;
allocates a remote pseudo terminal (PTY); and streams STDOUT, STDERR, and
STDIN through the remote PTY. This mode offers the maximum functionality.
Note: If your mode streams raw bytes, you wont be able to launch programs like vim
because these programs require the use of control characters.
Tip: We have included the text of the full flags above for readability, but each one can be
shortened. Instead of typing --interactive, you can just type -i. Likewise, instead of
typing --tty, you can just type -t.
Requirement: To use the debugging feature, the service or job must be launched using
either the Mesos container runtime or the Universal container runtime. Debugging cannot
be used on containers launched with the Docker runtime. See Using Mesos
Containerizers for more information.
Quick Start
origin/dev
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Jobs
PREVIEW Updated: April 17, 2017
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
Quick Start
You can create and administer jobs in the DC/OS web interface, from the DC/OS CLI, or
via the API. DC/OS Web Interface Note: The DC/OS web interface provides a subset of
the CLI an...
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
JOBS
Quick Start
PREVIEW Updated: April 17, 2017
You can create and administer jobs in the DC/OS web interface, from the DC/OS CLI, or
via the API.
Add a Job
From the DC/OS web interface, click the Jobs tab, then the Create a Job button. Fill in
the following fields, or toggle to JSON mode to edit the JSON directly.
General Tab
ID The ID of your job.
Disk space The amount of disk space, in MiB, your job requires.
Command The command your job will execute. Leave this blank if you will use a Docker
image.
Schedule Tab
Check the Run on a Schedule to reveal the following fields.
* Cron Schedule Specify the schedule in cron format. Use this crontab generator for
help.
* Time Zone Enter the time zone in TZ format, e.g. America/New_York.
* Starting Deadline This is the time, in seconds, to start the job if it misses scheduled
time for any reason. Missed jobs executions will be counted as failed ones.
Labels
Label Name and Label Value Attach metadata to your jobs so you can filter them.
Learn more about labels.
DC/OS CLI
You can create and manage jobs from the DC/OS CLI using dcos job commands. To see
a full list of available commands, run dcos job --help.
Add a Job
Create a job file in JSON format. The id parameter is the job ID. You will use this ID later to
manage your job.
{ "id": "myjob", "description": "A job that sleeps regularly", "run": { "cmd": "sleep
20000", "cpus": 0.01, "mem": 32, "disk": 0 }, "schedules": [ { "id": "sleep-schedule",
"enabled": true, "cron": "20 0 * * *", "concurrencyPolicy": "ALLOW" } ] }
Note: You can choose any name for your job file.
Go to the Jobs tab of the DC/OS web interface to verify that you have added your job,
or verify from the CLI:
Schedule-Only JSON
If you use the same schedule for more than one job, you can create a separate JSON file
for the schedule. Use the $ dcos job schedule add <job-id> <schedule-file> command
to associate a job with the schedule.
Remove a Job
Enter the following command on the DC/OS CLI:
Go to the Jobs tab of the DC/OS web interface to verify that you have removed your job, or
verify from the CLI:
Modify a Job
To modify your job, by update your JSON job file, then run
$ dcos job schedule add <job-id> <schedule-file>.json $ dcos job schedule remove <job-id>
<schedule-id> $ dcos job schedule update <job-id> <schedule-file>.json
To get the log for only a specific job run, use a job run ID from dcos job history <job-
id>
Note: The DC/OS CLI and web interface support a combined JSON format (accessed via
the /v0 endpoint) that allows you to specify a schedule in the job descriptor. To schedule
a job via the API, use two calls: one to add an unscheduled job and another to associate
a <schedule-file>.json with the job.
Add a Job
The following command adds a job called myjob.json.
Remove a Job
The following command removes a job regardless of whether the job is running:
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Tutorials
ENTERPRISE DC/OS Updated: April 17, 2017
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS
In this tutorial, a containerized Ruby on Rails app named Tweeter in installed and
deployed using DC/OS. Tweeter is an app similar to Twitter that you can use to post 140-
character messages to the internet. Then, you use Zeppelin to perform real-time
analytics on the data created by Tweeter.
Tweeter:
Stores tweets in the DC/OS Cassandra service.
Performs real-time analytics with the DC/OS Spark and Zeppelin services.
This tutorial uses DC/OS to launch and deploy these microservices to your cluster.
The Cassandra database is used on the backend to store the Tweeter app data.
The Kafka publish-subscribe message service receives tweets from Cassandra and routes them
to Zeppelin for real-time analytics.
The Marathon load balancer (Marathon-LB) is an HAProxy based load balancer for Marathon
only. It is useful when you require external routing or layer 7 load balancing features.
Zeppelin is an interactive analytics notebook that works with DC/OS Spark on the backend to
enable interactive analytics and visualization. Because its possible for Spark and Zeppelin
to consume all of your cluster resources, you must specify a maximum number of cores for the
Zeppelin service.
This tutorial demonstrates how you can build a complete IoT pipeline on DC/OS in about
15 minutes! You will learn:
How to install DC/OS services.
Prerequisites:
Enterprise DC/OS is installed with:
Security mode set to permissive or strict. By default, DC/OS installs in permissive
security mode.
The public IP address of your public agent node. After you have installed DC/OS with a
public agent node declared, you can navigate to the public IP address of your public agent
node.
Git:
OS X: Get the installer from Git downloads.
Tip: You can also install DC/OS packages from the DC/OS CLI with the dcos package
install command.
Find the cassandra package and click the INSTALL PACKAGE button and accept the default
installation. Cassandra will spin up to at least 3 nodes.
Find the kafka package and click the INSTALL button and accept the default installation.
Kafka will spin up 3 brokers.
Install Zeppelin.
zeppelin package
Find the and click the INSTALL button and then choose the ADVANCED
INSTALLATION option.
Click the spark tab and set cores_max to 8.
Install Marathon-LB.
Install the security CLI (dcos-enterprise-cli) by using the DC/OS CLI package install
commands. You will use this to partially configure the Marathon-LB security.
Search for the security CLI package repository by using the dcos package search command. In
this example the partial value enterprise* is used as an argument.
NAME VERSION SELECTED FRAMEWORK DESCRIPTION dcos-enterprise-cli 1.0.7 False False Enterprise
DC/OS CLI
Installing CLI subcommand for package [dcos-enterprise-cli] version [1.0.7] New command
available: dcos security
Create a new service account with the ID marathon-lb-service-acct. This command uses the
public-key.pem created in the previous step.
Create a new secret (marathon-lb-secret) using the private key (private-key.pem) and the
name of the service account (marathon-lb-service-acct).
You can verify that the secret was created successfully with this command.
- marathon-lb-secret
curl -k -v https://<master-ip>/ca/dcos-ca.crt
> GET /ca/dcos-ca.crt HTTP/1.1 > Host: 54.149.23.77 > User-Agent: curl/7.43.0 > Accept: */*
> < HTTP/1.1 200 OK < Server: openresty/1.9.15.1 < Date: Tue, 11 Oct 2016 18:30:49 GMT <
Content-Type: application/x-x509-ca-cert < Content-Length: 1241 < Last-Modified: Tue, 11 Oct
2016 15:17:28 GMT < Connection: keep-alive < ETag: "57fd0288-4d9" < Accept-Ranges: bytes < -
----BEGIN CERTIFICATE----- MIIDaDCCAlCgAwI... -----END CERTIFICATE-----
Grant the permissions and the allowed action to the service account.
Install Marathon-LB from the DC/OS CLI with the config.json file specified.
Monitor the Services tab to watch as your microservices are deployed on DC/OS. You will see the
Health status go from Idle to Unhealthy, and finally to Healthy as the nodes come online. This
may take several minutes.
Note: It can take up to 10 minutes for Cassandra to initialize with DC/OS because of race
conditions.
Add the HAPROXY_0_VHOST definition with the public IP address of your public agent node to
your tweeter.json file.
Important: You must remove the leading http:// and the trailing /.
Install and deploy Tweeter to your DC/OS cluster with this CLI command.
Tip: The instances parameter in tweeter.json specifies the number of app instances.
Use the following command to scale your app up or down:
Navigate to public agent node endpoint to see the Tweeter UI and post a tweet!
Tip: If youre having trouble, verify the HAPROXY_0_VHOST value in the tweeter.json file.
Click the Networking -> Service Addresses tab in the DC/OS web interface and select
the 1.1.1.1:30000 virtual network to see the load balancing in action.
The post-tweets app works by streaming to the VIP 1.1.1.1:30000. This address is
declared in the cmd parameter of the post-tweets.json app definition.
The Tweeter app uses the service discovery and load balancer service that is installed on
every DC/OS node. This address is defined in the tweeter.json definition VIP_0.
Run the Load Dependencies step to load the required libraries into Zeppelin.
Run the Spark Streaming step, which reads the tweet stream from ZooKeeper and puts
them into a temporary table that can be queried using SparkSQL.
Run the Top Tweeters SQL query, which counts the number of tweets per user using the
table created in the previous step. The table updates continuously as new tweets come
in, so re-running the query will produce a different result every time.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS
Autoscaling with Marathon
ENTERPRISE DC/OS Updated: April 17, 2017
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS
This tutorial shows how to create and deploy a simple one-command service and a
containerized service using both the DC/OS web interface and the CLI.
Prerequisites
<<<<<<< HEAD
A DC/OS cluster
A DC/OS cluster
>>>>>>> origin/dev
DOCKER ENGINE Use this option if you require specific features of the Docker package.
If you select this option, you must specify a Docker container image in the Container
Image field. For example, you can specify the Alpine Docker image.
MESOS RUNTIME Use this option if you prefer the original Mesos container runtime. It
does not support Docker containers.
UNIVERSAL CONTAINER RUNTIME Use this option if you are using Pods or GPUs. This
option also supports Docker images without depending on the Docker Engine. If you select
this option, you can optionally specify a Docker container image in the Container Image
field. For example, you can specify the Alpine Docker image.
For more information, see the containerizer documentation.
Thats it! Click the name of your service in the Services view to see it running and monitor
health.
{ "id": "/my-app-cli", "cmd": "sleep 10", "instances": 1, "cpus": 1, "mem": 128, "disk": 0,
"gpus": 0, "backoffSeconds": 1, "backoffFactor": 1.15, "maxLaunchDelaySeconds": 3600,
"upgradeStrategy": { "minimumHealthCapacity": 1, "maximumOverCapacity": 1 },
"portDefinitions": [ { "protocol": "tcp", "port": 10000 } ], "requirePorts": false }
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
`
Run the following command to verify that your service is running:
`bash
<<<<<<< HEAD</li>
</ol>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
origin/dev
`
You can also click the name of your service in the Services view of the DC/OS web
interface to see it running and monitor health.
Click the Services tab of the DC/OS web interface, then click the RUN A SERVICE.
Click Single Container and enter a name for your service in the SERVICE ID field.
Click the Container Settings tab and enter the following in the Container Image field:
mesosphere/hello-dcos:<image-tag>. Replace <image-tag> with the tag you copied in step
1.
<<<<<<< HEAD
Click Deploy.
In the Services tab, click the name of your service, then choose on of the task instances.
Click Logs, then toggle to the STDERR and STDOUT to see the output of the service.
<<<<<<< HEAD
origin/dev
Create a JSON file called hello-dcos-cli.json with the following contents. Replace <image-
tag> in the docker:image field with the tag you copied in step 1.
{
"id": "/hello-dcos-cli",
"instances": 1,
"cpus": 1,
"mem": 128,
"disk": 0,
"gpus": 0,
"backoffSeconds": 1,
"backoffFactor": 1.15,
"maxLaunchDelaySeconds": 3600,
"container": {
"docker": {
"image": "mesosphere/hello-dcos:<image-tag>",
"forcePullImage": false,
"privileged": false,
"network": "HOST"
}
},
"upgradeStrategy": {
"minimumHealthCapacity": 1,
"maximumOverCapacity": 1
},
"portDefinitions": [
{
"protocol": "tcp",
"port": 10001
}
],
"requirePorts": false
}
Run the service with the following command.
bash
dcos marathon app hello-dcos-cli.json
Run the following command to verify that your service is running:
`bash
<<<<<<< HEAD</li>
</ol>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
origin/dev
`
1. In the Services tab of the DC/OS web interface, click the name of your service, then
choose on of the task instances. Click Logs, then toggle to the Output (stdout) view to
see the output of the service.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS
Prerequisite:
This tutorial assumes that you have a working Jenkins installation and permission to
launch applications on Marathon. Jenkins for DC/OS must be installed as described on
the Jenkins Quickstart page.
The required files for this tutorial are Dockerfile, marathon.json, and the site directory.
Copy those items to a new project and push to a new Git repository on the host of your
choice.
This tutorial uses Docker Hub to store the created image and requires account
information to perform this task.
Click the Jenkins service and then Open Service to access the Jenkins web interface.
The Job
Well create a new Jenkins job that performs several operations with Docker Hub and
then either update or create a Marathon application.
Create a new Freestyle job with a name that includes only lowercase letters and
hyphens. This name will be used later in the Docker image name and possibly as the
Marathon application ID.
SCM / Git
From the Example Project section above, fill in the Git repository URL with the newly
created Git repository. This must be accessible to Jenkins and may require adding
credentials to the Jenkins instance.
Build Triggers
Select the Poll SCM build trigger with a schedule of: */5 * * * *. This will check the Git
repository every five minutes for changes.
Build Steps
The Jenkins job performs these actions:
Build a new Docker image.
These steps can be performed by a single build step using the Docker Build and Publish
plugin, which is already included and ready for use. From the Add build step drop-down
list, select the Docker Build and Publish option.
The Repository Name is your Docker Hub username with /${JOB_NAME} attached to the
end (myusername/${JOB_NAME}); the Tag field should be ${GIT_COMMIT}.
Set the Registry credentials to the credentials for Docker Hub that were created above.
Marathon Deployment
Add a Marathon Deployment post-build action by selecting the Marathon Deployment
option from the Add post-build action drop-down.
The Marathon instance within DC/OS can be accessed using the URL
http://leader.mesos/service/marathon. Fill in the fields appropriately, using Jenkins
variables if desired. The Docker Image should be the same as the build step above
(myusername/${JOB_NAME}:${GIT_COMMIT}) to ensure the correct image is used.
How It Works
The Marathon Deployment post-build action reads the application definition file, by
default marathon.json, contained within the projects Git repository. This is a JSON file
and must contain a valid Marathon application definition.
The configurable fields in the post-build action will overwrite the content of matching
fields from the file. For example, setting the Application Id will replace the id field in the
file. In the configuration above, Docker Image is configured and will overwrite the image
field contained within the docker field.
The final JSON payload is sent to the configured Marathon instance and the application
is updated or created.
Save
Save the job configuration.
Build It
Click Build Now and let the job build.
Deployment
Upon a successful run in Jenkins, the application will begin deployment on DC/OS. You
can visit the DC/OS web interface to monitor progress.
When the Status has changed to Running, the deployment is complete and you can visit
the website.
Commit the new post to Git. Shortly after the new commit lands on the master branch,
Jenkins will see the change and redeploy to Marathon.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
TUTORIALS
This tutorial illustrates how labels can be defined using the DC/OS web interface and the
Marathon HTTP API, and how information pertaining to applications and jobs that are
running can be queried based on label value criteria.
When you deploy applications, containers, or jobs in a DC/OS cluster, you can associate
a tag or label with your deployed components in order to track and report usage of the
cluster by those components. For example, you may want to assign a cost center
identifier or a customer number to a Mesos application and produce a summary report at
the end of the month with usage metrics such as the amount of CPU and memory
allocated to the applications by cost center or customer.
<<<<<<< HEAD
vi myapp.json
$ vi myapp.json
origin/dev
<<<<<<< HEAD dcos marathon app add <myapp>.json ======= $ dcos marathon app add <myapp>.json
>>>>>>> origin/dev
<<<<<<< HEAD
origin/dev
<<<<<<< HEAD
vi myjob.json
$ vi myjob.json
origin/dev
```json { "id": "my-job", "description": "A job that sleeps", "labels": { "department":
"marketing" }, "run": { "cmd": "sleep 1000", "cpus": 0.01, "mem": 32, "disk": 0 } } ```
<<<<<<< HEAD dcos marathon job add <myjob>.json ======= $ dcos marathon job add <myjob>.json
>>>>>>> origin/dev
You can also use the Marathon HTTP API from the DC/OS CLI to query the running
applications based on the label value criteria.
The code snippet below shows an HTTP request issued to the Marathon HTTP API. The
curl program is used in this example to submit the HTTP GET request, but you can use
any program that is able to send HTTP GET/PUT/DELETE requests. You can see the
HTTP end-point is https://52.88.210.228/marathon/v2/apps and the parameters sent
along with the HTTP request include the label criteria ?label=COST_CENTER==0001:
<<<<<<< HEAD
curl insecure \
$ curl --insecure \
origin/dev
https://52.
88.210.22
8/maratho
n/v2/apps
?label=C
OST_CE
NTER==0
001 \
| python -
m
json.tool |
more
In the example above, the response you receive will include only the applications that
have a label COST_CENTER defined with a value of 0001. The resource metrics are also
included, such as the number of CPU shares and the amount of memory allocated. At the
bottom of the response, you can see the date/time this application was deployed, which
can be used to compute the uptime for billing or charge-back purposes.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Release Notes
ENTERPRISE DC/OS Updated: April 17, 2017
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
GUI
ENTERPRISE DC/OS Updated: April 17, 2017
The DC/OS web interface provides a rich graphical view of your DC/OS cluster. With the
web interface you can view the current state of your entire cluster and DC/OS services.
The web interface is installed as a part of your DC/OS installation.
Additionally, there is a User Menu on the upper-left side of the web interface that includes
links for documentation, CLI installation, and user sign out.
Dashboard
The dashboard is the home page of the DC/OS web interface and provides an overview
of your DC/OS cluster.
From the dashboard you can easily monitor the health of your cluster.
The CPU Allocation panel provides a graph of the current percentage of available general
compute units that are being used by your cluster.
The Memory Allocation panel provides a graph of the current percentage of available
memory that is being used by your cluster.
The Task Failure Rate panel provides a graph of the current percentage of tasks that
are failing in your cluster.
The Services Health panel provides an overview of the health of your services. Each
service provides a healthcheck, run at intervals. This indicator shows the current
status according to that healthcheck. A maximum of 5 services are displayed, sorted
by priority of the most unhealthy. You can click the View all Services button for
detailed information and a complete list of your services.
The Tasks panel provides the current number of tasks that are staged and running.
Services
The Services tab provides a full featured interface to the native DC/OS Marathon
instance.
You can click the Deployments tab to view all active Marathon deployments.
Universe
The Universe tab shows all of the available DC/OS services. You can install packages
from the DC/OS Universe with a single click. The packages can be installed with defaults
or customized directly in the web interface.
Nodes
The Nodes tab provides a comprehensive view of all of the nodes that are used across
your cluster. You can view a graph that shows the allocation percentage rate for CPU,
memory, or disk.
By default all of your nodes are displayed in List view, sorted by hostname. You can filter
nodes by service type or hostname. You can also sort the nodes by number of tasks or
percentage of CPU, memory, or disk space allocated.
You can switch to Grid view to see a donuts percentage visualization.
Clicking on a node opens the Nodes side panel, which provides CPU, memory, and disk
usage graphs and lists all tasks on the node. Use the dropdown or a custom filter to sort
tasks and click on details for more information. Click on a task listed on the Nodes side
panel to see detailed information about the tasks CPU, memory, and disk usage and the
tasks files and directory tree.
Networking
The Networking tab provides a comprehensive view of the health of your VIPs. For more
information, see the documentation.
Security
The Security tab provides secret and certificates management. For more information, see
the secrets and certificates documentation.
System Overview
View the cluster details from the System Overview tab.
Components
View the system health of your DC/OS components from the Components tab.
Settings
Manage your DC/OS package repositories, secrets stores, LDAP directories, and identity
providers from the Settings tab.
Organization
Manage user access from the Organization tab.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
CLI
Updated: April 17, 2017
You can use the DC/OS command-line interface (CLI) to manage your cluster nodes,
install DC/OS packages, inspect the cluster state, and administer the DC/OS service
subcommands. You can install the CLI from the DC/OS web interface.
To list available commands, either run dcos with no parameters or run dcos help:
dcos Command line utility for the Mesosphere Datacenter Operating System (DC/OS). The
Mesosphere DC/OS is a distributed operating system built around Apache Mesos. This utility
provides tools for easy management of a DC/OS installation. Available DC/OS commands: auth
Authenticate to DCOS cluster config Get and set DC/OS CLI configuration properties help
Display command line usage information marathon Deploy and manage applications on the DC/OS
node Manage DC/OS nodes package Install and manage DC/OS packages service Manage DC/OS
services task Manage DC/OS tasks Get detailed command description with 'dcos <command> --
help'.
Environment Variables
For easy reference, these environment variables are supported by the DC/OS CLI:
The DC/OS CLI supports several environment variables that you can set dynamically.
DCOS_CONFIGSet the path to the DC/OS configuration file. By default, this variable is set to
DCOS_CONFIG=/<home-directory>/.dcos/dcos.toml. For example, if you moved your
DC/OS configuration file to /home/jdoe/config/ you can specify this command:
export DCOS_CONFIG=/home/jdoe/config/dcos.toml
DCOS_SSL_VERIFY Indicates whether to verify SSL certificates for HTTPS (true) or set the
path to the SSL certificates (false). By default, this is variable is set to true. This is
equivalent to setting the core.ssl_config option in the DC/OS configuration file. For
example, to set the path to SSL certificates:
export DCOS_SSL_VERIFY=false
DCOS_LOG_LEVEL Prints log messages to stderr at or above the level indicated. This is
equivalent to the --log-level command-line option. The severity levels are:
debug Prints all messages to stderr, including informational, warning, error, and
critical.
export DCOS_LOG_LEVEL=warning
export DCOS_DEBUG=true
Configuration Files
By default, the DC/OS command line stores its configuration files in a directory called
~/.dcos within your HOME directory. However, you can specify a different location by
using the DCOS_CONFIG environment variable.
The configuration settings are stored in the dcos.toml file. You can modify these settings
with the dcos config command.
dcos_url The the public master IP of your DC/OS installation. This is set by default during
installation. For example:
email Your email address. This is set by default during installation. For example, to reset
your email address:
reporting Indicate whether to report usage events to Mesosphere. By default this is set to
True. For example, to set to false:
ssl_verify Indicates whether to verify SSL certs for HTTPS or path to certs. By default this
is set to False. For example, to set to true:
timeout Request timeout in seconds, with a minimum value of 1 second. By default this is
set to 5 seconds. For example, to set to 3 seconds:
token The OAuth access token. For example, to change the OAuth token:
Security
ENTERPRISE DC/OS Updated: April 17, 2017
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
SECURITY
External: DC/OS stores only the users ID or user name, along with other DC/OS-
specific information, such as permissions and group membership. DC/OS never
receives or stores the passwords of external users. Instead, it delegates the
verification of the users credentials to one of the following: LDAP directory, SAML, or
OpenID Connect.
All users must have a unique identifier, i.e., a user ID or user name. Because DC/OS
needs to pass the users name or ID in URLs, it cannot contain any spaces or commas.
Only the following characters are supported: lowercase alphabet, uppercase alphabet,
numbers, @, ., \, _, and -.
Enteprise DC/OS also allows you to create groups of users and import groups of users
from LDAP. Groups can make it easier to manage permissions. Instead of assigning
permissions to each user account individually, you can assign the permissions to an
entire group of users at once.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
SECURITY
Identity provider-based
authentication
ENTERPRISE DC/OS Updated: April 17, 2017
When a user attempts to log on from the DC/OS web interface, they will be presented
with a list of the third-party identity providers that you have configured. They can click the
one that they have an account with for SSO.
Users logging in from the DC/OS CLI can use the following command to discover the
names of the IdPs that have been configured dcos auth list-providers. They can then
use the following command to log in using an IdP dcos auth login --
provider=<provider-name> --username=<user-email> --password=<secret-password.
Enterprise DC/OS supports two types of identity provider-based authentication methods:
Security Assertion Markup Language (SAML) and OpenID Connect (OIDC):
Adding a SAML Identity Provider
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Storage
Updated: April 17, 2017
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
STORAGE
Warning: This will terminate any running tasks or services on the node.
Connect to an agent in the cluster with SSH.
`bash
<<<<<<< HEAD</p></li>
</ol>
`bash
<<<<<<< HEAD</p></li>
</ol>
On a [public](/1.9/overview/concepts/#public) agent:
```bash
</code></pre>
<<<<<<< HEAD
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
origin/dev
```
Clear agent state.
Remove Volume Mount Discovery resource state with this command:
```bash
<<<<<<< HEAD</p></li>
</ol>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
```bash
</code></pre>
<<<<<<< HEAD
```bash
<<<<<<< HEAD
sudo mkdir -p /dcos/volume0
sudo dd if=/dev/zero of=/root/volume0.img bs=1M count=200
sudo losetup /dev/loop0 /root/volume0.img
sudo mkfs -t ext4 /dev/loop0</p></li>
</ol>
```bash
<<<<<<< HEAD
echo "/root/volume0.img /dcos/volume0 auto loop 0 2" | sudo tee -a /etc/fstab</p></li>
</ol>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
Reboot.
```bash
<<<<<<< HEAD</p></li>
</ol>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
```
```bash
<<<<<<< HEAD</p></li>
</ol>
<h1> journalctl -b | grep '/dcos/volume0'</h1>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<blockquote>
<p>origin/dev
May 05 19:18:40 dcos-agent-public-01234567000001 systemd<a
href="http://mesos.apache.org/documentation/latest/multiple-disk/">1</a>: Mounting
/dcos/volume0...
May 05 19:18:42 dcos-agent-public-01234567000001 systemd<a
href="http://mesos.apache.org/documentation/latest/multiple-disk/">1</a>: Mounted
/dcos/volume0.
May 05 19:18:46 dcos-agent-public-01234567000001 make_disk_resources.py[888]:
Found matching mounts : [('/dcos/volume0', 74)]
May 05 19:18:46 dcos-agent-public-01234567000001 make_disk_resources.py[888]:
Generated disk resources map: [{'name': 'disk', 'type': 'SCALAR', 'disk': {'source':
{'mount': {'root': '/dcos/volume0'}, 'type': 'MOUNT'}}, 'role': '<em>', 'scalar': {'value': 74}},
{'name': 'disk', 'type': 'SCALAR', 'role': '</em>', 'scalar': {'value': 47540}}]
May 05 19:18:58 dcos-agent-public-01234567000001 mesos-slave[1891]: " --
oversubscribed_resources_interval="15secs" --perf_duration="10secs" --
perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --
recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" -
-resources="[{"name": "ports", "type": "RANGES", "ranges": {"range": [{"end": 21, "begin":
1}, {"end": 5050, "begin": 23}, {"end": 32000, "begin": 5052}]}}, {"name": "disk", "type":
"SCALAR", "disk": {"source": {"mount": {"root": "/dcos/volume0"}, "type": "MOUNT"}},
"role": "<em>", "scalar": {"value": 74}}, {"name": "disk", "type": "SCALAR", "role": "</em>",
"scalar": {"value": 47540}}]" --revocable_cpu_low_priority="true" --
sandbox_directory="/mnt/mesos/sandbox" --slave_subsystems="cpu,memory" --
strict="true" --switch_user="true" --systemd_enable_support="true" --
systemd_runtime_directory="/run/systemd/system" --version="false" --
work_dir="/var/lib/mesos/slave"
```
Azure: Introduction to Microsoft Azure storage (see Blob Storage section on Page blobs)
Best Practices
Disk Mount Resources are primarily for stateful services like Kafka and Cassandra which
can benefit from having dedicated storage available throughout the cluster. Any service
that utilizes a Disk Mount Resource has exclusive access to the reserved resource.
However, it is still important to consider the performance and reliability requirements for
the service. The performance of a Disk Mount Resource is based on the characteristic of
the underlying storage and DC/OS does not provide any data replication services.
Consider the following:
Use Disk Mount Resources with stateful services that have strict storage requirements.
Carefully consider the filesystem type, storage media (network attached, SSD, etc.), and
volume characteristics (RAID levels, sizing, etc.) based on the storage needs and
requirements of the stateful service.
Label Mesos agents using a Mesos attribute that reflects the characteristics of the agent's
Disk Mounts, e.g. IOPS200, RAID1, etc.
Associate stateful services with storage Agents using Mesos Attribute constraints.
Consider isolating demanding storage services to dedicated storage agents, since the
filesystem page cache is a host-level shared resource.
Ensure all services using Disk Mount Resources are designed handle the permanent loss of one
or more Disk Mount Resources. Services are still responsible for managing data replication
and retention, graceful recovery from failed agents, and backups of critical service state.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
STORAGE
<<<<<<< HEAD
origin/dev
Use external volumes when fault-tolerance is crucial for your app. If a host fails, the
native Marathon instance reschedules your app on another host, along with its
associated data, without user intervention. External volumes also typically offer a larger
amount of storage.
Marathon applications normally lose their state when they terminate and are relaunched.
In some contexts, for instance, if your application uses MySQL, youll want your
application to preserve its state. You can use an external storage service, such as
Amazons Elastic Block Store (EBS), to create a persistent volume that follows your
application instance.
An external storage service enables your apps to be more fault-tolerant. If a host fails,
Marathon reschedules your app on another host, along with its associated data, without
user intervention.
This example configures REX-Ray to use Amazons EBS for storage and IAM for
authorization.
If your cluster will be hosted on Amazon Web Services and REX-Ray is configured to use
IAM, assign an IAM role to your agent nodes with the following policy:
If you scale your app down to 0 instances, the volume is detached from the agent where
it was mounted, but it is not deleted. If you scale your app up again, the data that was
associated with it is still be available.
<<<<<<< HEAD
* The size of the volume must be specified in GiB.
=======
origin/dev
* name is
the name
that your
volume
driver
uses to
look up
your
volume.
When
your task
is staged
on an
agent, the
volume
driver
queries
the
storage
service
for a
volume
with this
name. If
one does
not exist,
it is
created
implicitly.
Otherwise
, the
existing
volume is
reused.
Below is a sample app definition that uses a Docker container and specifies an external
volume:
Important: Refer to the REX-Ray documentation to learn which versions of Docker are
compatible with the REX-Ray volume
driver.
Click Volumes and enter your Volume Name and Container Path.
Click Deploy.
Implicit Volumes
The default implicit volume size is 16 GiB. If you are using the Mesos containerizer, you
can modify this default for a particular volume by setting volumes[x].external.size. You
cannot modify this default for a particular volume if you are using the Docker
containerizer. For both the Mesos and Docker containerizers, however, you can modify
the default size for all implicit volumes by modifying the REX-Ray configuration.
Potential Pitfalls
You can only assign one task per volume. Your storage provider might have other limitations.
The volumes you create are not automatically cleaned up. If you delete your cluster,
you must go to your storage provider and delete the volumes you no longer need. If
youre using EBS, find them by searching by the container.volumes.external.name
that you set in your Marathon app definition. This name corresponds to an EBS
volume Name tag.
Volumes are namespaced by their storage provider. If youre using EBS, volumes
created on the same AWS account share a namespace. Choose unique volume
names to avoid conflicts.
If you are using Docker, you must use a compatible Docker version. Refer to the REX-
Ray documentation to learn which versions of Docker are compatible with the REX-
Ray volume driver.
If you are using Amazons EBS, it is possible to create clusters in different availability
zones (AZs). If you create a cluster with an external volume in one AZ and destroy it,
a new cluster may not have access to that external volume because it could be in a
different AZ.
Launch time might increase for applications that create volumes implicitly. The amount
of the increase depends on several factors which include the size and type of the
volume. Your storage providers method of handling volumes can also influence
launch time for implicitly created volumes.
For troubleshooting external volumes, consult the agent or system logs. If you are
using REX-Ray on DC/OS, you can also consult the systemd journal.
<<<<<<< HEAD
origin/dev
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
STORAGE
When you specify a local volume or volumes, tasks and their associated data are
pinned to the node they are first launched on and will be relaunched on that node if they
terminate. The resources the application requires are also reserved. Marathon will
implicitly reserve an appropriate amount of disk space (as declared in the volume via
persistent.size) in addition to the sandbox disk size you specify as part of your
application definition.
You dont need constraints to pin a task to a particular agent where its data resides
You can still use constraints to specify distribution logic
Marathon lets you locate and destroy an unused persistent volume if you dont need it
anymore
origin/dev
Configuration options
Configure a persistent volume with the following options:
containerPath: The path where your application will read and write data. This must be a
single-level path relative to the container; it cannot contain a forward slash (/).
("data", but not "/data", "/var/data" or "var/data"). If your application requires an
absolute path, or a relative path with slashes, use this configuration.
mode: The access mode of the volume. Currently, "RW" is the only possible value and will
let your application read from and write to the volume.
You also need to set the residency node in order to tell Marathon to setup a stateful
application. Currently, the only valid option for this is:
"residency": { "taskLostBehavior": "WAIT_FOREVER" }
The second volume is a persistent volume with a containerPath that matches the
hostPath of the first volume.
For a complete example, see the Running stateful MySQL on Marathon section.
Choose the size of the volume or volumes you will use. Be sure that you choose a volume size
that will fit the needs of your application; you will not be able to modify this size after you
launch your application.
Specify the container path from which your application will read and write data. The container
path must be non-nested and cannot contain slashes e.g. data, but not ../../../etc/opt or
/user/data/. If your application requires such a container path, use this configuration.
Click Create.
Notes:
If your app is destroyed, any associated volumes and reserved resources will also be
deleted.
Mesos will currently not remove the data but might do so in the future.
Note: For a stateful application, Marathon will never start more instances than specified
in the UpgradeStrategy, and will kill old instances rather than create new ones during an
upgrade or restart.
In contrast to static reservations, dynamic reservations are created at runtime for a given
role and will associate resources with a combination of frameworkId and taskId using
reservation labels. This allows Marathon to restart a stateful task after it has terminated
for some reason, since the associated resources will not be offered to frameworks that
are not configured to use this role. Consult non-unique roles for more information.
Mesos creates persistent volumes to hold your applications stateful data. Because
persistent volumes are local to an agent, the stateful task using this data will be pinned to
the agent it was initially launched on, and will be relaunched on this node whenever
needed. You do not need to specify any constraints for this to work: when Marathon
needs to launch a task, it will accept a matching Mesos offer, dynamically reserve the
resources required for the task, create persistent volumes, and make sure the task is
always restarted using these reserved resources so that it can access the existing data.
When a task that used persistent volumes has terminated, its metadata will be kept. This
metadata will be used to launch a replacement task when needed.
For example, if you scale down from 5 to 3 instances, you will see 2 tasks in the Waiting
state along with the information about the persistent volumes the tasks were using as
well as about the agents on which they are placed. Marathon will not unreserve those
resources and will not destroy the volumes. When you scale up again, Marathon will
attempt to launch tasks that use those existing reservations and volumes as soon as it
gets a Mesos offer containing the labeled resources. Marathon will only schedule
unreserve/destroy operations when:
the application is deleted (in which case volumes of all its tasks are destroyed, and all
reservations are deleted).
you explicitly delete one or more suspended tasks with a wipe=true flag.
If reserving resources or creating persistent volumes fails, the created task will timeout
after the configured task_reservation_timeout (default: 20 seconds) and a new
reservation attempt will be made. In case a task is LOST (because its agent is
disconnected or crashed), the reservations and volumes will not timeout and you need to
manually delete and wipe the task in order to let Marathon launch a new one.
Potential Pitfalls
Be aware of the following issues and limitations when using stateful applications in
Marathon that make use of dynamic resevations and persistent volumes.
Resource requirements
Currently, the resource requirements of a stateful application cannot be changed. Your
initial volume size, cpu usage, memory requirements, etc., cannot be changed once
youve posted the AppDefinition.
If an agent re-registers with the cluster and offers its resources, Marathon is eventually
able to relaunch a task there. If a node does not re-register with the cluster, Marathon will
wait forever to receive expected offers, as its goal is to re-use the existing data. If the
agent is not expected to come back, you can manually delete the relevant tasks by
adding a wipe=true flag and Marathon will eventually launch a new task with a new
volume on another agent.
Disk consumption
As of Mesos 0.28, destroying a persistent volume will not cleanup or destroy data. Mesos
will delete metadata about the volume in question, but the data will remain on disk. To
prevent disk consumption, you should manually remove data when you no longer need it.
Non-unique Roles
Both static and dynamic reservations in Mesos are bound to roles, not to frameworks or
framework instances. Marathon will add labels to claim that resources have been
reserved for a combination of frameworkId and taskId, as noted above. However, these
labels do not protect from misuse by other frameworks or old Marathon instances (prior
to 1.0). Every Mesos framework that registers for a given role will eventually receive
offers containing resources that have been reserved for that role.
However, if another framework does not respect the presence of labels and the
semantics as intended and uses them, Marathon is unable to reclaim these resources for
the initial purpose. We recommend never using the same role for different frameworks if
one of them uses dynamic reservations. Marathon instances in HA mode do not need to
have unique roles, though, because they use the same role by design.
Examples
Running stateful PostgreSQL on Marathon
A model app definition for PostgreSQL on Marathon would look like this. Note that we set
the postgres data folder to pgdata which is relative to the Mesos sandbox (as contained in
the $MESOS_SANDBOX variable). This enables us to set up a persistent volume with a
containerPath of pgdata. This path is is not nested and relative to the sandbox as
required:
Send an HTTP DELETE request to Marathon that includes the wipe=true flag.
To locate the agent, inspect the Marathon UI and check out the detached volumes on the
Volumes tab. Or, query the /v2/apps endpoint, which provides information about the host
and Mesos slaveId.
Note: A running task will show stagedAt, startedAt and version in addition to the
information provided above.
Delete the task with wipe=true, which will expunge the task information from the Marathon
internal repository and eventually destroy the volume and unreserve the resources previously
associated with the task:
http DELETE
http://dcos/service/marathon/v2/apps/postgres/tasks/postgres.53ab8733-fd96-11e5-8e70-76a1c19
f8c3d?wipe=true
The Status column tells you if your app instance is attached to the volume or not. The
app instance will read as detached if you have scaled down your application. Currently
the only Operation Type available is read/write (RW).
Click a volume to view the Volume Detail Page, where you can see information about the
individual volume.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Metrics
PREVIEW Updated: April 17, 2017
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
Quick Start
Use this guide to get started with the DC/OS metrics component. The metrics component
is natively integrated with DC/OS and no additional setup is required. Prerequisites: You
must...
Metrics API
Use the Metrics API to poll for data about your cluster, hosts, containers, and
applications. You can then pass this data to a third party service of your choice.
Metrics Reference
These metrics are automatically collected by DC/OS. Node Metrics Metric Description
cpu.cores Percentage of cores used. cpu.idle Percentage of CPUs idle. cpu.system
Percentage of s...
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
METRICS
Quick Start
ENTERPRISE DC/OS PREVIEW Updated: April 17, 2017
Use this guide to get started with the DC/OS metrics component. The metrics component
is natively integrated with DC/OS and no additional setup is required.
Prerequisites:
You must have the DC/OS CLI installed and be logged in as a superuser via the dcos auth
login command.
Optional: the CLI JSON processor jq.
Optional: Deploy a sample Marathon app for use in this quick start guide. If you already have
tasks running on DC/OS, you can skip this setup step.
Create the following Marathon app definition and save as test-metrics.json.
{ "id": "/test-metrics", "cmd": "while true;do echo stdout;echo stderr >&2;sleep 1;done",
"cpus": 0.001, "instances": 1, "mem": 128 }
Obtain your DC/OS authentication token and copy for later use:
eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOiJib290c3RyYXB1c2VyIi...
SSH to the agent node that is running your app, where (--mesos-id=<mesos-id>) is the Mesos
ID of the node running your app.
dcos node ssh --master-proxy --mesos-id=<mesos-id>
Tip: To get the Mesos ID of the node that is running your app, run dcos task followed by
dcos node. For example:
Running dcos task shows that host 10.0.0.193 is running the Marathon task test-
metrics.93fffc0c-fddf-11e6-9080-f60c51db292b.
dcos task NAME HOST USER STATE ID test-metrics 10.0.0.193 root R test-metrics.93fffc0c-
fddf-11e6-9080-f60c51db292b
Running dcos node shows that host 10.0.0.193 has the Mesos ID 7749eada-4974-44f3-
aad9-42e2fc6aedaf-S1.
View metrics.
Metrics for all containers running on a host
To show all containers that are deployed on the agent node, run this command from
your agent node with your authentication token (<auth-token>) specified.
["121f82df-b0a0-424c-aa4b-81626fb2e369","87b10e5e-6d2e-499e-ae30-1692980e669a"]
The output will contain a datapoints array that contains information about container
resource allocation and utilization provided by Mesos. For example:
The output will also contain an object named dimensions that contains metadata about
the cluster/node/app.
The output will contain a datapoints array about resource allocation and utilization.
For example:
The output will contain an object named dimensions that contains metadata about the
cluster and node. For example:
... "dimensions": { "mesos_id": "a29070cd-2583-4c1a-969a-3e07d77ee665-S0", "hostname":
"10.0.2.255" } ...
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
METRICS
Metrics API
ENTERPRISE DC/OS PREVIEW Updated: April 17, 2017
Response format
The API supports JSON only. You will not need to send any JSON, but must indicate
Accept: application/json in the HTTP header, as shown below.
Accept: application/json
Base path
Append /system/v1/metrics/v0/ to the host name, as shown below.
https://<host-name-or-ip>/system/v1/metrics/v0/
{ "token":
"eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1aWQiOiJib290c3RyYXB1c2VyIiwiZXhwIjoxNDgyNjE1NDU2fQ
.j3_31keWvK15shfh_BII7w_10MgAj4ay700Rub5cfNHyIBrWOXbedxdKYZN6ILW9vLt3t5uCAExOOFWJkYcsI0sVFcM
1HSV6oIBvJ6UHAmS9XPqfZoGh0PIqXjE0kg0h0V5jjaeX15hk-LQkp7HXSJ-
V7d2dXdF6HZy3GgwFmg0Ayhbz3tf9OWMsXgvy_ikqZEKbmPpYO41VaBXCwWPmnP0PryTtwaNHvCJo90ra85vV85C02NE
dRHB7sqe4lKH_rnpz980UCmXdJrpO4eTEV7FsWGlFBuF5GAy7_kbAfi_1vY6b3ufSuwiuOKKunMpas9_NfDe7UysfPVH
lAxJJgg" }
Via the DC/OS CLI
When you log into the DC/OS CLI using dcos auth login, it stores the authentication
token value locally. You can reference this value as a variable in curl commands
(discussed in the next section).
Alternatively, you can use the following command to get the authentication token value.
Authorization:
token=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1aWQiOiJib290c3RyYXB1c2VyIiwiZXhwIjoxNDgyNjE1N
DU2fQ.j3_31keWvK15shfh_BII7w_10MgAj4ay700Rub5cfNHyIBrWOXbedxdKYZN6ILW9vLt3t5uCAExOOFWJkYcsI0
sVFcM1HSV6oIBvJ6UHAmS9XPqfZoGh0PIqXjE0kg0h0V5jjaeX15hk-LQkp7HXSJ-
V7d2dXdF6HZy3GgwFmg0Ayhbz3tf9OWMsXgvy_ikqZEKbmPpYO41VaBXCwWPmnP0PryTtwaNHvCJo90ra85vV85C02NE
dRHB7sqe4lKH_rnpz980UCmXdJrpO4eTEV7FsWGlFBuF5GAy7_kbAfi_1vY6b3ufSuwiuOKKunMpas9_NfDe7UysfPVH
lAxJJgg
$ curl -H "Authorization:
token=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1aWQiOiJib290c3RyYXB1c2VyIiwiZXhwIjoxNDgyNjE1N
DU2fQ.j3_31keWvK15shfh_BII7w_10MgAj4ay700Rub5cfNHyIBrWOXbedxdKYZN6ILW9vLt3t5uCAExOOFWJkYcsI0
sVFcM1HSV6oIBvJ6UHAmS9XPqfZoGh0PIqXjE0kg0h0V5jjaeX15hk-LQkp7HXSJ-
V7d2dXdF6HZy3GgwFmg0Ayhbz3tf9OWMsXgvy_ikqZEKbmPpYO41VaBXCwWPmnP0PryTtwaNHvCJo90ra85vV85C02NE
dRHB7sqe4lKH_rnpz980UCmXdJrpO4eTEV7FsWGlFBuF5GAy7_kbAfi_1vY6b3ufSuwiuOKKunMpas9_NfDe7UysfPVH
lAxJJgg"
API reference
Loading ...
Logging
While the API returns informative error messages, you may also find it useful to check
the logs of the Metrics service. Refer to Service and Task Logging for instructions.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
METRICS
Metrics Reference
PREVIEW Updated: April 17, 2017
Node
Metrics
Metric Description
cpu.cores Percentage of cores used.
cpu.idle Percentage of CPUs idle.
cpu.system Percentage of system used.
cpu.total Percentage of CPUs used.
cpu.user Percentage of CPU used by the user.
cpu.wait Percentage idle while waiting for an operation to complete.
load.1min Load average for the past minute.
load.5min Load average for the past 5 minutes.
Metric Description
load.15min Load average for the past 15 minutes.
memory.buffers Number of memory buffers.
memory.cachedAmount of cached memory.
memory.free Amount of free memory in bytes.
memory.total Total memory in bytes.
processes Number of processes that are running.
swap.free Amount of free swap space.
swap.total Total swap space.
swap.used Amount of swap space used.
uptime The system reliability and load average.
Filesystems
Metric Description
filesystem.{{.Name}}.capacity.free Amount of available capacity in bytes.
filesystem.{{.Name}}.capacity.total Total capacity in bytes.
filesystem.{{.Name}}.capacity.usedCapacity used in bytes.
filesystem.{{.Name}}.inodes.free Amount of available inodes in bytes.
filesystem.{{.Name}}.inodes.total Total inodes in bytes.
filesystem.{{.Name}}.inodes.used Inodes used in bytes.
Network interfaces
Metric Description
network.{{.Name}}.in.bytes Number of bytes downloaded.
network.{{.Name}}.in.dropped Number of downloaded bytes dropped.
network.{{.Name}}.in.errors Number of downloaded bytes in error.
network.{{.Name}}.in.packets Number of packets downloaded.
network.{{.Name}}.out.bytes Number of bytes uploaded.
network.{{.Name}}.out.droppedNumber of uploaded bytes dropped.
network.{{.Name}}.out.errors Number of uploaded bytes in error.
network.{{.Name}}.out.packets Number of packets uploaded.
Container
The following per-container resource utilization metrics are collected.
Disk info
Metric Description
disk_limit_bytes Hard memory limit for disk in bytes.
disk_used_bytes Hard memory used in bytes.
Memory info
Metric Description
mem_limit_bytes Hard memory limit for a container.
mem_total_bytesTotal memory of a process in RAM (as opposed to in Swap).
Dimensions
Dimensions are metadata about the metrics that are contained in a common message
format and are broadcast to one or more metrics producers.
Metric Description
net_rx_bytes Bytes received.
net_rx_dropped Packets dropped on receive.
net_rx_errors Errors reported on receive.
net_rx_packets Packets received.
net_tx_bytes Bytes sent.
net_tx_dropped Packets dropped on send
net_tx_errors Errors reported on send.
net_tx_packets Packets sent.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Deploying Jobs
PREVIEW Updated: April 17, 2017
You can create scheduled jobs in DC/OS without installing a separate service. Create
and administer jobs in the DC/OS web interface, the DC/OS CLI, or via an API.
Note: The Jobs functionality of DC/OS is provided by the DC/OS Jobs (Metronome)
component, an open source Mesos framework that comes pre-installed with DC/OS. You
may sometimes see the Jobs functionality referred to as Metronome in the logs, and the
service endpoint is service/metronome.
Functionality
You can create a job as a single command you include when you create the job, or you
can point to a Docker image.
The schedule for your job, in cron format. You can also set the time zone and starting
deadline.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
Installing Services
Installing a service using the CLI The general syntax for installing a service with the CLI
follows. dcos package install [--options=<config-file-name>.json] <servicename&...
Pods
Monitoring Services
You can monitor deployed DC/OS services from the CLI and web interface. Monitoring
Universe services CLI From the DC/OS CLI, enter the dcos service command. In this
example you can...
Updating a User-Created Service
You can easily view and update the configuration of a deployed app by using the dcos
marathon command. Note: The process for updating packages from the DC/OS Universe
is different....
Service Ports
Port configuration for applications in Marathon can be confusing and there is an
outstanding issue to redesign the ports API. This page attempts to explain more clearly
how they wo...
Exposing a Service
DC/OS agent nodes can be designated as public or private during installation. Public
agent nodes provide access from outside of the cluster via infrastructure networking to
your DC...
Deploying non-native Marathons
About deploying non-native Marathons Each service that Marathon launches uses the
same Mesos role that Marathon registered with for quotas and reservations. In addition,
any users ...
Marathon REST API
The Marathon API allows you to manage long-running containerized services (apps and
pods). The Marathon API is backed by the Marathon component, which runs on the
master nodes. One...
Enabling GPU Support
DC/OS supports allocating GPUs (Graphics Processing Units) to your long-running
DC/OS services. Adding GPUs to your service can dramatically accelerate big data
workloads. Learn mo...
Frequently Asked Questions
Weve collected some questions we often encounter concerning the usage of DC/OS.
Have got a new question youd like to see? Use the Submit feedback button at the
bottom...
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Installing Services
Updated: April 17, 2017
Use the optional --options flag to specify the name of the customized JSON file you
created in advanced configuration.
For example, you would use the following command to install Chronos with the default
parameters.
Universe tab
Navigate to the Universe > Packages page in the DC/OS GUI.
Services tab
Navigate to the Services tab in the DC/OS GUI.
Web GUI
Go to the Services tab and confirm that the service is running. For more information, see
the GUI documentation.
Tip: Some services from the Community Packages section of the Universe will not show
up in the DC/OS service listing. For these, inspect the services Marathon app in the
Marathon GUI to verify that the service is running and healthy.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Pods
ENTERPRISE DC/OS Updated: April 17, 2017
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Monitoring Services
Updated: April 17, 2017
You can monitor deployed DC/OS services from the CLI and web interface.
Monitoring Universe services
CLI
From the DC/OS CLI, enter the dcos service command. In this example you can see the
installed DC/OS services Chronos, HDFS, and Kafka.
dcos service NAME HOST ACTIVE TASKS CPU MEM DISK ID chronos <privatenode1> True 0 0.0 0.0
0.0 <service-id1> hdfs <privatenode2> True 1 0.35 1036.8 0.0 <service-id2> kafka
<privatenode3> True 0 0.0 0.0 0.0 <service-id3>
Web interface
See the monitoring documentation.
dcos task NAME HOST USER STATE ID cassandra 10.0.3.224 root R cassandra.36031a0f-feb4-11e6-
b09b-3638c949fe6b node-0 10.0.3.224 root R node-0__0b165525-13f2-485b-a5f8-e00a9fabffd9
suzanne-simple-service 10.0.3.224 root R suzanne-simple-service.47359150-feb4-11e6-
b09b-3638c949fe6b
Web interface
See the monitoring documentation.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Updating a User-Created Service
Updated: April 17, 2017
You can easily view and update the configuration of a deployed app by using the dcos
marathon command.
Note: The process for updating packages from the DC/OS Universe is different. For more
information, see the documentation.
A single element of the env field can be updated by specifying a JSON string in a
command argument.
Now, run the command below to see the result of your update:
The file will contain the JSON for the env field:
{ "SCHEDULER_DRIVER_PORT": "25501", }
Now edit the env_vars.json file. Make the JSON a valid object by enclosing the file
contents with { "env" :} and add your update:
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Service Ports
Updated: April 17, 2017
You can use virtual addresses (VIPs) to make ports management easier. VIPs simplify
inter-app communication and implement a reliable service-oriented architecture. VIPs
map traffic from a single virtual address to multiple IP addresses and ports.
Definitions
containerPort: A container port specifies a port within a container. This is only necessary
as part of a port mapping when using BRIDGE or USER mode networking with a Docker
container.
hostPort: A host port specifies a port on the host to bind to. When used with BRIDGE or
USER mode networking, you specify a port mapping from a host port to a container port. In
HOST networking, requested ports are host ports by default. Note that only host ports are
made available to a task through environment variables.
BRIDGE networking: used by Docker applications that specify BRIDGE mode networking.
In this mode, container ports (a port within the container) are mapped to host ports (a
port on the host machine). In this mode, applications bind to the specified ports within the
container and Docker networking binds to the specified ports on the host.
USER networking: used by Docker applications that specify USER mode networking. In
this mode, container ports (a port within the container) are mapped to host ports (a port
on the host machine). In this mode, applications bind to the specified ports within the
container and Docker networking binds to the specified ports on the host. USER network
mode is expected to be useful when integrating with user-defined Docker networks. In
the Mesos world such networks are often made accessible via CNI plugins used in
concert with a Mesos CNI network isolator.
portMapping: In Docker BRIDGE mode, a port mapping is necessary for every port that
should be reachable from outside of your container. A port mapping is a tuple containing
a host port, container port, service port and protocol. Multiple port mappings may be
specified for a Marathon application; an unspecified hostPort defaults to (meaning that
Marathon will assign one at random). In Docker USER mode the semantic for hostPort
slightly changes: hostPort is not required for USER mode and if left unspecified Marathon
WILL NOT automatically allocate one at random. This allows containers to be deployed
on USER networks that include containerPort and discovery information, but do NOT
expose those ports on the host network (and by implication would not consume host port
resources).
ports: The ports array is used to define ports that should be considered as part of a
resource offer in HOST mode. It is necessary only if no port mappings are specified. Only
one of ports and portDefinitions should be defined for an application.
portDefinitions: The portDefinitions array is used to define ports that should be
considered as part of a resource offer. It is necessary only to define this array if you are
using HOST networking and no port mappings are specified. This array is meant to replace
the ports array, and makes it possible to specify a port name, protocol and labels. Only
one of ports and portDefinitions should be defined for an application.
protocol: Protocol specifies the internet protocol to use for a port (e.g. tcp, udp or udp,tcp
for both). This is only necessary as part of a port mapping when using BRIDGE or USER
mode networking with a Docker container.
servicePort: When you create a new application in Marathon (either through the REST
API or the front end), you may assign one or more service ports to it. You can specify all
valid port numbers as service ports or you can use 0 to indicate that Marathon should
allocate free service ports to the app automatically. If you do choose your own service
port, you have to ensure yourself that it is unique across all of your applications.
Environment Variables
Each host port value is exposed to the running application instance via environment
variables $PORT0, $PORT1, etc. Each Marathon application is given a single port by default,
so $PORT0 is always available. These variables are available inside a Docker container
being run by Marathon too. Additionally, if the port is named NAME, it will also be
accessible via the environment variable, $PORT_NAME.
When using BRIDGE or USER mode networking, be sure to bind your application to the
containerPorts you have specified in your portMappings. However, if you have set
containerPort to 0 then this will be the same as hostPort and you can use the $PORT
environment variables.
Example Configuration
Host Mode
Host mode networking is the default networking mode for Docker containers and the only
networking mode for non-Docker applications. Note that it not necessary to EXPOSE ports
in your Dockerfile.
"ports": [ 0, 0, 0 ],
In this example, we specify three randomly assigned host ports which would then be
available to our command via the environment variables $PORT0, $PORT1 and $PORT2.
Marathon will also randomly assign three service posts in addition to these three host
ports.
Or:
In this case, host ports $PORT0, $PORT1 and $PORT3 remain randomly assigned. However,
the three service ports for this application are now 2001, 2002 and 3000.
In this example, as with the previous one, it is necessary to use a service discovery
solution such as HAProxy to proxy requests from service ports to host ports.
If you want the applications service ports to be equal to its host ports, you can set
requirePorts to true (requirePorts is false by default). This will tell Marathon to only
schedule this application on agents which have these ports available:
The service and host ports (including the environment variables $PORT0, $PORT1, and
$PORT2), are both now 2001, 2002 and 3000.
This property is useful if you dont use a service discovery solution to proxy requests from
service ports to host ports.
Defining the portDefinitions array allows you to specify a protocol, a name and labels
for each port. When starting
new tasks, Marathon will pass this metadata to Mesos. Mesos will expose this
information in the discovery field of the
task. Custom network discovery solutions can consume this field.
Example port definition requesting a dynamic tcp port named http with the label VIP_0
set to 10.0.0.1:80:
The port field is mandatory. The protocol, name and labels fields are optional. A port
definition in which only
the port field is set is equivalent to an element of the ports array.
Note that only the ports array and the portDefinitions array should not be specified
together, unless all their
elements are equivalent.
Referencing Ports
You can reference host ports in the Dockerfile for our fictitious app as follows:
Alternatively, if you arent using Docker or had specified a cmd in your Marathon
application definition, it works in the same way:
Bridge Mode
Bridge mode networking allows you to map host ports to ports inside your container and
is only applicable to Docker containers. It is particularly useful if you are using a container
image with fixed port assignments that you cant modify. Note that it not necessary to
EXPOSE ports in your Dockerfile.
Port mappings are specified inside the portMappings object for a container:
Alternatively, if our process running in the container had fixed ports, we might do
something like the following:
In this case, Marathon will randomly allocate host ports and map these to ports 80, 443
and 4000 respectively. Its important to note that the $PORT variables refer to the host
ports. In this case, $PORT0 will be set to the value of hostPort for the first mapping and so
on.
Specifying Protocol
You can also specify the protocol for these port mappings. The default is tcp:
By default, Marathon will be creating service ports for each of these ports and assigning
them random values. Service ports are used by service discovery solutions and it is often
desirable to set these to well known values. You can do this by setting a servicePort for
each mapping:
Referencing Ports
If you set containerPort to 0, then you should specify ports in the Dockerfile for our
fictitious app as follows:
However, if youve specified containerPort values, you simply use the same values in
the Dockerfile:
Alternatively, you can specify a cmd in your Marathon application definition, it works in the
same way as before:
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Exposing a Service
Updated: April 17, 2017
DC/OS agent nodes can be designated as public or private during installation. Public
agent nodes provide access from outside of the cluster via infrastructure networking to
your DC/OS services. By default, services are launched on private agent nodes and are
not accessible from outside the cluster.
To launch a service on a public node, you must create a Marathon app definition with the
"acceptedResourceRoles":["slave_public"] parameter specified and configure an edge
load balancer and service discovery mechanism.
Prerequisites:
DC/OS is installed
For more information about the acceptedResourceRoles parameter, see the Marathon
REST API documentation.
Add the your app to Marathon by using this command, where myApp.json is your app:
Tip: You can also add your app by using the Services tab of DC/OS GUI.
ID MEM CPUS TASKS HEALTH DEPLOYMENT CONTAINER CMD /myApp 64 0.1 0/1 --- scale DOCKER None
Tip: You can also view deployed apps by using the Services tab of DC/OS GUI.
All other users: You can use Marathon-LB, a rapid proxy and load balancer that is based on
HAProxy.
Go to your public agent to see the site running. For information about how to find your public
agent IP, see the documentation.
You should see the following message in your browser:
Next steps
Learn how to load balance your app on a public node using Marathon-LB.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
In addition, any users of a given Marathon can run tasks under any Linux user that
Marathon can run tasks under.
To achieve finer-grained control over reservations, quotas and Linux user accounts, you
must deploy non-native instances of Marathon. The non-native instances of Marathon will
be launched by the native instance of Marathon.
While the native Marathon instance runs on the master nodes, the non-native Marathon
instances will run on the private agent nodes. You may need additional private agent
nodes to accommodate the increased resource demands.
Namespacing considerations
You can copy and paste the code snippets in this section as is and succeed in deploying
a single non-native Marathon instance. However, if you need to deploy more than one
non-native Marathon instance or desire more descriptive names, you will need to modify
the code snippets before issuing them.
We recommend a simple strategy for modifying the code snippets. Just replace each
instance of serv-group with the name of the service group that the non-native Marathon
will be deployed into.
In the procedures, we will use the following names:
momee-serv-group as the name of the service group
Lets imagine you have a service group called test, another called dev, and a third called
prod. After replacing serv-group with the name of the actual service group as we
recommend, you will end up with the following names.
momee-test as the name of the service group
momee-test-private-key.pem as the name of the file containing the private key of the
non-native Marathon service account
momee-test-public-key.pem as the name of the file containing the public key of the
non-native Marathon service account
By following this scheme, you will end up with unique yet descriptive names for each of
your non-native Marathon instances. These names will match with the various roles,
service accounts, secrets, and files associated with the non-native Marathon instance. In
addition, following this scheme will protect the service account secret from other non-
native Marathon instances and from the services that the non-native Marathon launches.
Linux user account considerations
The procedures that follow will result in a non-native Marathon instance that runs under
the nobody Linux user. Feel free to replace the nobody Linux user in the config.json and
in the code snippets with another user of your choice. However, you must ensure that the
Linux user account exists on each of your agent nodes before attempting to deploy.
Load and push the Marathon image up to your private Docker registry.
Provision each private agent with the credentials for the private Docker registry.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
The Marathon API is backed by the Marathon component, which runs on the master
nodes.
One of the Marathon instances is elected as leader, while the rest are hot backups in
case of failure. All API requests must go through the Marathon leader. To enforce this,
Admin Router proxies requests from any master node to the Marathon leader.
Routes
Access to the Marathon API is proxied through the Admin Router on each master node
using the following route:
/service/marathon/
Resources
<
div class=swagger-section>
<
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
After creating a GPU-enabled DC/OS cluster, you can configure your service to use
GPUs.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEPLOYING SERVICES AND PODS
Weve collected some questions we often encounter concerning the usage of DC/OS.
Have got a new question youd like to see? Use the Submit feedback button at the bottom
of this page to suggest it or check out how you can contribute also the answer to it.
More info:
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
Welcome to the documentation for DC/OS version 1.9. For information about new and
changed features, see the release notes.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES
Disclaimer: This document provides the DC/OS Service requirements, but is not the
complete DC/OS service certification. For the complete DC/OS Service Specification,
send an email to partnerships@mesosphere.com.
By completing the requirements below, you can integrate with DC/OS and have your
service certified by Mesosphere.
Terminology
Universe
DC/OS Universe contains all services that have been certified by Mesosphere. For more
information on DC/OS Universe, see the GitHub Universe repository.
Framework
A Mesos framework is the combination of a Mesos scheduler and an optional custom
executor. A framework receives resource offers describing CPU, RAM, etc., and
allocates them for discrete tasks that can be launched on Mesos agent nodes.
Mesosphere-certified Mesos frameworks, called DCOS services, are packaged and
available from public GitHub package repositories. DCOS services include Mesosphere-
certified Mesos frameworks and other applications.
DC/OS Marathon
The native Marathon instance that is the init system for DCOS. It starts and monitors
DCOS applications and services.
State abstraction
Mesos provides an abstraction for accessing storage for schedulers for Java and C++
only. This is the preferred method to access ZooKeeper.
Service
01. Service MUST be able to install the service without supplying a
configuration.
Your service must be installable by using default values. The options.json must not be
required for installation. There are cases where a service might require a license to work.
Your service must provide a CLI option to pass the license information to the service to
enable it.
If the service isnt running because it is missing license information, that fact MUST be
logged through stdout or stderr.
For this to work, the metadata for your service must be registered in the Mesosphere
Universe package repository. The metadata format is defined in the Universe repository
README.
Your long-running app MAY use a Docker image retrieved by using a Docker registry or a
binary retrieved by using a CDN backed HTTP server.
Any components that are dynamically configured, for example the Mesos master or
ZooKeeper configuration, MUST be available as command line parameters or
environment variables to the service executable. This allows the parameters to be
passed to the scheduler during package installation.
08. All URIs used by the scheduler and executor MUST be specified in
config.json.
All URIs that are used by the service MUST be specified in the config.json file. Any URL
that is accessed by the service must be overridable and specified in the the config.json
file, including:
URLs required in the marathon.json file
URLs that retrieve the executor (if not supplied by the scheduler)
URLs required by the executors, except for URLs that are for the scheduler; or a process
launched by the scheduler for retrieving artifacts or executors that are local to the
cluster.
All URLs that are used by the service must be passed in by using the command line or
provided as environment variables to the scheduler at startup time.
Description
Tags
All images
License information
Post-install notes that indicate documentation, tutorials, and how to get support
The output from these checks is used by the DC/OS web interface to display your service
health:
If ALL of your health checks pass, your service is marked in green as Healthy.
If ANY of your health checks fail, your service is marked in red as Sick. Your documentation
must provide troubleshooting information for resolving the issue.
If your Service has no tasks running in Marathon, your service is marked in yellow as Idle.
This state is normally temporary and occurs only when your service is launching.
Your app MAY set maxConsecutiveFailures=0 on any of your health checks to prevent
Marathon from terminating your app if the failure threshold of the health check is
reached.
13. Scheduler MUST distribute its own binaries for executor and tasks.
The scheduler MUST attempt to run executors/tasks with no external dependencies. If an
executor/task requires custom dependencies, the scheduler should bundle the
dependencies and configure the Mesos fetcher to download the dependencies from the
scheduler or run executors/tasks in a preconfigured Docker image.
Mesos can fetch binaries by using HTTP[S], FTP[S], HDFS, or Docker pull. Many
frameworks run an HTTP server in the scheduler that can distribute the binaries, or just
rely on pulling from a public or private Docker registry. Remember that some clusters do
not have access to the public internet.
URLs for downloads must be parameterized and externalized in the config.json file, with
the exception of Docker images. The scheduler and executor MUST NOT use URLS
without externalizing them and allowing them to be configurable. This requirement
ensures that DC/OS supports on-prem datacenter environments which do not have
access to the public internet.
ALL properties that are used in the marathon.json file that are not in a conditional block
must be defined as required.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES
CLI Specification
ENTERPRISE DC/OS Updated: April 17, 2017
This document is intended for a developer creating new DC/OS CLI commands.
The DC/OS CLI uses a single command, dcos. All functions are expressed as
subcommands, and are shown with the dcos help command.
The DC/OS CLI is open and extensible: anyone can create a new subcommand and
make it available for installation by end users.
For example, the Spark DC/OS Service provides CLI extensions for working with Spark.
When installed, you can type the following command to run Spark jobs in the datacenter
and query their status:
In the Hello World example, written in Python, you can create an executable of the
subcommand using pyinstaller.
Standard flags
You must assign a standard set of flags to each DC/OS CLI subcommand, described
below:
info
The --info flag shows a short, one-line description of the function of your subcommand.
This content is displayed when the user runs dcos help.
dcos help | grep spark spark Run and manage Spark jobs
version
The --version flag shows the version of the subcommand package. Notice that the
subcommand package is unrelated to the version of the Service running on DC/OS.
For example, Spark v1.2.1 might be installed on DC/OS, whereas the local spark DC/OS
CLI package might be at v0.1.0.
help and -h
The --help and -h flags both show the detailed usage for your subcommand.
dcos marathon --help Deploy and manage applications on the DC/OS Usage: dcos marathon --
config-schema dcos marathon --info dcos marathon app add [<app-resource>] dcos marathon app
list dcos marathon app remove [--force] <app-id> dcos marathon app restart [--force] <app-
id> dcos marathon app show [--app-version=<app-version>] <app-id> dcos marathon app start [-
-force] <app-id> [<instances>] dcos marathon app stop [--force] <app-id> dcos marathon app
update [--force] <app-id> [<properties>...] dcos marathon app version list [--max-
count=<max-count>] <app-id> dcos marathon deployment list [<app-id>] dcos marathon
deployment rollback <deployment-id> dcos marathon deployment stop <deployment-id> dcos
marathon deployment watch [--max-count=<max-count>] [--interval=<interval>] <deployment-id>
dcos marathon task list [<app-id>] dcos marathon task show <task-id> Options: -h, --help
Show this screen --info Show a short description of this subcommand --version Show version -
-force ... --app-version=<app-version> ... --config-schema ... --max-count=<max-count> ... -
-interval=<interval> ... Positional arguments: <app-id> The application id <app-resource>
... <deployment-id> The deployment id <instances> The number of instances to start
<properties> ... <task-id> The task id
config-schema
The DC/OS CLI validates configurations set with the dcos config set command, by
comparing them against a JSON Schema that you define.
When your Marathon CLI subcommand is passed the --config-schema flag, it MUST
output a JSON Schema document for its configuration.
Logging
The environment variable DCOS_LOG_LEVEL is set to the log level the user sets at the
command line.
The logging levels are described in Pythons logging HOWTO: DEBUG, INFO,
WARNING, ERROR and CRITICAL.
Packaging
To make your subcommand available to end users, you must:
Add a package entry to the Mesosphere Universe repository. See the Universe README for the
specification.
The package entry contains a file named resource.json that contains links to the
executable subcommands.
When the end user runs dcos package install spark --cli:
The same packaging format and repository is used for both DC/OS Services and CLI
subcommands.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES
This page covers general advice and information about creating a DC/OS package that
can be published to the Mesosphere Universe. Consult the [Publish a Package][2] page
of the Universe documentation for full details.
Every package in Universe must have a package.json file that specifies the highest-level
metadata about the package (comparable to a package.json in Node.js or setup.py in
Python).
Currently, a package can specify one of two values for .packagingVersion, either 2.0 or
3.0. The version declared will dictate which other files are required for the complete
package as well as the schemas all the files must adhere to.
The tags parameter is used for user searches (dcos package search <criteria>). Add tags
that distinguish the service in some way. Avoid the following terms: Mesos, Mesosphere,
DC/OS, and datacenter. For example, the unicorns service could have: "tags": ["rainbows",
"mythical"].
The preInstallNotes parameter gives the user information theyll need before starting the
installation process. For example, you could explain what the resource requirements are for
the service: "preInstallNotes": "Unicorns take 7 nodes with 1 core each and 1TB
of ram."
The postInstallNotes parameter gives the user information theyll need after the
installation. Focus on providing a documentation URL, a tutorial, or both. For example:
"postInstallNotes": "Thank you for installing the Unicorn
service.\n\n\tDocumentation: http://<service-url>\n\tIssues:
https://github.com/"
The postUninstallNotes parameter gives the user information theyll need after an
uninstall. For example, further cleanup before reinstalling again and a link to the details.
A common issue is cleaning up ZooKeeper entries. For example: postUninstallNotes": "The
Unicorn DC/OS Service has been uninstalled and will no longer run.\nPlease
follow the instructions at http://<service-URL> to clean up any persisted
state" }
Example package.json
{ "packagingVersion": "2.0", // use either 2.0 or 3.0 "name": "foo", // your package name
"version": "1.2.3", // the version of the package "tags": ["mesosphere", "framework"],
"maintainer": "help@bar.io", // who to contact for help "description": "Does baz.", //
description of package "scm": "https://github.com/bar/foo.git", "website":
"http://bar.io/foo", "framework": true, "postInstallNotes": "Have fun foo-ing and baz-ing!"
}
resource.json
This file declares all the externally hosted assets the package will needfor example:
Docker containers, images, or native binary CLI. See the resource.json for details on
what can be defined in resource.json.
Example resource.json
{ "images": { "icon-small": "http://some.org/foo/small.png", "icon-medium":
"http://some.org/foo/medium.png", "icon-large": "http://some.org/foo/large.png",
"screenshots": [ "http://some.org/foo/screen-1.png", "http://some.org/foo/screen-2.png" ] },
"assets": { "uris": { "log4j-properties": "http://some.org/foo/log4j.properties" },
"container": { "docker": { "23b1cfe8e04a": "some-org/foo:1.0.0" } } } }
config.json
This file declares the packages configuration properties, such as the amount of CPUs,
number of instances, and allotted memory. The defaults specified in config.json will be
part of the context when marathon.json.mustache is rendered. This file describes the
configuration properties supported by the package, represented as a json-schema.
Each property should provide a default value, specify whether its required, and provide
validation (minimum and maximum values). Users can then override specific values at
installation time by passing an options file to the DC/OS CLI or by setting config values
through the DC/OS web interface.
Example config.json
{ "type": "object", "properties": { "foo": { "type": "object", "properties": { "baz": {
"type": "integer", "description": "How many times to do baz.", "minimum": 0, "maximum": 16,
"required": false, "default": 4 } }, "required": ["baz"] } }, "required": ["foo"] }
marathon.json.mustache
Variables in the mustache template are evaluated from a union object created by
merging three objects in the following order:
Defaults specified in config.json.
User-supplied options from either the DC/OS CLI or the DC/OS UI.
After you have tested your package, follow the Submit Your Package instructions to
submit it.
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES
When developing services on DC/OS, you may find it helpful to access your cluster from
your local machine via SOCKS proxy, HTTP proxy, or VPN. For instance, you can work
from your own development environment and immediately test against your DC/OS
cluster.
Warning: DC/OS Tunnel is appropriate for development, debugging, and testing only. Do
not use DC/OS Tunnel in production.
SOCKS
DC/OS Tunnel can run a SOCKS proxy over SSH to the cluster. SOCKS proxies work for
any protocol, but your client must be configured to use the proxy, which runs on port
1080 by default.
HTTP
The HTTP proxy can run in two modes: transparent and standard.
Transparent Mode
In transparent mode, the HTTP proxy runs as superuser on port 80 and does not require
modification to your application. Access URLs by appending the mydcos.directory
domain. You can also use DNS SRV records as if they were URLs. The HTTP proxy
cannot currently access HTTPS in transparent mode.
Standard Mode
Though you must configure your client to use the HTTP proxy in standard mode, it does
not have any of the limitations of transparent mode. As in transparent mode, you can use
DNS SRV records as URLs.
SRV Records
A SRV DNS record is a mapping from a name to a IP/port pair. DC/OS creates SRV
records in the form _<port-name>._<service-name>._tcp.marathon.mesos. The HTTP
proxy exposes these as URLs. This feature can be useful for communicating with DC/OS
services.
VPN
DC/OS Tunnel provides you with full access to the DNS, masters, and agents from within
the cluster. OpenVPN requires root privileges to configure these routes.
DC/OS Tunnel Options at a Glance
Pros Cons
-
Specify ports -
SOCKS
-Requires application configuration
All protocols
-
-Cannot specify ports (except through SRV)
HTTP SRV as URL -
(transparent) -Only supports HTTP
No application configuration -
Runs as superuser
- -
HTTP SRV as URL Requires application configuration
(standard) - -
Specify ports Only supports HTTP/HTTPS
- -
No application configuration More prerequisites
- -
Full and direct access to cluster Runs as superuser
VPN
- -
Specify ports May need to manually reconfigure DNS
- -
All protocols Relatively heavyweight
The DC/OS Tunnel package. Run dcos package install tunnel-cli --cli.
Example Application
All examples will refer to this sample application:
* Service Name: myapp
* Group: mygroup
* Port: 555
* Port Name: myport
In transparent mode, the HTTP proxy works by port forwarding. Append .mydcos.directory to the
end of your domain when you enter commands. For instance, http://example.com/?query=hello
becomes http://example.com.mydcos.directory/?query=hello. Note: In transparent mode,
you cannot specify a port in a URL.
Standard mode
To run the HTTP proxy in standard mode, without root privileges, use the --port flag to
configure it to use another port:
Configure your application to use the proxy on the port you specified above.
SRV Records
The HTTP proxy exposes DC/OS SRV records as URLs in the form _<port-
name>._<service-name>._tcp.marathon.mesos.mydcos.directory (transparent mode) or
_<port-name>._<service-name>._tcp.marathon.mesos (standard mode).
The <service-name> is the entry in the ID field of a service you create from the DC/OS
web interface or the value of the id field in your Marathon application definition.
Add a Named Port from the DC/OS Web Interface
To name a port from the DC/OS web interface, go to the Services > Services tab, click
the name of your service, and then click Edit. Enter a name for your port on the
Networking tab.
The VPN client attempts to auto-configure DNS, but this functionality does not work on
macOS. To use the VPN client on macOS, add the DNS servers that DC/OS Tunnel
instructs you to use.
When you use the VPN, you are virtually within your cluster. You can access
your master and agent nodes directly:
For example:
* Ubuntu: apt-get update && apt-get install openvpn
* ArchLinux: pacman -S openvpn
MESOSPHERE DOCUMENTATION
DOCUMENTATION FOR VERSION 1.9
DEVELOPING DC/OS SERVICES
DC/OS Integration
Updated: April 17, 2017
You can leverage several integration points when creating a DC/OS Service. The
sections below explain how to integrate with each respective component.
Admin Router
When a DC/OS Service is installed and run on DC/OS, the service is generally deployed
on a private agent node. In order to allow users to access a running instance of the
service, Admin Router can function as a reverse proxy for the DC/OS Service.
Service Endpoints
Admin Router allows marathon tasks to define custom service UI and HTTP endpoints,
which are made available as /service/<service-name>. Set the following marathon task
labels to enable this:
In order for the forwarding to work reliably across task failures, we recommend co-
locating the endpoints with the task. This way, if the task is restarted on another host and
with different ports, Admin Router will pick up the new labels and update the routing.
Note: Due to caching, there can be an up to 30-second delay before the new routing is
working.
We recommend having only a single task setting these labels for a given service name. If
multiple task instances have the same service name label, Admin Router will pick one of
the task instances deterministically, but this might make debugging issues more difficult.
Since the paths to resources for clients connecting to Admin Router will differ from those
paths the service actually has, ensure the service is configured to run behind a proxy.
This often means relative paths are preferred to absolute paths. In particular, resources
expected to be used by a UI should be verified to work through a proxy.
Tasks running in nested marathon app groups will be available only using their service
name (i.e., /service/<service-name>), not by the marathon app group name (i.e.,
/service/app-group/<service-name>).
DC/OS UI
Service health check information can be surfaced in the DC/OS services UI tab by:
Defining one or more healthChecks in the Services Marathon template, for example:
Defining the label DCOS_PACKAGE_FRAMEWORK_NAME in the Services Marathon template, with the
same value that will be used when the framework registers with Mesos. For example: