Funnel

Last updated on November 19, 2015


  1. Home
  2. Overview
  3. Getting Started
  4. Modules
  5. Services
    1. Agent
    2. Flask
    3. Chemist

Services

Whilst Funnel is mainly a library, in order to compose the core of the system across the network it is necessary to support a variety of different service components. Each of these services outlined below uses the Funnel core library and itself produces metrics (the monitoring services are monitored using Funnel too!).

Agent

Funnel agent is primarily designed to run on any given host and provides several pieces of functionality:

The agent process makes heavy use of the ZeroMQ module in order to provide communication locally, and remotely. Specifically applications running on that host may connect to the appropriate local domain socket to submit their metrics, and in turn, the agent will consume from that domain socket and transparently emit those metrics on a TCP socket bound to the host adaptor. This provides a convenient mechanism for the application to report metrics irrespective of if this particular host is mulit-tennant (e.g. using containers and a resource manager) or not.

Configuration

The agent does not have a huge number of configuration options, but what is avalible can be controlled via a Knobs configuration file. The agent can be controlled simply by commenting out sections of the configuration file. For example commenting out agent.http will disable the HTTP server when the agent boots. The default looks like this:

agent {

  enable-system-metrics = true

  enable-jvm-metrics = true

  #############################################
  #                   Proxy                   #
  #############################################

  zeromq {
    # local file path of the domain socket incoming
    # metrics will arrive on.
    socket = "/tmp/funnel.socket"

    proxy {
      # network address to bind to from which the flask
      # will later connect. Must be accessible from the LAN
      host = "0.0.0.0"
      port = 7390
    }
  }

  #############################################
  #                 Importers                 #
  #############################################

  # recomended to keep network host to 127.0.0.1 as
  # each node should only ever be publishing metrics
  # to its loopback network address.

  # http {
  #   host = "127.0.0.1"
  #   port = 8080
  # }

  # statsd {
  #   port   = 8125
  #   prefix = "oncue"
  # }

  # nginx {
  #   url = "http://127.0.0.1:8080/nginx_status"
  #   poll-frequency = 15 seconds
  # }

  # currently only supports non-authenticated, non-ssl connections
  # designed to be use locally within a docker container or host.
  # processes running jmx typically need the following set:
  #
  # -Dcom.sun.management.jmxremote
  # -Dcom.sun.management.jmxremote.authenticate=false
  # -Dcom.sun.management.jmxremote.ssl=false
  # -Djava.rmi.server.hostname=?????
  # -Dcom.sun.management.jmxremote.port=????
  # jmx {
  #   # Name this jmx resource so that a "cluster" field can be specified
  #   # at query time which references the metrics imported via this source.
  #   # Exmaple would be: "dev-accounts-db-west-1a"
  #   name = "example"
  #
  #   uri = "service:jmx:rmi:///jndi/rmi://127.0.0.1:7199/jmxrmi"
  #
  #   poll-frequency = 28 seconds
  #
  #   # Examples of jmx queries are:
  #   # *:type=Foo,name=Bar to match names in any domain whose exact set of keys is type=Foo,name=Bar.
  #   # d:type=Foo,name=Bar,* to match names in the domain d that have the keys type=Foo,name=Bar plus zero or more other keys.
  #   # *:type=Foo,name=Bar,* to match names in any domain that has the keys type=Foo,name=Bar plus zero or more other keys.
  #   # d:type=F?o,name=Bar will match e.g. d:type=Foo,name=Bar and d:type=Fro,name=Bar.
  #   # d:type=F*o,name=Bar will match e.g. d:type=Fo,name=Bar and d:type=Frodo,name=Bar.
  #   # d:type=Foo,name="B*" will match e.g. d:type=Foo,name="Bling". Wildcards are recognized even inside quotes, and like other special characters can be escaped with \.
  #   queries = [ "org.apache.cassandra.db:*" ]
  #
  #   # Use glob syntax to specify patterns of attributes that you do not wish to
  #   # collect from the JMX endpoint. An example are:
  #   # "*HistogramMicros"
  #   # "*Histogram"
  #   # etc
  #   exclude-attribute-patterns = [ "*HistogramMicros", "*Histogram" ]
  # }

}

HTTP

As for the HTTP API, it supports reporting of various types of metrics and the agent HTTP process is always bound to 127.0.0.1:$port, where $port is the agent.http.port configuration value. The structure of the JSON payload is defined below, and the body must be POSTed to /metrics resource on the specified host:port combination:

{
  "cluster": "example",
  "metrics": [
    {
      "name": "ntp/whatever",
      "kind": "gauge-double",
      "value": "0.1234"
    }
    {
      "name": "ntp/whatever",
      "kind": "counter",
      "value": "1"
    },
    ...
  ]
}

The structure is very simple, but lets take a moment to just walk through the various options for completeness:

Kind Type Required Description
counter Int No Atomic counter that can be incremented over time. If no value is specified, the counter increments by one, if an `Int` value is specified then the counter is incremented by the specified number.
timer String Yes Specify the time a particular operation took. It's important to note that the HTTP does not keep any state to actually conduct timing, this must be done by the caller and reported via the HTTP API. The value supplied is a simple string that is parsed out to a `scala.concurrent.Duration`, and can be constructed with strings such as "20 milliseconds", "5 minutes", "3 days" etc.
gauge-double Double Yes A double value that you wish to report and track its change over time.
gauge-string String Yes A string value that you wish to report and track its change over time. Non-numeric gauges are typically less useful and their use case is infrequent.

That's about it for the HTTP API. Whilst it should be able to tollerate a good beating performance wise, it will likley not be as performant as the native Scala bindings, so keep that in mind because of the extra layers of indirection.

StatsD

The StatsD interface provided by the agent uses UDP and supports c,ms,g and m metric types. The StatsD interface is entirely optioanl.

Nginx

The Nginx importer periodically fetches to the specified Nginx URL and extracts the opertaional data into the agent's Funnel. The following metrics are exposed if this feature is enabled:

The nginx/lifetime/* keys are essentially gauges that detail the state of nginx since it was first booted. These values will not reset over time, and this is how they are extracted from Nginx, so seeing these numbers grow over time is entirely normal. The remainder of the metrics behave exactly how normal gauges work: their values reflect current usage and are rolled up into the Funnel windowing.

JMX

The JMX module is designed to periodically import metrics from a specificed JMX RMI location. The agent does not support authenticated JMX endpoints at this time and assumes that it is operating on the same machine as the JMX service you wish to connect too (technically it could do remote monitoring too, but its less flexible in that regard by design as the intentino is not to deploy a central service that reaches out to remote JMX endpoints - hence only supporting a single RMI endpoint).

The primary thing to understand about the JMX module is that you need to specify the MBean queries you are actually interested in. Examples of MBean queries (taken from the ObjectName JavaDoc in the JDK):

Whilst this is reasoanbly expressive, ObjectName queries do not really support negation, and in some JMX endpoints which publish huge numbers of metric vectors it is often desirable to run a set of filters to remove unwanted keys. This is supported by the exclude-attribute-patterns configuration paramater in the agent.cfg (detailed above). The paramter uses glob syntax to select metrics names the user wishes to not import.

Flask

Flask takes the role of a collector. It reaches out to nodes that have the Funnel Agent running and consumes their metrics which it turn were gathered from the application(s) running on that host.

Whilst Flask is a fairly small process, this section includes several sub-sections:

Flask's sole job is to "mirror" the metrics from application hosts by taking mirroring instructions from an external orchestrating process (i.e. Flask only ever mirrors nodes it is instructed to mirror - it never figures this out itself). This is achieved with a simple HTTP POST to http://<host>:<port>/mirror on the Flask host. For example:

[
  {
    "cluster": "accounts-2.1-us-east",
    "urls": [
      "http://accounts-01.us-east.verizon.com:5777/stream/previous",
      "http://accounts-01.us-east.verizon.com:5777/stream/now?type=\"String\""
    ]
  }
]

This simple structure represents the "cluster" of URLs that one wishes to monitor. The expectation is that a given logical cluster (for example, accounts-2.1-us-east) there will be an associated set of URLs that are exposing metrics. This clustering is entirely arbitrary, and you can construct whatever cluster/URL combinations you want, with the one restriction that a given stream can only be connected to a single cluster at any given time (on the basis that connecting over and over to the same machine is a needless drain on resources). However, it is fine to send the same URL multiple times. In the event that there is an existing connection to that URL, the existing connection will be used instead of creating a new one (in other words, monitoring instructions are idempotent for cluster -> url combinations).

A URL can be any Funnel URL that serves a stream of metrics. It's a good idea to be as specific as possible here. Don't request the entire stream if you don't need it. As an example, the above cluster specification asks only for metrics from the previous window, as well as metrics from the now window that have the type "String" (which will generally not exist in the previous window).

Configuration

Flask takes all of its configuration information via Knobs; primarily by file. The default configuration file looks like this:

flask {
  name = "flask"

  metric-ttl = 24 hours

  collect-local-metrics = true

  local-metric-frequency = 30

  environment = "dev"

  network {
    host = "localhost"
    http-port = 6777
    selfie-port = 7557
  }

  retry-schedule {
    duration = 30s
    retries = 6
  }

  # elastic-search-exploded {
  #   url = "http://localhost:9200"
  #   index-name = "funnel"
  #   type-name = "metric"
  #   groups = [ "previous/jvm", "previous/system", "previous" ]
  #   template {
  #     name = "flask"
  #     location = "/path/to/exploded-template.json"
  #   }
  #   partition-date-format = "yyyy.ww"
  #   connection-timeout = 5 seconds
  #   minimum-publish-frequency = 10 minutes
  # }

  # elastic-search-flattened {
  #   url = "http://localhost:9200"
  #   index-name = "monitoring"
  #   type-name = "metric"
  #   groups = [ "previous" ]
  #   template {
  #     name = "flaskv2"
  #     location = "/path/to/flattened-template.json"
  #   }
  #   partition-date-format = "yyyy.MM.dd"
  #   connection-timeout = 5 seconds
  #   minimum-publish-frequency = 10 minutes
  # }
}

Let's consider these options one by one:

Handling Connection Failures

Sometimes, things go wrong and Flask will be unable to reach a Funnel host that it was previously talking to. When this happens, Flask will exponentially back-off and try to reconnect over a period of time, but if all attempts fail Flask will simply write a log to detailing the failure. This is ok due to the design of chemist and the manner in which work is discovered and re-allocated.

The length and count of the retry strategy is configured using the flask.retry-schedule block of configuration, for which the parameters are quite self-explanatory. The default is for a 30s interval between disconnect and rely, and to retry 6 times (3 minutes overall).

System Metrics

As mentioned above, the Flask is also a service so it too can produce metrics and be monitored. To enable funnel on the Flask instance, two conditions need to be true:

If your machine is a recently baked instance, it should already contain Hyperic SIGAR on the LD_LIBRARY_PATH.Otherwise download the library here: http://sourceforge.net/projects/sigar/files/.

Unzip the file and drop the appropriate .so file into your LD_LIBRARY_PATH.

SIGAR gathers a large set of metrics for the local machine:

Chemist

The final component living under the Funnel umbrella is Chemist. Given that each and every Flask does not know about any of its peers - it only understands the work it has been assigned - there has to be a way to automatically identify and assign new work as machines in a operational region startup, fail and gracefully shutdown. This is especially true when a given Flask instance fails unexpectedly, as its work will need to be re-assigned to other available Flask instances so that operational visibility is not compromised.

Chemist can be run on a variety of platforms. Currently there are two fully supported paths and one future path:

Chemist has a rudimentary but useful API for conducting several management tasks. At the time of writing the following APIs were available:

Internals

Chemist is implemented as a functional stream transducer, and it has a range of moving parts in order to make its life partitioning workloads simpler and easy to extend. Some of the key concepts are:

Chemist AWS

For deployment to AWS, Chemist leverages the fact that auto-scalling groups can notify an SNS topic upon different lifecycle events happening which are associated to that group. For example, if a machine is terminated a notification is pushed to SNS detailing exactly what happened and to the specific instance(s). An SQS topic is then connected to this SNS topic, and Chemist consumes the incoming messages. SQS acts as a persistent queue, so if the chemist node fails, and a new chemist is brought back up from the auto-scalling group, it simply picks up where its predecessor left off.

It is assumed that any deployment to AWS will be conducted using a CloudFormation (CFN) template, and that CFN template must create the SQS queue for this particular deployment. This is extremely important as SQS queues are essentially mutable state, and we don't want multiple deployments competing to read the same values from the mutable stack.

Chemist Static

For deployment to on-premis, or fixed-capacity infrastructure, Funnel provides "chemist-static" which can orchestrate your Flask instances by way of a simple static file-based configuration. The configuration options are the same as the regular Chemist, with a few key additions:

chemist {
  ...

  targets {
    instance1 {
      cluster-name = "one"
      uris = [ "http://alpha:1234" ]
    }
    instance2 {
      cluster-name = "two"
      uris = [ "http://beta:5678" ]
    }
    instance3 {
      cluster-name = "three"
      uris = [ "http://delta:9012" ]
    }
  }

  flasks {
    flask1 {
      location {
        host = "ay"
         port = 1111
        protocol = "http"
      }
      telemetry {
        host = "bee"
        port = 2222
        protocol = "tcp"
      }
    }
  }
}

As you can see, the targets section denotes the Funnel endpoints Chemist should try to monitor, whilst the flasks section denotes those nodes which should do the monitoring. The primary differentiator between the chemist-static module from other chemist implementations is that the configuration file will be dynamically reloaded if changes are made at runtime. The reason for this is that in fixed infrastructure, machines are typically mutated in-place using IT automation tools such as Ansible or Chef.