Monitoring should be a layer in the stack, not an application.
gem install rspec
git clone git://github.com/danielsdeleo/critical.git
Critical is my take on network/infrastructure monitoring. Here are the big ideas:
-
Infrastructure as code: The monitoring system should be an internal DSL so it can natively interact with any part of your infrastructure you can find or write a library for. You should also be able to productively alter its guts if you need to. This is a monitoring system for ops people who write code and coders who do ops.
-
Client-based: This scales better, and is actually easier to configure if you use configuration management, which you should be doing anyway.
-
Continuous verification: Critical has a single shot mode in addition to the typical daemonized operation. This allows you to verify the configuration on a host after making changes and then continuously monitor the state of the system using the same verification tests.
-
Declarative: Declare what the state of your system is supposed to be.
-
Alerting and Trending together: a client/agent can do both of these at the same time with less configuration overhead. It makes sense to keep them separate on the server side.
-
Licensing: “Do what thou wilt shall be the whole of the law,” except for patent trolls, etc. So, Apache 2.0 it is.
Critical runs as a cluster of daemons. The master process does the scheduling and assigns tasks to workers by communicating over a UNIX domain socket. The workers listen to the socket and process tasks as they come. I had also considered an evented architecture (using eventmachine), but that had the drawback of requiring users to write plugins using only EM-based libraries or risk running into problems with blocking IO.
Critical provides a DSL for writing metric gathering code. It looks like this:
Metric(:memory_utilization) do case RUBY_PLATFORM when /darwin/ # omitted... when /linux/ collects 'free -b' reports(:total_memory_in_kb => :int) do result.line(1).split[1] end reports(:bytes_used => :int) do result.line(2).split[2] end else raise UnsupportedPlatform, "memory_utilization does not have an implementation for your platform yet :(" end reports(:kb_free => :integer) do bytes_free / 1024 end reports(:kb_used => :integer) do pp :kb_used => (bytes_used / 1024) bytes_used / 1024 end reports(:mb_free => :float) do kb_free / 1024.0 end reports(:mb_used => :float) do kb_used.to_f / 4.0 end end
To configure critical to monitor your metrics, you use the monitor DSL:
require_metric 'disk_utilization' require_metric 'memory_utilization' require_metric 'cpu_utilization' require_metric 'cluster' # Monitors are also where you define your scheduling. Monitor(:system) do # Monitor statements can be nested, this nesting will be included in the # collected data for tracking/tagging purposes. Monitor(hostname) do # includes the hostname in the namespace # Specify collection intervals with +every+ or +collect_every+ # The +every+ form takes a block, each monitor you define inside the block # will be scheduled to run at that interval. every(10=>:seconds) do disk_utilization('/') { track :percentage } memory_utilization { track :bytes_used } cpu_utilization {track :percent_used} cluster("critical : worker") do |c| c.track :processes c.track :total_cpu c.track :total_rss c.track :uptime end end end end
See bin/critical –help and the examples directory
Initial work focused on the alerting half of the alerting/trending combo that comprises “monitoring.” I’ve pivoted and am currently focusing on making it dead simple to get data into graphite. Alerting is still a long term priority.
Distributed under the terms of the Apache 2.0 license. © 2010,2011 Daniel DeLeo