Metaparameters in mgmt

In mgmt we have meta parameters. They are similar in concept to what you might be familiar with from other tools, except that they are more clearly defined (in a single struct) and vastly more powerful.

In mgmt, a meta parameter is a parameter which is codified entirely in the engine, and which can be used by any resource. In contrast with Puppet, require/before are considered meta parameters, whereas in mgmt, the equivalent is a graph edge, which is not a meta parameter. [1]

Kinds

As of this writing we have seven different kinds of meta parameters:

AutoEdge
AutoGroup
Noop
Retry & Delay
Poll
Limit & Burst
Sema

The astute reader will note that there are actually nine different meta parameters listed, but I have grouped them into seven categories since some of them are very tightly interconnected. The first two, AutoEdge and AutoGroup have been covered in separate articles already, so they won’t be discussed here. To learn about the others, please read on…

Noop

Noop stands for no-operation. If it is set to true, we tell the CheckApply portion of the resource to not make any changes. It is up to the individual resource implementation to respect this facility, which is the case for all correctly written resources. You can learn more about this by reading the CheckApply section in the resource guide.

If you’d like to set the noop state on all resources at runtime, there is a cli flag which you can use to do so. It is unsurprisingly named –noop, and overrides all the resources in the graph. This is in stark contrast with Puppet which will allow an individual resource definition to override the user’s choice!

james@computer:/tmp$ cat noop.pp
file { '/tmp/puppet.noop':
    content => "nope, nope, nope!\n",
    noop => false,    # set at the resource level
}
james@computer:/tmp$ time puppet apply noop.pp
Notice: Compiled catalog for computer in environment production in 0.29 seconds
Notice: /Stage[main]/Main/File[/tmp/puppet.noop]/ensure: defined content as '{md5}d8bda32dd3fbf435e5a812b0ba3e9a95'
Notice: Applied catalog in 0.03 seconds

real    0m15.862s
user    0m7.423s
sys    0m1.260s
james@computer:/tmp$ file puppet.noop    # verify it worked
puppet.noop: ASCII text
james@computer:/tmp$ rm -f puppet.noop    # reset
james@computer:/tmp$ time puppet apply --noop noop.pp    # safe right?
Notice: Compiled catalog for computer in environment production in 0.30 seconds
Notice: /Stage[main]/Main/File[/tmp/puppet.noop]/ensure: defined content as '{md5}d8bda32dd3fbf435e5a812b0ba3e9a95'
Notice: Class[Main]: Would have triggered 'refresh' from 1 events
Notice: Stage[main]: Would have triggered 'refresh' from 1 events
Notice: Applied catalog in 0.02 seconds

real    0m15.808s
user    0m7.356s
sys    0m1.325s
james@computer:/tmp$ cat puppet.noop
nope, nope, nope!
james@computer:/tmp$

If you look closely, Puppet just trolled you by performing an operation when you thought it would be noop! I think the behaviour is incorrect, but if this isn’t supposed to be a bug, then I’d sure like to know why!

It’s worth mentioning that there is also a noop resource in mgmt which is similarly named because it does absolutely nothing.

Retry & Delay

In mgmt we can run continuously, which means that it’s often more useful to do something interesting when there is a resource failure, rather than simply shutting down completely. As a result, if there is an error during the CheckApply phase of the resource execution, the operation can be retried (retry) a number of times, and there can be a delay between each retry.

The delay value is an integer representing the number of milliseconds to wait between retries, and it defaults to zero. The retry value is an integer representing the maximum number of allowed retries, and it defaults to zero. A negative value will permit an infinite number of retries. If the number of retries is exhausted, then the temporary resource failure will be converted into a permanent failure. Resources which depend on a failed resource will be blocked until there is a successful execution. When there is a successful CheckApply, the resource retry counter is reset.

In general it is best to leave these values at their defaults unless you are expecting spurious failures, this way if you do get a failure, it won’t be masked by the retry mechanism.

It’s worth mentioning that the Watch loop can fail as well, and that the retry and delay meta parameters apply to this as well! While these could have had their own set of meta parameters, I felt it would have unnecessarily cluttered up the interface, and I couldn’t think of a reason where it would be helpful to have different values. They do have their own separate retry counter and delay timer of course! If someone has a valid use case, then I’m happy to separate these.

If someone would like to implement a pluggable back-off algorithm (eg: exponential back-off) to be used here instead of a simple delay, then I think it would be a welcome addition!

Poll

Despite mgmt being event based, there are situations where you’d really like to poll instead of using the Watch method. For these cases, I reluctantly implemented a poll meta parameter. It does exactly what you’d expect, generating events every poll seconds. It defaults to zero which means that it is disabled, and Watch is used instead.

Despite my earlier knock of it, it is actually quite useful, in that some operations might require or prefer polling, and having it as a meta parameter means that those resources won’t need to duplicate the polling code.

This might be very powerful for an aws resource that can set up hosted Amazon ec2 resources. When combined with the retry and delay meta parameters, it will even survive outages!

One particularly interesting aspect is that ever since the converged graph detection was improved, we can still converge a graph and shutdown with the converged-timeout functionality while using polling! This is described in more detail in the documentation.

Limit & Burst

In mgmt, the events generated by the Watch main loop of a resource do not need to be 1-1 matched with the CheckApply remediation step. This is very powerful because it allows mgmt to collate multiple events into a single CheckApply step which is helpful for when the duration of the CheckApply step is longer than the interval between Watch events that are being generated often.

In addition, you might not want to constantly Check or Apply (converge) the state of your resource as often as it goes out of state. For this situation, that step can be rate limited with the limit and burst meta parameters.

The limit and burst meta parameters implement something known as a token bucket. This models a bucket which is filled with tokens and which is drained slowly. It has a particular rate limit (which sets a maximum rate) and a burst count which sets a maximum bolus which can be absorbed.

This doesn’t cause us to permanently miss events (and stay un-converged) because when the bucket overfills, instead of dropping events, we actually cache the last one for playback once the bucket falls within the execution rate threshold. Remember, we expect to be converged in the steady state, not at every infinitesimal delta t in between.

The limit and burst metaparams default to allowing an infinite rate, with zero burst. As it turns out, if you have a non-infinite rate, the burst must be non-zero or you will cause a Validate error. Similarly, a non-zero burst, with an infinite rate is effectively the same as the default. A good rule of thumb is to remember to either set both values or neither. This is all because of the mathematical implications of token buckets which I won’t explain in this article.

Sema

Sema is short for semaphore. In mgmt we have implemented P/V style counting semaphores. This is a mechanism for reducing parallelism in situations where there are not explicit dependencies between resources. This might be useful for when the number of operations might outnumber the number of CPUs on your machine and you want to avoid starving your other processes. Alternatively, there might be a particular operation that you want to add a mutex (mutual exclusion) around, which can be represented with a semaphore of size (1) one. Lastly, it was a particularly fun meta parameter to write, and I had been itching to do so for some time.

To use this meta parameter, simply give a list of semaphore ids to the resource you want to lock. These can be any string, and are shared globally throughout the graph. By default, they have a size of one. To specify a semaphore with a different size, append a colon (:) followed by an integer at the end of the semaphore id.

Valid ids might include: “some_id:42”, “hello:13”, and “lockname”. Remember, the size parameter is the number of simultaneous resources which can run their CheckApply methods at the same time. It does not prevent multiple Watch methods from returning events simultaneously.

If you would like to force a semaphore globally on all resources, you can pass in the –sema argument with a size integer. This will get appended to the existing semaphores. For example, to simulate Puppet’s traditional non-parallel execution, you could specify –sema 1.

Oh, no! Does this mean I can deadlock my graphs? Interestingly enough, this is actually completely safe! The reason is that because all the semaphores exist in the mgmt directed acyclic graph, and because that DAG represents dependencies that are always respected, there will always be a way to make progress, which eventually unblocks any waiting resources! The trick to doing this is ensuring that each resource always acquires the list of semaphores in alphabetical order. (Actually the order doesn’t matter as long as it’s consistent across the graph, and alphabetical is as good as any!) Unfortunately, I don’t have a formal proof of this, but I was able to convince myself on the back of an envelope that it is true! Please contact me if you can prove me right or wrong! The one exception is that a counting semaphore of size zero would never let anyone acquire it, so by definition it would permanently block, and as a result is not currently permitted.

The last important point to mention is about the interplay between automatic grouping and semaphores. When more than one resource is grouped, they are considered to be part of the same resource. As a result, the resulting list of semaphores is the sum of the individual semaphores, de-duplicated. This ensures that individual locking guarantees aren’t broken when multiple resources are combined.

Future

If you have ideas for future meta parameters, please let me know! We’d love to hear about your ideas on our mailing list or on IRC. If you’re shy, you can contact me privately as well.

Happy Hacking,

James

[1] This is a bit of an apples vs. flame-throwers comparison because I’m comparing the mgmt engine meta parameters with the puppet language meta parameters, but I think it’s worth mentioning because there’s a clear separation between the two in mgmt, where as the separation is much more blurry in the puppet scenario. It’s also true that the mgmt language might grow a concept of language-level meta parameters which has a partial set that only maps partially to engine meta parameters, but this is a discussion for another day!

You can follow James on Mastodon for more frequent updates and other random thoughts.
You can follow James on Twitter for more frequent updates and other random thoughts.
You can support James on GitHub if you'd like to help sustain this kind of content.
You can support James on Patreon if you'd like to help sustain this kind of content.

March 1, 2017
1925 words

Categories

technical

Tags
P/V autoedge autoedges autogrouping burst counting semaphore dag delay devops directed acyclic graph fedora gluster limit metaparameters mgmt mgmtconfig noop planetfedora planetpuppet poll proofs puppet retry semaphore token bucket

Original URL
https://ttboj.wordpress.com/2017/03/01/metaparameters-in-mgmt/