Sudden breakage of puppet apply for some users
Things started harmlessly, with yet another mailing list post.
A user was reporting strange behavior of
The resource from the applied manifest was being ignored, apparently.
Misunderstandings about Puppet behaving in an unexpected fashion are commonplace. But after exchanging some debugging instructions and further issues, it soon became clear that something was going seriously wrong. Will Partain then notified me about PUP-3258 off-list. My curiosity was piqued.
Dissecting the report
Will had left a great bug report! It was a little thin on the description, but came with a complete lab environment for reproducing it. This was especially valuable to me, having never used an ENC of any kind, much less an ENC script. The included README file did a good job of describing what was happening, and what should have been happening instead.
My preference for trying buggy manifests is to place them somewhere in
/tmp and using
puppet apply on them directly.
Things that are directly involved with master operations, I put in
~/.puppet, so that I can just launch a
without becoming root. PUP-3258 used some of both.
The issue caused
puppet apply to make inappropriate use of the values from
some reason. Having a simple Puppet environment prepared in
~/.puppet comes in handy for this kind of occasion as well.
Will’s scenario included
- a module
- an ENC script
all reasonably stripped down to some very basic content that demonstrated the issue. He triggered it with a rather complex invocation.
/usr/bin/sudo /usr/bin/puppet apply \ --node_terminus exec --external_nodes /home/partain/t/puppet-vapply-bug/enc \ --modulepath /home/partain/t/puppet-vapply-bug/modules:/etc/puppet/modules \ --parser future --test --diff_args=-U1 --noop \ /home/partain/t/puppet-vapply-bug/site.pp
I would ignore
sudo of course, but what interested me most was whether I could get rid of the module and
It was simple - the example ENC returned a classes value of
[ foo ], so by changing the YAML to
I could just avoid including the module, and shed the need for a special
modulepath. The bug remained reproducible.
My first hope was that
--parser future was causing trouble. That would have narrowed down the possible causes nicely.
But removing it didn’t change the faulty behavior. I was also reasonably certain that
--test (not sensible with
--noop had nothing to do with the problem. And sure enough, removing those options had no
effect. What was necessary to keep reproducing the issue were the options for external nodes.
bundle exec puppet apply /tmp/site.pp --node_terminus exec --external_nodes /tmp/enc
That’s manageable - now the search could begin.
Locating the regression
The issue was known to have occurred at some point in between the
First point of order: Verify that current
master is still affected. It was.
Since I had no clear idea of where to start, and had a very direct way of reproducing, things
called for a
git bisection. The best way to run the code from your
git checkout is via
but there was a slight complication here. The
3.7 tags locked the
into different versions. So when the bundle is in shape to run one version, it will refuse to run the other.
This is especially annoying, seeing as
rspec is all about testing, and not about running the application at all.
But no big deal - it’s just important to make sure that
bundler does not skew the result of the
bisection test. This can be achieved by simply calling
bundle update on each step. I used the following scriptlet.
#!/bin/sh bundle update bundle exec puppet apply /tmp/site.pp \ --node_terminus exec --external_nodes /tmp/enc \ | grep -q '/tmp/site.pp is in use'
grep was effective because the correct manifest was prepared with the output to
grep for (using a
notify resource). Happy bisecting!
git bisect good 3.6.1 git bisect bad 3.7.0 git bisect run /tmp/bisect-test
Yes, this took a while, what with phoning home to rubygems.org on each run (and mostly just accepting the present bundle anyway). Still, as expected, it unearthed the apparently innocuous change that broke this scenario.
(PUP-2519) Create fallback environment only for default production
To verify that the identified patch was indeed all that was wrong with
3.7 and above, I like to try and
revert it on
master. This proved quite difficult. The suspect patch concerned directory environments,
and those have seen quite some stitching prior to the
3.7.0 release. Reverting my change resulted in
conflicts in almost all affected files. Without an understanding of how any of that code worked, it
would have been way too much work to resolve all of that.
So I scrapped that idea and checkout out the offending revision again. Time for some sharp looking at what was happening there.
Finding a remedy
Full disclosure, I have no idea how legacy environments worked code-wise, much less the shiny new
directory environments. What’s more, I had never traced the code that makes
puppet apply work.
As such, I had a hard time figuring out the intent of the code changes in question.
Next easiest step: Revert each patch in the offending commit individually, and see if one of them is solely
responsible for the problem. It turned out to be
a single line in
- Puppet.override(:environments => Puppet::Environments::Static.new(apply_environment)) do + Puppet.override(:current_environment => apply_environment) do
Still no idea what was going on there. So I first took a look at the
Environments::Static class definition.
It is relatively simple and allows the creation of a set of environments.
I then got curious what this
apply_environment was all about, anyway.
The answer is right above its use in the code.
apply_environment is a pragmatic approach to making
puppet apply work the way it does:
It is basically the current environment with the site manifest overridden
to the manifest that was specified on the command line. The next thing I wanted to know was
how the whole environment lookup process worked. I’m not big on the Ruby debugger (even though
pry is pretty neat), and my first try is often good old
printf debugging. Sorry, puts debugging.
Since the issue revolved around
nodes somehow (the
exec node terminus caused the problem, after all),
I wanted to find out how the
Node class interacted with environments. I added output to
to see what values it was receiving. While at it, I also pulled a respective stacktrace using
This is almost always enlightening, since it helps understanding the way several code pieces work together, especially
when you’re not familiar with any or most of them. Comparing outputs, this revealed two things:
- The caller is
lib/puppet/indirector/node/exec.rbwhen using the ENC script, as opposed to
- In the former case, it passes a value of
I was still unsure what I was looking at, but the pieces were almost complete. The
exec variant of the
node indirector terminus (Masterzen has a great article
for understanding what the indirector is all about) sets the environment if none is returned from the
data source. Something about this call was wrong, apparently. I still had no idea why the override in
Application::Apply had any relation to this call, but it looked like a good clue.
I compared the
exec terminus to the way that other termini handle this, and immediately found my answer
in the LDAP terminus.
Digging through its history, it became obvious that his was also quite recent.
My hunch that sending the environments name (or string representation) to the node was wrong seemed increasingly likely.
The fix was trivial of course.
So what had been happening? The
exec terminus initialized the
node object with the environment
"production", the name
of the current environment (which was actually the specially prepared
apply_environment). This caused
among the available environments. This returned the unaltered
production environment, so that the override
manifest setting for
puppet apply was short-circuited. Passing the actual
environment object would retain
the override. The issue had cropped up because
Environments::Static had previously hidden all environments away,
so that the lookup of
"production", or whatever environment, still did the right thing.
When I realized this, I grew suspicious that we still might have a problem if this default case was not hit. In other words, I worried that an ENC script that did actually return an environment name would still trigger the issue. This could be confirmed quickly, but would have to wait for yet another ticket.
Creating a test
As so often, the greater challenge was to devise a test that would keep this regression from cropping up again
by indicating it through a failure. I looked at the unit tests
exec node terminus, which does already cover some scenarios concerning environments.
But I wasn’t really sure what I would want to test here. The problem was that a symbol or string was being sent
Node#environment= but I couldn’t immediately find an expectation that was broken by the bug.
I also felt that this check wouldn’t adequately represent the problem we had faced here.
So I took a look at the integration tests for
puppet apply, and found a test that was
pretty close to what I had in mind.
I cloned it and extended it to include two manifests and an ENC script, just like my manual test.
I got my extended test to work quickly enough, but then spent quite a while trying
to make it actually use my prepared site manifest while the bug was still present. The key was
to manipulate the settings for the test to include the
I never really saw a use for ENCs (and honestly, still don’t :-) and hence hadn’t ever bothered looking at the code. I would have never grabbed this bug if I hadn’t gotten involved in that mailing list thread. It’s always a good idea to stay in touch with our peers in the user base. It’s not the first time that I learned something new this way. It was especially refreshing to finally poke around in environments and the work that has gone into the new (and awesome) directory environments.