Sudden breakage of puppet apply for some users
Things started harmlessly, with yet another mailing list post.
A user was reporting strange behavior of puppet apply
.
The resource from the applied manifest was being ignored, apparently.
Misunderstandings about Puppet behaving in an unexpected fashion are commonplace. But after exchanging some debugging instructions and further issues, it soon became clear that something was going seriously wrong. Will Partain then notified me about PUP-3258 off-list. My curiosity was piqued.
Dissecting the report
Will had left a great bug report! It was a little thin on the description, but came with a complete lab environment for reproducing it. This was especially valuable to me, having never used an ENC of any kind, much less an ENC script. The included README file did a good job of describing what was happening, and what should have been happening instead.
My preference for trying buggy manifests is to place them somewhere in /tmp
and using puppet apply
on them directly.
Things that are directly involved with master operations, I put in ~/.puppet
, so that I can just launch a WEBrick
master
without becoming root. PUP-3258 used some of both.
The issue caused puppet apply
to make inappropriate use of the values from puppet.conf
for
some reason. Having a simple Puppet environment prepared in ~/.puppet
comes in handy for this kind of occasion as well.
Will’s scenario included
- a
site.pp
file - a module
- an ENC script
all reasonably stripped down to some very basic content that demonstrated the issue. He triggered it with a rather complex invocation.
/usr/bin/sudo /usr/bin/puppet apply \
--node_terminus exec --external_nodes /home/partain/t/puppet-vapply-bug/enc \
--modulepath /home/partain/t/puppet-vapply-bug/modules:/etc/puppet/modules \
--parser future --test --diff_args=-U1 --noop \
/home/partain/t/puppet-vapply-bug/site.pp
I would ignore sudo
of course, but what interested me most was whether I could get rid of the module and --modulepath
flag.
It was simple - the example ENC returned a classes value of [ foo ]
, so by changing the YAML to
classes: []
I could just avoid including the module, and shed the need for a special modulepath
. The bug remained reproducible.
My first hope was that --parser future
was causing trouble. That would have narrowed down the possible causes nicely.
But removing it didn’t change the faulty behavior. I was also reasonably certain that --test
(not sensible with apply
anyway), --diff_args
and --noop
had nothing to do with the problem. And sure enough, removing those options had no
effect. What was necessary to keep reproducing the issue were the options for external nodes.
bundle exec puppet apply /tmp/site.pp --node_terminus exec --external_nodes /tmp/enc
That’s manageable - now the search could begin.
Locating the regression
The issue was known to have occurred at some point in between the 3.6.1
and 3.7
releases.
First point of order: Verify that current master
is still affected. It was.
Since I had no clear idea of where to start, and had a very direct way of reproducing, things
called for a git
bisection. The best way to run the code from your git
checkout is via bundler
,
but there was a slight complication here. The 3.6
and 3.7
tags locked the rspec
gem
into different versions. So when the bundle is in shape to run one version, it will refuse to run the other.
This is especially annoying, seeing as rspec
is all about testing, and not about running the application at all.
But no big deal - it’s just important to make sure that bundler
does not skew the result of the
bisection test. This can be achieved by simply calling bundle update
on each step. I used the following scriptlet.
#!/bin/sh
bundle update
bundle exec puppet apply /tmp/site.pp \
--node_terminus exec --external_nodes /tmp/enc \
| grep -q '/tmp/site.pp is in use'
The grep
was effective because the correct manifest was prepared with the output to grep
for (using a
notify
resource). Happy bisecting!
git bisect good 3.6.1
git bisect bad 3.7.0
git bisect run /tmp/bisect-test
Yes, this took a while, what with phoning home to rubygems.org on each run (and mostly just accepting the present bundle anyway). Still, as expected, it unearthed the apparently innocuous change that broke this scenario.
(PUP-2519) Create fallback environment only for default production
To verify that the identified patch was indeed all that was wrong with 3.7
and above, I like to try and
revert it on master
. This proved quite difficult. The suspect patch concerned directory environments,
and those have seen quite some stitching prior to the 3.7.0
release. Reverting my change resulted in
conflicts in almost all affected files. Without an understanding of how any of that code worked, it
would have been way too much work to resolve all of that.
So I scrapped that idea and checkout out the offending revision again. Time for some sharp looking at what was happening there.
Finding a remedy
Full disclosure, I have no idea how legacy environments worked code-wise, much less the shiny new
directory environments. What’s more, I had never traced the code that makes puppet apply
work.
As such, I had a hard time figuring out the intent of the code changes in question.
Next easiest step: Revert each patch in the offending commit individually, and see if one of them is solely
responsible for the problem. It turned out to be
a single line in lib/application/apply.rb
.
- Puppet.override(:environments => Puppet::Environments::Static.new(apply_environment)) do
+ Puppet.override(:current_environment => apply_environment) do
Still no idea what was going on there. So I first took a look at the Environments::Static
class definition.
It is relatively simple and allows the creation of a set of environments.
I then got curious what this apply_environment
was all about, anyway.
The answer is right above its use in the code.
The apply_environment
is a pragmatic approach to making puppet apply
work the way it does:
It is basically the current environment with the site manifest overridden
to the manifest that was specified on the command line. The next thing I wanted to know was
how the whole environment lookup process worked. I’m not big on the Ruby debugger (even though
pry is pretty neat), and my first try is often good old printf
debugging. Sorry, puts debugging.
Since the issue revolved around nodes
somehow (the exec
node terminus caused the problem, after all),
I wanted to find out how the Node
class interacted with environments. I added output to Node#environment=
to see what values it was receiving. While at it, I also pulled a respective stacktrace using Kernel#caller
.
This is almost always enlightening, since it helps understanding the way several code pieces work together, especially
when you’re not familiar with any or most of them. Comparing outputs, this revealed two things:
- The caller is
lib/puppet/indirector/node/exec.rb
when using the ENC script, as opposed toplain.rb
otherwise. - In the former case, it passes a value of
:production
instead ofproduction
.
I was still unsure what I was looking at, but the pieces were almost complete. The exec
variant of the
node
indirector terminus (Masterzen has a great article
for understanding what the indirector is all about) sets the environment if none is returned from the
data source. Something about this call was wrong, apparently. I still had no idea why the override in
Application::Apply
had any relation to this call, but it looked like a good clue.
I compared the exec
terminus to the way that other termini handle this, and immediately found my answer
in the LDAP terminus.
Digging through its history, it became obvious that his was also quite recent.
My hunch that sending the environments name (or string representation) to the node was wrong seemed increasingly likely.
The fix was trivial of course.
So what had been happening? The exec
terminus initialized the node
object with the environment "production"
, the name
of the current environment (which was actually the specially prepared apply_environment
). This caused
a lookup
among the available environments. This returned the unaltered production
environment, so that the override
of the manifest
setting for puppet apply
was short-circuited. Passing the actual environment
object would retain
the override. The issue had cropped up because Environments::Static
had previously hidden all environments away,
so that the lookup of "production"
, or whatever environment, still did the right thing.
When I realized this, I grew suspicious that we still might have a problem if this default case was not hit. In other words, I worried that an ENC script that did actually return an environment name would still trigger the issue. This could be confirmed quickly, but would have to wait for yet another ticket.
Creating a test
As so often, the greater challenge was to devise a test that would keep this regression from cropping up again
by indicating it through a failure. I looked at the unit tests
for the exec
node terminus, which does already cover some scenarios concerning environments.
But I wasn’t really sure what I would want to test here. The problem was that a symbol or string was being sent
to Node#environment=
but I couldn’t immediately find an expectation that was broken by the bug.
I also felt that this check wouldn’t adequately represent the problem we had faced here.
So I took a look at the integration tests for puppet apply
, and found a test that was
pretty close to what I had in mind.
I cloned it and extended it to include two manifests and an ENC script, just like my manual test.
I got my extended test to work quickly enough, but then spent quite a while trying
to make it actually use my prepared site manifest while the bug was still present. The key was
to manipulate the settings for the test to include the manifest
setting.
Final thoughts
I never really saw a use for ENCs (and honestly, still don’t :-) and hence hadn’t ever bothered looking at the code. I would have never grabbed this bug if I hadn’t gotten involved in that mailing list thread. It’s always a good idea to stay in touch with our peers in the user base. It’s not the first time that I learned something new this way. It was especially refreshing to finally poke around in environments and the work that has gone into the new (and awesome) directory environments.