Many users (myself included) would like to specify source => http://...
for file
resources in Puppet manifests. It turns out that the existing
infrastructure in the Puppet core makes this quite easy to implement.
File serving in Puppet
Among the most basic functions of any configuration management system is the central maintenance and programmatic distribution of various configuration files. Data files such as binary applications or tarballs are frequently managed as well. Puppet is no exception to this rule.
The backbone of Puppet’s management paradigm are discrete resources with
current and desired states. A convenient way to describe a file’s state
is to provide the Puppet master with an actual file that the agent should mirror exactly
to the system it manages. In this mode of file management, the user will supply a URI to point the agent
to the prepared file. The puppet://
URI scheme is prevalent. Its semantics
depends on the mode of Puppet operation.
- URLs that include a host name, such as
puppet://server/path/to/file
, will always make Puppet look up the target state on the named Puppet master, through Puppet’s fileserver API - URLs without a host name encountered by the Puppet agent locate the source file on the master that generated the current catalog
- URLs without a host name that are passed to puppet apply lead to source files being retrieved from the local configuration repository
So far, the only alternative supported URI scheme has been file://
.
A resource that uses such a file source makes Puppet sync the managed
file with another file from the local filesystem tree. Such URIs can
be specified as a file path. In other words, the following are
synonymous.
file { '/etc/motd':
source => "file:///var/shared/motds/$fqdn"
}
file { '/etc/motd':
source => "/var/shared/motds/$fqdn"
}
…with the latter being more common.
As an aside, it would appear that historically, the only available form of
management was the content
property, and that the source
parameter was apparently added as an afterthought.
This is what Charlie Sharpsteen concluded upon
researching a bug that is
actually closely related to the feature I’m going to discuss in this article.
More on this story as it develops. Now back to the topic at hand.
Metadata
For efficiency, the process of fileserving is split into two discreet phases. For each file that is synchronized from the server, Puppet will first retrieve its metadata. This puts much less strain on the network than the actual download of the file.
Metadata describes each file in terms of
- owner
- group
- mode (permissions)
- content checksum
- type
The checksum is usually created using the MD5 algorithm. At the time of writing, the aforementioned bug still causes all other available algorithms to not actually work.
The path to the managed file on the agent system is also part of the metadata as used by the agent. However, it is not received from the fileserving component of the Puppet master. This value is defined in the manifest and included in the catalog by the compiler.
Local and remote sources
The agent always uses Puppet’s indirector framework to receive metadata
for file
resources that make use of the source
attribute. If you would like
to learn more about the indirector, I recommend Masterzen’s
excellent post
on the topic. In short, the
indirector models arbitrary data paths for all parts of Puppet to get information
from various sources.
Once the metadata has been received and compared to the filesystem, Puppet might decide that a sync operation is necessary. In this case, it will also request the actual file content. This is done via indirector as well.
If the source
is specified as a URL with the puppet://
scheme, both indirector calls
map quite directly to the REST API. The desired data is handed out by the master,
or whatever alternate fileserver was specified, in the server response.
For sources
that use URLs with the file://
scheme or just a filesystem path,
the indirector terminus will not engage in any network communication at all.
It can gather the required metadata information through the stat
system call and the
respective checksumming function.
If a sync is necessary, file contents are copied locally.
The indirector picks the appropriate terminus according to the URI scheme.
HTTP servers as remote sources
In the introduction, I claimed that it was pretty simple to add HTTP support
to the file source
attribute. The reason for that is the indirector structure
that I just described. Basically, all it takes is adding a couple of termini, right? Right.
Where to start? First things first: Puppet needs file metadata to determine whether the local file needs to be synchronized to the server’s data. Common HTTP servers such as Nginx or Apache will not offer the kind of information that the Puppet agent usually manages. File ownership in terms of users and groups are of no concern to most HTTP clients, nor are permissions or the content checksum.
HTTP resources do carry some meta-information in their headers, such as the
last time of modification
or the file size.
The headers can be retrieved using the HTTP HEAD
request. So that’s what
the indirector can do to receive a limited set of metadata. Unsupported fields
will have to remain unmanaged by the source
parameter - if the manifest
does not specify an owner
or mode
, then the agent will not care about those
attributes of the local file. Defaults will be used for newly created files.
File content is available as part of the response’s body, requested through
the HTTP GET
method. This is all that the indirector really needs to do
in order to fetch data for the agent to synchronize.
Limitations of the protocol
The previous section already described how most of the relevant file attributes are not available via HTTP. There are some other implicit limitations that arise from the protocol.
HTTP resources can only ever be mapped to plain files. Symbolic links cannot be sensibly served. The closest equivalent in the protocol is the redirection. But it is not suitable for representing an actual link, because the user would expect Puppet to follow all HTTP redirects. After all, a server side reorganization can always make it sensible to add new redirections, and clients should just cope with that.
Directories can be served in the form of the respective index. It is even feasible to implement a spider algorithm in Puppet to allow the recursive synchronization of a whole directory tree to the agent system. But I decided to put this beyond the scope of the initial implementation. It can be subject to another feature request down the line. As it stands, Puppet will consider a directory index as text file content, which is perfectly acceptable for the time being, I feel.
Also on the bright side, the protocol does specify a checksum header that can be used to allow for actual synchronization of content based on this hash value. Thanks to Ken Barber for pointing this out to me. It is not mandatory for servers to include this header, and most will not bother. But users who set up servers for the express purpose of serving content to Puppet agents will want to enable it in order to enrich the available file metadata.
Conclusion
Puppet’s own facilities are quite fit for the addition of HTTP as an available protocol
for file synchronization. The nature of the protocol imposes some limits on the
number of availabe file attributes. Implementing a simplified
algorithm based on the mtime
should be straight forward.
Seeing as the theoretical background ended up being quite a mouth full, I’m postponing the overview of the implementation details to a later post. See you there!