In the previous post I ran into an issue with the YAML example graphs for the etcd functionality not really working anymore. Coming up with a solution in these graphs was not hard (and in fact proved an opportunity to learn a few things along the way). However, I did not manage to get the cluster up and talking regardless.
I accepted the challenge and sat down to try and find out what was wrong with etcd.
A first look at the code
The embedded etcd code lives in the etcd
subdirectory of the mgmt
source code. In etcd/etcd.go
, the EmbdEtcd struct
represents the
“embedded server and client etcd”. This file imports the “clientv3” package
from etcd as etcd
. As the first errors I had encountered seemed to originate
with the client, I looked at this first.
The online documentation for the clientv3 package is quite good, complete with a simple code example and everything. Matching this with the actual implementation in mgmt is rather daunting, however. Still, it gave me the idea of trying to build a very simple client to mgmt’s server, so that I would have a more limited code base to pry open.
Better yet, I recalled that the post that introduced the embedded
etcd
mentions that you can run simple tests using etcdctl
such as
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 member list
New step 1: See if we can talk to the embedded server this way in the first place.
Etsy Dee Cuddle
In order to get a suitably modern etcdctl
, I followed the instructions in the
article linked just above. In fact, mgmt bundles an appropriate version of etcd
in its vendor directory. I’m not entirely certain that switching into the
vendor/go.etcd.io/etcd
directory (a git submodule) is an acceptable way to run
the build, but it sure worked.
My approach was to only run the original seed server of mgmt, in order to target
that with a simple query from etcdctl
, connecting to the API port of the
embedded etcd server.
mgmt run --hostname h1 --ideal-cluster-size 3 yaml examples/yaml/etcd1a.yaml
I found myself a little surprised that etcdctl
noped out with a message that
was quite reminiscent of what I had seen mgmt do on repeat while trying to
contact the original cluster member.
ETCDCTL_API=3 ./vendor/go.etcd.io/etcd/bin/etcdctl --endpoints 127.0.0.1:2380 member list
{"level":"warn","ts":"2020-03-08T19:25:00.394Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-f55a7a6d-3208-45ef-9a7a-f28f55805a05/127.0.0.1:2380","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2380: connect: connection refused\""}
Error: context deadline exceeded
This was promising insofar as it suggested to me that the problem was probably with the original seed server after all, not with the second mgmt instance that tries to join the cluster. Sure enough, when examining the output of the seed process, it turned out that it was failing the embedded server start after the 60 second timeout now.
Imitating the server
In order to take a better look at what was going on with the seed server, I searched for one of its log messages, the one appearing before it ran into the timeout failure.
git grep "server: starting"
docs/faq.md:### On startup `mgmt` hangs after: `etcd: server: starting...`.
docs/faq.md:etcd: server: starting...
etcd/server.go: obj.Logf("server: starting...")
So this tells us two things. For one, this is the “server” component from
etcd/server.go
. This is where I started reading code, because I had totally
missed the other point: There is already an FAQ entry for this current problem.
It advises that the local state directory of the embedded etcd server is
probably corrupted. This seems likely enough; after all, my first attempt did
not go quite as disastrous.
This gives me an idea, but allow me to first regale you with the account of my
immediate investigation of what was going on. I first went and studied the
implementation in said server.go
file. It uses the embed
type from the
etcd package. The
documentation for this package comes with some sample code to spin up the most
simplistic embedded etcd server. Lo and behold, all the parts of this code
sample can be located in server.go
in mgmt.
So why does it not start? (Mind you, at this point I had not seen the FAQ, and was not yet on to the local state directory as the likely culprit.) My plan was simple: Build and run the example code from the etcd documentation, and then make it resemble mgmt’s implementation more and more. See where it breaks.
To get the embedded server to run, I copied the example code to a file
“demo_etcd.go” in the root directory of the mgmt code. I added a package main
line at the top and ran it through go run
.
The example itself worked quite well. The embedded server came up and talked to
etcdctl
without an issue. I managed to replicate some of mgmt’s configuration
settings as well, but not all of them. As I found myself printing the entire
cfg
structure from both mgmt and my example, and comparing them, one
particular parameter caught my eye. I made an adjustment.
- cfg.LogLevel = "error" // keep things quieter for now
+ cfg.LogLevel = "info"
I had marveled at how much less noisy mgmt 0.21 felt compared to the earlier releases I was used to. I suspect that this setting plays a big role. Changing this gave me a lot more output to work with, and I finally set back a little and wrote this blog post.
Wrapping up
I have not yet had a chance to take a good look at the output generated from the etcd server. However, while producing this write-up, I had a new thought that might explain what went wrong here, and why the embedded server is now permanently broken for me.
This post is already fairly long and pretty rambly. I will stop here and explore my new theory in yet another post in the near future.