03 Oct 2020 docker managing delegating case-study

What is the problem?

Starting in September 2020, my team (along with my peer teams) needed to migrate to a new remote Docker registry. The team in charge of the registries were migrating to save costs (more than 10x savings!) as well as increase throughput by running in the same region as our other software. Great idea!

The registry team gave lots of great advice, thorough documentation on old and new URIs, the new authentication mechanism, timeline, etc.

The timeline looked like this:

  • Sep 1: New registry available
  • Oct 1: Old registry becomes read-only
  • Nov 1: Old registry is shut off

What is the state of how builds work?

Now when you take a look at a problem like this on the surface, you might think this is what needs doing:

  • Update each git repo to push to and pull from the new registry

But when you start to take a closer look at the problem, there is a lot more to it. In order to figure that out, you need to know more about the ecosystem.

  • The overall team needs to update around 10 git repos that consume or produce docker images
  • There are dependencies between git repos: some repos produce images used by other repos
  • Some git repos have circular dependencies (whoops!): one git repo had a dependency on an older tag of a docker image that was produced by the same git repo

We can validate that the migration is a success when both

  • Our cloud software can be built and deployed using the new registry
  • Our on-premise software can be built and packaged using the new registry

What do we need to do?

So with that, you start to realize there is a decent amount of work.

  • Because there is a graph of dependencies between git repos, some repositories must be updated first to push to the new docker registry before other repos can pull from the new registry
  • This graph is also not a DAG because we knew of some circular dependencies. This isn’t as bad as it could be, luckily, since the graph involves using old docker image tags, so it should not result in a chicken-and-egg problem.
  • But wait, to solve that, how do we migrate old docker image tags? If you update a git repo to push to the new docker registry, only the most recent image tag will be pushed. You need a way to migrate old tags without migrating absolutely every docker image.
  • Since there is a linear dependency graph, how do we parallelize the effort in order to accomplish this within the deadline? We could update some git repos to push to the new registry, but continue to read from the old registry until all of its dependent repos are pushing to the new registry.
  • That also means we cannot switch to pushing exclusively to the new registry. We need to push to both old and new docker registries for a while or else the build for downstream repos will break when it needs a new image that is only being pushed to the new docker registry.
  • That also means that we need to stop pushing to the old docker registry before the old registry goes read-only. If we don’t stop doing that, it means any build that attempts to push to the old registry will start to fail as it will no longer be allowed to push starting October 1

How does a team actually accomplish this?

It was my job, as a manager on the team, to help the team work through how to do this. We decided to optimize for:

  • parallelizing the work
  • individual teams unblocking themselves as quickly as possible

So to kick off this work, we

  • Assigned ownership of each git repo to one team
  • Teams identified if their repo had a dependency on pushing images, pulling images, or both
  • Teams identified which other repos (and thus which teams) blocked their migration
  • Teams would migrate repositories to first push to both old and new registries
  • Once any dependent repos were migrated to push, a repo could migrate to pull from the new registry
  • Find one owner (on my team) to centrally migrate any old images
  • Once both exit criteria are met (cloud and on-premise software can be built using the new registries), then we can stop pushing to the old registries

What would you do differently next time?

Establish what success looks like earlier

It took some time to understand how we know the migration is complete. Knowing that earlier makes it easier to explain the goal to the team, and to find an owner for that verification. For us, success looked like successful builds of our on-premise software, ans successful deployments of our cloud software.

Write out what work is needed

Once we knew what it would look like to be finished, we could work backwards and defined the tasks along with their ordered dependencies. One thing we all agreed would have been helpful would be a diagram showing the dependencies.

It was about halfway through the project before we started doing this, and once we did, it was easy to communicate the work remaining and the order to tackle the work.

Have a kickoff meeting with everyone involved

It was apparent not every team had the same context on the problem, the constraints, or the work required to accomplish the goals. This also meant not every team had to reinvent this process. We tried to write it down for them, but having time to really cement this and get Q&A for anything we missed would have helped a lot.

Start on manual migration earlier

This would have unblocked a lot of repos whose only dependency on the old registry was on old tags of images. This also would have eliminated a step where some repos first push to the new registry, then later pull from the new registry.


09 Apr 2019 flink kubernetes

Last week, I had a great time attending Flink Forward SF 2019. Now, I really liked the conference since there were a lot of talks where I was able to take away actionable best practices from the other professionals using Apache Flink.

And of course I’d be remiss if I didn’t mention I gave a talk myself (just check out the conference schedule!)

But I wanted to use this page for some highlights of what I saw and what I learned while I was there.

Lyft engineers gave a talk about a Kubernetes Operator used to launch Flink clusters as a single Kubernetes resource. The Kubernetes Operator framework is something open-sourced by the CoreOS team that builds on the base Kubernetes Custom Resource Definition (CRD).

Kubernetes CRDs allow users to define their own resources (like how pods or deployments are built-in resources), which gives users the power to do something like make a resource that understands when to scale itself, or when to start or stop itself.

Lyft’s Kubernetes Operator lets users define a “Flink Cluster” as a resource, which will spin up one or more high-availability Flink Job Manager pods, and one or more Flink Task Manager pods. Here are some of the other key points:

  • Lyft’s team creates one of these resources for every Flink job. This essentially makes the Kubernetes deployment work like a Hadoop YARN deployment.
  • The resource can be set up with…
    • its own IAM role to give IAM isolation between jobs
    • Flink image tag to give flexibility to run different versions of Flink per job. I do wonder know how you handle writing client code to submit jobs against all the different version of Flink. Notably I’m thinking of how Flink 1.6 changed the client code to submit jobs.
    • a specific parallelism value. This can be based on the workload of the job, or the number of Task Managers created

Lyft plans to open-source this project at the end of April 2019.


05 Jun 2017 java

Update: This article also appears on the Rocana blog with much nicer formatting

I recently went through the fun (no, really!) task of ensuring a large codebase was able to run on Java 8. It had originally been written to work on Java 6 and Java 7, and Java 6 support was dropped before I started working with it. The transition to running on Java 8 is intended to be seamless, but it’s easy to have a codebase that accidentally relies on undefined JVM behavior.

HashSet and HashMap iteration order

The problem

HashMap (j7 j8) and HashSet (j7 j8) are commonly used classes in Java’s Collections API. HashSet explicitly states (in both Java 7 and 8 documentation) that it does not have a defined iteration order:

It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time.

HashMap also states iteration order is undefined:

This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.

In Java 8, changes were made to HashMap and HashSet in JEP-180 to improve performance during high-collision scenarios. They explicitly state

This change will likely result in a change to the iteration order of the HashMap class. The HashMap specification explicitly makes no guarantee about iteration order. The iteration order of the LinkedHashMap class will be maintained.

So, any code relying on that iteration order ends up breaking between Java versions. For the most part, this reliance manifests when generating String representations of data structures. In other words, serializing JSON, or building SQL.

How to fix it

JSON

For JSON, the main solution is to perform equals comparison of marshalled JSON objects rather than the String representation.

  public static void assertStringsEqual(String message, String expected, String actual) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    // marshall the JSON strings into JsonNode objects for comparison
    JsonNode expectedNode = objectMapper.readTree(expected);
    JsonNode actualNode = objectMapper.readTree(actual);

    Assert.assertEquals(message, expectedNode, actualNode);
  }

This uses com.fasterxml.jackson.databind.JsonNode and com.fasterxml.jackson.databind.ObjectMapper from the com.fasterxml.jackson.core:jackson-databind Maven dependency.

Note that org.json’s JSONObject does not properly perform equality checks of JSON trees, so I don’t recommend using it for this purpose.

Another library I looked at was JSONassert, but it includes a duplicate class that is in org.json (JSONString). My project is configured to disallow duplicate classes, so this dependency would have been a hassle to bring in.

Other data types

These other sort-order-dependent bugs mostly came up when comparing expected SQL statements with constructed SQL. The actual SQL was sorted differently in Java 7 and Java 8, so I had to make it consistent using these techniques.

  • Replace HashMap with LinkedHashMap, and HashSet with LinkedHashSet to have a consistent sort order in both versions of Java.
  • If a class requires a specific iteration order, then the data structure should be sorted. This can be accomplished with either TreeSet, an implementation of SortedSet, which extends Set; or TreeMap, an implementation of SortedMap, which extends Map.

Building and Running tests with different Java versions

This is a pretty interesting one: my organization needed to continue building with JDK 7, but the tests also needed to pass when run with JDK 8.

If your project uses Maven, the Surefire plugin has the jvm configuration property for specifying a different JVM when running tests. Note that this property should point to the java executable, not the home directory for that JDK.

# on OSX
java8_path=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/bin/java
mvn clean test -Djvm=${java8_path}

Similarly, the Failsafe plugin also has the jvm configuration property for specifying a different JVM when running integration tests, like with mvn verify.

And that’s it! If you were to build with JDK 8 as well, it’s likely other issues would arise. Whenever I get around to making Java 8 the minimum version, I’ll probably write about the new issues we run into there.


18 May 2017 kerberos strace

tl;dr, turn off iptables on the KDC node. Of course it was iptables.

The full story

I had previously set up Kerberos and Hadoop manually and now had the task of at least semi-automating the process. I had successfully built automation around installing the KDC and creating the admin/admin principal, and was able to kinit as admin/admin on the KDC node.

When automating the creation of per-node principals, I consistently ran into this error:

kinit: Cannot contact any KDC for realm 'EXAMPLE.COM' while getting initial credentials

For reference, here’s the full output for successful and unsuccessful kinits.

# success on KDC node (hn0)
[matt@matt7-hn0 ~]$ kinit admin/admin
Password for admin/admin@EXAMPLE.COM:

[matt@matt7-hn0 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_531
Default principal: admin/admin@EXAMPLE.COM

Valid starting     Expires            Service principal
05/17/17 18:39:29  05/18/17 18:39:29  krbtgt/EXAMPLE.COM@EXAMPLE.COM
    renew until 05/24/17 18:39:29
# failure on client node (dn0)
[matt@matt7-dn0 ~]$ kinit -V admin/admin
Using default cache: /tmp/krb5cc_531
Using principal: admin/admin@EXAMPLE.COM
kinit: Cannot contact any KDC for realm 'EXAMPLE.COM' while getting initial credentials

Debugging

SSH failure?

Can the client not resolve the hostname of the KDC server? I was able to SSH from the client to the KDC server using the IP address, fully-qualified hostname, and short hostname. So, no problem resolving the address.

krb5.conf mismatch?

I already knew kinit worked on the KDC node, so maybe there was a mistmatch in /etc/krb5.conf. I was able to verify the krb5.conf was identical on the two machines (just sha256sum each file), so it wasn’t that.

Some other network issue?

Using kinit with the verbose flag (-V) was not giving me any additional useful information, so I needed to add in some other tools. Since this is a CentOS 6.8 machine, let’s try strace!

# On client node, this returns immediately and fails.
# On KDC node, this waits for the password to be typed in (which I did), then succeeds.
strace kinit -V admin/admin &> out

I did not use any flags with strace because (a) I was not sure what I was searching for, and (b) honestly, I forgot the useful flags for strace since I last used it :)

If you want to know the awesome flags for strace, Julia Evans has a great zine about it.

Using strace on kinit

What I was searching for were instances of the connect syscall reaching out to the IP of the KDC node.

# successful on KDC node (hn0)
connect(3, {sa_family=AF_INET, sin_port=htons(88), sin_addr=inet_addr("10.10.178.123")}, 16) = 0
sendto(3, "j\201\3110\201\306\241\3\2\1\5\242\3\2\1\n\243\0160\f0\n\241\4\2\2\0\225\242\2\4\0"..., 204, 0, NULL, 0) = 204
gettimeofday({1495043558, 931567}, NULL) = 0
gettimeofday({1495043558, 931608}, NULL) = 0
poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "k\202\2\3410\202\2\335\240\3\2\1\5\241\3\2\1\v\242\0260\0240\22\241\3\2\1\23\242\v\4"..., 4096, 0, NULL, NULL) = 741
close(3)                                = 0
# failure on client node (dn0) - see the EHOSTUNREACH
connect(3, {sa_family=AF_INET, sin_port=htons(88), sin_addr=inet_addr("10.10.178.123")}, 16) = 0
sendto(3, "j\201\3110\201\306\241\3\2\1\5\242\3\2\1\n\243\0160\f0\n\241\4\2\2\0\225\242\2\4\0"..., 204, 0, NULL, 0) = 204
gettimeofday({1495043522, 503784}, NULL) = 0
gettimeofday({1495043522, 503838}, NULL) = 0
poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLERR}])
recvfrom(3, 0x7f54c19b3340, 4096, 0, 0, 0) = -1 EHOSTUNREACH (No route to host)
close(3)

Great! At this point I had clear evidence that the host was unreachable from the client. Thanks strace!

I was just lucky that my next guess was right: was iptables enabled on the KDC node? It was. Disabling iptables (sudo service iptables stop) allowed me to connect from the client node, success!

Checklist

This is the quick version of the debug steps I took to figure out the issue.

  • Can you SSH from the client to KDC node?
  • Are the /etc/krb5.conf files identical?
  • Is iptables on and set up with a port whitelist that does not include port 88?