Note Identifiers and Tests

Extended discussion on data representation.

My inclination, after getting the hello world project set up in place, is to start defining the major data representations of the domain we're working in. Since I'm documenting the process, and I want to discuss decisions, I'm going to start out pretty slowly — today's update is basically just a type for representing identifiers for "notes" (whatever it is we decide to call the basic content building block).

Basic Definition and Data Access

The plan here is to expose an opaque type t for our identifier values (Identifier.t), and be explicit and deliberate about what data is exposed to the rest of the program. I liked how Beads stuck with an identifier format that's human readable, and likely prevalent in LLM training data, so we'll use the same format for our identifier: [a-z]{1,5}-[0-9]+ (e.g. kbase-12345). Decomposing that identifier gets us a namespace and raw_id.

lib/data/identifier.ml:

type t = {
  namespace : string;
  raw_id    : int;
}

let namespace { namespace; _ } = namespace
let raw_id    { raw_id;    _ } = raw_id

Next we'll need to write a function for constructing values of the type.

Aside: Correct Construction

I've come to practice a pretty rigid pattern by default (unless circumstances require otherwise) for modeling data in both my imperative language work life (Java/Kotlin/Python) and my FP language hobby life (OCaml/Haskell), that I sometimes call "Correct Construction":

  1. Encapsulate primitive types to avoid using them directly.
  2. Immutable values.
  3. Validate constituent data on construction.

I give this practice to every Junior engineer I work with, as it promotes a ton of healthy behavior.

1 — Encapsulating primitive types discourages the Primitive Obsession code smell, and encourages the modeling of domain concepts using compiler-guaranteed types. Too many engineers get sucked into passing primitive data as multiple function arguments, never engaging with data hiding/sharing or attached semantics (every object is a DTO), or passing around the Map<String, List<Pair<Integer, DateTime>>> in all its inexpressiveness.

2 — Immutable values by default is the FP way, made super easy in OCaml, and an effective guard against a huge source of complexity and bugs: stateful values. Trying to trace state change through the lifetime of a value is awful, and going immutable makes it a non-issue. A lot of my work is with backend web services where every request is serviced in parallel — immutable data is simply superior to trying to deal with concurrent access of a stateful value.

3 — Lastly, another lesson from FP: represent concepts with strong types. When the language's type tools can't constrain the possible values any further, you constrain them at runtime. To do this you route all value construction through a single function which always checks if the input constituent data is within the valid subset of values, and explodes if not — encouraging the fail fast technique. We fail fast, rather than carrying around bugged, invalid data to explode far from the site of the problem; maintaining an invariant that all data is meaningful.

Under this practice, anywhere in the program you are handed some data, you know it is meaningful (and not bugged) because it can only be constructed in a semantic-upholding way and is immutable (guarding against invalidity-through-state-change). You know that you are protected to only use the data in semantically meaningful ways, since the data has been encapsulated (e.g. can't access the time of a datetime without engaging with the timezone).

Side Quest: Assertions

There's a maintenance and complexity burden that comes with each new dependency. For this project I'm going to try and keep a good balance between bringing in a library and just writing the code. I'll lean towards a library for logic that is not core to the project's purpose, and which is complex enough that reimplementing it would require too much code. That said we're going to reinvent some assertion and exception logic that could easily be found in a library, but which is so small as to be reasonably maintained.

We will need these helpers to constrain the identifier value-space to our preferred subset.

lib/control/assert.ml:

let require ?(msg = "Requirement not met") condition =
  if not condition then invalid_arg msg

let require1 ?(msg) ?(arg) condition =
  if not condition then
  match msg with
  | Some msg -> invalid_arg1 msg (Option.get arg)
  | None     -> invalid_arg "Requirement not met"

let require2 ?(msg) ?(arg1) ?(arg2) predicate =
  if not predicate then
  match msg with
  | Some msg -> invalid_arg2 msg (Option.get arg1) (Option.get arg2)
  | None     -> invalid_arg "Requirement not met"

You can see a micro-pattern I've adopted whenever I work with exceptions. So often the message is the only real data attached to an exception, and rather than clunkily append strings or format them at the callsite, I build the formatting into the exception construction.

Construction

We'll be using the basic Str for string matching, for now. There will probably come a time when a more expressive regex library will be needed, but for now we'll lean on Str and come back to replace when we make that decision.

lib/data/identifier.ml:

let _namespace_pattern = "^[a-z]+$"
let _namespace_re = Str.regexp _namespace_pattern

let _validate_namespace ns =
  let len = String.length ns in
  CA.require1 (len >= 1 && len <= 5)
    ~msg:"namespace must be between 1 and 5 characters, got \"%s\"" ~arg:ns;
  CA.require2 (Str.string_match _namespace_re ns 0)
    ~msg:"namespace must match `%s`, got \"%s\"" ~arg1:(_namespace_pattern) ~arg2:ns;
  ns

let _validate_raw_id id =
  CA.require1 (id >= 0) ~msg:"raw_id must be >= 0, got %d" ~arg:id;
  id

let make namespace raw_id = {
  namespace = _validate_namespace namespace;
  raw_id    = _validate_raw_id raw_id;
}

String Helpers

Since we get structural equality for free in OCaml, the only thing left on this basic implementation are string helpers, to_string, from_string, and pretty print pp.

lib/data/identifier.ml:

let pp fmt { namespace; raw_id } =
  Format.fprintf fmt "%s-%d" namespace raw_id

let to_string t = Format.asprintf "%a" pp t

let from_string s =
  match Str.split _dash_re s with
  | [namespace; raw_id_str] -> make namespace (int_of_string raw_id_str)
  | _ -> CE.invalid_arg1 "Invalid format \"%s\", expected \"namespace-id\"" s

Unit Tests

In my static site's codebase, I used alcotest for unit tests, so for this project I want to give expect tests a try. So far they're alright.

The ergonomics are nice. They follow dune's choice not to output anything when successful (both very useful for not polluting coding agent context). Using dune promote to quickly apply test changes you expected is slick. My one issue during setup is that I like my test directory structure to mirror the directory structure of the code, to help with discoverability, but there didn't appear to be a way to achieve this without turning subdirectories into submodules using an explicit dune file — an annoyance that's going to force me to copy-paste the same dune file in every new test subdirectory.

Helper Script

A fun trick I found at some point during building my static site that I brought over is setting dune into watch mode to both rebuild and rerun the tests as soon as any file is saved. It's so good. OCaml compilation speeds + dune compilation orchestration is so fast that using the language server and the watched builds is like having a full featured IDE in a lightweight text editor.

scripts/watch-build.sh:

#!/bin/bash
#
# Usage:
#   ./scripts/watch-build.sh
#
# This script will rebuild and run tests when files change.
#
# The --force flag is used to force a full rebuild of the project.
# The --watch flag is used to watch for changes to files known-to-dune.
# The --terminal-persistence=clear-on-rebuild flag clears the terminal for each rebuild.

dune runtest --force --watch --terminal-persistence=clear-on-rebuild

Retro

With these changes we've now got a basic identifier structure, assertion and exception handlers, and a functioning unit test setup. All in place for future work. I wish the expect test setup would more easily allow for subdirectories, but I'm willing to live with the annoyance for now.

See 6434f2b for all of the referenced changes.