Building Search for this Site

Search on a static site.

I like the idea of having search for my blog, to help readers find anything related to a specific topic, or something they'd read in the past. How do I make something as dynamic as search, but statically generated?

Requirements

Past Experience

On an unreleased personal project I wrote in Java, I used the Rhino Javascript engine and Lunr.js to make a search engine. During site generation, I executed Lunr inside Rhino, adding content to the index, which I serialized to JSON and stuck into the search page. In the search page HTML, I included Lunr, and a wrapper script to instantiate the Lunr index, pass queries to it, and manipulate the DOM to display the results. Success!

I thought something like my previous approach could work in OCaml.

Mise en Place

Site Generation

Defining the Search Index

For my purposes I need to make a number of changes to search, so I pulled it into my site generator repository.

type document = {
  id          : int;
  name        : string;
  description : string;
  body        : string;
  url         : string;
  created     : string; (** ISO8601 created date *)
}

type index = {
  core         : Index_impl.t;
  mutable docs : document list;
}

let create () : index =
  let core = Index_impl.empty () in
  Index_impl.add_index ~weight:5 core (fun d -> d.name);
  Index_impl.add_index ~weight:1 core (fun d -> d.description);
  Index_impl.add_index ~weight:1 core (fun d -> d.body);
  {
    core;
    docs = [];
  }

let add_document (t : state) (doc : document) : unit =
  Index_impl.add_document t.core doc.id doc;
  t.docs <- doc :: t.docs

let search ?limit (t : state) (query : string) : document list =
  Index_impl.search ?limit t.core query

let serialize (t : state) : string = Base85.encode (serialize_bin t)

let deserialize (s : string) : (state, string) result =
  Result.bind (Base85.decode s) deserialize_bin

Adding Blog Posts to the Index

During page generation, the generated HTML content is parsed in order to remove all content inside any tags like script or style, and strip out any non-plaintext markup.

Log.debug (fun m -> m "Adding post to search index: %s" page.path);
let plain_text = Processing.PlainText.extract_text body_html in
let post_url = "/posts/" ^ page.metadata.slug ^ ".html" in
incr doc_id_counter;
SearchIndex.add_document search_index {
  id = !doc_id_counter;
  name = page.metadata.title;
  description = Option.value page.metadata.description ~default:"";
  body = ( (* tags + body *)
      (String.concat " " (Option.value page.metadata.tags ~default:[]))
      ^ "\n" ^ plain_text
  );
  url = post_url;
  created = Time.Datetime.to_iso8601 page.metadata.created;
}

Compiling the Search Library

The search library is configured to be built as a standalone library target in dune. In a separate standalone library named searchClient containing a simple js_of_ocaml wrapper module, dune is configured to depend on search and to use js_of_ocaml to compile the code into a Javascript file. For a production build of the site, --opt=3 is specified so that the compiled Javascript library is minified (this includes dead code removal) — the non-optimized library is over 26k LOC while the minified version is 2k LOC.

(include_subdirs no)

; Build the search_client JS bundle.
(executable
 (name search_client)
 (libraries js_of_ocaml asite.search_index)
 (modes js)
)

; Profile-specific JS-of-OCaml flags
(env
 (dev
  (js_of_ocaml (flags (:standard))))
 (_
  (js_of_ocaml (flags (:standard "--opt=3")))))

Generating the Search Page

The base85-serialized search index is embedded in the Search page inside a script tag.

<main>
  <h1>Search Blog Posts</h1>
  <br>
  <div class="search-container">
    <input type="search" id="search-box"
           placeholder="Loading search index…" disabled>
  </div>
  <br>
  <section id="search-results"></section>
</main>

<script id="search-index" type="text/plain">{search_index}</script>
<script src="{__droot__}/scripts/search-client.bc.js"></script>

Search in the Browser

Search Index Javascript Interface

The following is the js_of_ocaml wrapper code to make the search library interoperable with Javascript code. It attaches a Javascript object to window with only two functions, query and deserialize. Deserialize replaces the global reference to the search index, while query returns a JS array of document objects.

module SI = SearchIndex

let index = ref (SI.create ())

(* Convert a SearchIndex.document to a plain JS object *)
let doc_to_js (d : SI.document) : 'a Js.t =
  Js.Unsafe.obj
    [|
      ("name",        Js.Unsafe.inject (string (SI.name d)));
      ("description", Js.Unsafe.inject (string (SI.description d)));
      ("url",         Js.Unsafe.inject (string (SI.url d)));
      ("created",     Js.Unsafe.inject (string (SI.created d)))
    |]

let js_query (q : js_string t) =
  let res = SI.search ~limit:10 !index (to_string q) in
  array (Array.of_list (List.map doc_to_js res))

let js_deserialize (binindex : js_string t) : bool Js.t =
  let s = to_string binindex in
  match SI.deserialize s with
  | Ok st -> index := st; Js._true
  | Error _ -> Js._false

let () =
  let obj = Js.Unsafe.obj
    [|
      ("query", Js.Unsafe.inject (Js.wrap_callback js_query));
      ("deserialize", Js.Unsafe.inject (Js.wrap_callback js_deserialize))
    |]
  in
  Js.Unsafe.set Js.Unsafe.global "searchClient" obj

Searching

In the search page there's a small amount of Javascript which invokes query on the index when characters are typed, and then takes the resulting objects and inserts search result HTML into the DOM.

Putting it Together

As of this writing, the Search page (HTML + serialized search index) is 76 kB over the wire (br encoded, 132 kB plain), and the search library is 31 kB over the wire (98 kb plain). The user receives a local-only full-text search, returning results in milliseconds. On my end I can use my existing build and package system, dune + opam, introduce the js_of_ocaml dependency, and use the exact same library at both site generation time and user query time.

I'm really pleased with how the feature turned out. It tickles me to use a language-to-language compiler in order to use the same code in different execution contexts. And for now it is performant. I'm not sure what I'll do when the search index gets really large, but I think I've got a fair amount of time until that happens — there are multiple-megabyte web applications out there that are decently usable.