Building Search for this Site
I like the idea of having search for my blog, to help readers find anything related to a specific topic, or something they'd read in the past. How do I make something as dynamic as search, but statically generated?
Requirements
- The solution has to be small — part of my purpose in statically generating the site is to make it fast to download, fast to view. Whatever is chosen needs to maintain that vision.
- The solution can't introduce too many dependencies — I really like OCaml as a language, and I used building my personal site as an excuse to work in it. Introducing non-OCaml dependencies has to be done with care, as I don't want to maintain a Rube Goldberg machine for my website.
Past Experience
On an unreleased personal project I wrote in Java, I used the Rhino Javascript engine and Lunr.js to make a search engine. During site generation, I executed Lunr inside Rhino, adding content to the index, which I serialized to JSON and stuck into the search page. In the search page HTML, I included Lunr, and a wrapper script to instantiate the Lunr index, pass queries to it, and manipulate the DOM to display the results. Success!
I thought something like my previous approach could work in OCaml.
Mise en Place
js_of_ocaml
— a compiler which compiles OCaml code into Javascript code.search
— an in-memory tf-idf search index written in OCaml.markup
— OCaml-native HTML parser.
Site Generation
Defining the Search Index
For my purposes I need to make a number of changes to search
, so I pulled it into my site generator repository.
- Stripped it of the heterogenous documents logic to make it smaller.
- Implemented a set of stop words (the default InnoDB stop words) into the tokenizer.
- By default
search
adds all prefixes of every token to the index. To reduce the size of the index and improve performance by not trying to display every blog post with a word starting with "s", I changed this to only add prefixes longer than 3 characters. - Added different weights for different fields in the library by multiplying the underlying index counts by the weight.
- Added a limit parameter on the number of returned results.
- Wrote a barebones length-prefixed binary serialization of the structs, lists, and maps making up the search index, which is then passed through a Base85 encoder.
type document = {
id : int;
name : string;
description : string;
body : string;
url : string;
created : string; (** ISO8601 created date *)
}
type index = {
core : Index_impl.t;
mutable docs : document list;
}
let create () : index =
let core = Index_impl.empty () in
Index_impl.add_index ~weight:5 core (fun d -> d.name);
Index_impl.add_index ~weight:1 core (fun d -> d.description);
Index_impl.add_index ~weight:1 core (fun d -> d.body);
{
core;
docs = [];
}
let add_document (t : state) (doc : document) : unit =
Index_impl.add_document t.core doc.id doc;
t.docs <- doc :: t.docs
let search ?limit (t : state) (query : string) : document list =
Index_impl.search ?limit t.core query
let serialize (t : state) : string = Base85.encode (serialize_bin t)
let deserialize (s : string) : (state, string) result =
Result.bind (Base85.decode s) deserialize_bin
Adding Blog Posts to the Index
During page generation, the generated HTML content is parsed in order to remove all content inside any tags like script
or style
, and strip out any non-plaintext markup.
Log.debug (fun m -> m "Adding post to search index: %s" page.path);
let plain_text = Processing.PlainText.extract_text body_html in
let post_url = "/posts/" ^ page.metadata.slug ^ ".html" in
incr doc_id_counter;
SearchIndex.add_document search_index {
id = !doc_id_counter;
name = page.metadata.title;
description = Option.value page.metadata.description ~default:"";
body = ( (* tags + body *)
(String.concat " " (Option.value page.metadata.tags ~default:[]))
^ "\n" ^ plain_text
);
url = post_url;
created = Time.Datetime.to_iso8601 page.metadata.created;
}
Compiling the Search Library
The search
library is configured to be built as a standalone library target in dune
. In a separate standalone library named searchClient
containing a simple js_of_ocaml
wrapper module, dune
is configured to depend on search
and to use js_of_ocaml
to compile the code into a Javascript file. For a production build of the site, --opt=3
is specified so that the compiled Javascript library is minified (this includes dead code removal) — the non-optimized library is over 26k LOC while the minified version is 2k LOC.
(include_subdirs no)
; Build the search_client JS bundle.
(executable
(name search_client)
(libraries js_of_ocaml asite.search_index)
(modes js)
)
; Profile-specific JS-of-OCaml flags
(env
(dev
(js_of_ocaml (flags (:standard))))
(_
(js_of_ocaml (flags (:standard "--opt=3")))))
Generating the Search Page
The base85-serialized search index is embedded in the Search page inside a script tag.
<main>
<h1>Search Blog Posts</h1>
<br>
<div class="search-container">
<input type="search" id="search-box"
placeholder="Loading search index…" disabled>
</div>
<br>
<section id="search-results"></section>
</main>
<script id="search-index" type="text/plain">{search_index}</script>
<script src="{__droot__}/scripts/search-client.bc.js"></script>
Search in the Browser
Search Index Javascript Interface
The following is the js_of_ocaml
wrapper code to make the search library interoperable with Javascript code. It attaches a Javascript object to window
with only two functions, query
and deserialize
. Deserialize replaces the global reference to the search index, while query returns a JS array of document objects.
module SI = SearchIndex
let index = ref (SI.create ())
(* Convert a SearchIndex.document to a plain JS object *)
let doc_to_js (d : SI.document) : 'a Js.t =
Js.Unsafe.obj
[|
("name", Js.Unsafe.inject (string (SI.name d)));
("description", Js.Unsafe.inject (string (SI.description d)));
("url", Js.Unsafe.inject (string (SI.url d)));
("created", Js.Unsafe.inject (string (SI.created d)))
|]
let js_query (q : js_string t) =
let res = SI.search ~limit:10 !index (to_string q) in
array (Array.of_list (List.map doc_to_js res))
let js_deserialize (binindex : js_string t) : bool Js.t =
let s = to_string binindex in
match SI.deserialize s with
| Ok st -> index := st; Js._true
| Error _ -> Js._false
let () =
let obj = Js.Unsafe.obj
[|
("query", Js.Unsafe.inject (Js.wrap_callback js_query));
("deserialize", Js.Unsafe.inject (Js.wrap_callback js_deserialize))
|]
in
Js.Unsafe.set Js.Unsafe.global "searchClient" obj
Searching
In the search page there's a small amount of Javascript which invokes query
on the index when characters are typed, and then takes the resulting objects and inserts search result HTML into the DOM.
Putting it Together
As of this writing, the Search page (HTML + serialized search index) is 76 kB over the wire (br encoded, 132 kB plain), and the search library is 31 kB over the wire (98 kb plain). The user receives a local-only full-text search, returning results in milliseconds. On my end I can use my existing build and package system, dune
+ opam
, introduce the js_of_ocaml
dependency, and use the exact same library at both site generation time and user query time.
I'm really pleased with how the feature turned out. It tickles me to use a language-to-language compiler in order to use the same code in different execution contexts. And for now it is performant. I'm not sure what I'll do when the search index gets really large, but I think I've got a fair amount of time until that happens — there are multiple-megabyte web applications out there that are decently usable.