Searching for something?

    Searching for something?

    This post is "cu dedicatie pentru Ciprian de la Bistrita", who has asked for a  search feature for some time now

    I didn't have a search on my blog for quite some time, because it's a static website, without any dynamic backend (except for comments, but those are well isolated, on a subdomain). But Javascript and the browsers are getting more and more features every day, so it's now possible to do all of this clientside. You just have to go to the search page (also linked in the menu).

    I had three options:

    • Acrylamid has a builtin   feature that builds up at com­pi­la­tion some compressed suffix tries from your   posts and allows you to search there. However, it's pretty raw, in the sense that it doesn't do query expansion, it doesn't support multiple terms, it doesn't do stemming and so on. Also, I just found out the Acrylamid is no longer maintained, so in a time frame of one year I will be moving away from it, so I didn't want to tie myself to it with this too.
    • lunr.js This is a purely in browser solution. It's like Solr, but smaller and less bright (Solr is one of the better known search engines). It does to­k­eniza­tion, stemming and stop word filtering. Then it builds up an inverted index, which allows for efficient querying, and bam, you've got a client side search.
    • elas­ti­clunr.js This is an extension/copy/im­prove­ment over lunr.js, which claims to be faster, but is also less popular, so I decided to skip it for now.

    The total size of my posts is about 1.5Mb. If I build the inverted index offline, it's about 9Mb. Because I didn't want to send that over the network on loading the search page, I decided to send only a JSON document containing all the posts and then create the index on the client side. This was quite simple:

    class LunrSearch(View):
        def generate(self, conf, env, request):
            if not env.options.search:
                raise StopIteration()
            docs = []
            for i, entry in enumerate(request['entrylist']):
                docs.append({"url": entry.permalink,
                             "date": entry.date.isoformat(' '),
                             "tags": entry.tags,
                             "title": entry.title,
                             "content": entry.content})
            yield (io.StringIO(json.dumps(docs, ensure_ascii=False)),
                  joinurl(conf['output_dir'], self.path))

    This view is called for a route in my con­fig­u­ra­tion and dumps there all the posts, in JSON format.

    The client side is more in­ter­est­ing.

    <input id="search"/>
    <ul id='results'></ul>
    
    <script src="/static/js/lunr.min.js"></script>
    <script>
    'use strict';
    var index, docs, tasks;
    var input = document.getElementById('search'); 
    var resultdiv = document.getElementById('results');

    We start by defining an input where we can type and a list where the results will be shown.

    document.addEventListener("DOMContentLoaded", function(event) { 
        // Set up search
        var xhr = new XMLHttpRequest();
        xhr.onreadystatechange = function()
        {
            if (xhr.readyState === XMLHttpRequest.DONE) {
                if (xhr.status === 200) {
                        parseResults(JSON.parse(xhr.responseText));
                } else {
                    resultdiv.innerHTML = "<li>Error loading search index! \
                        Please tell me about this! :(</li>";
                }
            }
        };
        xhr.open("GET", '/static/js/search.json', true);
        xhr.send();
    
    });

    Because I don't use jQuery, I had to load the JSON file with good ol' XML­HttpRe­quest (I can't wait for the fetch API to become more mainstream!), and then I call a function to parse the results (or show an error if that's the case). This function doesn't do much, except it ini­tial­izes the index, schedules the indexing and adds a listener for the input tag to process user input. The index is ini­tial­ized with the fields that we will want to search on and what should be the reference of a document. We boost the importance of the title and tags fields.

    function parseResults(response) {
        docs = response;
        tasks = response.slice(); // Make copy of document list
        // Create index
        index = lunr(function(){
                // Boost increases the importance of words found in this field
                this.field('content');
                this.field('url');
                this.field('title', 5);
                this.field('tags', 10);
                this.field('date');
                // the id
                this.ref('id');
        });
        // Schedule background indexing
        scheduleIndexing();
        // Add search handler
        document.getElementById('search').addEventListener("input", search)
    };

    Indexing takes about 2-4 seconds, so if we were to do it here, it would block the UI thread, resulting in a janky UI. So, we use the shiny new API of re­questI­dle­Call­back, which allows us to do it in the background, during idle moments. Because this is also not well supported yet (coughSafaricough), I give an al­ter­na­tive of just doing the indexing in the main thread, by mocking the API. To add a document to the index, you just have to call the add function with a JSON object rep­re­sent­ing the document.

    function scheduleIndexing() {
        if ('requestIdleCallback' in window) {
            requestIdleCallback(indexInBackground);
        } else { // Mock the API and do the indexing in the main thread
            indexInBackground({timeRemaining: function() { return 1}});
        }
        function indexInBackground(deadline) {
            // Run next task if possible
            while (deadline.timeRemaining() > 0 && tasks.length > 0) {
                var entry = tasks.pop();
                index.add({
                    url: entry.url,
                    date: entry.date,
                    title: entry.title,
                    content: entry.content,
                    tags: entry.tags,
                    id: tasks.length
                });
            }
            // Schedule further tasks if necessary
            if (tasks.length > 0) {
                requestIdleCallback(indexInBackground);
            } else {
                if (document.getElementById('search').value != '') {
                    search(); 
                }
            }
        }
    }

    This re­questI­dle­Call­back API works by taking a function which receives a deadline object, which tells you how much more time you have left. You are supposed to return before the time expires. Because of this, it's good only for tasks that can be split into small chunks. Indexing is a perfect example: indexing one document takes very little, on the order of 5 ms, and when we detect we ran out of time, we stop and request another time slot. When the browser "takes a break", it will schedule us again. For more details on the API, read this post. We do this as long as there are tasks left. When we finished the indexing, we check to see if the user has written anything in the checkbox and trigger a search if that's the case.

    function search() {
        var query = input.value;
        if (query.trim().length >= 3) {
            var result = index.search(query); // Search for it
            // Output it
            if (result.length === 0) {
                resultdiv.innerHTML = "<li>No result found! :(</li>";
            } else {
                resultdiv.innerHTML = '';
                for (var i=0; i < result.length; i++) {
                    var ref = result[i].ref;
                    var doc = docs[ref];
                    var li = document.createElement("li");
                    li.innerHTML = '<a href="' + doc.url + '">' + doc.title + '</a>';
                    resultdiv.appendChild(li);
                    if (i > 30) {
                        break;
                    }
                }
            }
        } else {
            resultdiv.innerHTML = "<li>Query is too short.</li>";
        }
    };

    Searching is not too com­pli­cat­ed. Too eliminate the case where there are too many results to be useful, we search only when there are at least three letters in the input form. We then loop over the results and add to the emptied list a link to their URL, with a title. We also limit the list to 30 items. Pagination could be added, but I don't think it's that useful to look at the long tail. lunr.js returns results sorted by score, so it's fine to cut off like this.

    Some posts that have inspired me:

    • https://29a.ch/2014/12/03/full-text-search-example-lunrjs
    • http://matthew­daly.co.uk/blog/2015/04/18/how-i-added-search-to-my-site-with-lunr-dot-js/