rolisz's site

Indexing IM logs with Elasticsearch

Remember my old project for processing instant messaging logs? Probably, because I wrote about it five years ago. Well, the project is only mostly dead, every once in a while I still oc­ca­sion­al­ly work on it.

I mostly use it as an excuse to learn tech­nolo­gies that are used outside of the Google bubble. One thing that really impressed me with how well it works and how easy it is to set up was Elas­tic­search. Elas­tic­search is a search engine. You give it your documents and it indexes them and enables you to query them fast. There are other projects that do this for you, but ES can continue.

Summing up contacts

Last time we processed the Digsby, Trillian and Pidgin logs and saved them as a unified YAML file for each contact.

Now let's start seeing who I talk with most. A naive way to do this would be to just simply sort by file size the YAML files. A small problem is that some contacts have longer IM names (such as thebest­catal­in), other have shorter names (such as b0gdiy). That's a difference of 8 characters, so 8 bytes, which over say 100.000 lines exchanged become 0.76 Megabytes. A 100.000 line file has about 2 Mb, so it would be an error of 38%.

And there are a few other reasons to continue.