Blog Post

Log scraping with Clojure
Clojure

Log scraping with Clojure 

Introduction

Clojure is a general purpose language that compiles to the JVM, Javascript and .Net CLR.
It’s a Lisp language, dynamically typed, functional,with a great set of immutable data structures and great support for multithreading which makes it a great tool for data processing.

In this post we are going to build a very small Clojure program to parse an openarena server log and calculate some stats from it.

TL;DR

  • Load openarena log lines into a Clojure repl
  • Filter lines that talks about a killer, a victim and a weapon
  • Calculate a global(all matches wide) weapon usage stats
  • Tweak the program so it uses all available CPU power

All with pure functions and 100% immutable data structures.

Before starting

All we need for this is Java and Leiningen installed and some basic Clojure knowledge.
Even if it’s the first time you see Clojure code, you should be able to follow the examples if you have ever used some language that uses map, filter and reduce functions.

A quick look at our data

This is a log file created by an Openarena server. Every action in the multiplayer game ends up here.
For example if one player kills another, the server logs something like :

5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA

Diego and dolphin are player nicknames, while MOD_PLASMA is the weapon used by Diego to kill dolphin.

Loading log lines into the repl

Let’s start a Clojure repl

and start requiring some utilities

Clojure is a functional programming language so let’s start defining some functions.
First define a function to get the lines from a file (this will be the only non pure function).

This should be kind of self descriptive, we first slurp the content of file-name, this will give us a big string with the file content, and then feed that into string split-lines, which will return a list of newlines separated lines.

Let’s try it

Great, a list with all log lines!
That last expression is functionally similar to the sed command we executed before, so we should be seeing the same lines.

Just keep the important stuff

For the stats we are going to compute, we are only interested in kill lines(the ones that talks about who killed who with which weapon), which looks like :

“5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA”

Now back to our repl and let’s define a regular expression to match those lines and capture it’s parts.

We can define regular expressions in clojure very similar to a string, but adding a # on the front.

Let’s try it :

As you can see, when we use captures in the re-matches regex, if it matches, it will return a vector with the matched string as the first element and one element for each capture, otherwise it just returns nil.

Let’s use that to create a predicate to check if our line is a kill line or not

And lets use that to filter only the lines we want.

Bring our data back to life

If we want to work with that data, we need something easier than plain strings.
Lets go back to our repl and define a function than given a kill line like :

” 5:07 Kill: 2 6 23: Snipper reloaded killed Diego by MOD_NAIL”

returns a map like :

Something like

Here we are using Clojure destructuring, one of those things that once you know them, you can’t live without.
The left part is a template for binding those names to the values returned by the re-matches expression.

Notice that we aren’t binding anything to the first two values because we aren’t going to use them.

Now let’s do that for all our kill lines.

Calculating the stats

Now that we have a list of kills, lets reduce them into some stats, like weapon usage.

For this we are going to create a reducer function, one that given stats and a kill will return stats updated with this new kill. Something like :

Lets apply that to all our kill-maps

Now lets put everything together with some sorting :

Shouldn’t all be working?

Now everything looks great, but let’s take a look at execution time and CPU usage :

Wow, 24 seconds for our 800Mb log, but look at our cores while the program is running

We have 8 cores processor and we are using only one of them.
Can’t we modify that process-all function to do those filtering, mapping and reducing in parallel?

One of the ways we can achieve this with Clojure is by using clojure.core.reducers

Back to our repl to require it

Now we can do (r/map) (r/filter) (r/fold) and everything will be distributed over all
available cores.

r/map and r/filter looks exactly the same as our original map and filter

r/fold instead is very similar to reduce, but it needs a way to combine partial reductions.
It will split our kill-maps collection evenly, and reduce each of the splits in parallel, so it will end up with a bunch of stats map, one for each sub collection, so then it needs a function to combine those partial stats into the final one.

In this case combining two partial stats into one with Clojure is pretty straight forward

Cool, lets put everything together again :

Less than 6 seconds, that’s much better!

And our cores?

So putting everything together

(Visited 351 times, 1 visits today)

Related posts

Leave a Reply

Required fields are marked *

en_USEnglish
en_USEnglish