Blog Post

Log scraping with Clojure
Clojure

Log scraping with Clojure 

Introduction

Clojure is a general purpose language that compiles to the JVM, Javascript and .Net CLR.
It’s a Lisp language, dynamically typed, functional,with a great set of immutable data structures and great support for multithreading which makes it a great tool for data processing.

In this post we are going to build a very small Clojure program to parse an openarena server log and calculate some stats from it.

TL;DR

  • Load openarena log lines into a Clojure repl
  • Filter lines that talks about a killer, a victim and a weapon
  • Calculate a global(all matches wide) weapon usage stats
  • Tweak the program so it uses all available CPU power

All with pure functions and 100% immutable data structures.

Before starting

All we need for this is Java and Leiningen installed and some basic Clojure knowledge.
Even if it’s the first time you see Clojure code, you should be able to follow the examples if you have ever used some language that uses map, filter and reduce functions.

A quick look at our data

 

$ du -h all-matches.log

801M all-matches.log


$ sed -n '666,686p' all-matches.log

5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA
5:09 PlayerScore: 3 43: Diego now has 43 points
5:09 Challenge: 3 207 1: Client 3 got award 207
5:09 Challenge: 3 1 1: Client 3 got award 1
5:09 Challenge: 0 2 1: Client 0 got award 2
5:09 PlayerScore: 3 45: Diego now has 45 points
5:09 CTF: 3 2 3: Diego fragged BLUE's flag carrier!
5:12 Item: 1 team_CTF_redflag
5:12 PlayerScore: 1 40: General McArthur now has 40 points
5:12 CTF: 1 1 2: General McArthur returned the RED flag!
5:12 Item: 1 weapon_plasmagun
5:15 Item: 2 team_CTF_redflag
5:15 CTF: 2 1 0: EL DOLOR ES PASAJERO got the RED flag!
5:15 PlayerScore: 2 13: EL DOLOR ES PASAJERO now has 13 points
5:16 Item: 3 weapon_railgun
5:16 Kill: 1 0 8: General McArthur killed dolphin by MOD_PLASMA
5:16 PlayerScore: 1 41: General McArthur now has 41 points
5:16 Challenge: 1 207 1: Client 1 got award 207
5:16 Challenge: 1 1 1: Client 1 got award 1
5:16 Challenge: 0 2 1: Client 0 got award 2
5:16 PlayerScore: 1 43: General McArthur now has 43 points

This is a log file created by an Openarena server. Every action in the multiplayer game ends up here.
For example if one player kills another, the server logs something like :

5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA

Diego and dolphin are player nicknames, while MOD_PLASMA is the weapon used by Diego to kill dolphin.

Loading log lines into the repl

Let’s start a Clojure repl

$ lein repl

Clojure 1.6.0

user=>

and start requiring some utilities

user=> (require '[clojure.string :as str])

Clojure is a functional programming language so let’s start defining some functions.
First define a function to get the lines from a file (this will be the only non pure function).

user=> (defn log-lines [file-name]
         (-> (slurp file-name)
             (str/split-lines)))

This should be kind of self descriptive, we first slurp the content of file-name, this will give us a big string with the file content, and then feed that into string split-lines, which will return a list of newlines separated lines.

Let’s try it

user=> (def log-lines (log-lines "all-matches.log"))

user=> (->> log-lines
            (drop 666)
            (take 20))

("5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA"
 "5:09 PlayerScore: 3 43: Diego now has 43 points"
 "5:09 Challenge: 3 207 1: Client 3 got award 207"
 "5:09 Challenge: 3 1 1: Client 3 got award 1"
 "5:09 Challenge: 0 2 1: Client 0 got award 2"
 "5:09 PlayerScore: 3 45: Diego now has 45 points"
 "5:09 CTF: 3 2 3: Diego fragged BLUE's flag carrier!"
 "5:12 Item: 1 team_CTF_redflag"
 "5:12 PlayerScore: 1 40: General McArthur now has 40 points"
 "5:12 CTF: 1 1 2: General McArthur returned the RED flag!"
 "5:12 Item: 1 weapon_plasmagun"
 "5:15 Item: 2 team_CTF_redflag"
 "5:15 CTF: 2 1 0: EL DOLOR ES PASAJERO got the RED flag!"
 "5:15 PlayerScore: 2 13: EL DOLOR ES PASAJERO now has 13 points"
 "5:16 Item: 3 weapon_railgun"
 "5:16 Kill: 1 0 8: General McArthur killed dolphin by MOD_PLASMA"
 "5:16 PlayerScore: 1 41: General McArthur now has 41 points"
 "5:16 Challenge: 1 207 1: Client 1 got award 207"
 "5:16 Challenge: 1 1 1: Client 1 got award 1"
 "5:16 Challenge: 0 2 1: Client 0 got award 2"
 "5:16 PlayerScore: 1 43: General McArthur now has 43 points")

Great, a list with all log lines!
That last expression is functionally similar to the sed command we executed before, so we should be seeing the same lines.

Just keep the important stuff

For the stats we are going to compute, we are only interested in kill lines(the ones that talks about who killed who with which weapon), which looks like :

“5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA”

Now back to our repl and let’s define a regular expression to match those lines and capture it’s parts.

user=> (def killed-line-regexp #"(.*?) Kill: .*?: (.*?) killed (.*?) by (.*?)")

We can define regular expressions in clojure very similar to a string, but adding a # on the front.

Let’s try it :

user=> (re-matches killed-line-regexp "5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA")

["5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA"
 "5:09"
 "Diego"
 "dolphin"
 "MOD_PLASMA"]

user=> (re-matches killed-line-regexp "another dummy line")
nil

As you can see, when we use captures in the re-matches regex, if it matches, it will return a vector with the matched string as the first element and one element for each capture, otherwise it just returns nil.

Let’s use that to create a predicate to check if our line is a kill line or not

user=> (defn is-kill-line? [line]
         (re-matches killed-line-regexp line))

And lets use that to filter only the lines we want.

user=> (def kill-lines (filter is-kill-line? log-lines))

user=> (take 10 kill-lines)

(" 5:07 Kill: 2 6 23: Snipper reloaded killed Diego by MOD_NAIL"
 " 5:08 Kill: 7 1 3: Assassin killed EL DOLOR ES PASAJERO by MOD_MACHINEGUN"
 " 5:09 Kill: 7 4 3: Assassin killed VOLVIO UN DIA by MOD_MACHINEGUN"
 " 5:10 Kill: 2 1 23: Snipper reloaded killed EL DOLOR ES PASAJERO by MOD_NAIL"
 " 5:14 Kill: 5 4 8: The Machine killed VOLVIO UN DIA by MOD_PLASMA"
 " 5:14 Kill: 3 5 7: dolphin killed The Machine by MOD_ROCKET_SPLASH"
 " 5:16 Kill: 7 3 3: Assassin killed dolphin by MOD_MACHINEGUN"
 " 5:18 Kill: 1 7 10: EL DOLOR ES PASAJERO killed Assassin by MOD_RAILGUN"
 " 5:21 Kill: 7 1 3: Assassin killed EL DOLOR ES PASAJERO by MOD_MACHINEGUN"
 " 5:25 Kill: 6 2 10: Diego killed Snipper reloaded by MOD_RAILGUN")

Bring our data back to life

If we want to work with that data, we need something easier than plain strings.
Lets go back to our repl and define a function than given a kill line like :

” 5:07 Kill: 2 6 23: Snipper reloaded killed Diego by MOD_NAIL”

returns a map like :

{:killer "Snipper reloaded"
 :victim "Diego"
 :weapon "MOD_NAIL"}

Something like

user=> (defn kill-line->kill-map [line]
         (let [[_ _ killer victim weapon] (re-matches killed-line-regexp line)]
           {:killer killer
            :victim victim
            :weapon weapon}))

Here we are using Clojure destructuring, one of those things that once you know them, you can’t live without.
The left part is a template for binding those names to the values returned by the re-matches expression.

[_                                                      _      killer  victim    weapon] 
["5:09 Kill: 3 0 8: Diego killed dolphin by MOD_PLASMA" "5:09" "Diego" "dolphin" "MOD_PLASMA"]

Notice that we aren’t binding anything to the first two values because we aren’t going to use them.

Now let’s do that for all our kill lines.

user=> (def kill-maps (map kill-line->kill-map kill-lines))

user=> (take 3 kill-maps)

({:weapon "MOD_NAIL", :victim "Diego", :killer "Snipper reloaded"}
 {:weapon "MOD_MACHINEGUN", :victim "EL DOLOR ES PASAJERO", :killer "Assassin"}
 {:weapon "MOD_MACHINEGUN", :victim "VOLVIO UN DIA", :killer "Assassin"})

Calculating the stats

Now that we have a list of kills, lets reduce them into some stats, like weapon usage.

For this we are going to create a reducer function, one that given stats and a kill will return stats updated with this new kill. Something like :

user=> (defn stats-reductor [stats kill-map]
         (-> stats
             (update-in [(:weapon kill-map)] (fnil inc 0))))

user=> (stats-reductor {} {:weapon "MOD_RAILGUN", :victim "Snipper reloaded", :killer "Diego"})
{"MOD_RAILGUN" 1}

user=> (stats-reductor {"MOD_RAILGUN" 4
                        "MOD_MACHINEGUN" 3}

                       {:weapon "MOD_RAILGUN", :victim "Snipper reloaded", :killer "Diego"})

{"MOD_RAILGUN" 5
 "MOD_MACHINEGUN" 3}

Lets apply that to all our kill-maps

user=> (reduce stats-reductor {} kill-maps)

{"MOD_PLASMA_SPLASH" 3,
 "MOD_ROCKET" 47,
 "MOD_SHOTGUN" 27,
 "MOD_RAILGUN" 168,
 "MOD_ROCKET_SPLASH" 67,
 "MOD_PLASMA" 56,
 "MOD_MACHINEGUN" 84,
 "MOD_NAIL" 166}

Now lets put everything together with some sorting :

user=> (defn process-all [log-lns]
         (->> log-lns
              (filter is-kill-line?)
              (map kill-line->kill-map)
              (reduce stats-reductor {})
              (into [])
              (sort-by second >)))

user=> (process-all log-lines)

(["MOD_RAILGUN" 168]
 ["MOD_NAIL" 166]
 ["MOD_MACHINEGUN" 84]
 ["MOD_ROCKET_SPLASH" 67]
 ["MOD_PLASMA" 56]
 ["MOD_ROCKET" 47]
 ["MOD_SHOTGUN" 27]
 ["MOD_PLASMA_SPLASH" 3])

Shouldn’t all be working?

Now everything looks great, but let’s take a look at execution time and CPU usage :

user=> (time (process-all log-lines))
"Elapsed time: 24301.618696 msecs"

Wow, 24 seconds for our 800Mb log, but look at our cores while the program is running

We have 8 cores processor and we are using only one of them.
Can’t we modify that process-all function to do those filtering, mapping and reducing in parallel?

One of the ways we can achieve this with Clojure is by using clojure.core.reducers

Back to our repl to require it

user=> (require '[clojure.core.reducers :as r])

Now we can do (r/map) (r/filter) (r/fold) and everything will be distributed over all
available cores.

r/map and r/filter looks exactly the same as our original map and filter

r/fold instead is very similar to reduce, but it needs a way to combine partial reductions.
It will split our kill-maps collection evenly, and reduce each of the splits in parallel, so it will end up with a bunch of stats map, one for each sub collection, so then it needs a function to combine those partial stats into the final one.

In this case combining two partial stats into one with Clojure is pretty straight forward

user=> (merge-with +
                   {"MOD_RAILGUN" 5
                    "MOD_MACHINEGUN" 3}
                   {"MOD_RAILGUN" 1
                    "MOD_NAILGUN" 2})

{"MOD_NAILGUN" 2
 "MOD_RAILGUN" 6
 "MOD_MACHINEGUN" 3}

Cool, lets put everything together again :

user=> (defn process-all-parallel [log-lns]
         (->> log-lns
              (r/filter is-kill-line?)
              (r/map kill-line->kill-map)
              (r/fold (partial merge-with +) stats-reductor)
              (into [])
              (sort-by second >)))


user=> (time (process-all-parallel log-lines))
"Elapsed time: 5898.349234 msecs"

Less than 6 seconds, that’s much better!

And our cores?

So putting everything together

(ns blog.core
  (:require [clojure.string :as str]
            [clojure.core.reducers :as r]))

(def killed-line-regexp #"(.*?) Kill: .*?: (.*?) killed (.*?) by (.*?)")

(defn kill-line->kill-map [l]
  (->> (re-matches killed-line-regexp l)
       (drop 2) ;; drop full line and time
       (zipmap [:killer :victim :weapon])))

(defn stats-reductor
  ([] {}) ;; Only needed by r/fold
  ([stats kill-map]
     (-> stats
         (update-in [(:weapon kill-map)] (fnil inc 0)))))

(defn is-kill-line? [line]
  (re-matches killed-line-regexp line))

(defn log-lines [file-name]
  (-> (slurp file-name)
      (str/split-lines)))

(defn process-all [log-lns]
  (->> log-lns
       (filter is-kill-line?)
       (map kill-line->kill-map)
       (reduce stats-reductor {})
       (into [])
       (sort-by second >)))

(defn process-all-parallel [log-lns]
  (->> log-lns
       (r/filter is-kill-line?)
       (r/map kill-line->kill-map)
       (r/fold (partial merge-with +) stats-reductor)
       (into [])
       (sort-by second >)))
(Visited 369 times, 1 visits today)

Related posts

en_USEnglish
en_USEnglish