A colleague recently came up to me with a problem: he had a trace file representing the history of several years' worth of CRUD operations on many records.
Something resembling:
1:create:hello
123:create:xxx
12345:create:hello
123:update:uuu
999:update:42
123:delete
1:update:xxx
1:update:iuu
12345:update:hello world
1:delete
The task at hand was to work out which records were active at the end of the trace and to determine the most up-to-date value each active record.
The expected result here is:
Didn't find 999 to update. Creating new entry.
Results:
12345 => hello world
999 => 42
This instantly reminded me of a concordancing problem.
The natural data structure for this is a Map. Thus we have the first potential solution for the problem:
def m = [:]
new File(/..\data.dat/).splitEachLine(':') { line ->
def (id, op, data) = line
switch (op) {
case 'update':
if (!m.containsKey(id))
println "Didn't find ${id} to update. Creating new entry."
case 'create':
m.put(id, data)
break
case 'delete':
m.remove(id)
break
}
}
println "Results:"
m.each { k,v ->
println "\t${k} => ${v}"
}
This works nicely: it's clear and concise…all the things I like! Groovy makes it even nicer :-)
But (there's always a but…sadly) my colleague had lots of data to process. Too much to be realistically held in memory.
The tool I would normally reach for in this circumstance is a plain old database, but hey! this is the age of noSQL.
Enter JDBM3:
JDBM provides HashMap, TreeMap, HashSet, TreeSet and LinkedList backed up by disk storage. Now you can insert more than a billion records into collection without running out of memory. JDBM is probably the fastest pure Java
database.
Ten minutes of hacking and version one had transformed into version two:
import net.kotek.jdbm.*
// uses JDBM3-alpha-2 from https://github.com/jankotek/JDBM3
final path = /..\data/
// start with non-existent database
new File(path).eachFile { it.delete() }
new DBMaker(path + /\ids/).build().with { db ->
db.createHashMap("ids").with { m ->
new File(/..\data.dat/).splitEachLine(':') { line ->
def (id, op, data) = line
switch (op) {
case 'update':
if (!m.containsKey(id))
println "Didn't find ${id} to update. Creating new entry."
case 'create':
m.put(id, data)
break
case 'delete':
m.remove(id)
break
}
db.commit()
}
m.each { k,v ->
println "\t${k} => ${v}"
}
}
db.close()
}
Still nice, clear and concise…but now capable of dealing with a (few) billion records.
Even more to like!
Update:
It is now possible to simplify this further. After mentioning to the developer that:
If there is one small thing I would like tidied up in the example code, it is this:
> // start with non-existent database
> new File(path).eachFile { it.delete() }
Having to delete the database/files by hand, rather than just specifying a create-truncate mode of some sort 'feels' wrong.
He responded within hours:
…it is done now, check DBMaker.
This means that one can now do:
new DBMaker(path + /\ids/).build().deleteFilesAfterClose().with { db ->
Open source FTW!