Games with the API: Counting (citizens) can be expensive (in time), or not

Day 1,267, 19:39 Published in Greece Greece by Earentir

I have been playing with the API for the last couple of days, it has been to say the least "interesting". Unfortunately the API is very limited on what information it provides but it still has some interesting things, mainly to create statistics on citizens and countries.

Before going on I would like to make an open query to erepublik admins.
Why dont we get economy, social, military etc in the country feed? It is a pain in the a** to parse the already rendered html from the site (?maybe another article?).

On to our target.
After a talk with a couple of friends we decided we wanted to get some statistics that could help get a better picture of the state of our eCountry.

As most would already know eRep reports the actual residence and not the citizenship on the country tab on the site, thus giving a misleading number of constituents.
So the most obvious option was to see what we could get through the API.

At first I started working on a small perl script using XML:😶Path:😶MLParser

Although I could get good results I had a few issues. Since the XMLParser in order to save time, dumps the results in the end of the process I had to wait and then cleanup and do sorting at the end not to mention the output from the parse wasnt in a perfect condition and needed more work.

The biggest issue though was time. It took the parser more than 8 minutes to traverse the xml and managed to raise dramatically the HDD IOPS. In order to scale this I would need a lot more spindles than I wanted to spend.

After playing for 4-5 hours with it, I decided to scrap it, disregard xml and start doing raw searches, it was time to play with C .

Unfortunately though this one was a disaster, in order to do the searches efficiently, I had to load large parts of the xml in memory, filling the memory though would create paging, and in the end the performance after the first 4MB would drop dramatically and when paging started on the 8MB mark, the search was crawling. Two more hours down the drain.

At this point I decided it was time to see what wise Google new and it new a lot.
The first couple of pages were miserable failures, but then I started noticing a pattern, happy people and XMLStarlet were appearing in the results together too often.
It was time for the XMLStarlet Toolkit.
After the first execution I was sure that I had a problem. Before I had to use a subset of the gz to run tests since the full parse was taking upwards to 8 minutes. But running xmlstarlet on the full set took 3 seconds. So there was only ONE explanation, it was broken.

It took 3 runs of each of my previous attempts to finally decide that xmlstarlet has been imbued with secret powers by Gandalf.

So this:
time xml sel -t -m //citizens/citizen/citizenship/country -v "id" -o " " -n citizens.xml | sort -b | uniq -c

Takes this much time:
real 0m2.966s
user 0m2.588s
sys 0m0.380s

Impressive. Isnt it?

So what does the bard want to tell us today?
Counting can be expensive, only... if you dont know how to count 😛

GP,
time to ..scale.. away
---
Just for people that do not know me (and nobody does here lol), I have a bad tendency to want all of my apps to work on 64MB of RAM, including the OS and the app data.