s11n.net
Bringing serialization into the 21st century... bit by bit.
Project powered by:
SourceForge.net

s11n data files

Or: What does s11n do with my data?

As has been repeated many times over, libs11n is internally data-format agnostic. What does this mean? It means that it doesn't really care what format your data is in. The library must expect some conventions to be followed, the most notable of which is that data is expected to be structurable in a DOM-like model, but it doesn't inherently care what data store is used for object persistance. The core lib works only at the level of DOM-like trees of abstract data, and knows nothing about file i/o.

The exact data formats are read and written by so-called Serializers, which are described in more detail on their page. On this page we will take a quick look at some file format comparisons for various data sets. The data formats we will look at here are described in more detail on the Serializers page.

Keep in mind that clients are not required to use libs11n's built-in i/o layer: they may provide their own arbitrary i/o layer and still take advantage of the core serialization interfaces.

We're going to be a bit crude here, and simply show a lightly edited copy of a shell session...
First, a script which we use to mass-convert a given input file:
#!/bin/bash
S="compact expat funtxt funxml parens simplexml wesnoth"
inf=${1?"Usage: {site_content} input_filename"}

lsfilter()
{
   ls $@ | awk '{ print ,;}'
}
echo "Input file: "
lsfilter -l $inf
for s in $S; do
    echo -n $s...
    of=${inf%%.*}.$s
    time ~/bin/s11nconvert -f $inf -s $s -o $of
    lsfilter -l $of
    echo
done

Now some data... a file containing 54400 object nodes (much larger than the average data file):

stephan@owl:~/> ls -lS *.s11n
-rw-r--r-- 1 stephan users 4894806 2004-09-29 23:41 biggie.s11n
The data format of input file is largely irrelevant, except that it will
impact the overall runtime (some serializers read or write more slowly
than others).
Run our "test":
stephan@owl:~/> ./stest.sh biggie.s11n
Input file:
4894806 biggie.s11n
compact...
real 0m3.360s
user 0m3.036s
sys 0m0.062s
2493010 biggie.compact
expat...
real 0m4.491s
user 0m3.958s
sys 0m0.214s
4720430 biggie.expat
funtxt...
real 0m3.897s
user 0m3.222s
sys 0m0.078s
4141438 biggie.funtxt
funxml...
real 0m3.781s
user 0m3.503s
sys 0m0.081s
4894806 biggie.funxml
parens...
real 0m3.160s
user 0m2.954s
sys 0m0.060s
2691751 biggie.parens
simplexml...
real 0m3.801s
user 0m3.471s
sys 0m0.078s
3658867 biggie.simplexml
wesnoth...
real 0m3.331s
user 0m3.100s
sys 0m0.083s
3902750 biggie.wesnoth


The actual load times, not including the startup time of s11nconvert, boil down to loading between 30k and 50k object nodes per second, depending on the data format, layout of the objects, etc. The sample data included deeply nested containers of objects containing several properties each (mostly numeric data with some strings).

Note that the above files don't use any sort of compression. If we enable compression in s11nconvert (via the -z and -bz flags) we can significantly reduce file sizes (assuming your copy of libzfstream was built with zlib/bz2lib support). The same data files, with and without compression (compressed via s11nconvert, not the gzip and bzip2 tools, though the results should be the same or very similar):

2493010 biggie.compact
14010 biggie.compact.bz2
178074 biggie.compact.gz
4720430 biggie.expat
36695 biggie.expat.bz2
204032 biggie.expat.gz
4141438 biggie.funtxt
31513 biggie.funtxt.bz2
199008 biggie.funtxt.gz
4894806 biggie.funxml
37072 biggie.funxml.bz2
205446 biggie.funxml.gz
2691751 biggie.parens
23992 biggie.parens.bz2
176088 biggie.parens.gz
3658867 biggie.simplexml
28020 biggie.simplexml.bz2
184820 biggie.simplexml.gz
3902750 biggie.wesnoth
31438 biggie.wesnoth.bz2
196264 biggie.wesnoth.gz

Yes, those bz2 compression levels are real! That compressor beats most others hands down, but is also notably slower than zlib. In fact, for large data sets, using zlib compression can actually speed up the read and write times by a small amount! bz2lib, however, is dog slow (but damned good).

Client code can set the compression level framework-wide with any of the following:
zfstream::compression_policy( zfstream::GZipCompression );
zfstream::compression_policy( zfstream::BZipCompression );
zfstream::compression_policy( zfstream::NoCompression );

That policy is respected by the s11n::io implementation.