Structure Versioning For PyTables
Structure Versioning For PyTables
Introduction
My task was to organize genotyping data for analysis. At the time I was working on a whole-genome
association study. For a programmer, the sheer size of input datasets is one of defining characteristics of
such projects. In this case I had over 70 GB of compressed text to deal with.
Data for a WGA study is best pictured as a very large matrix, with columns corresponding to single
nucleotide polymorphisms (SNPs). Rows correspond to samples or – in human terms, people who
donated their genetic material for the study. Values in cells of this huge1 matrix indicate nucleotides found
at these positions. Unfortunately, genotyping data does not necessarily arrive in a format convenient for
analysis. More often it will have structure suitable for recording measurements and will need to be
converted.
This document and the associated source code can be downloaded from www.visimatik.com.
The toolkit
Python and numpy are my tools of choice for this type of work. For data persistence, I would normally
use pickling or XML, but given the volume of data, neither was appropriate. Due to the required form of
data, a relational database would not fit the purpose, either. Enter HDF5 and PyTables2. HDF5 is a data
format and a library for managing large data collections, with a niche in scientific data. PyTables is a
Python interface to HDF5 files. The structure of an HDF5 file comprises of three types of entities,
groups, tables and arrays (arrays will not be present in the discussion). These entities3 are arranged into a
tree, not unlike a file system.
The problem
Ultimately, my goal was to explore data through various programs and scripts I was yet to write. At that
point I could not possibly know what exactly these programs would do. What I did know was how they
might be accessing their data. For instance, retrieval of parts of the main matrix, along with indexes
describing loaded fragments, was expected to be a frequent request. This perspective defined my first
design. My grasp of the domain was still weak at the time. Knowing it, I consciously tried to ignore my
inner prophet's predictions of how data structures would evolve. In principle, it is better to refrain from
making generalizations if you do not have enough detailed knowledge from which to draw these
generalizations.
Still, some rudimentary abstractions were necessary. To insulate computations from technicalities of
managing the data file itself, I put together a simple encapsulation a 'database connection'. My wrapper
consisted of two classes. TableDef defined the database structure, while Notebook was responsible for
'housekeeping chores': most importantly, creating, opening and closing the file.
Implementation of __getattr__ (listing 1) shows that no attempt was made to hide the structure of data
1 Whole genome studies frequently profile more then a thousand people using about several thousand SNPs.
2 The best place to start looking for tools of this kind is Scipy, a comprehensive resource for using Python in science.
3 In the sequel, I will often refer to these entities by a common moniker 'part'.
1
storage: clients are allowed to directly access the catalog structure of the dataset. Likely, this would not
be allowed to stand in the long run. However, this exercise is not about the mythical long run. It is chiefly
about createDatabase and verifyDatabase routines. The former, invoked only when a new file is created,
puts in place basic structures required by the application. The latter asserts that the opened HDF5 file
appears to have these basics. The code in listing 1 is intended to give the idea of that first, sketchy
solution. It is provided for illustration purposes only.
This system worked for me for several days, until I realized that additional data, describing relations
between SNPs on the same chromosome, was needed. The new information came in the form of several
large matrices, again with associated indexes. I needed to adjust my design and, since re-running the
computation-heavy conversion was not practical, to upgrade my data files accordingly.
As one would expect, HDF allows modifications of data structures in existing files. There are several
functions in Pytables, which manipulate data definition; I had already used some of them to create the
original structures. What could be simpler than adding similarly shaped code injecting additions into the
model? Well, the route of small, incremental changes has inherent problems. For instance, the new code
would have to be very particular about verifying its preconditions, to check, for example, if the entity
about to be created already is there. Managing data definitions through writing more and more imperative
definitions is bound to result in something rather ugly as the software evolves. Consider the following
argument:
Suppose, I would like to add a new table, defined by:
class DistanceInChromsomes(tables.IsDescription):
chromosome = tables.UInt8Col()
source_name = tables.StringCol(itemsize = 256)
array_set = tables.StringCol(itemsize = 256)
2
declaration4 as the parameter describing layout of the new table. I liked the idea for its apparent elegance
and decided to extend it into a reusable mechanism tying these classes-definitions with their incarnations
in the physical file. Not just tables, but also their place in the hierarchy ought to be deducted directly
from declarations. This definition will be allowed to evolve along with the application. The necessary
updates of the file structure will be taken care of as a matter of course.
4 The class must be derived from tables.IsDescription. Consult Pytables manual for details.
5 In our case the computation results in writing something to a disk file.
6 I can already hear my inner C++ programmer (have I already confessed that?) arguing the merits of meta-programing C+
+ style. One could imagine generating code for creating the table from a class using macros and/or templates.
7 Be aware that the term 'introspection' has a specific meaning in computer science. See type introspection.
8 I do realize it sounds awful. But neither dynamic languages have to be dynamically typed, nor dynamic typing implies
dynamic execution model. Consult your favorite on-line source or your local computer science guru for more
information.
3
layout in runtime. Consequently, reflection APIs in .NET and in JRE9 do not allow alterations to
definitions of class or functions in the runtime. Owing to its dynamic type system, Python permits
programs to modify themselves during its execution.
9 Admittedly, one can think up several ways of doing these things within the framework of .NET CLR (or in JRE), but
outside the framework of statically typed Java and C#. This is certainly a real need, since the software of this kind is
already emerging.
10 Arrays are different, because they do not have a structure to be defined up-front and they do not have to be declared
before they are inserted. Since they confuse the picture, I have decided to conveniently exclude them from consideration
for now.
11 Neither dropping of tables nor changes to tables' structure are supported by the code. It is quite apparent, though, how
additions of fields and removal of tables and groups could easily be handled in this framework (as soon, as we get it to
work). Data transforms are clearly out of scope of this simple example.
4
stripped of this information. An obvious mistake, easy to catch in the first code review! However, as it
turns out, this monstrosity would not work even once.
Unexpected complications
For now, let us ignore the impropriety of fiddling with the definition of a class in a method of its instance.
This will be dealt with in time. Why is the program not working at all? As it turns out, attempts to access
field objVersion in chld variable (we are still in __init__ method of TableDef) will often result in
AttributeError exception. This may be surprising, since all members of the class that may pass type filters
in if statements should have this attribute. Yet closer inspection in the debugger will reveal that there is no
such field. It is as if this piece of information has been lost. It is easy to find, though: all variables
assigned in class scope were turned into members of the columns collection (see Illustration 1, below).
Overall, the object representing a class derived from IsDescription looks quite different from what you
might expect. This is because the creators of Pytables have elected to customize the class creation process
through the use of __metaclass__.
Earlier in the article we have discussed the notion of reflection. There is one important conclusion to
draw from that section, which should be stated clearly: classes are objects, too. In Python classes are first-
class objects, which means they can be used as any other objects in Python. In particular, they are at some
point constructed.
Normally, classes are created by a call to a built-in function type(name, bases, dict). This default behavior
can be easily changed by substituting the factory of the class object: the class's metaclass. A metaclass
(itself a class, too) has to have a special method called __new__, with signature matching that of type
function above. All that remains is a single assignment to class's __metaclass__ variable in the body of its
declaration:
class metaMetaBase(type):
def __new__(cls, classname, bases, classdict):
pass
class MetaBase:
__metaclass__ = metaMetaBase
The substituted method can modify the created class12 or return some other representation altogether.
12 It is a dictionary; in Python all objects are dictionaries in a way.
5
Stepping though the process of constructing any class derived from tables.IsDescription13, we get quickly
to the implementation of the class metaIsDescription in description.py. There, the class dictionary is
customized. All declared fields that 'look like' field names are grouped in a single item in the class
dictionary: columns. In effect, instead of chld.objVersion, we end up with chld.columns['objVersion'].
This explains why the original code does not work and hints at what needs to be done to fix it.
Firstly, I will either need to override the existing mechanism of creating IsDescription, or hack my way
through nested declarations to extract objVersion and objName. Secondly, all processing of class
declarations will need to be moved from instance code of TableDef (or whatever will replace it) to its
class code.
After some thinking I decided to work around the nested dictionaries, rather than to insert another layer
of inheritance. I wanted to keep my little overlay consistent with Pytables' documentation and allow
deriving table definitions directly from IsDescription. Source code related to these changes is in listing 3
(note the lack of the usual 'suspicious code' disclaimer).
6
a reusable mechanism, this is to be avoided. I found it best to complete the encapsulation of the
mechanism by creating 'callable' objects, representing actions of creating tables and groups (listing 6; see
use in 4).
'__metaclass__' domesticated
Python's __metaclass__ idiom allows overriding Python's default constructor of objects representing
classes. This is an interesting semantic device and, although it certainly invites abuse, it can be very
helpful at times. While there are many ways to define database structures in code, the proposed approach
is elegant and, very importantly, offers strong encapsulation. Hence, classes which rearrange their own
layout in unexpected ways may still appear 'normal' to the outside world and the use of reflection
remains contained.
A reader interested in other uses of __metaclass__ should pick up a copy of the excellent Python
Cookbook. Standard Python documentation outlines issues related to class creation. Several suggestions
regarding potential application of metaclasses are given there, too.
class Individuals(tables.IsDescription):
symbol = tables.StringCol(itemsize = 64)
cohort = tables.UInt32Col()
class Cohorts(tables.IsDescription):
symbol = tables.StringCol(itemsize = 32)
fIndivIdx = tables.UInt32Col()
countIndiv = tables.UInt32Col()
class Chromosomes(tables.IsDescription):
symbol = tables.UInt8Col()
fSNP_Idx = tables.UInt32Col()
countSNP = tables.UInt32Col()
class Chunk(tables.IsDescription):
chromosome = tables.UInt8Col()
cohort = tables.StringCol(itemsize = 8)
chunk_name = tables.StringCol(itemsize = 256)
class Notebook:
Version = 1.0
def __init__(self, name, path, write_access = True, feedback =
Reporting.DummyFeedback()):
"""
7
If the directory exists, attempt to open the file and the indice files (if these
do not exist,
we have an error). If there is no directory, attempt to create it (and all
tables that should be there).
"""
# name will be a pickle with some basic info (just to mark the teritory)
self.__paths = {}
self.__paths["main"] = path
self.__paths["pickle"] = os.path.join(path, name.strip() + ".pck")
self.__paths["HD5"] = os.path.join(path, "data.HD5")
self.__HD5 = None
self.Stamp = None
self.__GUI = feedback
self.__ReadOnly = not write_access
try:
if os.path.isdir(path):
self.Stamp = cPickle.load(open(self.__paths["pickle"], "r"))
self.__HD5 = tables.openFile(self.__paths["HD5"], ["r", "r+"]
[self.__ReadOnly])
self.verifyDatabase(self.__HD5)
elif not self.__ReadOnly:
os.mkdir(path)
self.Stamp = (name, self.version, time())
cPickle.dump(self.Stamp, open(self.__paths["pickle"], "w"))
self.__HD5 = self.createDatabase(self.__paths["HD5"])
if self.__HD5 is None:
raise NotebookDoesNotExist(path, name)
def close(self):
if not self.__ReadOnly:
self.Stamp = (self.Stamp[0], self.Version, time())
cPickle.dump(self.Stamp, open(self.__paths["pickle"], "w"))
self.__HD5.close()
self.__HD5 = None
self.Stamp = None
def __del__(self):
if not self.__HD5 is None:
self.__HD5.close()
self.__HD5 = None
8
dbfile.isVisibleNode("/metadata")
dbfile.isVisibleNode("/definition")
dbfile.isVisibleNode("/source")
dbfile.isVisibleNode("/definition/chromosomes")
dbfile.isVisibleNode("/definition/cohorts")
dbfile.isVisibleNode("/definition/SNPs")
dbfile.isVisibleNode("/definition/individuals")
dbfile.isVisibleNode("/source/imports")
return dbfile
def kill(self):
self.close()
os.rmdir(self.__paths["main"])
class GroupDescriptor:
pass
def getHierarchyFromPath(all_path):
pieces = all_path.split("/")
if len(pieces) > 1:
return [ "/".join(pieces[:idx + 2]) for idx in xrange(len(pieces) - 1)]
else:
return []
class TableDef:
Version = 1.1
class SNPs(tables.IsDescription):
"""SNPs table"""
objName = "/definition/SNPs"
9
objVersion = 1.0
class Individuals(tables.IsDescription):
objName = "/definition/individuals"
objVersion = 1.0
class Cohorts(tables.IsDescription):
objName = "/definition/cohorts"
objVersion = 1.0
class Chromosomes(tables.IsDescription):
objName = "/definition/chromosomes"
objVersion = 1.0
symbol = tables.UInt8Col()
fSNP_Idx = tables.UInt32Col()
countSNP = tables.UInt32Col()
class Chunk(tables.IsDescription):
objName = "/source/imports"
objVersion = 1.0
chromosome = tables.UInt8Col()
cohort = tables.StringCol(itemsize = 8)
chunk_name = tables.StringCol(itemsize = 256)
class DistanceChromsomes(tables.IsDescription):
objName = "/source/hapmap/files"
objVersion = 1.1
chromosome = tables.UInt8Col()
source_name = tables.StringCol(itemsize = 256)
array_set = tables.StringCol(itemsize = 256)
class SourceGroup(GroupDescriptor):
"""Raw source data uploaded with nbcp"""
objName = "/source"
objVersion = 1.0
class AnalysesGroup(GroupDescriptor):
"""Information derived in analysis"""
objName = "/analyses"
objVersion = 1.0
class MetadataGroup(GroupDescriptor):
"""Metainformation about performed analyses"""
objName = '/metadata'
objVersion = 1.0
class DefinitionGroup(GroupDescriptor):
"""Subject data definition"""
10
objName = '/definition'
objVersion = 1.0
def __init__(self):
classOf = self.__class__
# now list all dependant classes
groups = []
tbls = []
for chld in classOf.__dict__.values():
try:
bases = chld.__bases__
if GroupDescriptor in bases:
description = chld.__doc__
name_parts = chld.objName.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
groups.append((short_name, path, count_parts, description,
chld.objVersion))
# this code does not work. It is here for illustration purposes only
# DO NOT ATTEMPT TO USE
version = chld.objVersion
del chld.objVersion
objName = chld.objName
del chld.objName
name_parts = objName.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
except AttributeError:
pass
# now find all groups that may be required (based on 'path' concept), but are not
explicitly listed
grdct = {}
for gr in groups:
created_grp = "/".join([gr[1], gr[0]])
grdct[created_grp] = gr[4]
extra_groups = []
for gr in groups:
extra_groups += [ (spath, gr[4]) for spath in getHierarchyFromPath(gr[1])]
for tb in tbls:
extra_groups += [ (spath, tb[4]) for spath in getHierarchyFromPath(tb[1])]
# walk through all extra (potential) groups. If they are not there yet, add them to
the list
extra_groups.sort()
prev_item = ""
for gr in extra_groups:
if gr[0] != prev_item and gr[0] not in grdct:
11
prev_item = gr[0]
name_parts = prev_item.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
groups.append((short_name, path, count_parts, None, gr[1]))
self.Groups = groups
self.Tables = tbls
def __str__(self):
outs = "Tables:\r\n"
for table in self.Tables:
outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4],
type(table[5]))
outs += "Groups:\r\n"
for group in self.Groups:
outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3])
return outs
if __name__ == "__main__":
metad = TableDef()
print metad
if len(bases) == 0:
return the_default # do not process the base class...
groups = []
tbls = []
for chld in the_default.__dict__.values():
try:
bases = chld.__bases__
if GroupDescriptor in bases:
description = chld.__doc__
name_parts = chld.objName.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
groups.append((short_name, path, count_parts, description,
chld.objVersion))
12
name_parts = objName.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
except AttributeError:
pass
# now find all groups that may be required (based on 'path' concept), but are not
explicitly listed
grdct = {}
for gr in groups:
created_grp = "/".join([gr[1], gr[0]])
grdct[created_grp] = gr[4]
extra_groups = []
for gr in groups:
extra_groups += [ (spath, gr[4]) for spath in getHierarchyFromPath(gr[1])]
for tb in tbls:
extra_groups += [ (spath, tb[4]) for spath in getHierarchyFromPath(tb[1])]
# walk through all extra (potential) groups. If they are not there yet, add them to
the list
extra_groups.sort()
prev_item = ""
for gr in extra_groups:
if gr[0] != prev_item and gr[0] not in grdct:
prev_item = gr[0]
name_parts = prev_item.split("/")
count_parts = len(name_parts) - 1
assert name_parts[0] == "" and name_parts[count_parts] != ""
short_name = name_parts[count_parts]
path = "/".join(name_parts[:count_parts])
groups.append((short_name, path, count_parts, None, gr[1]))
#TODO - check if that works for more levels of inheritance (than just something
derived from MetaBase)
the_default.Groups = groups
the_default.Tables = tbls
return the_default
class MetaBase:
__metaclass__ = metaMetaBase
Version = 0.0
def __str__(self):
outs = "Tables:\r\n"
for table in self.Tables:
outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4],
type(table[5]))
outs += "Groups:\r\n"
for group in self.Groups:
outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3])
13
return outs
class TestTableDef(MetaBase):
Version = 1.1
class SNPs(tables.IsDescription):
"""SNPs table"""
objName = "/definition/SNPs"
objVersion = 1.0
class Individuals(tables.IsDescription):
objName = "/definition/individuals"
objVersion = 1.0
class Cohorts(tables.IsDescription):
objName = "/definition/cohorts"
objVersion = 1.0
class Chromosomes(tables.IsDescription):
objName = "/definition/chromosomes"
objVersion = 1.0
symbol = tables.UInt8Col()
fSNP_Idx = tables.UInt32Col()
countSNP = tables.UInt32Col()
class Chunk(tables.IsDescription):
objName = "/source/imports"
objVersion = 1.0
chromosome = tables.UInt8Col()
cohort = tables.StringCol(itemsize = 8)
chunk_name = tables.StringCol(itemsize = 256)
class DistanceChromsomes(tables.IsDescription):
objName = "/source/hapmap/files"
objVersion = 1.1
chromosome = tables.UInt8Col()
source_name = tables.StringCol(itemsize = 256)
array_set = tables.StringCol(itemsize = 256)
class SourceGroup(GroupDescriptor):
"""Raw source data uploaded with nbcp"""
objName = "/source"
objVersion = 1.0
class AnalysesGroup(GroupDescriptor):
"""Information derived in analysis"""
objName = "/analyses"
objVersion = 1.0
14
class MetadataGroup(GroupDescriptor):
"""Metainformation about performed analyses"""
objName = '/metadata'
objVersion = 1.0
class DefinitionGroup(GroupDescriptor):
"""Subject data definition"""
objName = '/definition'
objVersion = 1.0
__metaclass__ = metaMetaBase
Version = 0.0
def __str__(self):
outs = "Tables:\r\n"
for table in self.Tables:
outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4],
type(table[5]))
outs += "Groups:\r\n"
for group in self.Groups:
outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3])
return outs
def getCreateSequence(self):
return self.getVersionDifferential(0.0)
15
Listing 5: The main wrapper class (Notebook)
class Notebook:
"""
THIS IS AN ILLUSTRATION ONLY. CODE NOT RELEVANT TO THE EXAMPLE HAS BEEN LEFT OUT
"""
def version(self):
ver = 0.0
if not self.__ReadOnly:
ver = self.metadata().Version
elif not self.Stamp is None:
ver = self.Stamp[1]
return ver
cmdlst = self.metadata().getVersionDifferential(ver)
if len(cmdlst):
for cmd in cmdlst:
cmd(db)
lst = self.metadata().getVersionPaths(self.version())
assert len(lst) > 0
for path in lst:
assert path in dbfile
def metadata(self):
if self.__metadata is None:
self.__metadata = TableDef()
return self.__metadata
return dbfile
16
def __init__(self, path, name, defobj, dsc):
self.Path = ["/", path][path is not None and path != ""]
self.Name = name
self.DefClss = defobj
self.Title = ["", str(dsc)][dsc is not None]
def __str__(self):
return "table %s=>%s (%s)" % (self.Path, self.Name, self.Title)
class CreateGroupWrapper:
def __init__(self, path, name, dscr):
self.Path = ["/", path][path is not None and path != ""]
self.Name = name
self.Title = dscr
def __str__(self):
return "group %s=>%s (%s)" % (self.Path, self.Name, self.Title)
17