Saturday, January 19, 2013

Numpy's memory maps for increased file read/write efficiency

I mainly use Matlab and Python when I need to analyze and visualize data and run simulations. As I mentioned in my previous post, I like to document when I learn something new that makes my programs written with these tools cleaner or more efficient.

I've just discovered Numpy's memory map object for efficiently reading and writing to very large arrays in files. A memory map is used to access only part of an array that is saved in a binary file on a hard disk. The memory map may be indexed like any other array that exists in memory, but only the part of the array that you are working on will be loaded into memory. This is similar to the idea of generators that I wrote about in my previous post.

As an example, I create a memory map object called inFile which interacts with a file whose location is stored in the variable fName:
inFile = numpy.memmap(filename, dtype = 'float32', mode = 'w+', shape = (3,4))
The mode argument 'w+' tells the program that I want to create the file if it doesn't exist or overwrite it in case that it does. Printing the contents of inFile at this point will result in a 3x4 array of zeros being displayed on the screen. When I want to write to this memory map, I can write to a portion of it like:
inFile[0,3] = 3.2
which writes 3.2 to element in the first row, fourth column of the array. When I delete inFile, the contents will be flushed to memory:
del inFile
Reading parts of this array from fName is done by expressions like the following:
inFile = numpy.memmap(filename, dtype = 'float32', mode = 'r', shape = (3,4))
 Now, to move some of the file's contents to an array in memory, I only need to type
myNewVariable = inFile[:,2]
which, for example, moves the third column of the file to the array myNewVariable.

I've noticed an increase in speed of my programs by using memory maps, though I haven't tested exactly how much time is saved.