Thursday, January 17, 2013

Increase a Python program's memory efficiency with generators

I've recently been working with very large data sets (more than a million data points) and have encountered a serious reduction in the efficiency of my Python programs' computations. One reason for this is that I have been reading a large data file into memory all at once, performing my computation on this data file, and then moving onto the next file, all the while keeping my calculation results in memory so that I can plot them at the end.

One tool I've discovered for increasing the efficiency of these types of operations is the idea of a generator. Now, I realize that these are well-known in Python circles. However, I am not primarily a programmer, so I am not fully aware of the tools available to me when I write programs. Hence, I sometimes use this blog as a notebook so I can easily find these tools again in the future.

Simply put, a generator is an object that executes commands and yet may be iterated over like the elements in the list. One example, shown here, is the readline() method. The code looks like this:
inFile = open('sampledata.txt')
w = inFile.readline()
w is a generator function that reads only one line in the file into memory. Each time readline() is called, w is assigned the string representing the next line in the file sampledata.txt. According to the above link, an object that offers the abilities to load just parts of an object into memory and yet still iterate over all the parts is a generator.

Another nice description of the use of generators is given here. And, because I am having some trouble plotting all this data, a discussion of possible solutions to plotting large data sets is presented here at Stack Overflow.