10 Reasons Python Rocks for Research (And a Few Reasons it Doesn’t)¶
The following is an account of my own experience with Python. Because that experience has been so positive, it is an unabashed attempt to promote the use of Python for general scientific research and development. About four years ago, I dropped MATLAB in favor of Python as my primary language for coding research projects. This article is a personal account of how rewarding I have found that experience.
As I describe in the next sections, the variety and quality of Python’s features has spoiled me. Even in small scripts, I now rely on Python’s numerous data structures, classes, nested functions, iterators, the flexible function calling syntax, an extensive kitchen-sink-included standard library, great scientific libraries, and outstanding documentation.
To clarify, I am not advocating just Python as the perfect scientific programming environment; I am advocating Python plus a handful of mature 3rd-party open source libraries, namely Numpy/Scipy for numerical operations, Cython for low-level optimization, IPython for interactive work, and MatPlotLib for plotting. Later, I describe these and others in more detail, but I introduce these four here so I can weave discussion of them throughout this article.
Given these libraries, many features in MATLAB that enable one to quickly write code for machine learning and artificial intelligence – my primary area of research – are essentially a small subset of those found in Python. After a day learning Python, I was able to still use most of the matrix tricks I had learned in MATLAB, but also utilize more powerful data structures and design patterns when needed.
Holistic Language Design¶
I once believed that the perfect language for research was one that allowed concise and direct translation from notepad scribblings to code. On the surface, this is reasonable. The more barriers between generating ideas and trying them out, the slower research progresses. In other words, the less one has to think about the actual coding, the better. I now believe, however, that this attitude is misguided.
MATLAB’s language design is focused on matrix and linear algebra operations; for turning such equations into one-liners, it is pretty much unsurpassed. However, move beyond these operations and it often becomes an exercise in frustration. R is beautiful for interactive data analysis, and its open library of statistical packages is amazing. However, the language design can be unnatural, and even maddening, for larger development projects. While Mathematica is perfect for interactive work with pure math, it is not intended for general purpose coding.
The problem with the “perfect match” approach is that you lose generalizability very quickly. When the criteria for language design is too narrow, you inevitably choose excellence for one application over greatness for many. This is why universities have graduate programs in computer language design – navigating the pros and cons of various design decisions is extremely difficult to get right. The extensive use of Python in everything from system administration and website design to numerical number-crunching shows that it has, indeed, hit the sweet spot. In fact, I’ve anecdotally observed that becoming better at R leads to skill at interacting with data, becoming better at MATLAB leads to skill at quick-and-dirty scripting, but becoming better at Python leads to genuine programming skill.
Practically, in my line of work, the downside is that some matrix operators that are expressable using syntactical constructs in MATLAB become function calls (e.g. instead of ). In exchange for this extra verbosity – which I have not found problematic – one gains incredible flexibility and a language that is natural for everything from automating system processes to scientific research. The coder doesn’t have to switch to another language when writing non-scientific code, and allows one to easily leverage other libraries (e.g. databases) for scientific research.
Furthermore, Python allows one to easily leverage object oriented and functional design patterns. Just as different problems call for different ways of thinking, so also different problems call for different programming paradigms. There is no doubt that a linear, procedural style is natural for many scientific problems. However, an object oriented style that builds on classes having internal functionality and external behavior is a perfect design pattern for others. For this, classes in Python are full-featured and practical. Functional programming, which builds on the power of iterators and functions-as-variables, makes many programming solutions concise and intuitive. Brilliantly, in Python, everything can be passed around as an object, including functions, class definitions, and modules. Iterators are a key language component and Python comes with a full-featured iterator library. While it doesn’t go as far in any of these categories as flagship paradigm languages such as Java or Haskell, it does allow one to use some very practical tools from these paradigms. These features combine to make the language very flexible for problem solving, one key reason for its popularity.
To reiterate a recurrent point, Python’s syntax is very well thought out. Unlike many scripting languages (e.g. Perl), readability was a primary consideration when Python’s syntax was designed. In fact, the joke is that turning pseudocode into correct Python code is a matter of correct indentation.
This readability has a number of beneficial effects. Guido van Rossum, Python’s original author, writes:
This emphasis on readability is no accident. As an object-oriented language, Python aims to encourage the creation of reusable code. Even if we all wrote perfect documentation all of the time, code can hardly be considered reusable if it’s not readable. Many of Python’s features, in addition to its use of indentation, conspire to make Python code highly readable.
In addition, I’ve found it encourages collaboration, and not just by lowers the barrier to contributing to an open source Python project. If you can easily discuss your code with others in your office, the result can be better code and better coders.
As two examples of this, consider the following code snippet:
Let me list three aspects of this code. First, it is a small, self-contained function that only requires three lines to define, including documentation (the string following the function). Second, a default argument for the boundary is specified in a way that is instantly readable (and yes, that does show up when using Sphinx for automatic documentation). Third, the list processing syntax is designed to be readable. Even if you are not used to reading Python code, it is easy to parse this code – a new list is defined and returned from the list using if a particular value is above and otherwise. Finally, when calling functions, Python allows named arguments – this universally promotes clarity and reduces stupid bookkeeping bugs, particularly with functions requiring more than one or two arguments.
Permit me to contrast these features with MATLAB. With MATLAB, globally available functions are put in separate files, discouraging the use of smaller functions and – in practice – often promotes cut-and-paste programming, the bane of debugging. Default arguments are a pain, requiring conditional coding to set unspecified arguments. Finally, specifying arguments by name when calling is not an option, though one popular but artificial construct – alternating names and values in an argument list – allows this to some extent.
Balance of High Level and Low Level Programming¶
The ease of balancing high-level programming with low-level optimization is a particular strong point of Python code. Python code is meant to be as high level as reasonable – I’ve heard that in writing similar algorithms, on average you would write six lines of C/C++ code for every line of Python. However, as with most high-level languages, you often sacrifice code speed for programming speed.
One sensible approach around this is to deal with higher level objects – such as matrices and arrays – and optimize operations on these objects to make the program acceptably fast. This is MATLAB’s approach and is one of the keys to its success; it is also natural with Python. In this context, speeding code up means vectorizing your algorithm to work with arrays of numbers instead of with single numbers, thus reducing the overhead of the language when array operations are optimized.
Abstractions such as this are absolutely essential for good scientific coding. Focusing on higher-level operations over higher-level data types generally leads to massive gains in coding speed and coding accuracy. Python’s extension type system seamlessly allows libraries to be designed around this idea. Numpy’s array type is a great examples.
However, existing abstractions are not always enough when you’re developing new algorithms or coding up new ideas. For example, vectorizing code through the use of arrays is powerful but limited. In many cases, operations really need loops, recursion, or other coding structures that are extremely efficient in optimized, compiled machine code but are not in most interpreted languages. As variables in many interpreted languages are not statically typed, the code can’t easily be compiled into optimized machine code. In the scientific context, Cython provides the perfect balance between the two by allowing either.
Cython works by first translating Python code into equivalent C code that runs the Python interpreted through the Python C API. It then uses a C compiler to create a shared library that can be loaded as a Python module. Generally, this module is functionally equivalent to the original Python module and usually runs marginally faster. The advantage, however, is that Cython allows one to statically type variables – e.g. declares to be an integer. This gives massive speedups, as typed variables are now treated using low-level types rather than Python variables. With these annotations, your “Python” code can be as fast as C – while requiring very little actual knowledge of C.
Practically, a few type declarations can give you incredible speedups. For example, suppose you have the following code:
where is a 2d NumPy array. This code uses interpreted loops and thus runs fairly slowly. However, add type information and use Cython:
Cython translates necessary Python operations into calls to the Python C-API, but the looping and array indexing operations are turned into low level C code. For a 1000 x 1000 array, on my 2.4 GHz laptop, the Python version takes 1.67 seconds, while the Cython version takes only 3.67 milliseconds (a vectorized version of the above using an outer product took 15.1 ms).
A general rule of thumb is that your program spends 80% of its time running 20% of the code. Thus a good strategy for efficient coding is to write everything, profile your code, and optimize the parts that need it. Python’s profilers are great, and Cython allows you to do the latter step with minimal effort.
As a side affect of its universality, Python excels at gluing other languages together. One can call MATLAB functions from Python (through the MATLAB engine) using MLabWrap, easing transitions from MATLAB to Python. Need to use that linear regression package in R? RPy puts it at your fingertips. Have fast FORTRAN code for a particular numerical algorithm? F2py will effortless generate a wrapper. Have general C or C++ libraries you want to call? Ctypes, Cython, or SWIG are three ways to easily interface to it (my favorite is Cython). Now, if only all these were two way streets...
Brilliantly, Python incorporates module, class, function, and method documentation directly into the language itself. In essence, there are two levels of comments – programming level comments (start with ) that are ignored by the compiler, and documentation comments that are specified by a doc string after the function or method name. These documentation strings add tags to the methods which are accessible by anyone using an interactive Python shell or by automatic documentation generators.
The beauty of Python’s system becomes apparent when using Sphinx, a documentation generation system originally built for the Python language documentation. To allow sufficient presentation flexibility, it allows reStructuredText directives, a simple, readable markup language that is becoming widely used in code documentation. Sphinx works easily with embedded doc-strings, but it is useful beyond documentation – for example, my personal website, my course webpages when I teach, my code documentation sites, and, of course, Python’s main website are generated using Sphinx.
One helpful feature for scientific programming is the ability to put LaTeX equations and plots directly in code documentation. For example, if you write:
in the doc string, it is rendered in the webpage as
Including plots is easy. The following doc-string code:
In essence, this enables not only comments about the code, but also comments about the science and research behind your code, to be interwoven into the coding file.
Hierarchical Module System¶
Python uses modular programming, a popular system that naturally organizes functions and classes into hierarchical namespaces. Each Python file defines a module. Classes, functions, or variables that are defined in or imported into that file show up in that module’s namespace. Importing a module either creates a local dictionary holding that module’s objects, pulls some of the module’s objects into the local namespace. For example, binds to hashlib’s md5 checksum function; alternately, binds to this function. This helps programming namespaces to follow a hierarchical organization.
On the coding end, a Python file defines a module. Similarly, a directory containing an Python file is treated the same way, files in that directory can define submodules, and so on. Thus the code is arranged in a hierarchical structure for both the programmer and the user.
Permit me a short rant about MATLAB to help illustrate why this is a great feature. In MATLAB, all functions are declared in the global namespace, with names determined by filenames in the current path variable. However, this discourages code reusability by making the programmer do extra work keeping disparate program components separate. In other words, without a hierarchical structure to the program, it’s difficult to extract and reuse specific functionality. Second, programmers must either give their functions long names, essentially doing what a decent hierarchical system inherently does, or risk namespace conflicts which can be difficult to resolve and result in subtle errors. While this may help one to throw something together quickly, it is a horrible system from a programming language perspective.
Good programming requires having and using the correct data structures for your algorithm. This is almost universally under-emphasized in research-oriented coding. While proof-of-concept code often doesn’t need optimal data structures, such code causes problems when used in production. This often – though it’s rarely stated or known explicitly – limits the scalability of a lot of existing code. Furthermore, when such features are not natural in a language’s design, coders often avoid them and fail to learn and use good design patterns.
Python has lists, tuples, sets, dictionaries, strings, threadsafe queues, and many other types built-in. Lists hold arbitrary data objects and can be sliced, indexed, joined, split, and used as stacks. Sets hold unordered, unique items. Dictionaries map from a unique key to anything and form the real basis of the language. Heaps are available as operations on top of lists (similar to the C++ STL heaps). Add in NumPy, and one has an n-dimensional array structure that supports optimized and flexible broadcasting and matrix operations. Add in SciPy, and you have sparse matrices, kd-trees, image objects, time-series, and more.
Python has an impressivestandard library packaged with Python. Its philosophy is batteries-included, and a standard Python distribution comes with built-in database functionality, a variety of data persistence features, routines for interfacing with the operating system, website interfacing, email and networking tools, data compression support, cryptography, xml support, regular expressions, unit testing, multithreading, and much more. In short, if I want to take a break from writing a bunch of matrix manipulation code and automate an operating system task, I don’t have to switch languages.
Numerous libraries provide the needed functionality for scientific . The following is a list of the ones I use regularly and find to be well-tested and mature:
- This pair of libraries provide array and matrix structures, linear algebra routines, numerical optimization, random number generation, statistics routines, differential equation modeling, Fourier transforms and signal processing, image processing, sparse and masked arrays, spatial computation, and numerous other mathematical routines. Together, they cover most of MATLAB’s basic functionality and parts of many of the toolkits, and include support for reading and writing MATLAB files. Additionally, they now have great documentation (vastly improved from a few years ago) and a very active community.
- One of the best things in Python is IPython, an enhanced interactive Python shell that makes debugging, profiling code, interactive plotting. It supports tab completion on objects, integrated debugging, module finding, and more – essentially, it does almost everything you’d expect a command line programming interface to do. Additionally,
- Referenced earlier, Cython is a painless way of embedding compiled, optimized bits of code in a larger Python program.
- SQLAlchemy makes leveraging the power of a database incredibly simple and intuitive. It is essentially a wrapper around an SQL database. You build queries using intuitive operators, then it generates the SQL, queries the database, and returns an iterator over the results. Combining it with sqlite – embedded in Python’s standard library – allows one to leverage databases for scientific work with impressive ease. And, if you tell sqlite to build its database in memory, you’ve got another powerful data structure. To slightly plagiarize xkcd, SQLAlchemy makes databases fun again.
- PyTables is a great way of managing large amounts of data in an organized, reliable, and efficient fashion. It optimizes resources, automatically transferring data between disk and memory as needed. It also supports on-the-fly (DE)compression and works seamlessly with NumPy arrays.
- For writing user interfaces in C++, I recommend it is, in my experience, difficult to beat QT. PyQt brings the ease of QT to Python. And I do mean ease – using the interactive QT designer, I’ve build a reasonably complex GUI-driven scientific application with only a few dozen lines of custom GUI code. The entire thing was done in a few days. The code is cross-platform over Linux, Mac OS X, and Windows. If you need to develop a front end to your data framework, and don’t mind the license (GPL for PyQT, LGPL for QT), this is, in my experience, the easiest way to do so.
- Without proper foresight and planning, larger research projects are particularly prone to the second law of thermodynamics: over time, the organization of parameters, options, data, and results becomes increasingly random. TreeDict is a Python data structure I designed to fight this. It stores hierarchical collections of parameters, variables, or data, and supports splicing, joining, copying, hashing, and other operations over tree structures. The hierarchical structure promotes organization that naturally tracks the conceptual divisions in the program – for example, a single file can define all parameters while reflecting the structure of the rest of the code.
- Sage doesn’t really fit on this list as it packages many of the above packages into a single framework for mathematical research. It aims to be a complete solution to scientific programming, and it incorporates over a hundred open source scientific libraries. It builds on these with a great notebook concept that can really streamline the thought process and help organize general research. As an added bonus, it has an online interface for trying it out. As a complete package, I recommend newcomers to scientific Python programming try Sage first; it does a great job of unifying available tools in a consistent presentation.
- Enthought Python Distribution:
- Also packaging these many libraries into a complete package for scientific computing, the Enthought Python Distribution is distributed by a company that contributes heavily to developing and maintaining these libraries. While there are commercial support options, it is free for academic use.
I do not feel comfortable releasing code without an accompanying suite of tests. This attitude, of course, reflects practical programmer wisdom; code that is guaranteed to function a certain way – as encapsulated in these unit tests – is reusable and dependable. While packaging test code with does not always equate with code quality, there is a strong correlation. Unfortunately, the research community does not often emphasize writing proper test code, due partly to that emphasis being directed, understandably, towards technique, theory, and publication. But this is exactly why a no-boilerplate, practical and solid testing framework and simple testing constructs like assert statements are so important. Python provides a built-in, low barrier-to-entry testing framework that encourages good test coverage by making the fastest workflow, including debugging time, involve writing test cases. In this way, Python again distinguishes itself from its competitors for scientific code.
A Complete Programming Solution¶
In summary, and reiterating my point that Python is a complete programming solution, I mention three additional points, each of which would make a great final thought. First, it is open source and completely free, even for commercial use, as are many of the key scientific libraries. Second, it runs natively on Windows, Mac OS, linux, and others, as does its standard library and the third party libraries I’ve mentioned here. Third, it fits quick scripting and large development projects equally well. A quick perusal of some success stories on Python’s website showcases the diversity of environments in which Python provides a scalable, well-supported, and complete programming solution for research and scientific coding. However, the best closing thought is due to Randall Monroe, the author of xkcd: “Programming is fun again!”
Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form or . This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.
Comparisons to singletons like None should always be done with or , never the equality operators.
Also, beware of writing when you really mean -- e.g. when testing whether a variable or argument that defaults to None was set to some other value. The other value might have a type (such as a container) that could be false in a boolean context!
Use operator rather than . While both expressions are functionally identical, the former is more readable and preferred.
Yes:if foo is not None:
No:if not foo is None:
When implementing ordering operations with rich comparisons, it is best to implement all six operations (, , , , , ) rather than relying on other code to only exercise a particular comparison.
To minimize the effort involved, the decorator provides a tool to generate missing comparison methods.
PEP 207 indicates that reflexivity rules are assumed by Python. Thus, the interpreter may swap with , with , and may swap the arguments of and . The and operations are guaranteed to use the operator and the function uses the operator. However, it is best to implement all six operations so that confusion doesn't arise in other contexts.
Always use a def statement instead of an assignment statement that binds a lambda expression directly to an identifier.
Yes:def f(x): return 2*x
No:f = lambda x: 2*x
The first form means that the name of the resulting function object is specifically 'f' instead of the generic '<lambda>'. This is more useful for tracebacks and string representations in general. The use of the assignment statement eliminates the sole benefit a lambda expression can offer over an explicit def statement (i.e. that it can be embedded inside a larger expression)
Derive exceptions from rather than . Direct inheritance from is reserved for exceptions where catching them is almost always the wrong thing to do.
Design exception hierarchies based on the distinctions that code catching the exceptions is likely to need, rather than the locations where the exceptions are raised. Aim to answer the question "What went wrong?" programmatically, rather than only stating that "A problem occurred" (see PEP 3151 for an example of this lesson being learned for the builtin exception hierarchy)
Class naming conventions apply here, although you should add the suffix "Error" to your exception classes if the exception is an error. Non-error exceptions that are used for non-local flow control or other forms of signaling need no special suffix.
Use exception chaining appropriately. In Python 3, "raise X from Y" should be used to indicate explicit replacement without losing the original traceback.
When deliberately replacing an inner exception (using "raise X" in Python 2 or "raise X from None" in Python 3.3+), ensure that relevant details are transferred to the new exception (such as preserving the attribute name when converting KeyError to AttributeError, or embedding the text of the original exception in the new exception message).
When raising an exception in Python 2, use instead of the older form .
The latter form is not legal Python 3 syntax.
The paren-using form also means that when the exception arguments are long or include string formatting, you don't need to use line continuation characters thanks to the containing parentheses.
When catching exceptions, mention specific exceptions whenever possible instead of using a bare clause.
For example, use:try: import platform_specific_module except ImportError: platform_specific_module = None
A bare clause will catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program with Control-C, and can disguise other problems. If you want to catch all exceptions that signal program errors, use (bare except is equivalent to ).
A good rule of thumb is to limit use of bare 'except' clauses to two cases:
- If the exception handler will be printing out or logging the traceback; at least the user will be aware that an error has occurred.
- If the code needs to do some cleanup work, but then lets the exception propagate upwards with . can be a better way to handle this case.
When binding caught exceptions to a name, prefer the explicit name binding syntax added in Python 2.6:try: process_data() except Exception as exc: raise DataProcessingFailedError(str(exc))
This is the only syntax supported in Python 3, and avoids the ambiguity problems associated with the older comma-based syntax.
When catching operating system errors, prefer the explicit exception hierarchy introduced in Python 3.3 over introspection of values.
Additionally, for all try/except clauses, limit the clause to the absolute minimum amount of code necessary. Again, this avoids masking bugs.
Yes:try: value = collection[key] except KeyError: return key_not_found(key) else: return handle_value(value)
No:try: # Too broad! return handle_value(collection[key]) except KeyError: # Will also catch KeyError raised by handle_value() return key_not_found(key)
When a resource is local to a particular section of code, use a statement to ensure it is cleaned up promptly and reliably after use. A try/finally statement is also acceptable.
Context managers should be invoked through separate functions or methods whenever they do something other than acquire and release resources. For example:
Yes:with conn.begin_transaction(): do_stuff_in_transaction(conn)
No:with conn: do_stuff_in_transaction(conn)
The latter example doesn't provide any information to indicate that the and methods are doing something other than closing the connection after a transaction. Being explicit is important in this case.
Be consistent in return statements. Either all return statements in a function should return an expression, or none of them should. If any return statement returns an expression, any return statements where no value is returned should explicitly state this as , and an explicit return statement should be present at the end of the function (if reachable).
Yes:def foo(x): if x >= 0: return math.sqrt(x) else: return None def bar(x): if x < 0: return None return math.sqrt(x)
No:def foo(x): if x >= 0: return math.sqrt(x) def bar(x): if x < 0: return return math.sqrt(x)
Use string methods instead of the string module.
String methods are always much faster and share the same API with unicode strings. Override this rule if backward compatibility with Pythons older than 2.0 is required.
Use and instead of string slicing to check for prefixes or suffixes.
startswith() and endswith() are cleaner and less error prone. For example:Yes: if foo.startswith('bar'): No: if foo[:3] == 'bar':
Object type comparisons should always use isinstance() instead of comparing types directly.Yes: if isinstance(obj, int): No: if type(obj) is type(1):
When checking if an object is a string, keep in mind that it might be a unicode string too! In Python 2, str and unicode have a common base class, basestring, so you can do:if isinstance(obj, basestring):
Note that in Python 3, and no longer exist (there is only ) and a bytes object is no longer a kind of string (it is a sequence of integers instead)
For sequences, (strings, lists, tuples), use the fact that empty sequences are false.Yes: if not seq: if seq: No: if len(seq): if not len(seq):
Don't write string literals that rely on significant trailing whitespace. Such trailing whitespace is visually indistinguishable and some editors (or more recently, reindent.py) will trim them.
Don't compare boolean values to True or False using .Yes: if greeting: No: if greeting == True: Worse: if greeting is True: