Sunday, November 8, 2015

Explaining Python Concepts: Iterable, Iterator, and Generator

A few days ago, some colleagues at work brought up a discussion about several closely-related concepts in Python: Iterable, Iterator, Generator. And I thought it might be a good blog post to explain them all, down to the words in the PEPs, and just not hand-waving over them.

Duck typing

Before we all go deep, let's quickly remember that Python is a dynamically typed language. Many people use the term duck typing to refer to Python's type system. Colloquially speaking, if it walks like a duck, quacks like a duck, it must be a duck. That is to say an object is considered a particular type if it behaves in accordance to the defined behavior of that type.

In other words, we are usually talking about interface rather than implementation in Python. The mentioned topics are in fact concepts, not any concrete implementation, as we shall see later.


Iterable is the easiest one to explain. If an object has either __iter__ or __getitem__ method, it's an iterable.

The main purpose of an iterable object is it can be used in a for loop. Let's take a look at how for loop works.

>>> import dis
>>> def f():
...   for x in o:
...     pass
>>> dis.dis(f)
  2           0 SETUP_LOOP              14 (to 17)
              3 LOAD_GLOBAL              0 (o)
              6 GET_ITER            
        >>    7 FOR_ITER                 6 (to 16)
             10 STORE_FAST               0 (x)

  3          13 JUMP_ABSOLUTE            7
        >>   16 POP_BLOCK           
        >>   17 LOAD_CONST               0 (None)
             20 RETURN_VALUE

As you can see, the for loop itself is converted into two main instructions: a GET_ITER followed by a FOR_ITER. The first instruction obtains an iterator from an iterable object. The next instruction grabs the next item from that iterator.

You can also tell that I have simplified things a little bit up there to not get into another term that we're getting to next


An iterator object has a next method (or __next__ in Python 3). This method returns the next item in a stream, or raises StopIteration when there is no more item.

Now, if there is only next method, the iterator will not be usable in a for loop. In order for "code receiving an iterator can use a for-loop over the iterator", "iterators are currently required to support both protocols.". The other "protocol" referred in that direct quote is iterable concept. That is, iterators are required to have an __iter__ method returning itself. The two concepts, however, are distinct.

So, by now, I hope that it is clear that iterators are iterable, and iterable objects produce iterators.

To illustrate the point, let's consider this example:

>>> seq = [1, 2, 3, 4, 5]
>>> iter1 = seq.__iter__()
>>> for x in iter1:
...   print x
...   if x == 3: break
>>> iter2 = seq.__iter__()
>>> for x in iter1:
...   print x
>>> for x in iter2:
...   print x

We have a seq sequence. This sequence is iterable because it has __iter__ method. We obtain the first iterator iter1. We then run a for loop over iter1, breaking out at value 3. The fact that we can run for loop over iter1 confirms that iter1 is iterable. But unlike seq, iterating over iter1 again continues from where it left off, not from the beginning. We also see that iter2 (which is created at the same time we restarts iteration over iter1) is totally different from iter1.

I hope the example makes sense. When iterating over a sequence, you would want to iterate over all its items. When iterating over an iterator, you want to iterate over the items left in that iterator, not items that have been consumed. Put it differently, the __iter__ method defines how the items are iterated over while the next method simply returns the next item in that order. Therefore, an iterator is almost always assumed to be used once only, and an iterable is often iterated over many times.

So we have seen the relationship between iterator and iterable, let's get to the last topic.


Quoting PEP 255: Python generator is a kind of Python iterator, but of an especially powerful kind.

Generator is more powerful because it has these other methods on top of next: send, throw, and close. Therefore, it should be clear that if an object does not accept any of these methods, it is not a generator. Conversely, an object that provides all of these methods can be treated as a generator.

Due to duck typing, it is wrong to use isinstance(x, types.GeneratorType) to check if x exhibits generator behavior. That type is only defined for object returned from a generator function or generator expression.

Speaking of generator function and generator expression, they should not be confused with generator. Generator function is any function that has the yield keyword somewhere in its body. Invoking this function does not run its body, but returns a generator object which PEP 255 calls generator-iterator. Calling next on the returned generator will execute the function body. Generator expression is similar in concept but with a more familiar list comprehension syntax.

The easiest way to exercise the advanced features of generators is to take advantage of yield from keyword introduced in PEP 380 -- Syntax for Delegating to a Subgenerator, and available since CPython 3.3.

>>> class MyGenerator(object):
...     def next(self):
...         return 42
...     __next__ = next
...     def send(self, *args):
...         print('Received', args)
...     def throw(self, *args):
...         pass
...     def close(self, *args):
...         pass
>>> class MyIterable(object):
...     def __iter__(self):
...         return MyGenerator()
>>> def test_my_generator():
...     yield from MyIterable()
>>> def test_list():
...     yield from [1, 2, 3]
>>> def test_genexp():
...     x = (i for i in (1, 2, 3))
...     yield from x
>>> g = test_genexp()
>>> next(g)
>>> g.send('Yup, genexp is a generator.')
>>> g = test_my_generator()
>>> next(g)
>>> g.send('Surprised! This is also a generator.')
Received ('Surprised! This is also a generator.',)
>>> g = test_list()
>>> next(g)
>>> g.send('Oops, not a generator')
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 2, in test_list
AttributeError: 'list_iterator' object has no attribute 'send'

The above code shows that the return value from a generator expression and generator function is, as expected, a proper generator. The code also shows that MyGenerator is indeed a generator because it fulfills the generator concept, it can be used as a subgenerator in a yield from statement. Lastly, it shows that a regular iterator is not a generator.

On a side note, the example makes it clear that iterable and iterator/generator are two distinct concepts. MyIterable does not produce next item, and MyGenerator is not iterable, but when combined they can be used in for loop.


To summarize it all up, iterable object produces iterators and therefore control how the items are iterated over. Iterator produces next item in a defined order and usually consumed once. For iterator to be used in a for loop (no pun intended), iterator is required to define an __iter__ method that returns itself. Generator is an enhanced iterator with additional semantics such as send, throw, and close.

Practically speaking, if an object can be fully iterated over with for loop more than once, it's an iterable. If an object has next (or __next__ in Python 3), it is an iterator. And if a value can be sent to an iterator, it is a generator.

Thursday, November 13, 2014

A short joke

Thought of this joke on the way home (based on another religiously-themed version from a colleague at work).

How can computer scientists help NASA find the next habitable planet? They do a star search.

Saturday, July 26, 2014

Quick and easy "software diversification"

The Economist published an article Divided we stand about the work that Full Professor Michael Franz did at University of California at Irvine to further secure software applications.

The gist of the technique is to compile the same source code into many variant binaries that perform the same function, but differ structurally, at the machine instruction level. For example: the sequence MOV EAX, EBX; XOR ECX, EDX can be rearranged into XOR ECX, EDX; MOV EAX, EBX, or made into MOV EAX, EBX; NOP; XOR ECX, EDX without affecting any functionality of the sequence. The team modified compilers (both LLVM clang and GCC) to automatically (and deterministically) introduce diversity (randomness) in the instruction scheduler. As such, exploit writers will have a harder time targeting all variants.

This immediately reminds me of a simply trick I used many years ago to achieve a similar effect: Randomizing the linking order of object files. Consider this, if you have object main.o, strcpy.o, and puts.o, you can create 6 (3 factorial) variants by linking them in different permutation orders:

  1. main strcpy puts
  2. main puts strcpy
  3. strcpy main puts
  4. strcpy puts main
  5. puts main strcpy
  6. puts strcpy main
$ gcc -o t1 f1.o main.o
$ gcc -o t2 main.o f1.o
$ nm t1
0000000100001040 S _NXArgc
0000000100001048 S _NXArgv
0000000100001058 S ___progname
0000000100000000 T __mh_execute_header
0000000100001050 S _environ
                 U _exit
0000000100000ec0 T _f1
0000000100000ed0 T _main
0000000100001000 s _pvars
                 U dyld_stub_binder
0000000100000e80 T start
$ nm t2
0000000100001040 S _NXArgc
0000000100001048 S _NXArgv
0000000100001058 S ___progname
0000000100000000 T __mh_execute_header
0000000100001050 S _environ
                 U _exit
0000000100000ef0 T _f1
0000000100000ec0 T _main
0000000100001000 s _pvars
                 U dyld_stub_binder
0000000100000e80 T start

In the first variant, f1 is at ~EC0 and main is at ~ED0. In the second variant, f1 is at ~EF0 and main is at ~EC0. There is a clear difference in the structure of the binaries but no functionality is affected.

This trick is performed at the final stage (linking) in the whole build process. Therefore, intermediate object files can be reused without recompilation. Furthermore, no source code is required for this "diversification" process to happen.

Clear tradeoffs are in the granularity of the diversification. In the context of Prof Franz's work, which is mainly in defense against ROP exploit, I'll happily ignore such granularity.

Oh, by the way, I did not use this trick to "secure" the application. It seems like a wrong tool for that purpose due to distribution and debugging problems it creates.

Friday, March 7, 2014

Functor optimization

I have this piece of code that can be compiled with -On (n > 0) but cannot be compiled with -O0.

#include <iostream>

class Functor1 {
    void operator()() const {
        std::cout << "Functor 1" << "\n";

class Functor2 {
    void operator()() const {
        std::cout << "Functor 2" << this << "\n";

template <typename FunctorType>
class TemplateWithStaticMember {
    TemplateWithStaticMember() {
    static const FunctorType functor_;  // THIS LINE!!!

/* Incomplete fix:
template <typename FunctorType>
const FunctorType TemplateWithStaticMember<FunctorType>::functor_; */

int main(int argc, char* argv[]) {
    TemplateWithStaticMember<Functor1> f1;
    // TemplateWithStaticMember<Functor2> f2;

Under GCC 4.8, when compile with -O0, we get this error:

/tmp/ccFIY33S.o: In function `TemplateWithStaticMember::TemplateWithStaticMember()': main.cpp:(.text._ZN24TemplateWithStaticMemberI8Functor1EC2Ev[_ZN24TemplateWithStaticMemberI8Functor1EC5Ev]+0xd): undefined reference to `TemplateWithStaticMember::functor_' collect2: error: ld returned 1 exit status

At other optimization levels, the code can be compiled and executed just fine.

If we uncomment the second functor, the code always fails, regardless of optimization levels.

The reason is our template declares a static constant variable functor_ (at the line marked with THIS LINE!!!). At high level optimization, the compiler finds out that we only use the functor object to execute a function so the compiler inlines the function and optimizes away the functor object. Without optimization, the compiler requires a definition of functor_ and fails to find one.

When we use f2, its functor_'s operator() refers back to itself via this. That requires the functor object to actually be allocated. But because we have not defined any such functor object, the compiler will fail to compile our code.

I find this piece of code interesting because usually higher (not lower) optimizations make code fail. For example: Prof. John Regehr initially blogged about undefined behavior under optimizations, and STACK team at MIT published a paper about optimization-safe code.

Wednesday, December 11, 2013

BrowsePass as a Chrome app

I am pleased to announce the availability of BrowsePass Chrome app.

This is an easier way to set BrowsePass up on the Google Chrome browser (as well as its open source cousin Chromium). The Chrome app alleviates the most cumbersome steps in setting up BrowsePass: finding a web host for the source code. Everything else remains the same, including the convenience of opening your password database with a browser.

Though there is not yet extension/add-on/app for other browsers (such as Firefox, Safari, Opera), they can still run BrowsePass normally.

As usual, the source code (including the Chrome app) is at BitBucket.

Tuesday, December 10, 2013

Chainload from GRUB to Chameleon boot loader

A quick note to myself. Assuming OS X is installed in the 2nd partition of the 1st hard disk, modify your /etc/grub.d/40_custom file to have:

menuentry "OS X" {
    insmod hfsplus
    set root="(hd0,gpt2)"
    chainload /usr/standalone/i386/boot0

Then remember to run update-grub2.

The file /usr/standalone/i386/boot0 can be found in your hackintosh. It is the boot code from Chameleon. If you do not find this file, grab it from Chameleon package.

Wednesday, November 27, 2013

Language definition file vietnam.ldf not found

Just a quick note to myself. VnTeX recently moved its vietnam.ldf file to babel's contrib as vietnamese.dtx. The move happened on April 14, 2013. The babel package in MiKTeX is currently at March 23, 2013. The vntex package in MiKTeX is currently at May 21, 2013. That is to say vntex package no longer provides vietnam.ldf, yet babel package is still not update-to-date enough to have vietnamese.dtx.

The fix is to maintain your own vietnam.ldf file. The code (that was taken before the move) is pasted below.

% Copyright 2000-2005 Werner Lemberg .
% This file is part of vntex.  License: LPPL, version 1.3 or newer,
% according to
% vietnam.ldf
% written by Werner LEMBERG 
% History
%   1.0  2000/09/01
%     First version.
%   1.1  2001/05/26
%     Moved \endlinechar downwards.
%   post 1.1  ?
%     Don't check for dblaccnt.sty.
%     Add support for ucs.sty.
%     Don't define \captionsvietnam but load vncaps.tex.
%   1.2  2005/04/21
%     Add copyright message.
%     Minor clean-ups.

    [2005/04/21 v1.2 Vietnamese support from the babel system]


\ifx\l@vietnam \@undefined
  \adddialect\l@vietnam 0


  {\message{Loading definitions for the Vietnamese font encoding}}
  {\errhelp{I can't find the file `t5enc.def' for Vietnamese fonts}
   \errmessage{Since I do not know what the T5 encoding means^^J
               I can't typeset Vietnamese.^^J
               I stop here, while you get a suitable `t5enc.def' file}

  {\PackageWarning{babel}{No input encoding specified for Vietnamese}}

\endlinechar \m@ne

  \ifx \UnicodeCharFilter \@undefined
%   \UCSProtectionUnichar

\let\viet \viettext





\endlinechar `\^^M


% end of vietnam.ldf