Python & Terminal

Written on July 15th, 2017 by Kishu Agarwal

(Note: All of these is related to Python 2. Python 3 may show different results.)

While doing some office work, I came across this.

I had written below line in my Python 2 interpreter.

    >>> s = '©'
    >>> s #for checking the value of s

Surprisingly, I got the following value.

    '\xc2\xa9'

The reason I was surprised, was because, © is a non-ASCII character and my Python interpreter has the default encoding of ASCII. So I was hoping that since ASCII can’t encode this value, it should throw me some kind of encode/decode error. But I was wrong. Instead I got the above value. But the fact that was even more suprising was the value that was printed was none other than the UTF-8 encoded value of the © character.

Just to be sure, I checked my Python default encoding.

    >>> import sys
    >>> sys.getdefaultencoding()
    'ascii'

Yup. I was right. Default is ASCII. Then why it was printing the UTF-8 encoding.

To find the reason for this strange result, I did some digging on internet.

And finally I got the reason for this strange result.

The reason was my TERMINAL. I was using the GNOME terminal running bash shell. Guess what was the default encoding of the terminal?

It was UTF-8.

But what terminal encoding has to do with Python? Well in this case, everything.

Following illustration would make things clear.

When we typed s = ‘©’ on the terminal and pressed enter, the terminal would pass these string as input to the stdin of the Python interpreter. As you already know, these glyphs are just for human display purposes and internally all these characters are just stored in some or other encoding. Here the encoding of the terminal is UTF-8, so when it send these string, it will actually send the UTF-8 encoded string as series of bytes to stdin.

Just to confirm, here is a session with Python interpreter that verifies this.

>>> import sys
>>> inp = sys.stdin.read()
s = '©'
>>> inp
"s = '\xc2\xa9'\n"
>>> len(inp)
9

As can be seen from above listing, when we manually called the read method on stdin and entered our string containing the ‘©’ character, what we actually got were the same two bytes ‘\xc2\xa9’ which are nothing but the UTF-8 encoding of ©.

Python then stores these two bytes in the ‘s’ variable. And when we ask to print it’s value, the same thing happens but in reverse. Here is another session with Python interpreter that shows this.

>>> import sys
>>> s = '©'
>>> sys.stdout.write(s)
©>>> 
>>> sys.stdout.write('\xc2\xa9')
©>>>

As can be seen from this, whether we print ‘s’ directly or write the UTF-8 encoding of ‘©’, we get the same result. The reason being, that after getting these two bytes, the terminal tried to decode them as UTF-8 encoding and successfully decoded them to be representing the ‘©’ Unicode Character and hence we get the nice copyright symbol glyph instead of the actual bytes being displayed.

NOTE

One thing that needs to be mentioned here is that shell encoding and the terminal encoding are two different things. By default, almost all the shells and terminals nowadays, have the default encoding set to UTF-8. But you can change both of them independently of each other.

Here is one session where you can see how these two different encodings come into play.

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u'\xa9'
Β©

For this session, I have left the shell encoding to be UTF-8 which can be seen from the stdout encoding and changed the terminal encoding to Western-1253.

To understand this example, there is one thing that you must know. When you print normal str type strings, Python writes them as it is, a sequence of bytes. But in case of unicode strings, Python first tries to convert the Unicode string into the encoding of the stdout and then writes that converted sequence of bytes.

So, now having understood that, we can see more clearly what is happening in the above example. When we asked Python to print u’\xa9’, since it is a Unicode string, Python would first convert it into the UTF-8 encoding because that it’s stdout encoding and output that converted string to the terminal, which in our case would be ‘\xc2\xa9’.

Now here comes the interesting part. Since we have changed the terminal encoding to Western-1253, terminal would try to decode these two bytes using this encoding instead of UTF-8. \xc2 corresponds to symbol Β in these encoding and \xa9 to ©, which is what terminal shows instead of just ©.

This result may seem surprising to some of you. It sure as hell, was surprising to me.

Thank you for reading my article. Let me know if you liked my article or any other suggestions for me, in the comments section below. And please, feel free to share :)

Feel free to share!