Aug 6, 2020

Less lines is better (in most cases)

Less lines is better (in most cases)

In most cases less code is better. "The best code is no code at all." - Jeff Atwood.

Googling "less code is better" gives some good articles about the statement. I would list some of them here:

And some other:

Instead of a disclaimer: "Readability matters, and you should not sacrifice readability in order to get less code." (c) Daan

In this particular article I would like to describe my view on the special case of having less code. That is having less lines of code. Although strictly speaking it is possible to have less lines of code with the same, less and even more code itself.

Here are some reasons why I would prefer less lines of code (without sacrificing readability):

  • Vertically more compact code
  • Less need for vertical scrolling
  • Smaller commit diffs
  • Possibly less code

The most important reasons are "Vertically more compact code" and "Less need for code scrolling". Both reasons let you reduce the number of bugs (per feature) and also code faster.

Vertically more compact code lets you see more source code at a time. And therefore frees your brain capacity from remembering the code that is currently not shown on the screen. The freed brain capacity can be used for analyzing the code and being more attentive which results in faster development and less bugs respectively.

Less need for vertical scrolling also results into less bugs because it reduces brain distractions on the scrollings. There are also some little saving because scrolling the code does require some actual time to do the scrolling. Also scrolling may introduce human factor bugs because of mislooking parts of the code related to their vertical position in a file.

Several years ago I read some research conclusions backing the above claims, but unfortunately I could not find proof links for them now. Please, submit links in comments if you have them.

There are some techniques to reduce negative effects of having many code lines:

  • Second, third, etc monitor (especially vertically positioned)
  • Smaller font sizes, so more code lines can fit vertically on a screen
  • Breaking long files into several smaller ones

Each of these techniques should be used if possible in addition, but not as a replacement to less code lines. And here is why.

Extra monitors is a great option. I remember I read a research conclusion that an extra monitor might increase productivity up to 30%. Unfortunately, although extra monitors help to reduce amount of computer interface interactions (scrolling and switching between files) they still do not completely solve the focus and attention issue. Because a human still able to look at only one screen at a time even if several of them are in front. Another issue with extra monitors is that they limit your mobility which might be important for those who often work on a laptop from different places (basically the extra monitor option might be just not option for such users).

Smaller font size let you display more lines on the same screen, but at an expense of more load on eyes. Which may be undesirable for health reasons and in fact make your eyes getting tired faster.

Breaking long files into several smaller ones solves a need for scrolling by a need for switching between files. Which may be a better option, but still has a drawback of distracting developer's attention for extra interface interactions.

There is something else about having more lines. When reading the code line by line every line switch requires some brain capacity for understanding if the next lines expressing another logical construct of the language or a continuation of the current one. As an example I would put a multiple line function or method call. So with each next line you read you need to understand if it is still the same function call that started several lines above or a new function call or statement. Therefore the less lines there are the less times you need figure out if you are still in the same logic construct or there is a new one started.

Based on the above explanation I would like to provide some otherwise questionable code style decisions that I prefer.

Multiline one-liners

Prefer:

value = (value_for_true() if some_looooooooooooong_expression(arg) else
         value_for_false())

over:

if some_looooooooooooong_expression(arg):
    value = value_for_true()
else:
    value = value_for_false()

This is an example of the same code size, but different number of lines. The preferred variant is 2 lines less and twice smaller. I should point out that the preferred variant is intact with Google Python Style Guide.

Condensed function/method calls

Prefer:

value = function(long_name_arg1, long_name_arg2,
                 long_name_kwarg1=value1, long_name_kwarg2=value2)

over:

value = function(
    long_name_arg1, long_name_arg2,
    long_name_kwarg1=value1, long_name_kwarg2=value2)

and especially over:

value = function(
    long_name_arg1,
    long_name_arg2,
    long_name_kwarg1=value1,
    long_name_kwarg2=value2,
)

All the above snippets are PEP8 complaint. But with preferred one we can save up to 4 lines and make the code up to 3 times lines shorter.

Feb 21, 2019

Python packaging

After years of working as Python developer in various teams I learned that most of Python backend developers do not respect Python packaging for backend applications. This makes me sad. So I decided to draw a blog post to explain why I do otherwise.

I think I started packaging all the python software I produce ages ago (in the beginning of 2013) right after I posted this article (in Russian) to be vetted as an admission fee to habr (to have write and commenting rights).

Many developers asked me why I do Python packaging while everything works just without it. "It is a higher level of maturity" usually was not taken as an answer.

Bullet proof reason (apparently not)

UPDATE: Workaround with
$ PYTHONPATH=$PWD:$PYTHONPATH python ./tmp/test1.py

So the 100% bullet proof reason for Python packaging is: without packaging the source code cannot be reused in certain cases. You just can not import it because it appears to be outside of visible scope of running python interpreter.

Imagine you have created a skeleton Django project (I am using Django as example just to make the example more practical) and a Django application with (as described in official Django tutorial):

$ python --version
Python 3.7.2
$ python -m django --version
2.1.7
$ django-admin startproject mysite
$ cd mysite
$ python manage.py startapp polls

Now you need to do some quick research to test your code, so you create a directory that should not be a part of a project and file with your snippet:

$ cd ..
$ mkdir tmp
$ vim tmp/test1.py

tmp/test1.py content:
from mysite.polls import models

This would look like:
(django) $ tree
.
├── mysite
│   ├── manage.py
│   ├── mysite
│   │   ├── __init__.py
│   │   ├── settings.py
│   │   ├── urls.py
│   │   └── wsgi.py
│   └── polls
│       ├── admin.py
│       ├── apps.py
│       ├── __init__.py
│       ├── migrations
│       │   └── __init__.py
│       ├── models.py
│       ├── tests.py
│       └── views.py
└── tmp
    └── test1.py

When you run the tmp/test1.py script you get:
$ python ./tmp/test1.py 
Traceback (most recent call last):
  File "./tmp/test1.py", line 1, in <module>
    from mysite.polls import models
ModuleNotFoundError: No module named 'mysite'

It does not import the module, because it appears to be outside visible scope of the script (python interpreter).

Relative import neither works:
from ..mysite.polls import models
$ python ./tmp/test1.py 
Traceback (most recent call last):
  File "./tmp/test1.py", line 1, in <module>
    from ..mysite.polls import models
ValueError: attempted relative import beyond top-level package
Let's bring tmp/test1.py into mytest directory:
$ tree
.
└── mysite
    ├── manage.py
    ├── mysite
    │   ├── __init__.py
    │   ├── settings.py
    │   ├── urls.py
    │   └── wsgi.py
    ├── polls
    │   ├── admin.py
    │   ├── apps.py
    │   ├── __init__.py
    │   ├── migrations
    │   │   └── __init__.py
    │   ├── models.py
    │   ├── tests.py
    │   └── views.py
    └── tmp
        └── test1.py

tmp/test1.py content:
from ..polls import models

Nope:
$ python ./mysite/tmp/test1.py 
Traceback (most recent call last):
  File "./mysite/tmp/test1.py", line 1, in 
    from ..polls import models
ValueError: attempted relative import beyond top-level package

You can import your code only if you move your test1.py to the root level like this:
$ tree
.
├── mysite
│   ├── manage.py
│   ├── mysite
│   │   ├── __init__.py
│   │   ├── settings.py
│   │   ├── urls.py
│   │   └── wsgi.py
│   ├── polls
│   │   ├── admin.py
│   │   ├── apps.py
│   │   ├── __init__.py
│   │   ├── migrations
│   │   │   └── __init__.py
│   │   ├── models.py
│   │   ├── tests.py
│   │   └── views.py
│   └── test1.py (content: from polls import models)
└── test1.py (content: from mysite.polls import models)

This leads to polluting source root with a bunch of testN.py files which contradicts Python Zen statement "Namespaces are one honking great idea -- let's do more of those!" and just makes code structure messy and harder to navigate.

Creating a package installable to virtualenv in edit mode (pip install -e .) will let you import any submodule from script or submodule does not matter where they are located relative to the source code as long as virutalenv is activated.

The content of tmp/test1.py would be the same no matter where you place it:
from mysite.polls import models

Other reasons

There are more reasons to do Python packaging (although most of them fall into "higher maturity" category):

  • Code can be uploaded to a private (local) PyPI repository (e.x. Gemfury)
  • Forces to have properly versioned software (independent from particular code version control system) with comparable version numbers
  • Forces to namespace the code which well-aligned with Python Zen statement "Namespaces are one honking great idea -- let's do more of those!".
  • Distributable independent from particular code version control system
    • No need to clone entire repository while you need only latest version
    • Allows not to give access to the entire source history to someone who is authorized only for deployment
    • No need to install code version control system client (like git) to deploy
  • Installable/uninstallable/upgradeable with pip, pipenv and other similar tools
  • Allows to distribute only the code required during run time (tests and other auxilary stuff may be excluded from package)
  • Package may include C-extensions or Cython code which are automatically compiled during installation
  • Package can be precompiled as Python wheel
  • Package can be compressed

Feb 13, 2019

My public source code

Prospective clients and employers often ask to show some source code to estimate my coding skills. Unfortunately, or fortunately, most of the code I write I do for money (aka professionally) therefore it is covered by NDA either explicitly or implicitly. Here is the list of mostly leisure projects which code I can show:

Stats on repositories that cannot be open sourced.

Jan 5, 2019

Select for update

It is easy to overlook select for update when using Django. Currently, I am working on a project that uses Django 1.11.x and Django REST Framework 3.8.2 (upgrade to Django 2.1 is planned, but we need to move to Python 3.7.1 first) and ran into an issue (again) of not having .select_for_update() where it is required. The issue appears as if your changes to the database do not happen, but what actually happens is that two parallel requests (or other parallel tasks) overwrite each others data. This is because both requests retrieved a record from database and reconstructed a corresponding model instance via ORM each its own copy. It a some moment one of the requests have modified its instance attributes and saved changes to the database. But it did not modify them in the second copy that sits in the other request. At a later moment second request modifies its copy attributes and saves changes to the database. It may not be an issue (in some cases) if the same attributes are being modified - the database then would contain the most up to date values (probably what actually need), but if it is not the case we have an issue with older values being stored to the database.

With Django default behavior (at least for version 2.11) all model attributes are being saved to the database even if they were not modified. Imagine you have a model A with attributes b and c. Two parallel requests get its instance from the database instance = A(b='b1', c='c1'). Then request 1 changes b = 'b2' and saves instance to the database: instance.save(). A(b='b2', c='c1') will be stored to the database. Note, that although request 1 did not change attribute c it will be stored to the database anyway. At this moment request 2 still contains its own copy as A(b='b1', c='c1'). Then it changes c = 'c2' and saves changes to the database: instance.save(). A(b='b1', c='c2') will be stored to database. Again attribute b was not changed by the request, but its value is being saved to the database by default. Therefore overwrites the value saved by request 1 (b = 'b2') with an older value b = 'b1' that was retrieved before. It is a typical lost update problem (also see write-write conflict).

In development and test environments this issue rarely reproduces. This is because of very low level of concurrency in these environments. But in production environment it may very well happen. And when it does it is really hard to debug (because it is production and because it is hard to figure out what are conditions for reproduction). Therefore this kind of issues should be prevented during development. One should make habit to query objects with .select_for_update() if they are to be modified.

But developers still forget to do it (I do forget sometimes at least). There are two things that could be done here. First, if Django REST Framework is used it could use .select_for_update() for all modification operations by default or at least for PATCH and PUT (my question to the core developer). Second, Django should not save all attributes blindly, but only those that were modified (this will cover cases where different attributes are modified by parallel requests and also improve performance).

UPDATE: While we are waiting a reply from core developer here is a snippet for Django REST Framework:
from rest_framework import mixins
from rest_framework.generics import GenericAPIView
from rest_framework.viewsets import ViewSetMixin


class CustomGenericAPIView(GenericAPIView):
    def get_queryset(self):
        qs = super(CustomGenericAPIView, self).get_queryset()
        if self.request.method in ('PATCH', 'PUT'):
            qs = qs.select_for_update()

        return qs


class CustomGenericViewSet(ViewSetMixin, CustomGenericAPIView):
    pass


class CustomModelViewSet(mixins.CreateModelMixin,
                         mixins.RetrieveModelMixin,
                         mixins.UpdateModelMixin,
                         mixins.DestroyModelMixin,
                         mixins.ListModelMixin,
                         CustomGenericViewSet):
    pass


Apr 22, 2017

Refactor me

This repository represents a step by step refactoring of a dirty code given me as a test task to estimate my coding skills. The only remark about the code was: "refactor_me.py is expected to contain Python 3.5.x code" (actually file naming was not provided in the task).

I did it in a way that every commit contains one particular change described in the commit message. The original dirty code can be found in this commit: 1036c091cb70ef110b4e56702bdc012c8a110336

Remarks on final result:

  • 100 character line length limit is used on purpose


Please, do not hesitate to submit pull requests for improvements if you feel that I missed something.