Feb 21, 2019

Python packaging

After years of working as Python developer in various teams I learned that most of Python backend developers do not respect Python packaging for backend applications. This makes me sad. So I decided to draw a blog post to explain why I do otherwise.

I think I started packaging all the python software I produce ages ago (in the beginning of 2013) right after I posted this article (in Russian) to be vetted as an admission fee to habr (to have write and commenting rights).

Many developers asked me why I do Python packaging while everything works just without it. "It is a higher level of maturity" usually was not taken as an answer.

Bullet proof reason (apparently not)

UPDATE: Workaround with
$ PYTHONPATH=$PWD:$PYTHONPATH python ./tmp/test1.py

So the 100% bullet proof reason for Python packaging is: without packaging the source code cannot be reused in certain cases. You just can not import it because it appears to be outside of visible scope of running python interpreter.

Imagine you have created a skeleton Django project (I am using Django as example just to make the example more practical) and a Django application with (as described in official Django tutorial):

$ python --version
Python 3.7.2
$ python -m django --version
2.1.7
$ django-admin startproject mysite
$ cd mysite
$ python manage.py startapp polls

Now you need to do some quick research to test your code, so you create a directory that should not be a part of a project and file with your snippet:

$ cd ..
$ mkdir tmp
$ vim tmp/test1.py

tmp/test1.py content:
from mysite.polls import models

This would look like:
(django) $ tree
.
├── mysite
│   ├── manage.py
│   ├── mysite
│   │   ├── __init__.py
│   │   ├── settings.py
│   │   ├── urls.py
│   │   └── wsgi.py
│   └── polls
│       ├── admin.py
│       ├── apps.py
│       ├── __init__.py
│       ├── migrations
│       │   └── __init__.py
│       ├── models.py
│       ├── tests.py
│       └── views.py
└── tmp
    └── test1.py

When you run the tmp/test1.py script you get:
$ python ./tmp/test1.py 
Traceback (most recent call last):
  File "./tmp/test1.py", line 1, in <module>
    from mysite.polls import models
ModuleNotFoundError: No module named 'mysite'

It does not import the module, because it appears to be outside visible scope of the script (python interpreter).

Relative import neither works:
from ..mysite.polls import models
$ python ./tmp/test1.py 
Traceback (most recent call last):
  File "./tmp/test1.py", line 1, in <module>
    from ..mysite.polls import models
ValueError: attempted relative import beyond top-level package
Let's bring tmp/test1.py into mytest directory:
$ tree
.
└── mysite
    ├── manage.py
    ├── mysite
    │   ├── __init__.py
    │   ├── settings.py
    │   ├── urls.py
    │   └── wsgi.py
    ├── polls
    │   ├── admin.py
    │   ├── apps.py
    │   ├── __init__.py
    │   ├── migrations
    │   │   └── __init__.py
    │   ├── models.py
    │   ├── tests.py
    │   └── views.py
    └── tmp
        └── test1.py

tmp/test1.py content:
from ..polls import models

Nope:
$ python ./mysite/tmp/test1.py 
Traceback (most recent call last):
  File "./mysite/tmp/test1.py", line 1, in 
    from ..polls import models
ValueError: attempted relative import beyond top-level package

You can import your code only if you move your test1.py to the root level like this:
$ tree
.
├── mysite
│   ├── manage.py
│   ├── mysite
│   │   ├── __init__.py
│   │   ├── settings.py
│   │   ├── urls.py
│   │   └── wsgi.py
│   ├── polls
│   │   ├── admin.py
│   │   ├── apps.py
│   │   ├── __init__.py
│   │   ├── migrations
│   │   │   └── __init__.py
│   │   ├── models.py
│   │   ├── tests.py
│   │   └── views.py
│   └── test1.py (content: from polls import models)
└── test1.py (content: from mysite.polls import models)

This leads to polluting source root with a bunch of testN.py files which contradicts Python Zen statement "Namespaces are one honking great idea -- let's do more of those!" and just makes code structure messy and harder to navigate.

Creating a package installable to virtualenv in edit mode (pip install -e .) will let you import any submodule from script or submodule does not matter where they are located relative to the source code as long as virutalenv is activated.

The content of tmp/test1.py would be the same no matter where you place it:
from mysite.polls import models

Other reasons

There are more reasons to do Python packaging (although most of them fall into "higher maturity" category):

  • Code can be uploaded to a private (local) PyPI repository (e.x. Gemfury)
  • Forces to have properly versioned software (independent from particular code version control system) with comparable version numbers
  • Forces to namespace the code which well-aligned with Python Zen statement "Namespaces are one honking great idea -- let's do more of those!".
  • Distributable independent from particular code version control system
    • No need to clone entire repository while you need only latest version
    • Allows not to give access to the entire source history to someone who is authorized only for deployment
    • No need to install code version control system client (like git) to deploy
  • Installable/uninstallable/upgradeable with pip, pipenv and other similar tools
  • Allows to distribute only the code required during run time (tests and other auxilary stuff may be excluded from package)
  • Package may include C-extensions or Cython code which are automatically compiled during installation
  • Package can be precompiled as Python wheel
  • Package can be compressed

Feb 13, 2019

My public source code

Prospective clients and employers often ask to show some source code to estimate my coding skills. Unfortunately, or fortunately, most of the code I write I do for money (aka professionally) therefore it is covered by NDA either explicitly or implicitly. Here is the list of mostly leisure projects which code I can show:

Stats on repositories that cannot be open sourced.

Jan 5, 2019

Select for update

It is easy to overlook select for update when using Django. Currently, I am working on a project that uses Django 1.11.x and Django REST Framework 3.8.2 (upgrade to Django 2.1 is planned, but we need to move to Python 3.7.1 first) and ran into an issue (again) of not having .select_for_update() where it is required. The issue appears as if your changes to the database do not happen, but what actually happens is that two parallel requests (or other parallel tasks) overwrite each others data. This is because both requests retrieved a record from database and reconstructed a corresponding model instance via ORM each its own copy. It a some moment one of the requests have modified its instance attributes and saved changes to the database. But it did not modify them in the second copy that sits in the other request. At a later moment second request modifies its copy attributes and saves changes to the database. It may not be an issue (in some cases) if the same attributes are being modified - the database then would contain the most up to date values (probably what actually need), but if it is not the case we have an issue with older values being stored to the database.

With Django default behavior (at least for version 2.11) all model attributes are being saved to the database even if they were not modified. Imagine you have a model A with attributes b and c. Two parallel requests get its instance from the database instance = A(b='b1', c='c1'). Then request 1 changes b = 'b2' and saves instance to the database: instance.save(). A(b='b2', c='c1') will be stored to the database. Note, that although request 1 did not change attribute c it will be stored to the database anyway. At this moment request 2 still contains its own copy as A(b='b1', c='c1'). Then it changes c = 'c2' and saves changes to the database: instance.save(). A(b='b1', c='c2') will be stored to database. Again attribute b was not changed by the request, but its value is being saved to the database by default. Therefore overwrites the value saved by request 1 (b = 'b2') with an older value b = 'b1' that was retrieved before. It is a typical lost update problem (also see write-write conflict).

In development and test environments this issue rarely reproduces. This is because of very low level of concurrency in these environments. But in production environment it may very well happen. And when it does it is really hard to debug (because it is production and because it is hard to figure out what are conditions for reproduction). Therefore this kind of issues should be prevented during development. One should make habit to query objects with .select_for_update() if they are to be modified.

But developers still forget to do it (I do forget sometimes at least). There are two things that could be done here. First, if Django REST Framework is used it could use .select_for_update() for all modification operations by default or at least for PATCH and PUT (my question to the core developer). Second, Django should not save all attributes blindly, but only those that were modified (this will cover cases where different attributes are modified by parallel requests and also improve performance).

UPDATE: While we are waiting a reply from core developer here is a snippet for Django REST Framework:
from rest_framework import mixins
from rest_framework.generics import GenericAPIView
from rest_framework.viewsets import ViewSetMixin


class CustomGenericAPIView(GenericAPIView):
    def get_queryset(self):
        qs = super(CustomGenericAPIView, self).get_queryset()
        if self.request.method in ('PATCH', 'PUT'):
            qs = qs.select_for_update()

        return qs


class CustomGenericViewSet(ViewSetMixin, CustomGenericAPIView):
    pass


class CustomModelViewSet(mixins.CreateModelMixin,
                         mixins.RetrieveModelMixin,
                         mixins.UpdateModelMixin,
                         mixins.DestroyModelMixin,
                         mixins.ListModelMixin,
                         CustomGenericViewSet):
    pass