Mar 5, 2019
Feb 21, 2019
Python packaging
After years of working as Python developer in various teams I learned that most of Python backend developers do not respect Python packaging for backend applications. This makes me sad. So I decided to draw a blog post to explain why I do otherwise.
I think I started packaging all the python software I produce ages ago (in the beginning of 2013) right after I posted this article (in Russian) to be vetted as an admission fee to habr (to have write and commenting rights).
Many developers asked me why I do Python packaging while everything works just without it. "It is a higher level of maturity" usually was not taken as an answer.
Bullet proof reason (apparently not)
UPDATE: Workaround with
So the 100% bullet proof reason for Python packaging is: without packaging the source code cannot be reused in certain cases. You just can not import it because it appears to be outside of visible scope of running python interpreter.
$ PYTHONPATH=$PWD:$PYTHONPATH python ./tmp/test1.py
So the 100% bullet proof reason for Python packaging is: without packaging the source code cannot be reused in certain cases. You just can not import it because it appears to be outside of visible scope of running python interpreter.
Imagine you have created a skeleton Django project (I am using Django as example just to make the example more practical) and a Django application with (as described in official Django tutorial):
$ python --version
$ python --version
Python 3.7.2 $ python -m django --version 2.1.7 $ django-admin startproject mysite $ cd mysite $ python manage.py startapp polls
Now you need to do some quick research to test your code, so you create a directory that should not be a part of a project and file with your snippet:
$ cd ..
$ mkdir tmp $ vim tmp/test1.py
tmp/test1.py
content:from mysite.polls import models
This would look like:
(django) $ tree . ├── mysite │ ├── manage.py │ ├── mysite │ │ ├── __init__.py │ │ ├── settings.py │ │ ├── urls.py │ │ └── wsgi.py │ └── polls │ ├── admin.py │ ├── apps.py │ ├── __init__.py │ ├── migrations │ │ └── __init__.py │ ├── models.py │ ├── tests.py │ └── views.py └── tmp └── test1.py
When you run the
tmp/test1.py
script you get:$ python ./tmp/test1.pyTraceback (most recent call last): File "./tmp/test1.py", line 1, in <module> from mysite.polls import models ModuleNotFoundError: No module named 'mysite'
It does not import the module, because it appears to be outside visible scope of the script (python interpreter).
Relative import neither works:
from ..mysite.polls import models
$ python ./tmp/test1.pyTraceback (most recent call last): File "./tmp/test1.py", line 1, in <module> from ..mysite.polls import models ValueError: attempted relative import beyond top-level package
Let's bring
$ tree
tmp/test1.py
into mytest directory
:$ tree
. └── mysite ├── manage.py ├── mysite │ ├── __init__.py │ ├── settings.py │ ├── urls.py │ └── wsgi.py ├── polls │ ├── admin.py │ ├── apps.py │ ├── __init__.py │ ├── migrations │ │ └── __init__.py │ ├── models.py │ ├── tests.py │ └── views.py └── tmp └── test1.py
tmp/test1.py
content:from ..polls import models
Nope:
$ python ./mysite/tmp/test1.py Traceback (most recent call last): File "./mysite/tmp/test1.py", line 1, infrom ..polls import models ValueError: attempted relative import beyond top-level package
You can import your code only if you move your
test1.py
to the root level like this:$ tree . ├── mysite │ ├── manage.py │ ├── mysite │ │ ├── __init__.py │ │ ├── settings.py │ │ ├── urls.py │ │ └── wsgi.py │ ├── polls │ │ ├── admin.py │ │ ├── apps.py │ │ ├── __init__.py │ │ ├── migrations │ │ │ └── __init__.py │ │ ├── models.py │ │ ├── tests.py │ │ └── views.py │ └── test1.py (content: from polls import models) └── test1.py (content: from mysite.polls import models)
This leads to polluting source root with a bunch of
testN.py
files which contradicts Python Zen statement "Namespaces are one honking great idea -- let's do more of those!" and just makes code structure messy and harder to navigate.
Creating a package installable to virtualenv in edit mode (
pip install -e .
) will let you import any submodule from script or submodule does not matter where they are located relative to the source code as long as virutalenv is activated.The content of
tmp/test1.py
would be the same no matter where you place it:from mysite.polls import models
Other reasons
There are more reasons to do Python packaging (although most of them fall into "higher maturity" category):
- Code can be uploaded to a private (local) PyPI repository (e.x. Gemfury)
- Forces to have properly versioned software (independent from particular code version control system) with comparable version numbers
- Forces to namespace the code which well-aligned with Python Zen statement "Namespaces are one honking great idea -- let's do more of those!".
- Distributable independent from particular code version control system
- No need to clone entire repository while you need only latest version
- Allows not to give access to the entire source history to someone who is authorized only for deployment
- No need to install code version control system client (like git) to deploy
- Installable/uninstallable/upgradeable with
pip
,pipenv
and other similar tools - Allows to distribute only the code required during run time (tests and other auxilary stuff may be excluded from package)
- Package may include C-extensions or Cython code which are automatically compiled during installation
- Package can be precompiled as Python wheel
- Package can be compressed
P.S. Advanced tutorial on Python packaging for Django users
Feb 13, 2019
My public source code
Prospective clients and employers often ask to show some source code to estimate my coding skills. Unfortunately, or fortunately, most of the code I write I do for money (aka professionally) therefore it is covered by NDA either explicitly or implicitly. Here is the list of mostly leisure projects which code I can show:
- packy-agent - Raspberry Pi installable agent that does various network measurements (ping, traceroute, speedtest, http request), the only paid work I have publicly available (packaged source code), 2019
- scrape-upwork - Upwork job scraper, 2020
- pascal_triangle - code for my blog post on Pascal's Triangle printing optimization, 2017
- dmu-utils - an abandoned try to build an open source library (pypi), 2017
- refactor-me - an example repository of refactoring process, a refactoring step per commit, 2017
Stats on repositories that cannot be open sourced.
Jan 5, 2019
Select for update
It is easy to overlook select for update when using Django. Currently, I am working on a project that uses Django 1.11.x and Django REST Framework 3.8.2 (upgrade to Django 2.1 is planned, but we need to move to Python 3.7.1 first) and ran into an issue (again) of not having
With Django default behavior (at least for version 2.11) all model attributes are being saved to the database even if they were not modified. Imagine you have a model
In development and test environments this issue rarely reproduces. This is because of very low level of concurrency in these environments. But in production environment it may very well happen. And when it does it is really hard to debug (because it is production and because it is hard to figure out what are conditions for reproduction). Therefore this kind of issues should be prevented during development. One should make habit to query objects with
But developers still forget to do it (I do forget sometimes at least). There are two things that could be done here. First, if Django REST Framework is used it could use
UPDATE: While we are waiting a reply from core developer here is a snippet for Django REST Framework:
.select_for_update()
where it is required. The issue appears as if your changes to the database do not happen, but what actually happens is that two parallel requests (or other parallel tasks) overwrite each others data. This is because both requests retrieved a record from database and reconstructed a corresponding model instance via ORM each its own copy. It a some moment one of the requests have modified its instance attributes and saved changes to the database. But it did not modify them in the second copy that sits in the other request. At a later moment second request modifies its copy attributes and saves changes to the database. It may not be an issue (in some cases) if the same attributes are being modified - the database then would contain the most up to date values (probably what actually need), but if it is not the case we have an issue with older values being stored to the database.With Django default behavior (at least for version 2.11) all model attributes are being saved to the database even if they were not modified. Imagine you have a model
A
with attributes b
and c
. Two parallel requests get its instance from the database instance = A(b='b1', c='c1')
. Then request 1 changes b = 'b2'
and saves instance to the database: instance.save()
. A(b='b2', c='c1')
will be stored to the database. Note, that although request 1 did not change attribute c
it will be stored to the database anyway. At this moment request 2 still contains its own copy as A(b='b1', c='c1')
. Then it changes c = 'c2'
and saves changes to the database: instance.save()
. A(b='b1', c='c2')
will be stored to database. Again attribute b
was not changed by the request, but its value is being saved to the database by default. Therefore overwrites the value saved by request 1 (b = 'b2'
) with an older value b = 'b1'
that was retrieved before. It is a typical lost update problem (also see write-write conflict).In development and test environments this issue rarely reproduces. This is because of very low level of concurrency in these environments. But in production environment it may very well happen. And when it does it is really hard to debug (because it is production and because it is hard to figure out what are conditions for reproduction). Therefore this kind of issues should be prevented during development. One should make habit to query objects with
.select_for_update()
if they are to be modified.But developers still forget to do it (I do forget sometimes at least). There are two things that could be done here. First, if Django REST Framework is used it could use
.select_for_update()
for all modification operations by default or at least for PATCH
and PUT
(my question to the core developer). Second, Django should not save all attributes blindly, but only those that were modified (this will cover cases where different attributes are modified by parallel requests and also improve performance).UPDATE: While we are waiting a reply from core developer here is a snippet for Django REST Framework:
from rest_framework import mixins from rest_framework.generics import GenericAPIView from rest_framework.viewsets import ViewSetMixin class CustomGenericAPIView(GenericAPIView): def get_queryset(self): qs = super(CustomGenericAPIView, self).get_queryset() if self.request.method in ('PATCH', 'PUT'): qs = qs.select_for_update() return qs class CustomGenericViewSet(ViewSetMixin, CustomGenericAPIView): pass class CustomModelViewSet(mixins.CreateModelMixin, mixins.RetrieveModelMixin, mixins.UpdateModelMixin, mixins.DestroyModelMixin, mixins.ListModelMixin, CustomGenericViewSet): pass
Subscribe to:
Posts (Atom)