Python Data Science CookieCutter

The Python Data Science CookieCutter is a useful standardized framework for organizing data science projects. Here I write down things I learn after going through the framework.

# CookieCutter Project Tree

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

.env

Simply put, .env is for storing passwords or other sensitive information. Generally, its usage is

# .env file
EMAIL="hello@world.com"
DATABASE_PASSWORD="41addd05-49fa-4b64-9b44-7a2ee256451"

# python code
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

SECRET_KEY = os.getenv("EMAIL")
DATABASE_PASSWORD = os.getenv("DATABASE_PASSWORD")

If your project only uses open source datasets, then dotenv seems to be not very useful. Otherwise, you may use it for storing e.g. dataset API id and password.

Caution

load_dotenv() by default would not update environment variables that already exist. So your changes to the .env would not be reflected. Usually load_dotenv(find_dotenv(), override=True) is a much better choice.

.env file also treats # as start of the comment. So you don’t want something like DATASET_URL=http://example.com/#/dataset. Quotes are mandatory in this case.

.env is by default in .gitignore.

.gitkeep

Since git only tracks files, .gitkeep is a hack for letting git not ignore structured directories. Since it is basically a dummy file, the name doesn’t really matter (it can also be renamed to.keep).

data

data contains four folders: external, interim, processed, and raw.

When join and aggregate operations are necessary, the definition of interim is very clear. You may have two tables as raw in the beginning, then join them to produce a joint table which is stored to interim. Otherwise, I don’t think interim is useful in most cases. You may just collect and put the raw data in raw. After the normalization and removal of the unwanted features, it directly goes to processed. This example only uses two of the four folders.

The difference between raw and external seems pretty obscure. Fortunately, the author provides an explanation here. So it seems that external is only useful when you start with some e.g. sponsored data and then find it useful to compare to other data. If you just need to prove your model’s superiority then it is definitely not necessary.

Personally, I would prefer simulated or synthetic instead of external folder.

Don’t ever edit your raw data, especially not manually, and especially not in Excel. Don’t overwrite your raw data. Don’t save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis

raw folder is necessary. I cannot agree more with this.

doc

Cookiecutter recommends sphinx for writing documentation. It can support markdown syntax by introducing recommonmark. There are many online tutorials for using sphinx. If the project you setup is more of an “exploration” one, then doc is clearly not the priority. I prefer writing detailed comments (especially focusing on dimensionality) instead of writing doc along with code.

notebook

Before I have developed an interest of data science project convention, I solely use notebooks for data science projects. From my personal experience, the two suggestions: use autoreload

# OPTIONAL: Load the "autoreload" extension so that code can change 
%load_ext autoreload 
# OPTIONAL: always reload modules so that as you change code in src, it gets loaded
%autoreload 2

and use the same naming format <step>-<ghuser>-<description>.ipynb

would both largely reduce your workload.

Trick

Frequently we want to add new method to a model class, but it won’t let the already defined class instance have the new method. However, a trick to solve the problem is

inst.__class__ = testmodule.TestClass # TestClass already modified

tox.ini

tox is a useful testing tool and can also be used to manage testing in different virtual environments. It would be useful when a baseline model’s open-source implementation is incompatible with the latest torch distribution. The default tox.ini includes a flake8 syntax checker. Most code editors achieve similar functionalities.

Final takeaway

Cookiecutter definitely provides a lot of useful Python data science project conventions. But it is not your mom, it does not provide any direct guide on

  • The structure for simulated data (compare learned model with ground truth parameters)
  • If your baselines have very different structures such that their train methods can not be exactly the same (forward → loss v.s. forward → sample → integration → loss), what should you do to reduce the amount of duplicated code?
    • File structure for setting up the inheritance between different models is something that really needs some convention.
  • Where to put the hyperparameters for all different models?
    • I personally prefer adding a config.json file that handles everything. Different models take different directories.
    • I know the convention seems to be making everything command line arguments. But I hate type long command.
  • Where to put the utilities and debug code?
    • The cookiecutter looks like there are no “dirty” steps when playing with models.

Nevertheless, I thank the authors for their contributions.