Python Data Science CookieCutter
The Python Data Science CookieCutter is a useful standardized framework for organizing data science projects. Here I write down things I learn after going through the framework.
# CookieCutter Project Tree
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
.env
Simply put, .env
is for storing passwords or other sensitive information. Generally, its usage is
# .env file
EMAIL="hello@world.com"
DATABASE_PASSWORD="41addd05-49fa-4b64-9b44-7a2ee256451"
# python code
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
SECRET_KEY = os.getenv("EMAIL")
DATABASE_PASSWORD = os.getenv("DATABASE_PASSWORD")
If your project only uses open source datasets, then dotenv seems to be not very useful. Otherwise, you may use it for storing e.g. dataset API id and password.
Caution
load_dotenv()
by default would not update environment variables that already exist. So your changes to the .env
would not be reflected. Usually load_dotenv(find_dotenv(), override=True)
is a much better choice.
.env
file also treats #
as start of the comment. So you don’t want something like DATASET_URL=http://example.com/#/dataset
. Quotes are mandatory in this case.
.env
is by default in .gitignore
.
.gitkeep
Since git
only tracks files, .gitkeep
is a hack for letting git
not ignore structured directories. Since it is basically a dummy file, the name doesn’t really matter (it can also be renamed to.keep
).
data
data
contains four folders: external
, interim
, processed
, and raw
.
When join and aggregate operations are necessary, the definition of interim
is very clear. You may have two tables as raw
in the beginning, then join them to produce a joint table which is stored to interim
. Otherwise, I don’t think interim
is useful in most cases. You may just collect and put the raw data in raw
. After the normalization and removal of the unwanted features, it directly goes to processed
. This example only uses two of the four folders.
The difference between raw
and external
seems pretty obscure. Fortunately, the author provides an explanation here. So it seems that external
is only useful when you start with some e.g. sponsored data and then find it useful to compare to other data. If you just need to prove your model’s superiority then it is definitely not necessary.
Personally, I would prefer simulated
or synthetic
instead of external
folder.
Don’t ever edit your raw data, especially not manually, and especially not in Excel. Don’t overwrite your raw data. Don’t save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis
raw
folder is necessary. I cannot agree more with this.
doc
Cookiecutter recommends sphinx
for writing documentation. It can support markdown syntax by introducing recommonmark
. There are many online tutorials for using sphinx
. If the project you setup is more of an “exploration” one, then doc is clearly not the priority. I prefer writing detailed comments (especially focusing on dimensionality) instead of writing doc along with code.
notebook
Before I have developed an interest of data science project convention, I solely use notebooks for data science projects. From my personal experience, the two suggestions: use autoreload
# OPTIONAL: Load the "autoreload" extension so that code can change
%load_ext autoreload
# OPTIONAL: always reload modules so that as you change code in src, it gets loaded
%autoreload 2
and use the same naming format <step>-<ghuser>-<description>.ipynb
would both largely reduce your workload.
Trick
Frequently we want to add new method to a model class, but it won’t let the already defined class instance have the new method. However, a trick to solve the problem is
inst.__class__ = testmodule.TestClass # TestClass already modified
tox.ini
tox
is a useful testing tool and can also be used to manage testing in different virtual environments. It would be useful when a baseline model’s open-source implementation is incompatible with the latest torch distribution. The default tox.ini
includes a flake8
syntax checker. Most code editors achieve similar functionalities.
Final takeaway
Cookiecutter definitely provides a lot of useful Python data science project conventions. But it is not your mom, it does not provide any direct guide on
- The structure for simulated data (compare learned model with ground truth parameters)
- If your baselines have very different structures such that their train methods can not be exactly the same (forward → loss v.s. forward → sample → integration → loss), what should you do to reduce the amount of duplicated code?
- File structure for setting up the inheritance between different models is something that really needs some convention.
- Where to put the hyperparameters for all different models?
- I personally prefer adding a
config.json
file that handles everything. Different models take different directories. - I know the convention seems to be making everything command line arguments. But I hate type long command.
- I personally prefer adding a
- Where to put the utilities and debug code?
- The cookiecutter looks like there are no “dirty” steps when playing with models.
Nevertheless, I thank the authors for their contributions.