Python Data Science CookieCutter
The Python Data Science CookieCutter is a useful standardized framework for organizing data science projects. Here I write down things I learn after going through the framework.
# CookieCutter Project Tree ├── LICENSE ├── Makefile <- Makefile with commands like `make data` or `make train` ├── README.md <- The top-level README for developers using this project. ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump. │ ├── docs <- A default Sphinx project; see sphinx-doc.org for details │ ├── models <- Trained and serialized models, model predictions, or model summaries │ ├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), │ the creator's initials, and a short `-` delimited description, e.g. │ `1.0-jqp-initial-data-exploration`. │ ├── references <- Data dictionaries, manuals, and all other explanatory materials. │ ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. │ └── figures <- Generated graphics and figures to be used in reporting │ ├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g. │ generated with `pip freeze > requirements.txt` │ ├── setup.py <- makes project pip installable (pip install -e .) so src can be imported ├── src <- Source code for use in this project. │ ├── __init__.py <- Makes src a Python module │ │ │ ├── data <- Scripts to download or generate data │ │ └── make_dataset.py │ │ │ ├── features <- Scripts to turn raw data into features for modeling │ │ └── build_features.py │ │ │ ├── models <- Scripts to train models and then use trained models to make │ │ │ predictions │ │ ├── predict_model.py │ │ └── train_model.py │ │ │ └── visualization <- Scripts to create exploratory and results oriented visualizations │ └── visualize.py │ └── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
.env is for storing passwords or other sensitive information. Generally, its usage is
# .env file EMAIL="[email protected]" DATABASE_PASSWORD="41addd05-49fa-4b64-9b44-7a2ee256451" # python code from dotenv import load_dotenv, find_dotenv load_dotenv(find_dotenv()) SECRET_KEY = os.getenv("EMAIL") DATABASE_PASSWORD = os.getenv("DATABASE_PASSWORD")
If your project only uses open source datasets, then dotenv seems to be not very useful. Otherwise, you may use it for storing e.g. dataset API id and password.
load_dotenv() by default would not update environment variables that already exist. So your changes to the
.env would not be reflected. Usually
load_dotenv(find_dotenv(), override=True) is a much better choice.
.env file also treats
# as start of the comment. So you don’t want something like
DATASET_URL=http://example.com/#/dataset. Quotes are mandatory in this case.
.env is by default in
git only tracks files,
.gitkeep is a hack for letting
git not ignore structured directories. Since it is basically a dummy file, the name doesn’t really matter (it can also be renamed to
data contains four folders:
When join and aggregate operations are necessary, the definition of
interim is very clear. You may have two tables as
raw in the beginning, then join them to produce a joint table which is stored to
interim. Otherwise, I don’t think
interim is useful in most cases. You may just collect and put the raw data in
raw. After the normalization and removal of the unwanted features, it directly goes to
processed. This example only uses two of the four folders.
The difference between
external seems pretty obscure. Fortunately, the author provides an explanation here. So it seems that
external is only useful when you start with some e.g. sponsored data and then find it useful to compare to other data. If you just need to prove your model’s superiority then it is definitely not necessary.
Personally, I would prefer
synthetic instead of
Don’t ever edit your raw data, especially not manually, and especially not in Excel. Don’t overwrite your raw data. Don’t save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis
raw folder is necessary. I cannot agree more with this.
sphinx for writing documentation. It can support markdown syntax by introducing
recommonmark. There are many online tutorials for using
sphinx. If the project you setup is more of an “exploration” one, then doc is clearly not the priority. I prefer writing detailed comments (especially focusing on dimensionality) instead of writing doc along with code.
Before I have developed an interest of data science project convention, I solely use notebooks for data science projects. From my personal experience, the two suggestions: use
# OPTIONAL: Load the "autoreload" extension so that code can change %load_ext autoreload # OPTIONAL: always reload modules so that as you change code in src, it gets loaded %autoreload 2
and use the same naming format
would both largely reduce your workload.
Frequently we want to add new method to a model class, but it won’t let the already defined class instance have the new method. However, a trick to solve the problem is
inst.__class__ = testmodule.TestClass # TestClass already modified
tox is a useful testing tool and can also be used to manage testing in different virtual environments. It would be useful when a baseline model’s open-source implementation is incompatible with the latest torch distribution. The default
tox.ini includes a
flake8 syntax checker. Most code editors achieve similar functionalities.
Cookiecutter definitely provides a lot of useful Python data science project conventions. But it is not your mom, it does not provide any direct guide on
- The structure for simulated data (compare learned model with ground truth parameters)
- If your baselines have very different structures such that their train methods can not be exactly the same (forward → loss v.s. forward → sample → integration → loss), what should you do to reduce the amount of duplicated code?
- File structure for setting up the inheritance between different models is something that really needs some convention.
- Where to put the hyperparameters for all different models?
- I personally prefer adding a
config.jsonfile that handles everything. Different models take different directories.
- I know the convention seems to be making everything command line arguments. But I hate type long command.
- I personally prefer adding a
- Where to put the utilities and debug code?
- The cookiecutter looks like there are no “dirty” steps when playing with models.
Nevertheless, I thank the authors for their contributions.
Except where otherwise noted, content on this blog is licensed under a Creative Commons Attribution 4.0 International License.