WorkFlow and Methodology

Table of contents:

Overview
Drafting
- Outline: MindNode
- Extend: export to Markdown
Developing
Publishing

Overview

The following map summarized my general WorkFlow for data analysis based on Python, which is a popular community-driven and user-friendly language.
I hope the WorkFlow and Methodology beneath can also serve as a reference for other language/projects, where Python is found useful as a prototyping/demonstration tool.

This is a work in progress, and will be aggregated into a talk about Open Research. My idea of open research is adopted from the computational thinking from open source community.

About ME

Some of the basic concepts include,

continuous integration: test-driven and object-oriented for easier maintenances.
well documented: including report/web-page/presentation for better understanding.
reproducibility: containerized developing environment, archived data and git-logged code repository should be accessible to replicate.
community driven: public ready to communicate on peer reviews and general feedbacks.

Drafting

Outline: MindNode

Use mind map to conveniently remark and organize the outline of a project.

Content type in MindNode

image
smart link with http://
note
task

Extend: export to Markdown

Export the mind map to markdown document to extend the details on each topic.

Markdown Basics: the key formatting syntax. Markdown is also compatible to use html markup most of the time.

<!-- image with style -->

[<img src="img_url" style="float: right" width="100px">](img_url)

<!-- cross reference -->

TOC

- [title](#tag)
  <a name='#tag'>title</a>

<!-- footnote -->

<sup>[1](#myfootnote1)</sup>
<a name="myfootnote1">1</a>: Footnote content goes here

Developing

Environment: Docker

Containerize: balancing between system isolation and performance like a sandbox for micro-service.

Basic concepts

Images - The blueprints of our application which form the basis of containers. In the demo above, we used the docker pull command to download the busybox image.

Containers - Created from Docker images and run the actual application. We create a container using docker run which we did using the busybox image that we downloaded. A list of running containers can be seen using the docker ps command.
Docker Daemon - The background service running on the host that manages building, running and distributing Docker containers. The daemon is the process that runs in the operating system to which clients talk to.

Docker Client - The command line tool that allows the user to interact with the daemon. More generally, there can be other forms of clients too - such as Kitematic which provide a GUI to the users.
Docker Hub - A registry of Docker images. You can think of the registry as a directory of all available Docker images. If required, one can host their own Docker registries and can use them for pulling images.

1. Start from base image

Basic commands

docker pull image_name  # pull public image/repository from a registry

docker build [--no-cache] -t image_name path/to/Dockerfile [-f renamed-dockerfile]

docker run -it  # interactive
           --rm  # rm container when exit
           -d  # run as detached
           -p 8888:8888  # port fwd to host
           -e DISPLAY=$DISPLAY  # set environment variable
           -u user  # username/uid in image
           -v path/to/local:path/to/container  # mount directory
           image_name
           [command]

docker port container_id  # show the open ports of a container instance
docker start/attach/stop/rm container_id  # manage a container instance
docker rmi image_id  # remove an image

public registry: be caution about potential security risk in 3rd party image/repository.

2. Record needed ingredients in requirement.txt

While developing, record the additional python packages in a text file named requirement.txt, which will be useful to construct the Dockerfile to automatically configure the developing environment, as well as hosting an interactive Jupyter notebook with mybinder.

file format and specifiers
pip freeze > requirement.txt: list installed package

# Example requirement.txt
SomeProject
SomeProject == 1.3
SomeProject >=1.2,<.2.0
SomeProject[foo, bar]
SomeProject~=1.4.2

3. Compose building recipe to Dockerfile

Some resources

Best practices: use minimum system with necessary packages
Base app samples: from IDE to OS, official docker images to start with
Docker Hub: Community hub, use with caution

Steps

start with a base image that meets the need of the project
customize with additional dependency using RUN sudo pip install -r requirement.txt
add special system setup and port configuration as needed

Prototyping: Python

Some principles

Object-oriented programing
function -> method
variable -> attribute
Computational thinking: see the full image here

A few suggestions

think about goals of code
break into reasonable classes
pseudocode it up
check for efficient algorithm
test each function and class
assemble code in main
run
optimize if necessary

Code Styling: PEP8

indentation: 4 spaces
snake_case: packages, modules, functions
CamelCase: classes, attributes
_alltogether: variables, add _ for build-in variables_
space near operators, except for func(a=1, b=3) or c = a/b
separate imports
>2 space # 1 space inline comments, docstring
break liner /

Practical steps

1. Dev with Jupyter, note issues: interactive notebook is very handy at development

Pitfall with interactive notebook
execution order matters
memory consuming
See my LSST talk about Jupyter as a Research Tool for example.

2. Aggregate to python script: modularize codes into functions

Packaging: iPython

When the code is ready to share,

create a parent directory
create setup script
create MANIFEST.in for list of files
include README
include LICENSE
include setup.py
recursive-include folders scripts
python setup.py sdist

1. Modularize function: if not done earlier

class CamelCase():
    def __init__():  # initialize
        self. ...
    def __repr__(self):  # print
        return "..."

if __name__ == '__main__':
    body of program

2. Unittest: remember the issues we note down during the developing? These are good cases to write up tests about. A more proactive concept is test-driven programming.

├── __init__.py
├── code.py
├── func_a.py
├── func_b.py
├── func_c.py
└── tests
    ├── __init__.py
    ├── test_funcs.py
    └── test_something.py

testing framework

pytest
conda install pytest pytest-cov
compose tests under tests/
run py.test
nose

3. Continuous Integration: use continuous integration to automatically test when something changes in repository.

Travis: require .travis.yml file to config
circleci
AppVeyor

# example .travis.yml file

language: python
python:
  - "3.6"
# command to install dependencies
install: "pip install -r requirement.txt"
# command to run tests
script: pytest

4. Profiling & Optimization

Premature optimization is the root of all evil. -- Donald Knuth

computation amount
memory usage
input/output
storage

Tips from Cameron Hummels

Decide what you are optimizing over
Computer time versus person time
Write readable code first, then optimize
Use profilers to identify bottlenecks
Address bottlenecks one at a time
Latest Python is most optimized
Try new approaches and profile/testit

Object oriented code
NumPy arrarys are optimized
Vectorize loops when possible
List comprehensions
Avoid building lists through appends
In place operations as opposed to rebuilding
Cython
Numba

time  # coarse total time
%%timeit  # code snippets
cProfile  # deep profiling with viz tools
  pstats  # text-based
  snakeviz
  runsnakerun  # pipeline tool
  pyprof2html  # html tool
line_profiler/kernprof  # line-by-line function
memory_profiler  # memory consumption and memory leaks

Parallel computing

multithreaded
multiprocessing
MPI and mpi4py

5. Documenting: essential for future revisit or further development.

versioning: x.y.z (E.g., 0.2.3, 2.7.12, 3.6)

• change x for breaking changes • change y for non-breaking changes • change z for bug-fixes

docstring and comments tips

document while coding
document the interfaces of modular
use descriptive names
consistent style/format
docstring

Publishing

Presentation

Following my WorkFlow, most of the work has been done at this stage. The rest can be carried out in very minimum effort with decent finish.

Example finish: use ? for keyboard shortcuts to control the slides.

Jupyter notebook to html report: my Ph.D. project
Jupyter notebook to html slides: my talk slides on LSST 2018 workshop
MindNode to markdown to html slides: my draft slides of this document

1. MindNote -> Markdown: re-arrange and convert the outline mind map to markdown.

Warning: Jupyter notebook should be re-organized for presentation, especially dissertation defense! The order of work is not necessary the order of talk! Check my LSST talk for some tips.

2. Markdown -> HTML: extend the details in markdown and convert to html.

Pandoc: powerful tool for conversion.

install: brew install pandoc
use % for frontpage info
usage demo: pandoc -s --mathjax -i -t --slide-level=2 revealjs WorkFlow.md -o WorkFlow.html

3. Slideshow: HTML + reveal.js

add -V revealjs-url=http://lab.hakim.se/reveal-js when using pandoc to convert
or download reveal.js to the same directory of the converted directory
Now it's ready to open the html file to start the slideshow, use ? for keyboard shortcuts to control the slides.

Workshop

To take the research/project to a workshop, we need to recall what we've done.

1. Config env.: Dockerfile

With everything done, it is now easy to put all the ingredients and recipe together into a Dockerfile.

# Example Dockerfile for python
cat > Dockerfile <<EOF
FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD [ "python", "./your-daemon-or-script.py" ]
EOF

# build
docker build -t pydev .
docker run -it --rm pydev

2. Demo: Markdown + Scripts -> Jupyter notebook

Following the mind map and resulting markdown file, we can put the outline structure in Jupyter notebook since it natively supports markdown formatted cells. We can fill in the function calls and visualization codes in between.

My secret on toggling code cells

nbextension
codefolding: $ jupyter nbextension enable codefolding/main
more extenstions: spellchecker, Table of Contents, Autoscroll,...
nbconvert with --template template.tpl
hide/remove code in one step templates
more functionality available

Another wheel to edit slide styles on the fly

RISE: good for dev./workshop
demo: interactive live rendering, based on reveal.js
reveal.js: good for prod./presentation
post-process with nbconvert

3. Slideshow: nbconvert + reveal.js

Wrap up commands to convert notebook into slideshow

jupyter nbconvert notebook.ipynb --to slides --reveal-prefix reveal.js [--template hidecode/rmcode.tpl]

wget https://github.com/hakimel/reveal.js/archive/3.6.0.tar.gz | tar -xvzf -

open notebook.slides.html

Repository: GitHub

When it is ready to take the project to the public, there are a few wheels very handy to make it more appealing.

Live slideshow: add some markup in the url of the html file in repository to render the slideshow in live, not always working.

reveal html files: go to http://htmlpreview.github.io/?+git_html_url+?print-pdf

Live notebook demo: binder

Everything is ready, just paste the repository link to mybinder.org

configure: add requirement.txt or Dockerfile
launch on binder: recall the RISE demo
static option: open http://nbviewer.jupyter.org/ and paste the github url of the ipynb file

Project webpage: github.io + HTML

github.io: use any of the converted html file to set it up in 3 steps

create a repository with username.github.io
add HTML file
go to https://username.github.io

Documentation: Read the Docs

This step depends on how often and well the project is documented. If earlier guide is followed, there is no pain at all.

sphinx with reST: sphinx-quickstart -a "Name" -p Repo -v 0.1 --ext-autodoc -q
doxygen
readthedocs: link github & auto build

Building test: Travis CI

Follow the manual/documentation!

tricks

# load dataset based on memory
sources = pd.read_csv(chunksize=100000)
for i, chunk in enumerate(sources):
    ...