The following map summarized my general WorkFlow for data analysis based on Python, which is a popular community-driven and user-friendly language.
I hope the WorkFlow and Methodology beneath can also serve as a reference for other language/projects, where Python is found useful as a prototyping/demonstration tool.
- This is a work in progress, and will be aggregated into a talk about Open Research. My idea of open research is adopted from the computational thinking from open source community.
Some of the basic concepts include,
- continuous integration: test-driven and object-oriented for easier maintenances.
- well documented: including report/web-page/presentation for better understanding.
- reproducibility: containerized developing environment, archived data and git-logged code repository should be accessible to replicate.
- community driven: public ready to communicate on peer reviews and general feedbacks.
Use mind map to conveniently remark and organize the outline of a project.
Content type in MindNode
- smart link with
Extend: export to Markdown
Export the mind map to markdown document to extend the details on each topic.
Markdown Basics: the key formatting syntax. Markdown is also compatible to use html markup most of the time.
<!-- image with style --> [<img src="img_url" style="float: right" width="100px">](img_url) <!-- cross reference --> TOC - [title](#tag) <a name='#tag'>title</a> <!-- footnote --> <sup>(#myfootnote1)</sup> <a name="myfootnote1">1</a>: Footnote content goes here
Containerize: balancing between system isolation and performance like a sandbox for micro-service.
- Images - The blueprints of our application which form the basis of containers. In the demo above, we used the
docker pullcommand to download the busybox image.
- Containers - Created from Docker images and run the actual application. We create a container using
docker runwhich we did using the busybox image that we downloaded. A list of running containers can be seen using the
- Docker Daemon - The background service running on the host that manages building, running and distributing Docker containers. The daemon is the process that runs in the operating system to which clients talk to.
- Docker Client - The command line tool that allows the user to interact with the daemon. More generally, there can be other forms of clients too - such as Kitematic which provide a GUI to the users.
- Docker Hub - A registry of Docker images. You can think of the registry as a directory of all available Docker images. If required, one can host their own Docker registries and can use them for pulling images.
1. Start from base image
docker pull image_name # pull public image/repository from a registry docker build [--no-cache] -t image_name path/to/Dockerfile [-f renamed-dockerfile] docker run -it # interactive --rm # rm container when exit -d # run as detached -p 8888:8888 # port fwd to host -e DISPLAY=$DISPLAY # set environment variable -u user # username/uid in image -v path/to/local:path/to/container # mount directory image_name [command] docker port container_id # show the open ports of a container instance docker start/attach/stop/rm container_id # manage a container instance docker rmi image_id # remove an image
- public registry: be caution about potential security risk in 3rd party image/repository.
2. Record needed ingredients in requirement.txt
While developing, record the additional python packages in a text file named
requirement.txt, which will be useful to construct the
Dockerfile to automatically configure the developing environment, as well as hosting an interactive Jupyter notebook with mybinder.
# Example requirement.txt SomeProject SomeProject == 1.3 SomeProject >=1.2,<.2.0 SomeProject[foo, bar] SomeProject~=1.4.2
- Best practices: use minimum system with necessary packages
- Base app samples: from IDE to OS, official docker images to start with
- Docker Hub: Community hub, use with caution
- start with a base image that meets the need of the project
- customize with additional dependency using
RUN sudo pip install -r requirement.txt
- add special system setup and port configuration as needed
- Object-oriented programing
- function -> method
- variable -> attribute
- Computational thinking: see the full image here
A few suggestions
- think about goals of code
- break into reasonable classes
- pseudocode it up
- check for efficient algorithm
- test each function and class
- assemble code in main
- optimize if necessary
Code Styling: PEP8
- indentation: 4 spaces
- snake_case: packages, modules, functions
- CamelCase: classes, attributes
- alltogether: variables, add
_for build-in variables
- space near operators, except for
c = a/b
- separate imports
>2 space # 1 spaceinline comments, docstring
- break liner
1. Dev with Jupyter, note issues: interactive notebook is very handy at development
- execution order matters
- memory consuming
See my LSST talk about Jupyter as a Research Tool for example.
2. Aggregate to python script: modularize codes into functions
More to read and adopt
- Python Community Code of Conduct
- modular code
- specific and fine-grained test
- regression test for bugs
3. Checkpoint scripts with git: git log the progress
Github git cheat sheet: some basic operations
Create a private git repository on any ssh server with 6 lines
# on server mkdir project.git cd project.git git --bare init # client side git init git remote add mygitserver ssh://git@remote-host[:port]/path/to/project.git git push mygitserver master
When the code is ready to share,
- create a parent directory
- create setup script
- create MANIFEST.in for list of files
- include README
- include LICENSE
- include setup.py
- recursive-include folders scripts
python setup.py sdist
1. Modularize function: if not done earlier
class CamelCase(): def __init__(): # initialize self. ... def __repr__(self): # print return "..." if __name__ == '__main__': body of program
2. Unittest: remember the issues we note down during the developing? These are good cases to write up tests about. A more proactive concept is test-driven programming.
├── __init__.py ├── code.py ├── func_a.py ├── func_b.py ├── func_c.py └── tests ├── __init__.py ├── test_funcs.py └── test_something.py
- conda install
- compose tests under
- conda install
3. Continuous Integration: use continuous integration to automatically test when something changes in repository.
- Travis: require
.travis.ymlfile to config
# example .travis.yml file language: python python: - "3.6" # command to install dependencies install: "pip install -r requirement.txt" # command to run tests script: pytest
4. Profiling & Optimization
Premature optimization is the root of all evil. -- Donald Knuth
- computation amount
- memory usage
Tips from Cameron Hummels
- Decide what you are optimizing over
- Computer time versus person time
- Write readable code first, then optimize
- Use profilers to identify bottlenecks
- Address bottlenecks one at a time
- Latest Python is most optimized
- Try new approaches and profile/testit
- Object oriented code
- NumPy arrarys are optimized
- Vectorize loops when possible
- List comprehensions
- Avoid building lists through appends
- In place operations as opposed to rebuilding
time # coarse total time %%timeit # code snippets cProfile # deep profiling with viz tools pstats # text-based snakeviz runsnakerun # pipeline tool pyprof2html # html tool line_profiler/kernprof # line-by-line function memory_profiler # memory consumption and memory leaks
- MPI and mpi4py
5. Documenting: essential for future revisit or further development.
versioning: x.y.z (E.g., 0.2.3, 2.7.12, 3.6)
• change x for breaking changes • change y for non-breaking changes • change z for bug-fixes
docstring and comments tips
- document while coding
- document the interfaces of modular
- use descriptive names
- consistent style/format
Following my WorkFlow, most of the work has been done at this stage. The rest can be carried out in very minimum effort with decent finish.
Example finish: use
? for keyboard shortcuts to control the slides.
- Jupyter notebook to html report: my Ph.D. project
- Jupyter notebook to html slides: my talk slides on LSST 2018 workshop
- MindNode to markdown to html slides: my draft slides of this document
1. MindNote -> Markdown: re-arrange and convert the outline mind map to markdown.
Warning: Jupyter notebook should be re-organized for presentation, especially dissertation defense! The order of work is not necessary the order of talk! Check my LSST talk for some tips.
2. Markdown -> HTML: extend the details in markdown and convert to html.
Pandoc: powerful tool for conversion.
brew install pandoc
%for frontpage info
- usage demo:
pandoc -s --mathjax -i -t --slide-level=2 revealjs WorkFlow.md -o WorkFlow.html
3. Slideshow: HTML + reveal.js
-V revealjs-url=http://lab.hakim.se/reveal-jswhen using pandoc to convert
or download reveal.js to the same directory of the converted directory
Now it's ready to open the html file to start the slideshow, use
?for keyboard shortcuts to control the slides.
To take the research/project to a workshop, we need to recall what we've done.
1. Config env.: Dockerfile
With everything done, it is now easy to put all the ingredients and recipe together into a Dockerfile.
# Example Dockerfile for python cat > Dockerfile <<EOF FROM python:3 WORKDIR /usr/src/app COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [ "python", "./your-daemon-or-script.py" ] EOF # build docker build -t pydev . docker run -it --rm pydev
2. Demo: Markdown + Scripts -> Jupyter notebook
Following the mind map and resulting markdown file, we can put the outline structure in Jupyter notebook since it natively supports markdown formatted cells. We can fill in the function calls and visualization codes in between.
My secret on toggling code cells
$ jupyter nbextension enable codefolding/main
- more extenstions: spellchecker, Table of Contents, Autoscroll,...
- hide/remove code in one step templates
- more functionality available
Another wheel to edit slide styles on the fly
- RISE: good for dev./workshop
- demo: interactive live rendering, based on reveal.js
- reveal.js: good for prod./presentation
- post-process with nbconvert
3. Slideshow: nbconvert + reveal.js
Wrap up commands to convert notebook into slideshow
jupyter nbconvert notebook.ipynb --to slides --reveal-prefix reveal.js [--template hidecode/rmcode.tpl] wget https://github.com/hakimel/reveal.js/archive/3.6.0.tar.gz | tar -xvzf - open notebook.slides.html
When it is ready to take the project to the public, there are a few wheels very handy to make it more appealing.
Live slideshow: add some markup in the url of the html file in repository to render the slideshow in live, not always working.
reveal html files: go to
Live notebook demo: binder
Everything is ready, just paste the repository link to mybinder.org
- configure: add requirement.txt or Dockerfile
launch on binder: recall the RISE demo
static option: open
http://nbviewer.jupyter.org/and paste the github url of the ipynb file
Project webpage: github.io + HTML
github.io: use any of the converted html file to set it up in 3 steps
- create a repository with
- add HTML file
- go to
Documentation: Read the Docs
This step depends on how often and well the project is documented. If earlier guide is followed, there is no pain at all.
- sphinx with reST:
sphinx-quickstart -a "Name" -p Repo -v 0.1 --ext-autodoc -q
- readthedocs: link github & auto build
Building test: Travis CI
Follow the manual/documentation!
# load dataset based on memory sources = pd.read_csv(chunksize=100000) for i, chunk in enumerate(sources): ...