Overview
-
The following map summarized my general WorkFlow for data analysis based on Python, which is a popular community-driven and user-friendly language.
-
I hope the WorkFlow and Methodology beneath can also serve as a reference for other language/projects, where Python is found useful as a prototyping/demonstration tool.
- This is a work in progress, and will be aggregated into a talk about Open Research. My idea of open research is adopted from the computational thinking from open source community.
Some of the basic concepts include,
- continuous integration: test-driven and object-oriented for easier maintenances.
- well documented: including report/web-page/presentation for better understanding.
- reproducibility: containerized developing environment, archived data and git-logged code repository should be accessible to replicate.
- community driven: public ready to communicate on peer reviews and general feedbacks.
Drafting
Outline: MindNode
Use mind map to conveniently remark and organize the outline of a project.
Content type in MindNode
- image
- smart link with
http://
- note
- task
Extend: export to Markdown
Export the mind map to markdown document to extend the details on each topic.
Markdown Basics: the key formatting syntax. Markdown is also compatible to use html markup most of the time.
<!-- image with style -->
[<img src="img_url" style="float: right" width="100px">](img_url)
<!-- cross reference -->
TOC
- [title](#tag)
<a name='#tag'>title</a>
<!-- footnote -->
<sup>[1](#myfootnote1)</sup>
<a name="myfootnote1">1</a>: Footnote content goes here
Developing
Environment: Docker
Containerize: balancing between system isolation and performance like a sandbox for micro-service.
Basic concepts
- Images - The blueprints of our application which form the basis of containers. In
the demo above, we used the
docker pull
command to download the busybox image.
- Containers - Created from Docker images and run the actual application. We create a
container using
docker run
which we did using the busybox image that we downloaded. A list of running containers can be seen using thedocker ps
command. - Docker Daemon - The background service running on the host that manages building, running and distributing Docker containers. The daemon is the process that runs in the operating system to which clients talk to.
- Docker Client - The command line tool that allows the user to interact with the daemon. More generally, there can be other forms of clients too - such as Kitematic which provide a GUI to the users.
- Docker Hub - A registry of Docker images. You can think of the registry as a directory of all available Docker images. If required, one can host their own Docker registries and can use them for pulling images.
1. Start from base image
Basic commands
docker pull image_name # pull public image/repository from a registry
docker build [--no-cache] -t image_name path/to/Dockerfile [-f renamed-dockerfile]
docker run -it # interactive
--rm # rm container when exit
-d # run as detached
-p 8888:8888 # port fwd to host
-e DISPLAY=$DISPLAY # set environment variable
-u user # username/uid in image
-v path/to/local:path/to/container # mount directory
image_name
[command]
docker port container_id # show the open ports of a container instance
docker start/attach/stop/rm container_id # manage a container instance
docker rmi image_id # remove an image
- public registry: be caution about potential security risk in 3rd party image/repository.
2. Record needed ingredients in requirement.txt
While developing, record the additional python packages in a text file named
requirement.txt
, which will be useful to construct the
Dockerfile
to automatically configure the developing environment, as well
as hosting an interactive Jupyter notebook with mybinder.
- file format and specifiers
pip freeze > requirement.txt
: list installed package
# Example requirement.txt
SomeProject
SomeProject == 1.3
SomeProject >=1.2,<.2.0
SomeProject[foo, bar]
SomeProject~=1.4.2
3. Compose building recipe to Dockerfile
Some resources
- Best practices: use minimum system with necessary packages
- Base app samples: from IDE to OS, official docker images to start with
- Docker Hub: Community hub, use with caution
Steps
- start with a base image that meets the need of the project
- customize with additional dependency using
RUN sudo pip install -r requirement.txt
- add special system setup and port configuration as needed
Prototyping: Python
Some principles
- Object-oriented programing
- function -> method
- variable -> attribute
- Computational thinking: see the full image here
A few suggestions
- think about goals of code
- break into reasonable classes
- pseudocode it up
- check for efficient algorithm
- test each function and class
- assemble code in main
- run
- optimize if necessary
Code Styling: PEP8
- indentation: 4 spaces
- snake_case: packages, modules, functions
- CamelCase: classes, attributes
- _alltogether: variables, add
_
for build-in variables_ - space near operators, except for
func(a=1, b=3)
orc = a/b
- separate imports
>2 space # 1 space
inline comments, docstring- break liner
/
Practical steps
1. Dev with Jupyter, note issues: interactive notebook is very handy at development
-
execution order matters
-
memory consuming
-
See my LSST talk about Jupyter as a Research Tool for example.
2. Aggregate to python script: modularize codes into functions
More to read and adopt
- Python Community Code of Conduct
-
Test-driven design
- modular code
- specific and fine-grained test
- regression test for bugs
3. Checkpoint scripts with git: git log the progress
-
Github git cheat sheet: some basic operations
-
Create a private git repository on any ssh server with 6 lines
# on server
mkdir project.git
cd project.git
git --bare init
# client side
git init
git remote add mygitserver ssh://git@remote-host[:port]/path/to/project.git
git push mygitserver master
Packaging: iPython
When the code is ready to share,
- create a parent directory
- create setup script
- create MANIFEST.in for list of files
- include README
- include LICENSE
- include setup.py
- recursive-include folders scripts
python setup.py sdist
1. Modularize function: if not done earlier
class CamelCase():
def __init__(): # initialize
self. ...
def __repr__(self): # print
return "..."
if __name__ == '__main__':
body of program
2. Unittest: remember the issues we note down during the developing? These are good cases to write up tests about. A more proactive concept is test-driven programming.
├── __init__.py
├── code.py
├── func_a.py
├── func_b.py
├── func_c.py
└── tests
├── __init__.py
├── test_funcs.py
└── test_something.py
testing framework
- pytest
- conda install
pytest pytest-cov
- compose tests under
tests/
- run
py.test
- nose
3. Continuous Integration: use continuous integration to automatically test when something changes in repository.
- Travis: require
.travis.yml
file to config - circleci
- AppVeyor
# example .travis.yml file
language: python
python:
- "3.6"
# command to install dependencies
install: "pip install -r requirement.txt"
# command to run tests
script: pytest
4. Profiling & Optimization
Premature optimization is the root of all evil. -- Donald Knuth
- computation amount
- memory usage
- input/output
- storage
Tips from Cameron Hummels
- Decide what you are optimizing over
- Computer time versus person time
- Write readable code first, then optimize
- Use profilers to identify bottlenecks
- Address bottlenecks one at a time
- Latest Python is most optimized
- Try new approaches and profile/testit
- Object oriented code
- NumPy arrarys are optimized
- Vectorize loops when possible
- List comprehensions
- Avoid building lists through appends
- In place operations as opposed to rebuilding
- Cython
- Numba
time # coarse total time
%%timeit # code snippets
cProfile # deep profiling with viz tools
pstats # text-based
snakeviz
runsnakerun # pipeline tool
pyprof2html # html tool
line_profiler/kernprof # line-by-line function
memory_profiler # memory consumption and memory leaks
Parallel computing
- multithreaded
- multiprocessing
- MPI and mpi4py
5. Documenting: essential for future revisit or further development.
versioning: x.y.z (E.g., 0.2.3, 2.7.12, 3.6)
• change x for breaking changes • change y for non-breaking changes • change z for bug-fixes
docstring and comments tips
- document while coding
- document the interfaces of modular
- use descriptive names
- consistent style/format
- docstring
Publishing
Presentation
Following my WorkFlow, most of the work has been done at this stage. The rest can be carried out in very minimum effort with decent finish.
Example finish: use ?
for keyboard shortcuts to control the slides.
- Jupyter notebook to html report: my Ph.D. project
- Jupyter notebook to html slides: my talk slides on LSST 2018 workshop
- MindNode to markdown to html slides: my draft slides of this document
1. MindNote -> Markdown: re-arrange and convert the outline mind map to markdown.
Warning: Jupyter notebook should be re-organized for presentation, especially dissertation defense! The order of work is not necessary the order of talk! Check my LSST talk for some tips.
2. Markdown -> HTML: extend the details in markdown and convert to html.
Pandoc: powerful tool for conversion.
- install:
brew install pandoc
- use
%
for frontpage info - usage demo:
pandoc -s --mathjax -i -t --slide-level=2 revealjs WorkFlow.md -o WorkFlow.html
3. Slideshow: HTML + reveal.js
- add
-V revealjs-url=http://lab.hakim.se/reveal-js
when using pandoc to convert -
or download reveal.js to the same directory of the converted directory
-
Now it's ready to open the html file to start the slideshow, use
?
for keyboard shortcuts to control the slides.
Workshop
To take the research/project to a workshop, we need to recall what we've done.
1. Config env.: Dockerfile
With everything done, it is now easy to put all the ingredients and recipe together into a Dockerfile.
# Example Dockerfile for python
cat > Dockerfile <<EOF
FROM python:3
WORKDIR /usr/src/app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD [ "python", "./your-daemon-or-script.py" ]
EOF
# build
docker build -t pydev .
docker run -it --rm pydev
2. Demo: Markdown + Scripts -> Jupyter notebook
Following the mind map and resulting markdown file, we can put the outline structure in Jupyter notebook since it natively supports markdown formatted cells. We can fill in the function calls and visualization codes in between.
My secret on toggling code cells
nbextension
codefolding
:$ jupyter nbextension enable codefolding/main
- more extenstions: spellchecker, Table of Contents, Autoscroll,...
nbconvert
with--template template.tpl
- hide/remove code in one step templates
- more functionality available
Another wheel to edit slide styles on the fly
- RISE: good for dev./workshop
- demo: interactive live rendering, based on reveal.js
- reveal.js: good for prod./presentation
- post-process with nbconvert
3. Slideshow: nbconvert + reveal.js
Wrap up commands to convert notebook into slideshow
jupyter nbconvert notebook.ipynb --to slides --reveal-prefix reveal.js [--template hidecode/rmcode.tpl]
wget https://github.com/hakimel/reveal.js/archive/3.6.0.tar.gz | tar -xvzf -
open notebook.slides.html
Repository: GitHub
When it is ready to take the project to the public, there are a few wheels very handy to make it more appealing.
Live slideshow: add some markup in the url of the html file in repository to render the slideshow in live, not always working.
reveal html
files: go to
http://htmlpreview.github.io/?
+git_html_url
+?print-pdf
Live notebook demo: binder
Everything is ready, just paste the repository link to mybinder.org
- configure: add requirement.txt or Dockerfile
-
launch on binder: recall the RISE demo
-
static option: open
http://nbviewer.jupyter.org/
and paste the github url of the ipynb file
Project webpage: github.io + HTML
github.io: use any of the converted html file to set it up in 3 steps
- create a repository with
username.github.io
- add HTML file
- go to
https://username.github.io
Documentation: Read the Docs
This step depends on how often and well the project is documented. If earlier guide is followed, there is no pain at all.
- sphinx with reST:
sphinx-quickstart -a "Name" -p Repo -v 0.1 --ext-autodoc -q
- doxygen
- readthedocs: link github & auto build
Building test: Travis CI
Follow the manual/documentation!
tricks
# load dataset based on memory
sources = pd.read_csv(chunksize=100000)
for i, chunk in enumerate(sources):
...
Comments !