Scale Computing on Clusters

Configure & utilize HPCs

computational thinking
4A: Abstraction, Automation, Ability and Audacity
establish and manage data driven collaborations at global scale
efficient and coordinated use of CI resources: NSF XSEDE, iPlant, campus HPC and high bandwidth
adopting best practices: HEP, Life science
community driven, self-provisioning, extensible and open source: CISE communities
NSF Infrastructure
data rich and knowledge poor
NSF XSEDE $121M every 5 year
- support team
- travel support
Jetstream @ U Indiana
- cmd line, GUI
- DOI, share VMs and then store, publish via IU Scholarworks
- 1-44 cpu, OpenStack
- VMs
- Atmosphere web interface
- direct API access via OpenStack cmd line or Horizon
- admin/root
- allocations
- CV for PI, abstract
- Main description
Technology landscape
cloud
- image: instance (isolated)
- automation
containers
- app, bins, libs
- user-end
- docker: docker engine/OS
- singularity
Toolbox
- Ansible for automation Ansible: playbooks, configuration management, deployment, and orchestration
- Docker for execution environment
- Makeflow workqueue for task distribution makeflow
- Pegasus
  
  High-constrast imaging in the cloud with klipReduce and Findr

Jetstream demo

jetstream homepage train56: 091Z67OEe8jE

Create Project
create image(instance)
visibility
excluding folders
build
launch image

docker run -it --rm -p 8888:8888 -v /home/train56:/home/jovyan/work/ jupyter/datascience-notebook

-it: interactive
--rm: clean after done
-v: volumn

Demand * open innovation, science and collaboration * complexity of infrastructure * evolving technology landscape * data/metadata * extreme information technology -> renew computational platform every x years

Docker

docker run -it -p 8888:8888 tensorflow/tensorflow tensorflow docker

singularity: prepare and configure HPCs

Spark @ Amazon

Industrial * databricks * myria * Daytona sort contest * gartner: hyper cycle * 2012 * 2014 * 2016

Techniques * MapReduce-based * HPC cluster computing * Databases * Transactional * OLTP * Analytic (NoSQL) * OLAP * Latency * CAP Theorem

Partioned data

Availability Consistency

hadoop REDUNDANCY: default 3, for fault torelance

Amazon demo

AWS
S3: distributed storage
EMR: hadoop framework
- create cluster
- hadoop, HBase, Presto, Spark
- instance
- type m4.large
- number: 1 master and x core nodes
- inbound: allow ip
- Zeppelin: notebook format json

%md
%pyspark
sc.parallelize(data, 8)  # `from pyspark.context import SparkContest as sc` in python

RDD.count()/take(n)/takeSample()/map()/collect()
RDD.flatmap()  # tuble to list
RDD.reduce(add)
RDD.repartition(n)
RDD.groupByKey()
RDD.reduceByKey(add)

sc.textFile('s3://path_to_file')

%sql
$var  # user-defined paramter