Configure & utilize HPCs
Nirav Merchant @CyVerseOrg
computational thinking
- 4A: Abstraction, Automation, Ability and Audacity
- establish and manage data driven collaborations at global scale
- efficient and coordinated use of CI resources: NSF XSEDE, iPlant, campus HPC and high bandwidth
- adopting best practices: HEP, Life science
- community driven, self-provisioning, extensible and open source: CISE communities
- NSF Infrastructure
data rich and knowledge poor
- NSF XSEDE $121M every 5 year
- support team
- travel support
- Jetstream @ U Indiana
- cmd line, GUI
- DOI, share VMs and then store, publish via IU Scholarworks
- 1-44 cpu, OpenStack
- VMs
- Atmosphere web interface
- direct API access via OpenStack cmd line or Horizon
- admin/root
- allocations
- CV for PI, abstract
- Main description
- Technology landscape
- cloud
- image: instance (isolated)
- automation
- containers
- app, bins, libs
- user-end
- docker: docker engine/OS
- singularity
- Toolbox
Jetstream demo
jetstream homepage train56: 091Z67OEe8jE
- Create Project
- create image(instance)
- visibility
- excluding folders
- build
- launch image
docker run -it --rm -p 8888:8888 -v /home/train56:/home/jovyan/work/ jupyter/datascience-notebook
-it: interactive
--rm: clean after done
-v: volumn
Demand * open innovation, science and collaboration * complexity of infrastructure * evolving technology landscape * data/metadata * extreme information technology -> renew computational platform every x years
Docker
docker run -it -p 8888:8888 tensorflow/tensorflow
tensorflow docker
singularity: prepare and configure HPCs
Spark @ Amazon
Industrial * databricks * myria * Daytona sort contest * gartner: hyper cycle * 2012 * 2014 * 2016
Techniques * MapReduce-based * HPC cluster computing * Databases * Transactional * OLTP * Analytic (NoSQL) * OLAP * Latency * CAP Theorem
Partioned data
Availability Consistency
hadoop REDUNDANCY: default 3, for fault torelance
Amazon demo
- AWS
- S3: distributed storage
- EMR: hadoop framework
- create cluster
- hadoop, HBase, Presto, Spark
- instance
- type m4.large
- number: 1 master and x core nodes
- inbound: allow ip
- Zeppelin: notebook format
json
%md
%pyspark
sc.parallelize(data, 8) # `from pyspark.context import SparkContest as sc` in python
RDD.count()/take(n)/takeSample()/map()/collect()
RDD.flatmap() # tuble to list
RDD.reduce(add)
RDD.repartition(n)
RDD.groupByKey()
RDD.reduceByKey(add)
sc.textFile('s3://path_to_file')
%sql
$var # user-defined paramter
LSST, avoid lambda function Spark: lazy, not computing until request data
XSEDE by TACC
- Resources
- HPC
- ECSS: Extended Collaborative Support Service
- Science Gateways
Comments !