Scale Computing on Clusters

Configure & utilize HPCs

Nirav Merchant @CyVerseOrg

  • computational thinking
  • 4A: Abstraction, Automation, Ability and Audacity
  • establish and manage data driven collaborations at global scale
  • efficient and coordinated use of CI resources: NSF XSEDE, iPlant, campus HPC and high bandwidth
  • adopting best practices: HEP, Life science
  • community driven, self-provisioning, extensible and open source: CISE communities
  • NSF Infrastructure
  • data rich and knowledge poor
  • NSF XSEDE $121M every 5 year
    • support team
    • travel support
  • Jetstream @ U Indiana
    • cmd line, GUI
    • DOI, share VMs and then store, publish via IU Scholarworks
    • 1-44 cpu, OpenStack
    • VMs
    • Atmosphere web interface
    • direct API access via OpenStack cmd line or Horizon
    • admin/root
    • allocations
    • CV for PI, abstract
    • Main description
  • Technology landscape
  • cloud
    • image: instance (isolated)
    • automation
  • containers
    • app, bins, libs
    • user-end
    • docker: docker engine/OS
    • singularity
  • Toolbox
    • Ansible for automation Ansible: playbooks, configuration management, deployment, and orchestration
    • Docker for execution environment
    • Makeflow workqueue for task distribution makeflow
    • Pegasus

      High-constrast imaging in the cloud with klipReduce and Findr

Jetstream demo

jetstream homepage train56: 091Z67OEe8jE

  • Create Project
  • create image(instance)
  • visibility
  • excluding folders
  • build
  • launch image
docker run -it --rm -p 8888:8888 -v /home/train56:/home/jovyan/work/ jupyter/datascience-notebook

-it: interactive
--rm: clean after done
-v: volumn

Demand * open innovation, science and collaboration * complexity of infrastructure * evolving technology landscape * data/metadata * extreme information technology -> renew computational platform every x years

Docker

docker run -it -p 8888:8888 tensorflow/tensorflow tensorflow docker

singularity: prepare and configure HPCs


Spark @ Amazon

Industrial * databricks * myria * Daytona sort contest * gartner: hyper cycle * 2012 * 2014 * 2016

Techniques * MapReduce-based * HPC cluster computing * Databases * Transactional * OLTP * Analytic (NoSQL) * OLAP * Latency * CAP Theorem

Partioned data

Availability Consistency

hadoop REDUNDANCY: default 3, for fault torelance

Amazon demo

  • AWS
  • S3: distributed storage
  • EMR: hadoop framework
    • create cluster
    • hadoop, HBase, Presto, Spark
    • instance
    • type m4.large
    • number: 1 master and x core nodes
    • inbound: allow ip
    • Zeppelin: notebook format json
%md
%pyspark
sc.parallelize(data, 8)  # `from pyspark.context import SparkContest as sc` in python

RDD.count()/take(n)/takeSample()/map()/collect()
RDD.flatmap()  # tuble to list
RDD.reduce(add)
RDD.repartition(n)
RDD.groupByKey()
RDD.reduceByKey(add)

sc.textFile('s3://path_to_file')

%sql
$var  # user-defined paramter

LSST, avoid lambda function Spark: lazy, not computing until request data


XSEDE by TACC

  • Resources
  • HPC
  • ECSS: Extended Collaborative Support Service
  • Science Gateways

globus connect

Published: Fri 28 April 2017. By Dongming Jin in

Comments !