Data Management @ LSST

4/24/17

SQLite with python

Basic operation syntax

import sqlite3
conn = sqlite3.connect("reintro.db") 
cur = conn.cursor() // creat cursor for operations
cur.execute("""create table PetInfo(
                                    Name text, 
                                    Species text,
                                    Age tinyint,
                                    FavoriteFood text
                                    )""")
cur.fetchall() // display the cursor      
conn.commit() // commit change to database
cur.execute("""drop table if exists xxx""") # drop the table if is already exists

SDSS API

from astroquery.sdss import SDSS
SDSS.query_sql("""select top 20 * from PhotoObjAll""")

demonstration of Join:


4/25/17

SQL byYusra AlSayyad

Data Management: building database, storing data

  • SQL database
  • Relational Algebra, not good for graph structure
  • combined -> pandas by Wes McKinney
  • TOPCAT: Relational database with uncertainty
  • Indexing: optimize data extracture and slicing
  • hierarchical triangular mesh, reduce 2D to 1D for GIS indexing

SQL examples for inspiration

CREATE database mysql_demo;
use mysql_demo;
CREATE TABLE Person (
  personId int,
  firstName varchar(255),
  lastName varchar(255),
  phoneNumber char(10));
CREATE TABLE Publication (
  pubId int,
  title varchar(255),
  journalName varchar(255));
INSERT INTO Person VALUES (1, 'Cameron', 'Hummels', '1234567890');
INSERT INTO Person VALUES (2, 'Y', 'A', '1234567890');
INSERT INTO Publication VALUES (101, 'Stuff about stars', 'Apj');
INSERT INTO Publication VALUES (102, 'Stuff about galaxies', 'Apj');
show tables;

CREATE TABLE Authorship ( ! relational dababase
  personId int,
  pubId int);

INSERT INTO Authorship VALUE (1,101);
INSERT INTO Authorship VALUE (1,102);
INSERT INTO Authorship VALUE (2,101);
INSERT INTO Authorship VALUE (2,102);

SELECT pub.*, p.*
  FROM publication pub 
  INNER JOIN Authorship a
  on a.pubId = pub.pubId
  INNER JOIN Person p
  on a.personId = p.personId

CREATE INDEX key On table

SELECT TOP 10
   p.objid,p.ra,p.dec,p.u,p.g,p.r,p.i,p.z,
   p.run, p.rerun, p.camcol, p.field,
   s.specobjid, s.class, s.z as redshift,
   s.plate, s.mjd, s.fiberid
FROM PhotoObj AS p
   JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE 
   p.u BETWEEN 0 AND 19.6
   AND g BETWEEN 0 AND 20

SELECT top 10 * 
        FROM photoobjall p 
        inner join ROSAT r 
        on p.objid = r.objid
        where (p.cModelFlux_u + p.cModelFlux_g + p.cModelFlux_r + p.cModelFlux_i + p.cModelFlux_z > 10000) 
        and (p.type = 3) 

SELECT pr.*
FROM
    (SELECT p.* 
        FROM photoobjall p JOIN rosat r ON p.objid = r.objid
        WHERE (p.cModelFlux_u + p.cModelFlux_g + p.cModelFlux_r + p.cModelFlux_i + p.cModelFlux_z > 10000) 
        and (p.type = 3)
    ) as pr
        LEFT JOIN specobjall s ON s.bestobjid = pr.objid where s.bestobjid is null

GROUP BY

ORDER BY ASC/DEC

4/26/17

Data Management: extracting data

Scaling up -> scaling out

Hardware Architecture

  • Parallel RDMS Architecture: shared nothing
  • Partitioning: spatical indexing
    • fast extracting regional data in high dimension
    • random distribute data based on hash table to balance load
  • Shared scanning
  • Fault tolerance

Software

  • Spark, HDFS solution
  • Hadoop Distributed Files System, HDFS
  • MapReduce
  • Relation (shcema)
  • Indexing
  • Eventually -> database
  • NoSQL: not only SQL, non-relational (no schema)
  • MongoDB
  • DynamoDB
  • ...
  • Aggregation-oriented database
  • pros: same type of aggregated data
  • cons: hard to slice and dice

LSST QSERV


10/19/17

Resources * SQL notebook * SQL course on KhanAcademy

Published: Thu 19 October 2017. By Dongming Jin in

Comments !