Data Management @ LSST

4/24/17

SQLite with python

Basic operation syntax

import sqlite3
conn = sqlite3.connect("reintro.db") 
cur = conn.cursor() // creat cursor for operations
cur.execute("""create table PetInfo(
                                    Name text, 
                                    Species text,
                                    Age tinyint,
                                    FavoriteFood text
                                    )""")
cur.fetchall() // display the cursor      
conn.commit() // commit change to database
cur.execute("""drop table if exists xxx""") # drop the table if is already exists

SDSS API

from astroquery.sdss import SDSS
SDSS.query_sql("""select top 20 * from PhotoObjAll""")

demonstration of Join:

4/25/17

SQL byYusra AlSayyad

Data Management: building database, storing data

SQL database
Relational Algebra, not good for graph structure
combined -> pandas by Wes McKinney
TOPCAT: Relational database with uncertainty
Indexing: optimize data extracture and slicing
hierarchical triangular mesh, reduce 2D to 1D for GIS indexing

SQL examples for inspiration

CREATE database mysql_demo;
use mysql_demo;
CREATE TABLE Person (
  personId int,
  firstName varchar(255),
  lastName varchar(255),
  phoneNumber char(10));
CREATE TABLE Publication (
  pubId int,
  title varchar(255),
  journalName varchar(255));
INSERT INTO Person VALUES (1, 'Cameron', 'Hummels', '1234567890');
INSERT INTO Person VALUES (2, 'Y', 'A', '1234567890');
INSERT INTO Publication VALUES (101, 'Stuff about stars', 'Apj');
INSERT INTO Publication VALUES (102, 'Stuff about galaxies', 'Apj');
show tables;

CREATE TABLE Authorship ( ! relational dababase
  personId int,
  pubId int);

INSERT INTO Authorship VALUE (1,101);
INSERT INTO Authorship VALUE (1,102);
INSERT INTO Authorship VALUE (2,101);
INSERT INTO Authorship VALUE (2,102);

SELECT pub.*, p.*
  FROM publication pub 
  INNER JOIN Authorship a
  on a.pubId = pub.pubId
  INNER JOIN Person p
  on a.personId = p.personId

CREATE INDEX key On table

SELECT TOP 10
   p.objid,p.ra,p.dec,p.u,p.g,p.r,p.i,p.z,
   p.run, p.rerun, p.camcol, p.field,
   s.specobjid, s.class, s.z as redshift,
   s.plate, s.mjd, s.fiberid
FROM PhotoObj AS p
   JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE 
   p.u BETWEEN 0 AND 19.6
   AND g BETWEEN 0 AND 20

SELECT top 10 * 
        FROM photoobjall p 
        inner join ROSAT r 
        on p.objid = r.objid
        where (p.cModelFlux_u + p.cModelFlux_g + p.cModelFlux_r + p.cModelFlux_i + p.cModelFlux_z > 10000) 
        and (p.type = 3) 

SELECT pr.*
FROM
    (SELECT p.* 
        FROM photoobjall p JOIN rosat r ON p.objid = r.objid
        WHERE (p.cModelFlux_u + p.cModelFlux_g + p.cModelFlux_r + p.cModelFlux_i + p.cModelFlux_z > 10000) 
        and (p.type = 3)
    ) as pr
        LEFT JOIN specobjall s ON s.bestobjid = pr.objid where s.bestobjid is null

GROUP BY

ORDER BY ASC/DEC

4/26/17

Data Management: extracting data

Scaling up -> scaling out

Hardware Architecture

Parallel RDMS Architecture: shared nothing
Partitioning: spatical indexing
- fast extracting regional data in high dimension
- random distribute data based on hash table to balance load
Shared scanning
Fault tolerance

Software

Spark, HDFS solution
Hadoop Distributed Files System, HDFS
MapReduce
Relation (shcema)
Indexing
Eventually -> database
NoSQL: not only SQL, non-relational (no schema)
MongoDB
DynamoDB
...
Aggregation-oriented database
pros: same type of aggregated data
cons: hard to slice and dice

LSST QSERV

10/19/17

Resources * SQL notebook * SQL course on KhanAcademy

Data Management @ LSST

SQLite with python

SQL byYusra AlSayyad

Data Management: extracting data

Comments !