4/24/17
SQLite with python
Basic operation syntax
import sqlite3
conn = sqlite3.connect("reintro.db")
cur = conn.cursor() // creat cursor for operations
cur.execute("""create table PetInfo(
Name text,
Species text,
Age tinyint,
FavoriteFood text
)""")
cur.fetchall() // display the cursor
conn.commit() // commit change to database
cur.execute("""drop table if exists xxx""") # drop the table if is already exists
SDSS API
from astroquery.sdss import SDSS
SDSS.query_sql("""select top 20 * from PhotoObjAll""")
demonstration of Join:
4/25/17
SQL byYusra AlSayyad
Data Management: building database, storing data
- SQL database
- Relational Algebra, not good for graph structure
- combined -> pandas by Wes McKinney
- TOPCAT: Relational database with uncertainty
- Indexing: optimize data extracture and slicing
- hierarchical triangular mesh, reduce 2D to 1D for GIS indexing
SQL examples for inspiration
CREATE database mysql_demo;
use mysql_demo;
CREATE TABLE Person (
personId int,
firstName varchar(255),
lastName varchar(255),
phoneNumber char(10));
CREATE TABLE Publication (
pubId int,
title varchar(255),
journalName varchar(255));
INSERT INTO Person VALUES (1, 'Cameron', 'Hummels', '1234567890');
INSERT INTO Person VALUES (2, 'Y', 'A', '1234567890');
INSERT INTO Publication VALUES (101, 'Stuff about stars', 'Apj');
INSERT INTO Publication VALUES (102, 'Stuff about galaxies', 'Apj');
show tables;
CREATE TABLE Authorship ( ! relational dababase
personId int,
pubId int);
INSERT INTO Authorship VALUE (1,101);
INSERT INTO Authorship VALUE (1,102);
INSERT INTO Authorship VALUE (2,101);
INSERT INTO Authorship VALUE (2,102);
SELECT pub.*, p.*
FROM publication pub
INNER JOIN Authorship a
on a.pubId = pub.pubId
INNER JOIN Person p
on a.personId = p.personId
CREATE INDEX key On table
SELECT TOP 10
p.objid,p.ra,p.dec,p.u,p.g,p.r,p.i,p.z,
p.run, p.rerun, p.camcol, p.field,
s.specobjid, s.class, s.z as redshift,
s.plate, s.mjd, s.fiberid
FROM PhotoObj AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE
p.u BETWEEN 0 AND 19.6
AND g BETWEEN 0 AND 20
SELECT top 10 *
FROM photoobjall p
inner join ROSAT r
on p.objid = r.objid
where (p.cModelFlux_u + p.cModelFlux_g + p.cModelFlux_r + p.cModelFlux_i + p.cModelFlux_z > 10000)
and (p.type = 3)
SELECT pr.*
FROM
(SELECT p.*
FROM photoobjall p JOIN rosat r ON p.objid = r.objid
WHERE (p.cModelFlux_u + p.cModelFlux_g + p.cModelFlux_r + p.cModelFlux_i + p.cModelFlux_z > 10000)
and (p.type = 3)
) as pr
LEFT JOIN specobjall s ON s.bestobjid = pr.objid where s.bestobjid is null
GROUP BY
ORDER BY ASC/DEC
4/26/17
Data Management: extracting data
Scaling up -> scaling out
Hardware Architecture
- Parallel RDMS Architecture: shared nothing
- Partitioning: spatical indexing
- fast extracting regional data in high dimension
- random distribute data based on hash table to balance load
- Shared scanning
- Fault tolerance
Software
- Spark, HDFS solution
- Hadoop Distributed Files System, HDFS
- MapReduce
- Relation (shcema)
- Indexing
- Eventually ->
database
- NoSQL: not only SQL, non-relational (no schema)
- MongoDB
- DynamoDB
- ...
- Aggregation-oriented database
- pros: same type of aggregated data
- cons: hard to slice and dice
10/19/17
Resources * SQL notebook * SQL course on KhanAcademy
Comments !