Skip to content

Foundations of Data Systems

  • an application has to meet two types of requirements in order to be useful
  • functional : what it does
  • non functional : general properties like reliability, scalability, maintainability

1. Reliable, Scalable, and Maintainable Applications

Reliability

  • system should work even when faced with faults and prevent failures from ocurring
  • fault: one component of system deviates from its spec
  • failure: entire system stops providing service

hardware faults: - add redundancy to hardware components - use fault-tolerannt software to tolerate loss of entire machines

software errors - tend to cause more issues since software is shared accross nodes - think about assumptions, testing, isolate processes, allow process to crash and restart, monitor system in production

human errors - design systems to minimize opportunites for error - don't make to limiting as people will try to get around it - decouple place where people make mistakes from where they can cause failures - sandbox environments - test thoroughly: unit tests, system wide integration tetss, manual tests - make it fast to roll back configuration changes

Scalability

  • systems ability to cope with increased load
  • depends on system architecture. load can be volume of reads, volume of writes, volume of data to store, complexity of data, response times, access patterns

describing load - load parameters depends on architecture of system: requests per second, reads and writes to database, number of simulatneous users - focus on average case vs bottleneck cases

describe performance - when load parameter is increased and system resources are unchanged, how is performance affected - when you increase a load parameter, how much do you need to increase the resources to keep performance unchanged - reponse time: what client sees. inclues service time, network delays, queing delays etc. - latency : time requests is waiting to be handled - use percentiles to understand median performance, and outliers - queing delays account for a large part of the reponse time at high percentiles - head of line blocking: early slow request delay others behind in queue. which is why it's important to measure response times on the client side - or when several backend calls are required to serve a single request

approaches for coping with load - scaling up / verical scaling: moving to a more powerful machine - common wisodom to scale databases up for simplicity - scaling out / horizontal scaling: distubuting the load across multiple smaller machines, also known as shared-nothing - elastic: can automatically add computing resources when load increases.

Maintainability

  • it should be reasonable to expect different people can work on the system productively

  • majority of cost of software is ongoing maintenence

  • operability: make it easy to keep the system running smoothly
  • good visibility into systems health
  • simplicity: make it easy for new engineers to understand
  • complexity increases risk of bugs by hiding assumptions, unintedned consequences, and unexpected interactions
  • abstractions: can hide implementation detail
  • evolvability: make it easy to make changes to the system in the future

2. Data Models and Query Languages

  • application are built by layering data models
  • objects/data structures > JSON/XML > bytes in memory/disk/network > electrical currents
  • data models embody assumptions about how it is going to be used
  • some usage are easy and some are not supported; some operations are fast and some perform badly
  • One model can be emulated in terms of another model, but the result is often awkward.
  • different systems for different purposes, not a single one-size-fits-all solution.
  • schemas can either be explicit (enforced on write) or implicit (handled on read)

Relational Model Versus Document Model

Models - hierarchal model - represented all data as a tree of records nested within records - not good for representing many-to-many relationships - network model - record could have multiple parents - ex: one record for a location which is linked to many users, allowing many to one and many to many - many access paths, became cumbersome, updating code for querying and updating the database complicated and inflexible - relational model - all data in a relation (table) made of up of rows - no complicated access paths, query optimizer decides optimal path - document model - similar storage as hierarchal model, with data in JSON - greater scalability than relational, very large datasets or very high write throughput

Comparison

Relational Document
joins better support for joins, many to one, and many to many relationships joins not natively supported, work of joins shifted to application code, use document reference (unique identifier) to point to other documents and resolve using follow up queries
schema schema enforced at write time, changes require migrations schema-on-read approach (don't enforce any schema), more flexible, can immediately write new document structures and have application code handle variants
data locality if getting document like data, requires access multiple different tables can enjoy better data locality if application needs most of document since entire document will be loaded, since updates can require rewriting of entire document it is recommended to keep documents small
Object-Relational Mismatch require translation layer between objects in application code and the database model JSON model in document databases is usually similar to objects in application code, resulting in easier translation

Extra - Many-to-One and Many-to-Many Relationships - if a certain data is used by many different entities, it can make sense to store it as an ID - enforce consistency across entities - avoid ambiguity by making distinctions - localization support (for translation) - better search - ease of updating, as changes only has to happen in one place - less write overhead and inconsistencies - removing duplication helps normalization in databases

  • convergence of document and relational databases
  • most relational databases support xml and json
  • some document databases support relational like joins

Query Languages for Data

imperative vs declarative - imperative: tell computer to perform certain operations in a specific order - declarative: specify end goal but not how to get there - concise, easier to use, hides implementation details, more room for optimization, lends itself to parallel execution which is hard for imperative - SQL is declarative

  • map reduce querying
  • programming model for processing large amounts of data in bulk across many machines
  • in between declarative and imperative
  • logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework

Graph-Like Data Models

  • graph made of two objects: vertices/nodes and edges

Graph models - property graphs - each vertex consists of: - a unique identifier, set of outgoing edges, set of incoming edges, collection of properties (key-value pairs) - each edge consists of: - a unique identifier, vertex where it starts (tail), vertex where it ends (head), label describing the relationship between the two vertices, collection of properties (key-value pairs)

  • triple-stores
  • mostly similar to property graph model
  • all information is stored in three part statement: subject, predicate, object
  • subject is a vertex
  • object can be either

    • a value in a primitive datatype such as string or number
    • in which case the predicate and object are the key value of property on the subject vertex
    • another vertex in the graph
    • in which case the predicate is and edge, the subject is the tail, and the object is the head
  • graph store:

  • represent property graph using a relational schema
  • two relational tables, one for vertices and one for edges
  • any vertex can have an edge connecting it with any other vertex (no schema restriction)
  • given any vertex you can efficiently find both its incoming and outgoing edges, and traverse the graph
  • can store different kinds of information using labels while still keeping a clean data model

Querying - cypher query language - declarative query language for property graphs for Neo4j graph database - graph queries in sql - can be done using the WITH RECURSIVE syntax to follow paths, but is difficult - sparql query language - query language for triple-stores using the rdf data model - datalog - query language of datomic - data model similar to triple store, writes a predicate(subject, object) - define rules that it uses to find data, less convenient for simple one off queries by copes better with complex data

Comparison

Network Model Graph Database
schema had schema specifying which record type could be nested within other record type no restrictions
access traverse an access path to reach a record refer directly to any vertex by unique id or use an index to find vertices with a particular value
order data is ordered and database maintained ordering no order
querying imperative support declarative

3. Storage and Retrieval

  • database should be able to do two things:
  • store data
  • give back data at a later time
  • application developer should know how databases handle storage and retrieval of data
  • to make decision on what storage engine to use in your project

Data Structures That Power Your Database

  • index
  • an addition structure containing metadata about primary data to speed up reads
  • slows down writes, because the index will have to be updated every time data is written
  • application developer creates index depending on typical queries that are expected

Log Structured

  • designed around the concept of many append-only log segment files
  • new data and modifications are appended to a log file sequentially
  • periodically compact and merge: to remove old duplicates of keys

hash indexes - in memory hash map where every key is mapped to the location where the value can be found - crash recovery: if database is restarted in memory hash maps are lost, store snapshot of each segments hash map on disk which can be loaded into memory more quickly - SSTables - same log structure as hash index - sorted string table sstable : sort sequence of key-value pairs by key - range queries are more efficient since data is sorted - merging is more efficient : look at files side by side and copy lowest key - save memory: don't need to keep index of all keys in memory can use a sparse index - save disk space and i/o bandwidth : compress blocks that use same index key

LSM-Trees - log structured merge-tree - cascade of SSTables that are merged in the background - writes go to sorted in memory structure which is eventually written to disk

  • bloom filters : tell you if keys appears in database, to avoid wasting time looking for non existent keys

Page-Oriented

  • B-Trees are the most widely used index, standard index implementation in relational databases
  • break database down into fix sized blocks/pages
  • each page can be identified using and address/location
  • one page is designated as the root of the b-tree
  • page contains several keys and references to child pages
  • each child is responsible for a continuous range of keys and references to their children nodes
  • update: search for the leaf page containing the key, and change the value, write to disk
  • adding: find page whose range encompasses the new key and add it to that page
  • if there isn't enough space in page, then page is split into two half-full pages and parent is updated to account for new subdivision of key ranges

  • write-ahead log (WAL/redo log): append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree

  • used after crash to restore database to consistent state

Comparison

  • Log-Structured Storage Engines
  • LSM-trees typically have higher write throughput
  • smaller files by merging produce smaller files, more read and write requests within the available I/O
  • compaction process can interfere with ongoing reads and writes and consumes bandwidth
  • growing number of unmerged segments can slow down reads
  • Page-Oriented Storage Engines
  • B-Trees typically have faster reads
  • Read Performance: B-Tree structure reduces the number of disk accesses needed to locate data, faster reads
  • Predictable Latency: Random read times are relatively stable
  • fragmentation when page is split or when a row cannot fit into an existing page

Other Index Structures

Storing values within the index - key in an index is the thing that queries search for, but the value can be one of two things - nonclustered index - storing only references to the data within the index - avoids duplicating data when multiple secondary indexes are present: each index references a location in the heap file and the actual data is kept in one place - clustered index - store all row data within the index - hop from the index to the heap file is too much of a performance penalty - covering index or index with included columns - stores some of a table’s columns within the index - compromise between a clustered index and a nonclustered index - clustered and covering indexes can speed up reads, but they require additional storage and can add overhead on writes

Full-text search and fuzzy indexes - search for one word expands to include synonyms, ignore grammatical variations, and to search for occurrences of words near each other in the same document - Lucene is able to search text for words within a certain edit distance

Keeping everything in memory - inmemory databases - performance boost comes from avoiding the overhead of encoding in-memory data structures in a form that can be written to disk - providing data models that are difficult to implement with disk-based indexes: priority queues and sets - some intended for caching use only, where it's acceptable for data to be lost if a machine is restarted - for persistence: write periodic snapshots to disk, or replicate the in-memory state to other machines

Transaction Processing or Analytics

  • Online Transaction Processing (OLTP)
  • access pattern: look up, insert, update, small number of records based on users input
  • interactive
  • Online Analytic Processing (OLAP)
  • access pattern: scan over a huge number of records, usually a few columns, calculate aggregate statistics
Property Online Transaction Processing (OLTP) Online Analytic Processing (OLAP)
Read Pattern small number of records per query, fetched by key Aggregate over large number of records
Write Pattern random-access, low latency writes from user input Bulk import (ELT) or event stream
Used By end user/customer, via web application Internal analyst, for decision support
Data Represents latest state of data History of event over time
Dataset Size Gigabytes to Terabytes Terabytes to Petabytes
Bottleneck Disk seek time Disk bandwidth

Data Warehousing

  • Data Warehouse
  • database separate from the ones used for OLTP systems
  • analyst can query without affecting OLTP operations
  • contains read-only copy of data in all the various OLTP systems of company
  • data extracted from OLTP databases, transformed into analysis friendly schema, and then loaded into data warehouse (ELT)
  • data warehouse can be optimized for analytic access patterns
  • data model usually relational as SQL is good for analytics

Schemas For Analytics

  • star schema (dimensional modeling)
  • fact table: each row represents an event that happened at some point in time, created by referencing a bunch of dimensional tables that give data about event
  • simple
  • snowflake schema
  • dimensions are further broken down into sub-dimensions
  • more normalized

Column-Oriented Storage

  • row oriented : store all values from one row together
  • loads all attributes for each row before filtering
  • column oriented: store all values from each column together
  • queries only loads required columns to save time

Column Compression

  • column oriented storage lends itself well to compression, which reduces disk throughput demand by compressing data
  • number of distinct values in a column are small compared to number of rows
  • can use bitmap encoding

Memory bandwidth and vectorized processing - column compression allows more rows from a column to fit in the same amount of L1 cache - vectorized processing: operate on such chunks of compressed column data directly

Sort Order In Column Storage

  • can impose an order for how the rows are stored
  • data sorted entire row at a time, the key chosen based on common query
  • a second column can be used to sort rows that have same value in first sort column
  • helps with compression since sorting creates long sequences of repeating number if there isn't many distinct values

Several different sort orders - redundantly store data sorted in different ways - different queries benefit from different sort orders

Writing To Column-Oriented Storage

  • read optimizations make writes a little more difficult
  • like LSM-Tree, writes go to in memory sorted structure before being written to disk
  • queries will have to examine both column data on disk and recent writes in memory

Aggregation: Data Cubes and Materialized Views

  • data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL
  • materialized view: copy of the query results, written to disk
  • need to be updated when underlying data changes
  • updates make writes more expensive, make more sense in read heavy warehouse
  • virtual view: shortcut for writing queries
  • SQL engine expands it into the view’s underlying query on the fly and then processes the expanded query
  • data cube or OLAP cube: common special case of a materialized view
  • grid of aggregates grouped by different dimensions (example: SUM sales for each combination of date and product)
  • certain queries become very fast because they have effectively been precomputed
  • not as flexible as querying raw data: used to improve specific queries

Summary

  • OLTP: user facing, small queries, large volume of requests, disk seek time is bottleneck
  • OLAP: analyst facing, large queries, lower volume of requests, disk bandwidth is bottleneck
  • large number of rows, indexes are much less relevant.
  • becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk
  • log-structure: append to files and delete obsolete files but never update in place
  • systematically turn random-access writes into sequential writes on disk to enable higher write throughput
  • Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene
  • page-oriented: treats the disk as a set of fixed-size pages that can be overwritten
  • BTrees, RDBMS