merged from staging - done caring about general populace

This commit is contained in:
Medium Fries
2018-11-08 19:02:41 -08:00
50 changed files with 182023 additions and 17 deletions

180499
cst363/lab/campaign-ca-2016.sql Normal file

File diff suppressed because it is too large Load Diff

BIN
cst363/lab/hashing-lab.pdf Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

26
cst363/lec/lec11.md Normal file
View File

@@ -0,0 +1,26 @@
# lec11
_this section still needs more info_
## Query processing
Keep in mind we are still concerned with systems like sqlite3.
First we have to parse an input to validate it.
Then we should also validate any semantics about the input, ensure that the given _tables, objects etc_ are correct.
Finally we should somehow calculate the input: usually by converting the given expression to the equivalent relational algebra expression.
If we can optimize this expression we can then create more efficient queries.
To do this we take into account 3 main factors:
1. I/O time
* if we have to write something to disk over and over again then we
2. Computational Time
3. Required memory/disk space
### Cost funtion
## Performance of Disk and RAM
## DB Block
## Disk Buffers

43
cst363/lec/lec12.md Normal file
View File

@@ -0,0 +1,43 @@
# lec12
## Lab
This section has a lab activity in `lab/` with instructions on `in-memory-searches.pdf` and `on-disk-search.pdf`.
## In-memory Search
_For now we'll deal with trivial queuries._
Say we perform this query: `select name from censusData where age<30;`.
If we do a linear search we will nearly always have to go through all `N` records in the table to get the data we want out.
Binary searches prove to be quicker but our data must be ordered in some fashion.
_Note:_ just recall that we can only sort a table's entries by a single column at any given time.
The other problem we encounter is that our data must _always_ remaini sorted, which means entering, modifying, and deleting data has much larger overhead than other methods.
## On-Disk Search
There are two main ways of storing the data on disk: by record or by column.
Likewise we also have to deal with variable length data types like `varchar` which provides an uppoer bound but no fixed size necessarily.
### Blocks
Blocks contain records or sometimes columns depending on the implementation.
We usually allocate these blocks in 4k or 8k bytes of space since sectors are split into 512 byte chunks.
These things are taken into account because I/O time sucks, it always has and until ssd's lifetime performace doesn't suck this always will.
The main issue with getting data off the disk isn't the read time, it's the time to find something in the first place. This is because we write to the disk in a fashion that _isn't_ completely linear.
Also keep in mind that our total I/O time to search for something is going to be T~access~ + T~transfer~\*N~records~.
* If we search on a keytype then we only have to search half the records.
* Also this is assuming that _all_ the blocks are right next to each other.
If we search for some blocks that happen to be right next to each then we only need to bother finding the first block but with a binary search we have to bother accessing _every single research_.
This is because unlike memory which is managed by a well written OS, the disk is dumb... very dumb.
The way it(physical machine disk) writes/modifies data is nearly always trivial, meaning there is no clever way that it is writing data.
This is half the reason we say that I/O time sucks.
Because hard disks are slow and stupid compared to memory which is quick and clever.

34
cst363/lec/lec13.md Normal file
View File

@@ -0,0 +1,34 @@
# lec13
## Lab Exercises
This lecture has a lab portion in `lab/` directory.
Directions are on `index-structures-lab.pdf` and `ordered-indexes-lab.pdf`.
## Indexing
To create an index we do:
```
create index indexName on targetTable(attrs);
```
We create an index based on some field, where we sort the entries in this index table.
Each entry then contains a pointer to each record in the target table.
Sorting the indexes allows us to search them _much faster_ than we could ever do on disk.
> What about collision?
Then we simply add a pointer to the index's list of associated pointers.
The biggest problem we have with indexing that if have a large number of entries then we would end up storing a huge number of indexes and pointers.
In order to avoid this, we don't take all of the entries.
Instead of taking all entries we take instead every other entry into our index or even every third.
This means that if we have a search that lands us inside one of the gaps we still search in a binary fashion but once we detect that we are we should search a _gap_ we linearly search through that gap.
## Clustering
First let's recall that ideally our data entries in some table are physically located close to each other on disk _and_, are ordered somehow.
### Dense Clustering
### Sparser Clustering

35
cst363/lec/lec14.md Normal file
View File

@@ -0,0 +1,35 @@
# lec14
Let's say we have a massive dense index, so large that we can't fit it into memory.
We can use this dense index to create a sparse index off of that target dense index.
Basically we're indexing the index to reach some data on disk that is very big.
We can of course take this even further to index and index which is indexing an index of a table.
This concept leads us to B+ Trees.
## B+ Trees
This type of tree here is a self-balancing tree.
This means that as we add rows our indexes the structure will adjust as necessary, meaning the indexes are updated and the pointers in the indexes are updated so that the height of the tree remains balanced.
The leaves on the tree will _always_ have pointers to the target data which we've built a tree upon.
### Overhead
> found on disk or ram?
> whats the cost for balancing the tree everytime
## Hashing
### Direct Indexing
We can create some index where each entry is an index number with a pointer to that respective entry.
If we have id numbers that aren't right next to each other then we have a problem: we also have to include every intermediate values in our table.
### Hash Function
Now we'll have some function that takes an input[key] and generates a value[index].
_For now we'll assume we are dealing with collisions with chaining._

44
cst363/lec/lec15.md Normal file
View File

@@ -0,0 +1,44 @@
# lec15
This lecture has two corresponding lab exercises `lab/hashing-lab.pdf` and `lab/other-operations.pdf`.
## Deleted Data on Disk
Let's say we did the following operations on our disk:
```
insert data1
insert data2
delete data1
lookup data2
```
Let's say that when we inserted data2 with a hash function there was a collision with data1.
In sequential storage we would normal try to put data2 right after data1.
If we try to lookup data2 through our hash function we would again land at data1 so we would have to search linarly for data2.
Now let's suppose that data1 is deleted.
If we lookup data2 again we would still land at data1's location but this time there's no collision, ergo, there's no linar correction to reach data2.
This is why when something is deleted on disk we don't actually delete things.
Instead we simply _mark_ or _flag_ a block for deletion.
This means we still get a collision so that we can linearly correct for data2.
The other side to this is that if we do another insert that collides with data1's location we are allowed to overwrite that data because it has been marked for deletion.
## 'where' clause
Let's say we have the following query:
```
... where condition or other_condition;
```
By default the database will try to optimize the query by effectively replacing the query with its own version of the same query but tailored specifically for that task.
We can also use `and`'s with the `where` clause which the databse must also evaluate to create a more efficient query.
## Pages in Memory
If we have a large table that won't fit into memory we can partition that table so that when we push it into memory it fits in our pages.
We can do a first pass where we sort individual partitions in the memory pages.
This first pass will temporarily write our sorted paritions to the disk where we can then gladitorially sort the partitions against eacch other, writing the result to some output.
The previous temporary files from earlier can then be marked for deletion.

61
cst363/lec/lec16.md Normal file
View File

@@ -0,0 +1,61 @@
# lec16
Let's now go into how we build a utility like the _transaction_ in SQLite.
## Problem Statement
> Why we need or care about transactions.
If we have tons of users all trying to access a databse to say, reserve a hotel room, we need to make sure that each operation doesn't fail or write over each other.
Otherwise we're going to have tons of undefined behavior.
## A.C.I.D Principles
### Atomicity
Mneumonically: _all or nothing_
Either everything in our transaction happens, or none of it happens.
The reason why we care about this is because we want to be able to _recover_ from problems, like a power outage for instance, or some error which causes a halt.
To acheive this we have to log everything we're going to do.
Before we do anything in our transactions, we log what we're going to do, what changes are being made and what those changes are.
WAL: _write-ahead logging_
### Consistency
Like the name implies we mean to say that our transactions should result in a predictable output everytime.
### Isolation
Transactions should never be able to peek into another transaction.
As the name implies the transaction runs alone.
### Dependability
Essentially once we reach the end of a transaction we should commit those changes to the database.
This way if something goes wrong, where the whole database needs to be shutdown, our changes should still be there.
_Basically this means that we dump anything our transaction buffer onto disk_.
To achieve this we must verify that the changes were actually committed to the disk.
## Serializability
What we ultimately want is to be able to operate on multiple transactions while also being able to get the same result as if we had done everything in linear order.
We want that result because it maintains isolation for each transaction.
## Transaction Schedule
If we have two complex transactions that need to run then we can schedule them in some manner.
Sometimes it means that we do one transaction first then another, and sometimes it means we do pieces of both in some order.
The latter is known as _interleaving_.
Just like individual transactions we can serialize schedules.
### More on interleaving
We mentioned interleaving earlier.
Basically this just means that we run part of one transaction then another part of a _different_ transaction.
We only do this if the result of this operation is the same as running them in a serialized fashion.

47
cst363/lec/lec17.md Normal file
View File

@@ -0,0 +1,47 @@
# lec17
The previous lecture we covered methods and principles of transactions.
This time around we'll take care of proper ordering of operations.
## Operation Order
If two\* transactions work on two different data items then we know that they shouldn't collide in their operative results, therefore the order wouldn't matter.
The order matters if there is a collision between transactions on similar data.
_Conflict Serializability_ : the ability to swap an interleaved schuedule into a serialized schedule while maintaining the conflict result from the start to the end.
## Determining Serializability
We can go through a schedule where each transaction is placed into a graph as a node.
We draw edges from each node to another if say we run into a read in transaction-A, followed later on by a opposing write action in another transaction.
The opposite also applies.
Our schedule is not serializable if we have a cycle in the resulting graph.
## Locks
Exclusive lock: process locks the database for itself.
Shared lock: allows others to put locks on the databse but not exclusive locks
There are some drawbacks to using locks, especially if done poorly.
If transaction-a locks some data, say exculsively, but doesn't release the lock before another transaction tries to use that data means that we may end up in a state where everyone is locked out of certain data.
For this reason we use a special locking protocol to take care of this exact scenario.
The state where everyone is locked out of something is called a deadlock.
### Two-Phase locking Protocol
The two phases include the _growing_ and _shrinking_ phase.
This means we are getting more and more locks before we finally release locks until there are none left.
We don't mix locks and unlocks however, so `[lock lock free lock free free]` isn't valid but `[lock lock lock free free free]` is fine.
We get two main advantages from this:
1. Serializability is maintained
2. Dead locks are easy to find
Keep in mind however, deadlocks still happen with this protocol.

22
cst363/lec/lec18.md Normal file
View File

@@ -0,0 +1,22 @@
# lec18
Using graphs & trees to avoid deadlocks.
## Trees
If we have a tree filled some data that we want to access.
With our first accessing into the tree we may lock whichever node we want, however, every subsequent lock after that point must happen _only_ if the parent node to that target is locked.
The main disadvantage to this methodology is that if we want to access the root node and a leaf node, it means we must do a lot of intermediary locking.
## Snapshot Isolation
For this strategy we're going to scrap the idea that we're going to be using locks, graphs or even trees.
Instead, when a transaction is about to run, we take a snapshot of everything we're going to modify, then work from there.
When we commit on the first transaction we'll query to see if anything else has changed the data we're trying to write to.
If nothing comes up we commit with no issue.
If something does come up we abort and restart the transaction with a new snapshot, _this time with the new stuff_.
This time around we should be ok to commit.
The overhead comes in hard if we have to be correcting transaction but, if we don't find ourselveds do that too much then it beats graphs and trees since there's barely anything to maintain.

63
cst363/lec/lec19.md Normal file
View File

@@ -0,0 +1,63 @@
# lec19
Let's say you are asked to model a business's data.
The first question you may have is, _where do I even begin_.
If you are in charge of this kind of project there are some things to consider:
* Scalability for future changes
* What kind of data you are dealing with
## Design process
1. Understand the problem at hand and needs of users
2. Create conceptual design
## Entity-Relationship (ER) Models
Used for conceptual design.
There are 3 building blocks for an ER model:
* Entities
* Relationships
* Attributes
For the rest of this lecture we'll be using a book store as our working example to get through this concept.
## Entity & Entity Set
Often with entities we're really referring to real world _things_ which have properties of their own.
An entity for a bookstore would just be something like a book, author, or a publisher.
We would have to consider these things when modeling the pertinent data in regards to the business or organization, because we have to keep track of these things to ensure that the business runs smoothly.
If we keep track of things like the books in our store, then we might avoid accidently ordering too many of the same book, or running out of one specifically.
## Relationships
A _relationship_ is simply an association between entities.
Furthermore entities, can participate in relationships by simply being related to some other entity.
Coming back to our exmample, books and publishers, are two entities whom participate in a relationship together.
Likewise we can two books which may be a part of a long running series, which means they should be related.
This would mean we have a book in a relationship with a book.
Each book however, takes on a different _role_ in the relationship; perhaps one book is the sequel to the other.
A more clear example might be that a book has an author which means the two must be related, therefore we may create associate some kind of relationship.
Likewise the same book entity may participate in another relationship if appropriate, like with the publisher.
## Mapping Cardinalities
> One-to-One Mapping
Say we have to entity sets, where each entity in each set is related to one entity in the opposing set[_ex. every office has 1 instructor_].
A is in a relationship with B or, B is in a relationship wth A.
Both explanations are fine and valid.
> One-to-Many/Many-to-One
All the entities in one set are related to by _at most_ one entity from the other set[A student can have at most one advisor].
The student set would by one-to-many while the advisor set would be many-to-one.
> Many-to-Many
Each entity in a given set may have zero or more relationships with entities in an opposing entity set.
Likewise the inverse is also true.

View File

@@ -6,3 +6,13 @@ Notes are found in `lec/` directory and lab excercises are found in `lab/` direc
If you want to compile these lessons to a pdf then use pandoc.
Tables are usually the largest source of problems so keep that in mind.
=======
# CST 363
Introduction to Databases and SQLite3
## Labs and Practice problems
This course uses a lot of exercises to practice concepts and new information.
I urge you to _try all the labs_ since it makes understanding difficult things trivial.