more updated content to display on the new site

2020-07-05 17:18:01 -07:00
parent 62bcfa79b3
commit ec31274f14
6 changed files with 317 additions and 272 deletions
--- a/337/lec/lec10.md
+++ b/337/lec/lec10.md
@@ -1,41 +1,169 @@
-# lec11
+lec1
+====

-At this point I'l mention that just reading isn't going to get you anywhere, you have to try things, and give it a real earnest attempt.
+> What on earth?

-__ALU:__ Arithmetic Logic Unit
+The first lecture has bee 50% syllabus 25% videos, 25% simple
+terminology; expect nothing interesting for this section

-## Building a 1-bit ALU 
+General Performance Improvements in software
+--------------------------------------------

-![fig0](../img/alu.png)
+In general we have a few options to increase performace in software;
+pipelining, parallelism, prediction.

-First we'll create an example _ALU_ which implements choosing between an `and`, `or`, `xor`, or `add`.
-Whether or not our amazing _ALU_ is useful doesn't matter so we'll go one function at a time(besides `and/or`).
+1.  Parallelism

-First recognize that we need to choose between `and` or `or` against our two inputs A/B.
-This means we have two inputs and/or, and we need to select between them.
-_Try to do this on your own first!_
+If we have multiple tasks to accomplish or multiple sources of data we
+might instead find it better to work on multiple things at
+once\[e.g. multi-threading, multi-core rendering\]

-![fig1](../mg/fig1llec11.png)
+2.  Pipelining

-Next we'll add on the `xor`.
-Try doing this on your own but as far as hints go: don't be afraid to make changes to the mux.
+Here we are somehow taking *data* and serializing it into a linear form.
+We do things like this because it could make sense to things
+linearly\[e.g. taking data from a website response and forming it into a
+struct/class instance in C++/Java et al.\].

-![fig2](../img/fig2lec11.png)
+3.  Prediction

-Finally we'll add the ability to add and subtract. 
-You may have also noted that we can subtract two things to see if they are the same however, we can also `not` the result of the `xor` and get the same result.
+If we can predict an outcome to avoid a bunch of computation then it
+could be worth to take our prediction and proceed with that instead of
+the former. This happens **a lot** in cpu's where they use what's called
+[branch prediction](https://danluu.com/branch-prediction/) to run even
+faster.

-![fig3](../img/fig3lec11.png)
+Cost of Such Improvements
+-------------------------

-At this point our _ALU_ can `and`, `or`, `xor`, and `add`/`sub`.
-The mux will choose one which logic block to use; the carry-in line will tell the `add` logic block whether to add or subtract.
-Finally the A-invert and B-invert line allow us to determine if we want to invert either A or B (inputs).
+As the saying goes: every decision you make as an engineer ultimately
+has a cost, let's look at the cost of these improvements.

-## N-bit ALU
+1.  Parallelism

-For sanity we'll use the following block for our new ALU.
+If we have a data set which has some form of inter-dependencies between
+its members then we could easily run into the issue of waiting on other
+things to finish.

-![fig4](../img/fig4lec11.png)
+Contrived Example:

-Note that we are chaining the carry-in's to the carry-out's just like a ripple adder.
-also each ALU just works with `1` bit from our given 4-bit input.
+    Premise: output file contents -> search lines for some text -> sort the resulting lines
+
+    We have to do the following processes:
+    print my-file.data 
+    search file
+    sort results of the search
+
+    In bash we might do: cat my-file.data | grep 'Text to search for' | sort
+
+Parallelism doesn't make sense here for one reason: this series of
+proccesses don't benefit from parallelism because the 2nd and 3rd tasks
+*must* wait until the previous ones finish first.
+
+2.  Pipelining
+
+Let's say we want to do the following:
+
+    Search file1 for some text : [search file1] 
+    Feed the results of the search into a sorting program [sort]
+
+    Search file2 for some text  [search file2]
+    Feed the results of the search into a reverse sorting program [reverse sort]
+
+    The resulting Directed Acyclic Graph looks like
+
+    [search file1] => [sort]
+
+    [search file2] => [reverse sort]
+
+Making the above linear means we effectively have to:
+
+    [search file1] => [sort] [search file2] => [reverse sort]
+    | proc2 waiting........| 
+
+Which wastes a lot of time if the previous process is going to take a
+long time. Bonus points if process 2 is extremely short.
+
+3.  Prediction
+
+Ok two things up front:
+
+-   First: prediction's fault is that we could be wrong and have to end
+    up doing hard computations.
+-   Second: *this course never covers branch prediction(something that
+    pretty much every cpu in the last 20 years out there does)* so I'm
+    gonna cover it here; ready, let's go.
+
+For starters let's say a basic cpu takes instructions sequentially in
+memory: `A B C D`. However this is kinda slow because there is *time*
+between getting instructions, decoding it to know what instruction it is
+and finally executing it proper. For this reason modern CPU's actually
+fetch, decode, and execute(and more!) instructions all at the same time.
+
+Instead of getting instructions like this:
+
+    0
+     AA
+       BB
+         CC
+           DD 
+
+We actually do something more like this
+
+    A
+     AB
+       BC
+         CD
+           D0
+
+If it doesn't seem like much remember this is half an instruction on a
+chip that is likely going to process thousands/millions of instructions
+so the savings scales really well.
+
+This scheme is fine if our instructions are all coming one after the
+other in memory, but if we need to branch then we likely need to jump to
+a new location like so.
+
+    ABCDEFGHIJKL
+    ^^^*     ^
+       |-----|
+
+Now say we have the following code:
+
+    if (x == 123) {
+        main_call();
+    }
+    else {
+        alternate_call();
+    }
+
+The (psuedo)assembly might look like
+
+``` {.asm}
+    cmp x, 123 
+    je second
+main_branch:    ; pointless label but nice for reading
+    call main_call
+    jmp end
+second:
+    call alternate_call
+end:
+    ; something to do here
+```
+
+Our problem comes when we hit the je. Once we've loaded that instruction
+and can start executing it, we have to make a decision, load the
+`call main_call` instruction or the `call alternate_call`? Chances are
+that if we guess we have a 50% change of saving time and 50% chance of
+tossing out our guess and starting the whole *get instruction =\> decode
+etc.* process over again from scratch.
+
+Solution 1:
+
+Try do determine what branches are taken prior to running the program
+and just always guess the more likely branches. If we find that the
+above branch calls `main_branch` more often then we should load that
+branch always; knowing that the loss from being wrong is offset by the
+gain from the statistically more often correct branches.
+
+...