more updated content to display on the new site
This commit is contained in:
178
337/lec/lec10.md
178
337/lec/lec10.md
@@ -1,41 +1,169 @@
|
||||
# lec11
|
||||
lec1
|
||||
====
|
||||
|
||||
At this point I'l mention that just reading isn't going to get you anywhere, you have to try things, and give it a real earnest attempt.
|
||||
> What on earth?
|
||||
|
||||
__ALU:__ Arithmetic Logic Unit
|
||||
The first lecture has bee 50% syllabus 25% videos, 25% simple
|
||||
terminology; expect nothing interesting for this section
|
||||
|
||||
## Building a 1-bit ALU
|
||||
General Performance Improvements in software
|
||||
--------------------------------------------
|
||||
|
||||

|
||||
In general we have a few options to increase performace in software;
|
||||
pipelining, parallelism, prediction.
|
||||
|
||||
First we'll create an example _ALU_ which implements choosing between an `and`, `or`, `xor`, or `add`.
|
||||
Whether or not our amazing _ALU_ is useful doesn't matter so we'll go one function at a time(besides `and/or`).
|
||||
1. Parallelism
|
||||
|
||||
First recognize that we need to choose between `and` or `or` against our two inputs A/B.
|
||||
This means we have two inputs and/or, and we need to select between them.
|
||||
_Try to do this on your own first!_
|
||||
If we have multiple tasks to accomplish or multiple sources of data we
|
||||
might instead find it better to work on multiple things at
|
||||
once\[e.g. multi-threading, multi-core rendering\]
|
||||
|
||||

|
||||
2. Pipelining
|
||||
|
||||
Next we'll add on the `xor`.
|
||||
Try doing this on your own but as far as hints go: don't be afraid to make changes to the mux.
|
||||
Here we are somehow taking *data* and serializing it into a linear form.
|
||||
We do things like this because it could make sense to things
|
||||
linearly\[e.g. taking data from a website response and forming it into a
|
||||
struct/class instance in C++/Java et al.\].
|
||||
|
||||

|
||||
3. Prediction
|
||||
|
||||
Finally we'll add the ability to add and subtract.
|
||||
You may have also noted that we can subtract two things to see if they are the same however, we can also `not` the result of the `xor` and get the same result.
|
||||
If we can predict an outcome to avoid a bunch of computation then it
|
||||
could be worth to take our prediction and proceed with that instead of
|
||||
the former. This happens **a lot** in cpu's where they use what's called
|
||||
[branch prediction](https://danluu.com/branch-prediction/) to run even
|
||||
faster.
|
||||
|
||||

|
||||
Cost of Such Improvements
|
||||
-------------------------
|
||||
|
||||
At this point our _ALU_ can `and`, `or`, `xor`, and `add`/`sub`.
|
||||
The mux will choose one which logic block to use; the carry-in line will tell the `add` logic block whether to add or subtract.
|
||||
Finally the A-invert and B-invert line allow us to determine if we want to invert either A or B (inputs).
|
||||
As the saying goes: every decision you make as an engineer ultimately
|
||||
has a cost, let's look at the cost of these improvements.
|
||||
|
||||
## N-bit ALU
|
||||
1. Parallelism
|
||||
|
||||
For sanity we'll use the following block for our new ALU.
|
||||
If we have a data set which has some form of inter-dependencies between
|
||||
its members then we could easily run into the issue of waiting on other
|
||||
things to finish.
|
||||
|
||||

|
||||
Contrived Example:
|
||||
|
||||
Note that we are chaining the carry-in's to the carry-out's just like a ripple adder.
|
||||
also each ALU just works with `1` bit from our given 4-bit input.
|
||||
Premise: output file contents -> search lines for some text -> sort the resulting lines
|
||||
|
||||
We have to do the following processes:
|
||||
print my-file.data
|
||||
search file
|
||||
sort results of the search
|
||||
|
||||
In bash we might do: cat my-file.data | grep 'Text to search for' | sort
|
||||
|
||||
Parallelism doesn't make sense here for one reason: this series of
|
||||
proccesses don't benefit from parallelism because the 2nd and 3rd tasks
|
||||
*must* wait until the previous ones finish first.
|
||||
|
||||
2. Pipelining
|
||||
|
||||
Let's say we want to do the following:
|
||||
|
||||
Search file1 for some text : [search file1]
|
||||
Feed the results of the search into a sorting program [sort]
|
||||
|
||||
Search file2 for some text [search file2]
|
||||
Feed the results of the search into a reverse sorting program [reverse sort]
|
||||
|
||||
The resulting Directed Acyclic Graph looks like
|
||||
|
||||
[search file1] => [sort]
|
||||
|
||||
[search file2] => [reverse sort]
|
||||
|
||||
Making the above linear means we effectively have to:
|
||||
|
||||
[search file1] => [sort] [search file2] => [reverse sort]
|
||||
| proc2 waiting........|
|
||||
|
||||
Which wastes a lot of time if the previous process is going to take a
|
||||
long time. Bonus points if process 2 is extremely short.
|
||||
|
||||
3. Prediction
|
||||
|
||||
Ok two things up front:
|
||||
|
||||
- First: prediction's fault is that we could be wrong and have to end
|
||||
up doing hard computations.
|
||||
- Second: *this course never covers branch prediction(something that
|
||||
pretty much every cpu in the last 20 years out there does)* so I'm
|
||||
gonna cover it here; ready, let's go.
|
||||
|
||||
For starters let's say a basic cpu takes instructions sequentially in
|
||||
memory: `A B C D`. However this is kinda slow because there is *time*
|
||||
between getting instructions, decoding it to know what instruction it is
|
||||
and finally executing it proper. For this reason modern CPU's actually
|
||||
fetch, decode, and execute(and more!) instructions all at the same time.
|
||||
|
||||
Instead of getting instructions like this:
|
||||
|
||||
0
|
||||
AA
|
||||
BB
|
||||
CC
|
||||
DD
|
||||
|
||||
We actually do something more like this
|
||||
|
||||
A
|
||||
AB
|
||||
BC
|
||||
CD
|
||||
D0
|
||||
|
||||
If it doesn't seem like much remember this is half an instruction on a
|
||||
chip that is likely going to process thousands/millions of instructions
|
||||
so the savings scales really well.
|
||||
|
||||
This scheme is fine if our instructions are all coming one after the
|
||||
other in memory, but if we need to branch then we likely need to jump to
|
||||
a new location like so.
|
||||
|
||||
ABCDEFGHIJKL
|
||||
^^^* ^
|
||||
|-----|
|
||||
|
||||
Now say we have the following code:
|
||||
|
||||
if (x == 123) {
|
||||
main_call();
|
||||
}
|
||||
else {
|
||||
alternate_call();
|
||||
}
|
||||
|
||||
The (psuedo)assembly might look like
|
||||
|
||||
``` {.asm}
|
||||
cmp x, 123
|
||||
je second
|
||||
main_branch: ; pointless label but nice for reading
|
||||
call main_call
|
||||
jmp end
|
||||
second:
|
||||
call alternate_call
|
||||
end:
|
||||
; something to do here
|
||||
```
|
||||
|
||||
Our problem comes when we hit the je. Once we've loaded that instruction
|
||||
and can start executing it, we have to make a decision, load the
|
||||
`call main_call` instruction or the `call alternate_call`? Chances are
|
||||
that if we guess we have a 50% change of saving time and 50% chance of
|
||||
tossing out our guess and starting the whole *get instruction =\> decode
|
||||
etc.* process over again from scratch.
|
||||
|
||||
Solution 1:
|
||||
|
||||
Try do determine what branches are taken prior to running the program
|
||||
and just always guess the more likely branches. If we find that the
|
||||
above branch calls `main_branch` more often then we should load that
|
||||
branch always; knowing that the loss from being wrong is offset by the
|
||||
gain from the statistically more often correct branches.
|
||||
|
||||
...
|
||||
|
||||
Reference in New Issue
Block a user