nearly done with lecture 1 rewrie
This commit is contained in:
parent
2a5b249a5c
commit
63bb0b961d
149
337/lec/lec1.md
149
337/lec/lec1.md
@ -4,13 +4,152 @@
|
|||||||
|
|
||||||
The first lecture has bee 50% syllabus 25% videos, 25% simple terminology; expect nothing interesting for this section
|
The first lecture has bee 50% syllabus 25% videos, 25% simple terminology; expect nothing interesting for this section
|
||||||
|
|
||||||
## Performace Options
|
## General Performance Improvements in software
|
||||||
|
|
||||||
|
|
||||||
In general we have a few options to increase performace in software; pipelining, parallelism, prediction.
|
In general we have a few options to increase performace in software; pipelining, parallelism, prediction.
|
||||||
|
|
||||||
Parallelism/Pipelining
|
1. Parallelism
|
||||||
|
|
||||||
* I'll just assume you know what this entail; one does many things at once; the other is like queues for processessing.
|
If we have multiple tasks to accomplish or multiple sources of data we might instead find it better to work on multiple things at once[e.g. multi-threading, multi-core rendering]
|
||||||
|
|
||||||
* Prediction
|
2. Pipelining
|
||||||
|
|
||||||
Yes this means interpreting potential outcomes/inputs/outputs etc. __BRANCHING__. We try to predict potentiality and account for it ahead of time.
|
Here we are somehow taking _data_ and serializing it into a linear form.
|
||||||
|
We do things like this because it could make sense to things linearly[e.g. taking data from a website response and forming it into a struct/class instance in C++/Java et al.].
|
||||||
|
|
||||||
|
3. Prediction
|
||||||
|
|
||||||
|
If we can predict an outcome to avoid a bunch of computation then it could be worth to take our prediction and proceed with that instead of the former.
|
||||||
|
This happens **a lot** in cpu's where they use what's called [branch prediction](https://danluu.com/branch-prediction/) to run even faster.
|
||||||
|
|
||||||
|
## Cost of Such Improvements
|
||||||
|
|
||||||
|
As the saying goes: every decision you make as an engineer ultimately has a cost, let's look at the cost of these improvements.
|
||||||
|
|
||||||
|
1. Parallelism
|
||||||
|
|
||||||
|
If we have a data set which has some form of inter-dependencies between its members then we could easily run into the issue of waiting on other things to finish.
|
||||||
|
|
||||||
|
Contrived Example:
|
||||||
|
|
||||||
|
```
|
||||||
|
Premise: output file contents -> search lines for some text -> sort the resulting lines
|
||||||
|
|
||||||
|
We have to do the following processes:
|
||||||
|
print my-file.data
|
||||||
|
search file
|
||||||
|
sort results of the search
|
||||||
|
|
||||||
|
In bash we might do: cat my-file.data | grep 'Text to search for' | sort
|
||||||
|
```
|
||||||
|
|
||||||
|
Parallelism doesn't make sense here for one reason: this series of proccesses don't benefit from parallelism because the 2nd and 3rd tasks _must_ wait until the previous ones finish first.
|
||||||
|
|
||||||
|
2. Pipelining
|
||||||
|
|
||||||
|
Let's say we want to do the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
Search file1 for some text : [search file1]
|
||||||
|
Feed the results of the search into a sorting program [sort]
|
||||||
|
|
||||||
|
Search file2 for some text [search file2]
|
||||||
|
Feed the results of the search into a reverse sorting program [reverse sort]
|
||||||
|
|
||||||
|
The resulting Directed Acyclic Graph looks like
|
||||||
|
|
||||||
|
[search file1] => [sort]
|
||||||
|
|
||||||
|
[search file2] => [reverse sort]
|
||||||
|
```
|
||||||
|
|
||||||
|
Making the above linear means we effectively have to:
|
||||||
|
|
||||||
|
```
|
||||||
|
[search file1] => [sort] [search file2] => [reverse sort]
|
||||||
|
| proc2 waiting........|
|
||||||
|
```
|
||||||
|
|
||||||
|
Which wastes a lot of time if the previous process is going to take a long time.
|
||||||
|
Bonus points if process 2 is extremely short.
|
||||||
|
|
||||||
|
|
||||||
|
3. Prediction
|
||||||
|
|
||||||
|
Ok two things up front:
|
||||||
|
|
||||||
|
* First: prediction's fault is that we could be wrong and have to end up doing hard computations.
|
||||||
|
* Second: _this course never covers branch prediction(something that pretty much every cpu in the last 20 years out there does)_ so I'm gonna cover it here; ready, let's go.
|
||||||
|
|
||||||
|
For starters let's say a basic cpu takes instructions sequentially in memory: `A B C D`.
|
||||||
|
However this is kinda slow because there is _time_ between getting instructions, decoding it to know what instruction it is and finally executing it proper.
|
||||||
|
For this reason modern CPU's actually fetch, decode, and execute(and more!) instructions all at the same time.
|
||||||
|
|
||||||
|
Instead of getting instructions like this:
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
0
|
||||||
|
AA
|
||||||
|
BB
|
||||||
|
CC
|
||||||
|
DD
|
||||||
|
```
|
||||||
|
|
||||||
|
We actually do something more like this
|
||||||
|
|
||||||
|
```
|
||||||
|
A
|
||||||
|
AB
|
||||||
|
BC
|
||||||
|
CD
|
||||||
|
D0
|
||||||
|
```
|
||||||
|
|
||||||
|
If it doesn't seem like much remember this is half an instruction on a chip that is likely going to process thousands/millions of instructions so the savings scales really well.
|
||||||
|
|
||||||
|
|
||||||
|
This scheme is fine if our instructions are all coming one after the other in memory, but if we need to branch then we likely need to jump to a new location like so.
|
||||||
|
|
||||||
|
```
|
||||||
|
ABCDEFGHIJKL
|
||||||
|
^^^* ^
|
||||||
|
|-----|
|
||||||
|
```
|
||||||
|
|
||||||
|
Now say we have the following code:
|
||||||
|
|
||||||
|
```
|
||||||
|
if (x == 123) {
|
||||||
|
main_call();
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
alternate_call();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The (psuedo)assembly might look like
|
||||||
|
|
||||||
|
```asm
|
||||||
|
cmp x, 123
|
||||||
|
je second
|
||||||
|
main_branch: ; pointless label but nice for reading
|
||||||
|
call main_call
|
||||||
|
jmp end
|
||||||
|
second:
|
||||||
|
call alternate_call
|
||||||
|
end:
|
||||||
|
; something to do here
|
||||||
|
```
|
||||||
|
|
||||||
|
Our problem comes when we hit the je.
|
||||||
|
Once we've loaded that instruction and can start executing it, we have to make a decision, load the `call main_call` instruction or the `call alternate_call`?
|
||||||
|
Chances are that if we guess we have a 50% change of saving time and 50% chance of tossing out our guess and starting the whole _get instruction => decode etc._ process over again from scratch.
|
||||||
|
|
||||||
|
Solution 1:
|
||||||
|
|
||||||
|
Try do determine what branches are taken prior to running the program and just always guess the more likely branches.
|
||||||
|
If we find that the above branch calls `main_branch` more often then we should load that branch always; knowing that the loss from being wrong is offset by the gain from the statistically more often correct branches.
|
||||||
|
|
||||||
|
...
|
||||||
|
@ -18,12 +18,12 @@ _Try to do this on your own first!_
|
|||||||

|

|
||||||
|
|
||||||
Next we'll add on the `xor`.
|
Next we'll add on the `xor`.
|
||||||
AGAIN: try to do this on your own, the main hint I'll give here is: the current mux needs to be changed.
|
Try doing this on your own but as far as hints go: don't be afraid to make changes to the mux.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
Finally we'll add the ability to add and subtract.
|
Finally we'll add the ability to add and subtract.
|
||||||
You may have also noted that we can subtract two things to see if they are the same dhowever, we can also `not` the result of the `xor` and get the same result.
|
You may have also noted that we can subtract two things to see if they are the same however, we can also `not` the result of the `xor` and get the same result.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user