csnotes/337/lec/lec1.md

# lec1

> What on earth?

The first lecture has bee 50% syllabus 25% videos, 25% simple terminology; expect nothing interesting for this section

## General Performance Improvements in software


In general we have a few options to increase performace in software; pipelining, parallelism, prediction.

1. Parallelism

If we have multiple tasks to accomplish or multiple sources of data we might instead find it better to work on multiple things at once[e.g. multi-threading, multi-core rendering]

2. Pipelining

Here we are somehow taking _data_ and serializing it into a linear form.
We do things like this because it could make sense to things linearly[e.g. taking data from a website response and forming it into a struct/class instance in C++/Java et al.].

3. Prediction

If we can predict an outcome to avoid a bunch of computation then it could be worth to take our prediction and proceed with that instead of the former.
This happens **a lot** in cpu's where they use what's called [branch prediction](https://danluu.com/branch-prediction/) to run even faster.

## Cost of Such Improvements

As the saying goes: every decision you make as an engineer ultimately has a cost, let's look at the cost of these improvements.

1. Parallelism

If we have a data set which has some form of inter-dependencies between its members then we could easily run into the issue of waiting on other things to finish.

Contrived Example:

```
Premise: output file contents -> search lines for some text -> sort the resulting lines

We have to do the following processes:
print my-file.data
search file
sort results of the search

In bash we might do: cat my-file.data | grep 'Text to search for' | sort
```

Parallelism doesn't make sense here for one reason: this series of proccesses don't benefit from parallelism because the 2nd and 3rd tasks _must_ wait until the previous ones finish first.

2. Pipelining

Let's say we want to do the following:

```
Search file1 for some text : [search file1]
Feed the results of the search into a sorting program [sort]

Search file2 for some text  [search file2]
Feed the results of the search into a reverse sorting program [reverse sort]

The resulting Directed Acyclic Graph looks like

[search file1] => [sort]

[search file2] => [reverse sort]
```

Making the above linear means we effectively have to:

```
[search file1] => [sort] [search file2] => [reverse sort]
| proc2 waiting........|
```

Which wastes a lot of time if the previous process is going to take a long time.
Bonus points if process 2 is extremely short.


3. Prediction

Ok two things up front:

* First: prediction's fault is that we could be wrong and have to end up doing hard computations.
* Second: _this course never covers branch prediction(something that pretty much every cpu in the last 20 years out there does)_ so I'm gonna cover it here; ready, let's go.

For starters let's say a basic cpu takes instructions sequentially in memory: `A B C D`.
However this is kinda slow because there is _time_ between getting instructions, decoding it to know what instruction it is and finally executing it proper.
For this reason modern CPU's actually fetch, decode, and execute(and more!) instructions all at the same time.

Instead of getting instructions like this:


```
0
 AA
   BB
     CC
       DD
```

We actually do something more like this

```
A
 AB
   BC
     CD
	   D0
```

If it doesn't seem like much remember this is half an instruction on a chip that is likely going to process thousands/millions of instructions so the savings scales really well.


This scheme is fine if our instructions are all coming one after the other in memory, but if we need to branch then we likely need to jump to a new location like so.

```
ABCDEFGHIJKL
^^^*     ^
   |-----|
```

Now say we have the following code:

```
if (x == 123) {
	main_call();
}
else {
	alternate_call();
}
```

The (psuedo)assembly might look like

```asm
	cmp x, 123
	je second
main_branch:	; pointless label but nice for reading
	call main_call
	jmp end
second:
	call alternate_call
end:
	; something to do here
```

Our problem comes when we hit the je.
Once we've loaded that instruction and can start executing it, we have to make a decision, load the `call main_call` instruction or the `call alternate_call`?
Chances are that if we guess we have a 50% change of saving time and 50% chance of tossing out our guess and starting the whole _get instruction => decode etc._ process over again from scratch.

Solution 1:

Try do determine what branches are taken prior to running the program and just always guess the more likely branches.
If we find that the above branch calls `main_branch` more often then we should load that branch always; knowing that the loss from being wrong is offset by the gain from the statistically more often correct branches.

...