csnotes/337/lec/lec10.md

lec1
====

> What on earth?

The first lecture has bee 50% syllabus 25% videos, 25% simple
terminology; expect nothing interesting for this section

General Performance Improvements in software
--------------------------------------------

In general we have a few options to increase performace in software;
pipelining, parallelism, prediction.

1.  Parallelism

If we have multiple tasks to accomplish or multiple sources of data we
might instead find it better to work on multiple things at
once\[e.g. multi-threading, multi-core rendering\]

2.  Pipelining

Here we are somehow taking *data* and serializing it into a linear form.
We do things like this because it could make sense to things
linearly\[e.g. taking data from a website response and forming it into a
struct/class instance in C++/Java et al.\].

3.  Prediction

If we can predict an outcome to avoid a bunch of computation then it
could be worth to take our prediction and proceed with that instead of
the former. This happens **a lot** in cpu's where they use what's called
[branch prediction](https://danluu.com/branch-prediction/) to run even
faster.

Cost of Such Improvements
-------------------------

As the saying goes: every decision you make as an engineer ultimately
has a cost, let's look at the cost of these improvements.

1.  Parallelism

If we have a data set which has some form of inter-dependencies between
its members then we could easily run into the issue of waiting on other
things to finish.

Contrived Example:

    Premise: output file contents -> search lines for some text -> sort the resulting lines

    We have to do the following processes:
    print my-file.data
    search file
    sort results of the search

    In bash we might do: cat my-file.data | grep 'Text to search for' | sort

Parallelism doesn't make sense here for one reason: this series of
proccesses don't benefit from parallelism because the 2nd and 3rd tasks
*must* wait until the previous ones finish first.

2.  Pipelining

Let's say we want to do the following:

    Search file1 for some text : [search file1]
    Feed the results of the search into a sorting program [sort]

    Search file2 for some text  [search file2]
    Feed the results of the search into a reverse sorting program [reverse sort]

    The resulting Directed Acyclic Graph looks like

    [search file1] => [sort]

    [search file2] => [reverse sort]

Making the above linear means we effectively have to:

    [search file1] => [sort] [search file2] => [reverse sort]
    | proc2 waiting........|

Which wastes a lot of time if the previous process is going to take a
long time. Bonus points if process 2 is extremely short.

3.  Prediction

Ok two things up front:

-   First: prediction's fault is that we could be wrong and have to end
    up doing hard computations.
-   Second: *this course never covers branch prediction(something that
    pretty much every cpu in the last 20 years out there does)* so I'm
    gonna cover it here; ready, let's go.

For starters let's say a basic cpu takes instructions sequentially in
memory: `A B C D`. However this is kinda slow because there is *time*
between getting instructions, decoding it to know what instruction it is
and finally executing it proper. For this reason modern CPU's actually
fetch, decode, and execute(and more!) instructions all at the same time.

Instead of getting instructions like this:

    0
     AA
       BB
         CC
           DD

We actually do something more like this

    A
     AB
       BC
         CD
           D0

If it doesn't seem like much remember this is half an instruction on a
chip that is likely going to process thousands/millions of instructions
so the savings scales really well.

This scheme is fine if our instructions are all coming one after the
other in memory, but if we need to branch then we likely need to jump to
a new location like so.

    ABCDEFGHIJKL
    ^^^*     ^
       |-----|

Now say we have the following code:

    if (x == 123) {
        main_call();
    }
    else {
        alternate_call();
    }

The (psuedo)assembly might look like

``` {.asm}
    cmp x, 123
    je second
main_branch:    ; pointless label but nice for reading
    call main_call
    jmp end
second:
    call alternate_call
end:
    ; something to do here
```

Our problem comes when we hit the je. Once we've loaded that instruction
and can start executing it, we have to make a decision, load the
`call main_call` instruction or the `call alternate_call`? Chances are
that if we guess we have a 50% change of saving time and 50% chance of
tossing out our guess and starting the whole *get instruction =\> decode
etc.* process over again from scratch.

Solution 1:

Try do determine what branches are taken prior to running the program
and just always guess the more likely branches. If we find that the
above branch calls `main_branch` more often then we should load that
branch always; knowing that the loss from being wrong is offset by the
gain from the statistically more often correct branches.

...