Jake's CS Notes - Machine Learning for Trading

//****************************************************************************//
//**************** Numpy Crash-Course - August 22nd, 2019 *******************//
//**************************************************************************//

- Alright, my laptop has decided to start acting up and I disapprove of its shenanigans (doesn't sleep when the lid's closed, etc.)
- So, a few announcements:
    - Please make sure you're signed up on Piazza! We won't have essential information on there, but it's certainly a valuable resource (there's a TON of posts since there's ~800 students, but the pinned posts are helpful summaries)
    - Canvas will also be used to make announcements, hand in homework, etc., as well as recorded lectures from a previous semester of the course (if you ever miss a lecture, etc.)
        - "I still think there's some value in coming here and staring at my face, though, so please come"
    - Office hours will be right after lecture, outside the classroom; TA office hours will begin next week
    - Project 1 (Martingale) is due 10 days from now; I know our online project descriptions look like legal contracts, so we'll go over it a bit during class
        - Most projects are due Sunday night "anywhere on earth" (due to the online students - it's easier than dealing with the timezone madness)
            - Late assignments will NOT be accepted; if you have an actual emergency, talk to the dean of students and they can get you an exception
        - There are also 4 CentOS servers in this class called the "buffet" servers - "they were meant to be named after Warren Buffett, but they are not, because of spelling" - and you'll have access to all of them
            - These should come pre-configured with all the libraries and proper versions of Python you need; run your code there to guarantee it'll run on our laptops (if it does NOT run there when the TAs run it, you'll get an automatic 0)
                - You don't have access YET, though; the IT department needs to approve the class roster after Phase II ends, and that won't happen until sometime next week (it's not mandatory for project 1)
    - Provisionally, Exam 1 will be on October 10th and Exam 2 will be right before Thanksgiving; these dates ARE subject to change, so don't plan around them just yet
--------------------------------------------------------------------------------

- So, today we'll give you guys a crash-course through NumPy - a math library for Python that's SUPER well optimized (and written in Fortran/C), and is basically Matlab in Python
    - "If you're trying to process thousands of trades a second, you don't want 3 nested for-loops inside each other; you want matrix math speediness!"

- LAST TIME, though (when I personally wasn't here), we talked about how the CSV files are organized as follows:
    
    Date,OpenPrice,HighPrice,LowPrice,ClosePrice,Volume,AdjustedClosePrice

    - Dates will always be in the format "YYYY-MM-DD", meaning we can sort things by date properly
    - Open/high/low/close will be in "decimal dollars.cents" format
    - Volume format will almost ALWAYS ends in 2 zeroes, since stocks are usually traded in "round orders" in increments of 100, with multiple people's smaller orders bundled together
    - Adjusted close price controls for stuff like stock-splitting (splitting the price of the stock while doubling the number of shares), so we don't actually think armageddon is breaking out when nothing's happened

- With that out of the way, there are 3 main Python libraries we'll use in this class:
    - NumPy (Matlab for python)
    - Pandas (a time-series library based on NumPy)
    - Matplotlib (Matlab-style plots and graphs in Python)
- We're using basically the latest version of all of these, and hopefully you'll be very familiar with all of them by the time the course ends
    - Why are we using Pandas instead of just using raw Python? Suppose we wanted to take a CSV file for Yahoo's stock price and print out the highest price in the last 5 years (and plot the price data)
        - We could do this by writing our own iteration function, OR we could write this complete code snippet:

            ```python
            import pandas as pd
            # Dataframe, just a 2D numpy array with fancy indexes
            df = pd.readCSV('data/Yahoo.csv',
                            index_col='Date',   # Use dates as indexes
                            parse_dates=True)   # Parse the indices as dates
            print(df['AdjClose'].max())
            df['AdjClose'].plot()               # Pandas has matplotlib built in!
            ```

        - As a side-note, since numpy is actually C-code, if you want to write fast code you do NOT want to leave C-land and come back to python; every time you go back to Python in a for-loop, that's less time you're spending in that super optimized bytecode the numpy authors wrote

- So, what IS numpy anyway?
    - Even for CS people who haven't used Matlab, this can be a little strange, so let's walk through this!
        - If you HAVE used Matlab, know that all numpy functions operate on a per-element basis
        - There are THOUSANDS of functions in numpy, but only a few of them you really need to use often
    - First, let's assume we've imported numpy as follows:

        ```python
        import numpy as np
        ```

    - Then, we'll want to create a "numpy array" that we can call numpy functions on, which we can do from existing data as follows:

        ```python
        np.array(<sequence, tuple, list, etc.>)
        ```

        - Or, if we want to make a totally empty numpy array as fast as possible (avoiding malloc calls and such):

            ```python
            np.empty(dimensions tuple, e.g. (4,9))
            ```

        - We can create it initialized to all 0s/1s with `np.zeroes()` or `np.ones()`

    - More often in this class, you might want an array filled with random numbers, and we can do that SUPER efficiently with numpy:

        ```python
        np.random.random(<dimensions>)  # Uniform random numbers in range [0.0, 1.0)

        np.random.normal(mean, std, size=(1,4)) # Generates "size"-sized array w/ random numbers from a normal distribution with the given mean/standard deviation

        np.random.randint(low, high, size=(1,4))    # Generates "size"-sized array with integers from low to high

        np.random.seed(seed)    # Set the seed for the random number, which is NOT the same as Python's seed, so that we can get predictable results
        ```

    - How can we sliceThese arrays? It's pretty similar to standard Python slicing:

        ```python
        n[2,4]      # Get item at row 2, column 4
        n[2:4, 0:2] # Get rows 2 to 4 (NOT including 4), and columns 0 to 2 (NOT including 2)
        n[:, 4]     # Get all rows, 4th column
        ```

    - Some useful information that you can get about an array:

        ```python
        n.shape   = (5,4)     # Returns a tuple of (# rows, # columns)
        n.size    = 20        # Returns the NUMBER of elements
        n.
        ```

    - ...and some other common functions you *might* want to use:

        ```python
        nd.sum()    # Sums all elements in your array together
        nd.sum(axis=0)  # Confusingly, this sums ACROSS the given dimension; so, this'd sum ACROSS your rows and give you the sum in each column

        nd.min()    # Minimum of the whole array, or
        nd.min(axis=0)  # Minimum of each column, returned as a 1D array

        .max()
        .mean()
        .std()      # Standard deviation
        ```

    - One other thing that might not be obvious: you can assign stuff to slices!

        ```python
        nd[2:4, 1:3] = nd[1:3, 0:2]
        ```

- One semi-confusing thing that numpy supports is ARRAY BROADCASTING: where we can "broadcast" a value we're assigning/getting to multiple elements in the array
    - As an example:

        ```python
        nd[:, 1] = 4    # Sets all elements in column 1 to 4
        ```

    - This also lets us do "array masking", where we can specify the indices we're interested in and then get only them, e.g.

        ```python
        data = [9, 7, 4, 1, 6, 3]
        indices = [1, 4, 1, 0]
        data[indices] = [7, 6, 7, 9]
        ```

    - Why is that useful? Because we can chain it together with other stuff!
        - For instance, we can get the indices of all the maximum numbers and then access them to do something like this:

            ```python
            array[array.max().maxidx()].runSomeOperation()
            ```

- How much faster is numpy? A LOT faster than standard python!
    - Consider this code, which gets the average of an array:

        ```python
        for i in range(array.shape[0]):
            for j in range(array.shape[1]):
                sum += array[i,j]
        print(sum/array.size)
        ```

    - This took 3 seconds to run; not bad, but THIS code took only 0.005 seconds to run!

        ```
        array.mean()
        ```

- So, numpy is EXTREMELY useful, especially for the stuff we're dealing with in this class

- Now, onto your 1st project, Martingale!
    - Professor Balch is more what Professor Byrd likes to call a "normal person" than your average Professor, and when he goes to Las Vegas (which he does surprisingly often) he like to use the "Martingale" strategy
        - So, to introduce you to Python and these libraries, we'll be testing how good this strategy actually is!
    - What's in the project? Basically, just the world's dumbest roulette wheel!
        - You give it the boolean odds you're betting on, and it returns "true" if you won and "false" if you lose
    - What do we expect you to do?
        - Basically, we expect you to run trials/"episodes" of an experiment, following the rules we give you
            - Each "episode" will be one gambler walking into the casino, using the Martingale strategy, and then leaving after they've run the strategy to completion
            - What is the Martingale strategy anyway, though? It's pretty simple!
                - You bet $1 on black; if you win, you gain a dollar, and if you lose you lose that bet (i.e. +/-$1 every time)
                - When you lose, you double your bet
                - When you win, you go back to $1 and repeat
                    - You keep doing this until you hit EITHER your win/loss limit (i.e. you've won or lost a certain amount of money)
        - So, what's your win percentage with this?
            - So, with an ideal, "fair" roulette wheel ("which you'll never find anyway"), you would have 18 black/36 total spaces = 0.5 chance of winning
                - What casinos actually do to make SURE you'll lose more often, then, is add green spaces to the wheel (often 2 or 3, depending on where you are in the world) that count as losses; so, you'd have a win percentage of "(18 black - green spaces)/(36 + green spaces)"
        - For each episode, we'll always do 1000 spins
            - If you hit your win/loss limit, you'll stop betting, but still populate the rest of the array by...
            - ...after each spin, tracking your TOTAL winnings thus far (positive or negative)
                - The win limit will always be $80
                - If you hit your limit, you'll just copy your last "actual" spin to the remaining
            - You'll then plot the array of winnings on the Y axis, and the spin # on the X axis
                - You can think of each subsequence ending with a win always having a net gain of +$1; the problem, though, is that you might lose so much money before that point that you 
        - So, what're you 2 experiments?
            - Experiment 1 is that you have an INFINITE bankroll (you'll have no loss limit at all)
                - For this one's first part, you'll run 10 completely separate episodes, and you'll plot each of their records as 10 separate lines 
                    - Matplotlib.pyplot should make this pretty easy
                - You'll then run 1000 episodes of this, but only make 2 plots:
                    - The mean winning of each episode at that point, +/- the standard deviation (3 lines total)
                    - The median +/- the standard deviation
            - Experiment 2 is the same thing EXCEPT you now have only $256 in your pocket
                - So, you do the same mean/median plots, but if you lose enough money that you can't double your bet, you'll just bet whatever you have left
                    - If you lose that, then you'll just "bet" $0 for the rest of the spins

- So, that's project 1; next time we'll get into some financial details for this course, so I'll see you all for that!