//****************************************************************************//
//**************** Numpy Crash-Course - August 22nd, 2019 *******************//
//**************************************************************************//
- Alright, my laptop has decided to start acting up and I disapprove of its shenanigans (doesn't sleep when the lid's closed, etc.)
- So, a few announcements:
- Please make sure you're signed up on Piazza! We won't have essential information on there, but it's certainly a valuable resource (there's a TON of posts since there's ~800 students, but the pinned posts are helpful summaries)
- Canvas will also be used to make announcements, hand in homework, etc., as well as recorded lectures from a previous semester of the course (if you ever miss a lecture, etc.)
- "I still think there's some value in coming here and staring at my face, though, so please come"
- Office hours will be right after lecture, outside the classroom; TA office hours will begin next week
- Project 1 (Martingale) is due 10 days from now; I know our online project descriptions look like legal contracts, so we'll go over it a bit during class
- Most projects are due Sunday night "anywhere on earth" (due to the online students - it's easier than dealing with the timezone madness)
- Late assignments will NOT be accepted; if you have an actual emergency, talk to the dean of students and they can get you an exception
- There are also 4 CentOS servers in this class called the "buffet" servers - "they were meant to be named after Warren Buffett, but they are not, because of spelling" - and you'll have access to all of them
- These should come pre-configured with all the libraries and proper versions of Python you need; run your code there to guarantee it'll run on our laptops (if it does NOT run there when the TAs run it, you'll get an automatic 0)
- You don't have access YET, though; the IT department needs to approve the class roster after Phase II ends, and that won't happen until sometime next week (it's not mandatory for project 1)
- Provisionally, Exam 1 will be on October 10th and Exam 2 will be right before Thanksgiving; these dates ARE subject to change, so don't plan around them just yet
--------------------------------------------------------------------------------
- So, today we'll give you guys a crash-course through NumPy - a math library for Python that's SUPER well optimized (and written in Fortran/C), and is basically Matlab in Python
- "If you're trying to process thousands of trades a second, you don't want 3 nested for-loops inside each other; you want matrix math speediness!"
- LAST TIME, though (when I personally wasn't here), we talked about how the CSV files are organized as follows:
Date,OpenPrice,HighPrice,LowPrice,ClosePrice,Volume,AdjustedClosePrice
- Dates will always be in the format "YYYY-MM-DD", meaning we can sort things by date properly
- Open/high/low/close will be in "decimal dollars.cents" format
- Volume format will almost ALWAYS ends in 2 zeroes, since stocks are usually traded in "round orders" in increments of 100, with multiple people's smaller orders bundled together
- Adjusted close price controls for stuff like stock-splitting (splitting the price of the stock while doubling the number of shares), so we don't actually think armageddon is breaking out when nothing's happened
- With that out of the way, there are 3 main Python libraries we'll use in this class:
- NumPy (Matlab for python)
- Pandas (a time-series library based on NumPy)
- Matplotlib (Matlab-style plots and graphs in Python)
- We're using basically the latest version of all of these, and hopefully you'll be very familiar with all of them by the time the course ends
- Why are we using Pandas instead of just using raw Python? Suppose we wanted to take a CSV file for Yahoo's stock price and print out the highest price in the last 5 years (and plot the price data)
- We could do this by writing our own iteration function, OR we could write this complete code snippet:
```python
import pandas as pd
# Dataframe, just a 2D numpy array with fancy indexes
df = pd.readCSV('data/Yahoo.csv',
index_col='Date', # Use dates as indexes
parse_dates=True) # Parse the indices as dates
print(df['AdjClose'].max())
df['AdjClose'].plot() # Pandas has matplotlib built in!
```
- As a side-note, since numpy is actually C-code, if you want to write fast code you do NOT want to leave C-land and come back to python; every time you go back to Python in a for-loop, that's less time you're spending in that super optimized bytecode the numpy authors wrote
- So, what IS numpy anyway?
- Even for CS people who haven't used Matlab, this can be a little strange, so let's walk through this!
- If you HAVE used Matlab, know that all numpy functions operate on a per-element basis
- There are THOUSANDS of functions in numpy, but only a few of them you really need to use often
- First, let's assume we've imported numpy as follows:
```python
import numpy as np
```
- Then, we'll want to create a "numpy array" that we can call numpy functions on, which we can do from existing data as follows:
```python
np.array(<sequence, tuple, list, etc.>)
```
- Or, if we want to make a totally empty numpy array as fast as possible (avoiding malloc calls and such):
```python
np.empty(dimensions tuple, e.g. (4,9))
```
- We can create it initialized to all 0s/1s with `np.zeroes()` or `np.ones()`
- More often in this class, you might want an array filled with random numbers, and we can do that SUPER efficiently with numpy:
```python
np.random.random(<dimensions>) # Uniform random numbers in range [0.0, 1.0)
np.random.normal(mean, std, size=(1,4)) # Generates "size"-sized array w/ random numbers from a normal distribution with the given mean/standard deviation
np.random.randint(low, high, size=(1,4)) # Generates "size"-sized array with integers from low to high
np.random.seed(seed) # Set the seed for the random number, which is NOT the same as Python's seed, so that we can get predictable results
```
- How can we sliceThese arrays? It's pretty similar to standard Python slicing:
```python
n[2,4] # Get item at row 2, column 4
n[2:4, 0:2] # Get rows 2 to 4 (NOT including 4), and columns 0 to 2 (NOT including 2)
n[:, 4] # Get all rows, 4th column
```
- Some useful information that you can get about an array:
```python
n.shape = (5,4) # Returns a tuple of (# rows, # columns)
n.size = 20 # Returns the NUMBER of elements
n.
```
- ...and some other common functions you *might* want to use:
```python
nd.sum() # Sums all elements in your array together
nd.sum(axis=0) # Confusingly, this sums ACROSS the given dimension; so, this'd sum ACROSS your rows and give you the sum in each column
nd.min() # Minimum of the whole array, or
nd.min(axis=0) # Minimum of each column, returned as a 1D array
.max()
.mean()
.std() # Standard deviation
```
- One other thing that might not be obvious: you can assign stuff to slices!
```python
nd[2:4, 1:3] = nd[1:3, 0:2]
```
- One semi-confusing thing that numpy supports is ARRAY BROADCASTING: where we can "broadcast" a value we're assigning/getting to multiple elements in the array
- As an example:
```python
nd[:, 1] = 4 # Sets all elements in column 1 to 4
```
- This also lets us do "array masking", where we can specify the indices we're interested in and then get only them, e.g.
```python
data = [9, 7, 4, 1, 6, 3]
indices = [1, 4, 1, 0]
data[indices] = [7, 6, 7, 9]
```
- Why is that useful? Because we can chain it together with other stuff!
- For instance, we can get the indices of all the maximum numbers and then access them to do something like this:
```python
array[array.max().maxidx()].runSomeOperation()
```
- How much faster is numpy? A LOT faster than standard python!
- Consider this code, which gets the average of an array:
```python
for i in range(array.shape[0]):
for j in range(array.shape[1]):
sum += array[i,j]
print(sum/array.size)
```
- This took 3 seconds to run; not bad, but THIS code took only 0.005 seconds to run!
```
array.mean()
```
- So, numpy is EXTREMELY useful, especially for the stuff we're dealing with in this class
- Now, onto your 1st project, Martingale!
- Professor Balch is more what Professor Byrd likes to call a "normal person" than your average Professor, and when he goes to Las Vegas (which he does surprisingly often) he like to use the "Martingale" strategy
- So, to introduce you to Python and these libraries, we'll be testing how good this strategy actually is!
- What's in the project? Basically, just the world's dumbest roulette wheel!
- You give it the boolean odds you're betting on, and it returns "true" if you won and "false" if you lose
- What do we expect you to do?
- Basically, we expect you to run trials/"episodes" of an experiment, following the rules we give you
- Each "episode" will be one gambler walking into the casino, using the Martingale strategy, and then leaving after they've run the strategy to completion
- What is the Martingale strategy anyway, though? It's pretty simple!
- You bet $1 on black; if you win, you gain a dollar, and if you lose you lose that bet (i.e. +/-$1 every time)
- When you lose, you double your bet
- When you win, you go back to $1 and repeat
- You keep doing this until you hit EITHER your win/loss limit (i.e. you've won or lost a certain amount of money)
- So, what's your win percentage with this?
- So, with an ideal, "fair" roulette wheel ("which you'll never find anyway"), you would have 18 black/36 total spaces = 0.5 chance of winning
- What casinos actually do to make SURE you'll lose more often, then, is add green spaces to the wheel (often 2 or 3, depending on where you are in the world) that count as losses; so, you'd have a win percentage of "(18 black - green spaces)/(36 + green spaces)"
- For each episode, we'll always do 1000 spins
- If you hit your win/loss limit, you'll stop betting, but still populate the rest of the array by...
- ...after each spin, tracking your TOTAL winnings thus far (positive or negative)
- The win limit will always be $80
- If you hit your limit, you'll just copy your last "actual" spin to the remaining
- You'll then plot the array of winnings on the Y axis, and the spin # on the X axis
- You can think of each subsequence ending with a win always having a net gain of +$1; the problem, though, is that you might lose so much money before that point that you
- So, what're you 2 experiments?
- Experiment 1 is that you have an INFINITE bankroll (you'll have no loss limit at all)
- For this one's first part, you'll run 10 completely separate episodes, and you'll plot each of their records as 10 separate lines
- Matplotlib.pyplot should make this pretty easy
- You'll then run 1000 episodes of this, but only make 2 plots:
- The mean winning of each episode at that point, +/- the standard deviation (3 lines total)
- The median +/- the standard deviation
- Experiment 2 is the same thing EXCEPT you now have only $256 in your pocket
- So, you do the same mean/median plots, but if you lose enough money that you can't double your bet, you'll just bet whatever you have left
- If you lose that, then you'll just "bet" $0 for the rest of the spins
- So, that's project 1; next time we'll get into some financial details for this course, so I'll see you all for that!