//****************************************************************************// //**************** Numpy Crash-Course - August 22nd, 2019 *******************// //**************************************************************************// - Alright, my laptop has decided to start acting up and I disapprove of its shenanigans (doesn't sleep when the lid's closed, etc.) - So, a few announcements: - Please make sure you're signed up on Piazza! We won't have essential information on there, but it's certainly a valuable resource (there's a TON of posts since there's ~800 students, but the pinned posts are helpful summaries) - Canvas will also be used to make announcements, hand in homework, etc., as well as recorded lectures from a previous semester of the course (if you ever miss a lecture, etc.) - "I still think there's some value in coming here and staring at my face, though, so please come" - Office hours will be right after lecture, outside the classroom; TA office hours will begin next week - Project 1 (Martingale) is due 10 days from now; I know our online project descriptions look like legal contracts, so we'll go over it a bit during class - Most projects are due Sunday night "anywhere on earth" (due to the online students - it's easier than dealing with the timezone madness) - Late assignments will NOT be accepted; if you have an actual emergency, talk to the dean of students and they can get you an exception - There are also 4 CentOS servers in this class called the "buffet" servers - "they were meant to be named after Warren Buffett, but they are not, because of spelling" - and you'll have access to all of them - These should come pre-configured with all the libraries and proper versions of Python you need; run your code there to guarantee it'll run on our laptops (if it does NOT run there when the TAs run it, you'll get an automatic 0) - You don't have access YET, though; the IT department needs to approve the class roster after Phase II ends, and that won't happen until sometime next week (it's not mandatory for project 1) - Provisionally, Exam 1 will be on October 10th and Exam 2 will be right before Thanksgiving; these dates ARE subject to change, so don't plan around them just yet -------------------------------------------------------------------------------- - So, today we'll give you guys a crash-course through NumPy - a math library for Python that's SUPER well optimized (and written in Fortran/C), and is basically Matlab in Python - "If you're trying to process thousands of trades a second, you don't want 3 nested for-loops inside each other; you want matrix math speediness!" - LAST TIME, though (when I personally wasn't here), we talked about how the CSV files are organized as follows: Date,OpenPrice,HighPrice,LowPrice,ClosePrice,Volume,AdjustedClosePrice - Dates will always be in the format "YYYY-MM-DD", meaning we can sort things by date properly - Open/high/low/close will be in "decimal dollars.cents" format - Volume format will almost ALWAYS ends in 2 zeroes, since stocks are usually traded in "round orders" in increments of 100, with multiple people's smaller orders bundled together - Adjusted close price controls for stuff like stock-splitting (splitting the price of the stock while doubling the number of shares), so we don't actually think armageddon is breaking out when nothing's happened - With that out of the way, there are 3 main Python libraries we'll use in this class: - NumPy (Matlab for python) - Pandas (a time-series library based on NumPy) - Matplotlib (Matlab-style plots and graphs in Python) - We're using basically the latest version of all of these, and hopefully you'll be very familiar with all of them by the time the course ends - Why are we using Pandas instead of just using raw Python? Suppose we wanted to take a CSV file for Yahoo's stock price and print out the highest price in the last 5 years (and plot the price data) - We could do this by writing our own iteration function, OR we could write this complete code snippet: ```python import pandas as pd # Dataframe, just a 2D numpy array with fancy indexes df = pd.readCSV('data/Yahoo.csv', index_col='Date', # Use dates as indexes parse_dates=True) # Parse the indices as dates print(df['AdjClose'].max()) df['AdjClose'].plot() # Pandas has matplotlib built in! ``` - As a side-note, since numpy is actually C-code, if you want to write fast code you do NOT want to leave C-land and come back to python; every time you go back to Python in a for-loop, that's less time you're spending in that super optimized bytecode the numpy authors wrote - So, what IS numpy anyway? - Even for CS people who haven't used Matlab, this can be a little strange, so let's walk through this! - If you HAVE used Matlab, know that all numpy functions operate on a per-element basis - There are THOUSANDS of functions in numpy, but only a few of them you really need to use often - First, let's assume we've imported numpy as follows: ```python import numpy as np ``` - Then, we'll want to create a "numpy array" that we can call numpy functions on, which we can do from existing data as follows: ```python np.array(<sequence, tuple, list, etc.>) ``` - Or, if we want to make a totally empty numpy array as fast as possible (avoiding malloc calls and such): ```python np.empty(dimensions tuple, e.g. (4,9)) ``` - We can create it initialized to all 0s/1s with `np.zeroes()` or `np.ones()` - More often in this class, you might want an array filled with random numbers, and we can do that SUPER efficiently with numpy: ```python np.random.random(<dimensions>) # Uniform random numbers in range [0.0, 1.0) np.random.normal(mean, std, size=(1,4)) # Generates "size"-sized array w/ random numbers from a normal distribution with the given mean/standard deviation np.random.randint(low, high, size=(1,4)) # Generates "size"-sized array with integers from low to high np.random.seed(seed) # Set the seed for the random number, which is NOT the same as Python's seed, so that we can get predictable results ``` - How can we sliceThese arrays? It's pretty similar to standard Python slicing: ```python n[2,4] # Get item at row 2, column 4 n[2:4, 0:2] # Get rows 2 to 4 (NOT including 4), and columns 0 to 2 (NOT including 2) n[:, 4] # Get all rows, 4th column ``` - Some useful information that you can get about an array: ```python n.shape = (5,4) # Returns a tuple of (# rows, # columns) n.size = 20 # Returns the NUMBER of elements n. ``` - ...and some other common functions you *might* want to use: ```python nd.sum() # Sums all elements in your array together nd.sum(axis=0) # Confusingly, this sums ACROSS the given dimension; so, this'd sum ACROSS your rows and give you the sum in each column nd.min() # Minimum of the whole array, or nd.min(axis=0) # Minimum of each column, returned as a 1D array .max() .mean() .std() # Standard deviation ``` - One other thing that might not be obvious: you can assign stuff to slices! ```python nd[2:4, 1:3] = nd[1:3, 0:2] ``` - One semi-confusing thing that numpy supports is ARRAY BROADCASTING: where we can "broadcast" a value we're assigning/getting to multiple elements in the array - As an example: ```python nd[:, 1] = 4 # Sets all elements in column 1 to 4 ``` - This also lets us do "array masking", where we can specify the indices we're interested in and then get only them, e.g. ```python data = [9, 7, 4, 1, 6, 3] indices = [1, 4, 1, 0] data[indices] = [7, 6, 7, 9] ``` - Why is that useful? Because we can chain it together with other stuff! - For instance, we can get the indices of all the maximum numbers and then access them to do something like this: ```python array[array.max().maxidx()].runSomeOperation() ``` - How much faster is numpy? A LOT faster than standard python! - Consider this code, which gets the average of an array: ```python for i in range(array.shape[0]): for j in range(array.shape[1]): sum += array[i,j] print(sum/array.size) ``` - This took 3 seconds to run; not bad, but THIS code took only 0.005 seconds to run! ``` array.mean() ``` - So, numpy is EXTREMELY useful, especially for the stuff we're dealing with in this class - Now, onto your 1st project, Martingale! - Professor Balch is more what Professor Byrd likes to call a "normal person" than your average Professor, and when he goes to Las Vegas (which he does surprisingly often) he like to use the "Martingale" strategy - So, to introduce you to Python and these libraries, we'll be testing how good this strategy actually is! - What's in the project? Basically, just the world's dumbest roulette wheel! - You give it the boolean odds you're betting on, and it returns "true" if you won and "false" if you lose - What do we expect you to do? - Basically, we expect you to run trials/"episodes" of an experiment, following the rules we give you - Each "episode" will be one gambler walking into the casino, using the Martingale strategy, and then leaving after they've run the strategy to completion - What is the Martingale strategy anyway, though? It's pretty simple! - You bet $1 on black; if you win, you gain a dollar, and if you lose you lose that bet (i.e. +/-$1 every time) - When you lose, you double your bet - When you win, you go back to $1 and repeat - You keep doing this until you hit EITHER your win/loss limit (i.e. you've won or lost a certain amount of money) - So, what's your win percentage with this? - So, with an ideal, "fair" roulette wheel ("which you'll never find anyway"), you would have 18 black/36 total spaces = 0.5 chance of winning - What casinos actually do to make SURE you'll lose more often, then, is add green spaces to the wheel (often 2 or 3, depending on where you are in the world) that count as losses; so, you'd have a win percentage of "(18 black - green spaces)/(36 + green spaces)" - For each episode, we'll always do 1000 spins - If you hit your win/loss limit, you'll stop betting, but still populate the rest of the array by... - ...after each spin, tracking your TOTAL winnings thus far (positive or negative) - The win limit will always be $80 - If you hit your limit, you'll just copy your last "actual" spin to the remaining - You'll then plot the array of winnings on the Y axis, and the spin # on the X axis - You can think of each subsequence ending with a win always having a net gain of +$1; the problem, though, is that you might lose so much money before that point that you - So, what're you 2 experiments? - Experiment 1 is that you have an INFINITE bankroll (you'll have no loss limit at all) - For this one's first part, you'll run 10 completely separate episodes, and you'll plot each of their records as 10 separate lines - Matplotlib.pyplot should make this pretty easy - You'll then run 1000 episodes of this, but only make 2 plots: - The mean winning of each episode at that point, +/- the standard deviation (3 lines total) - The median +/- the standard deviation - Experiment 2 is the same thing EXCEPT you now have only $256 in your pocket - So, you do the same mean/median plots, but if you lose enough money that you can't double your bet, you'll just bet whatever you have left - If you lose that, then you'll just "bet" $0 for the rest of the spins - So, that's project 1; next time we'll get into some financial details for this course, so I'll see you all for that!