Google Developers Notes

by Gregory Walsh — on  , 

cover-image

This page’s purpose is to display the notes I took on the Google Developers Machine Learning Recipes with Josh Gordon Youtube series. These notes are also available on the Github Repo for this Independent Study.


NOTE: In order to use the Anaconda libraries you need to open VSCode from Anaconda.

Episode 1:

Supervised Learing Recipe:
    Collect Training Data -> Train Classifier -> Make Predictions

    Collect Training Data: 
    In this case we are looking at fruit. 
    The Table would look something like 
    Weight | Texture | Label 
    
    Weight and Texture are FEATURES the better the feature the easier it is to discriminate data 
    The last column is called a LABEL and it identifies the type of finalized data to be selected (Orange and Apple)
    Features are INPUT to a Classifier and LABEL is OUTPUT
    The whole table is the TRAINING DATA. 

    the data will look like this when put in python:

    features = [[140, "smooth"], [130, "smooth"], [150, "bumpy"], [170, "bumpy"]]
    labels = ["apple", "apple", "orange", "orange"]

    Because sklearn uses real-valued features we need to change the strings to ints
        0 -> bumpy
        1 -> smooth
        0 -> apple
        1 -> orange

    Train Classifier:

    This exercise uses a Decision Tree
    
    A Classifier is a box of rules. 
    A Learning Algorith is the procedure that creates Classifiers.
        Finds patterns in data. 
        Included in the classifier Object and is called "Fit"
    There are many different types of classifiers
    You need to import the tree and classifier:

    clf = tree.DecisionTreeClassifier()

    Important Concepts:
        How does this work in the real world?
        How much training data do you need?
        How is the tree created?
        What makes a good feature?

Episode 2:

There are MANY types of Classifiers
    EX.
    Decision Tree - Easy to Read and Understand 
    Artificial neural network
    Support Vector Machine
    
Data Set being used is "Iris" 
    Has 3 different types of species by looking at Sepal Length and Width as well as Predal Length and Width as our Features and the Species as our Label
Goals: 
    1. Import dataset
    2. Train a classifier
    3. Predict label for new flower
    4. Visualize the tree  

1. Import dataset
    sklearn has the iris dataset built in so we can just import the set. 
    Note: There are errors thrown in the linter im using but everything compliles and runs fine. 

2. Train the Classifier
    This was done by removing a datapoint and then running a fit on the data to see if it would properly predict the right data. 

3. Predict label for new flower 
    After the fit was run a predict was run and it gave back the exact data there was before it was removed. This shows the prediction and training was accurate. 

4. Visualize the tree
    Graphviz was used for generating a color coded tree then exported it to a pdf called "iris.pdf" within the same folder. 
    Created a great visual representation of how the model is making the decisions

Important Notes:
    Every question a Tree asks must be about your Features. That means the better your features are, the better a tree you can build. 

Episode 3:

What makes a Good Feature?

Some features are good like height or weight but you usually need multiple features to properly predict. 
Avoid useless features and independent. If you have Height in inches you dont need height in centimeters

Ideal features are:
    informative
    independent
    simple

Episode 4:

Machine Learning pipelines
- scikit-learn has a handy function for splitting data sets into a training and a testing set
- it's sklearn.model_selection.train_test_split(data_set_features,data_set_labels,test_fraction)
- this function will return 1) training_features 2) testing_features 3) training_labels and 4) testing_labels
- i.e. it returns a tuple of 4 elements
- note, the test_fraction argument specifies the fraction of the data you want to use for testing
- so if you put 0.5, it means you want to use half the data for testing (and the other half for training obviously)

- recall that the .predict() method returns a list of predictions for the list of examples you pass it
- you can use sklearn.metrics.accuracy_score(test_labels,predicted_labels) to compare two list of labels essentially

- supervised learning is also known as function approximation because ultimately what you are doing is finding a function that matches your training examples well
- you start with some general form of the function (e.g. y = mx+b) and then you tune the parameters such that it best describes your training examples (i.e. change m and b until you get a line that best splits your data)

Key thing to take away from the video:
Supervised learning is just function approximation. You start with a general function and then tweak the parameters of the function based on your training examples until your function describes the training data well.

    
Features = Label    
F(x)        y 

One way to think of learning is using training data to adjust the parameters of a model
playground.tensorflow.org

Episode 5: Writing Classifier from scratch

Creating K Nearest Neighbors with the pipeline.py program.
Algorithm info from sklearn here:
https://scikit-learn.org/stable/modules/neighbors.html

K Nearest Neighbors tests the distance to the closest neighbor to classify the unknown data.
If there is equidistance between both red and green sides the k = # of neighbors to consider.
If k was 1 then it would look at the closest neighbor.
If k was 3 then it would look at the 3 closest neighbors and make the decision from there. 

Implementation:

1. Find the nearest Straightline Distance:
    Will use the Eulidean Distance by measuring distance between two points and uses the formula found in formula.png
    Distance is in 2D space with 2 features but with 3 it will be in 3D or 4D. This is hard to visualize but ED will work perfect regardless. 

2. Run rest of program with new pipeline

-  recall, a classifier is first trained (where it receives a bunch of training inputs along with the corresponding outputs) and then it is asked to predict the output for new inputs.
-  one basic way of implementing a classifier is to save all of your training input/output pairs, and then when you are asked to predict the output for a new input, return the output of the closest training input to your new input

- so for example, let's say I give you the following training inputs (1, 12, 130) and their respective outputs (2, 14, 140).
- and then I ask you to predict the output for a new input, 4
- well, 4 is closest to 1, and the output for 1 was 2, so we will predict the output for 4 to be 2 as well

- the inputs in the above example are 1 dimensional, but you can use the distance formula for 2, 3 or any number of dimensions


Pros: 
    Relatively simple
Cons:
    Computationally intensive
    Hard to represent relationships between features

Episode 6: Image Classifier

    using tensorflow for Poets
    High level code

    In order to train tensorflow you need lots of images
    create a classifier to tell the difference between 5 types of flowers.

Sources

Developers, Google. “Let’s Write a Decision Tree Classifier from Scratch - Machine Learning Recipes #8.” YouTube, YouTube, 13 Sept. 2017, www.youtube.com/watch?v=LDRbO9a6XPU.