Homework 0: Penguins

29 Mar 2021 in Blog / Programming / Pic16b

A quick tutorial on PIC 16’s favorite dataset. Image credit to Allison Horst.

Introduction to the Dataset

My penguins-, um, enthused Python professor just gave us this dataset all about penguins! In fact, we had an entire project dedicated to penguins last quarter in PIC 16A… Either way, let’s get started on making a cool dataset visualization about these penguins, and see if we can learn anything just from some pretty plots.

Of course, even though we spend many more hours that I’m willing to admit staring about concentrations of isotopes in penguin blood last quarter, we’d better start by giving an introduction tot he data. In particular, it contains measurements on three penguin species: Chinstrap, Gentoo, and Adelie.

import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

penguins.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/07	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/07	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	11/16/07	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	11/16/07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	11/16/07	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN

Neat, so as you might expect, we have each row corresponding to a particular penguin, then we get all its info: species, region, physiological measurements, etc.

For our purposes, we won’t need to worry too much about cleaning (no fancy machine learning yet), but we still need to do just one small step: the species name is currently pretty bulky, so let’s just take the first word.

# replace species with just its first word
penguins['Species'] = penguins['Species'].str.split(' ').str.get(0)

penguins.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/07	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/07	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	11/16/07	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	11/16/07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	11/16/07	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN

Professor helped me get this table looking a lot better.

I think that looks a lot better already.

Making the Visualization

But now, onto the actual data visualization. I’d like to create a histogram for each species that has in two different colors the frequencies for the Culmen Depths of each sex male and female. That way, we’ll be able to see overall if there’s a significant difference between male and female penguins, and if that difference is consistent across all the species.

First and foremost though, we’ll need matplotlib.

from matplotlib import pyplot as plt

Now let’s create a function that plots two histograms on top of each other. The idea will be to use a groupby and apply later so that we can create this double histogram for each species.

# Overlay 2 histograms to compare them
def overlaid_histogram(data1, data2, n_bins = 10, data1_name=None, data1_color="firebrick", data2_name=None, data2_color="blue", **kwargs):
    # get axes in the standard way
    fig, ax = plt.subplots()

    # create the two histograms
    ax.hist(data1, bins = bins, color = data1_color, alpha = 1, label = data1_name or "")
    ax.hist(data2, bins = bins, color = data2_color, alpha = 0.75, label = data2_name or "")
    ax.set(**kwargs)
    ax.legend(loc = 'best')

I can only use SpeedChat.

Just for fun, we can test this on a single species, say Adelie.

adelie_df = penguins[penguins['Species'] == 'Adelie'] # glad we shortened the species name...
overlaid_histogram(adelie_df[adelie_df['Sex'] == 'MALE']['Culmen Depth (mm)'], adelie_df[adelie_df['Sex'] == 'FEMALE']['Culmen Depth (mm)'],
n_bins=10,
data1_name="Adelie Males",
data2_name="Adelie Females",
xlabel="Culmen Depth (mm)",
ylabel="Frequency",
title="Adelie Gender Culmen Comparison")

svg

I think that looks great. So now, let’s just generalize this in a function, and then run our groupby + apply.

def plot_species_histogram(df):
    species_name = df['Species'].iloc[0]
    overlaid_histogram(df[df['Sex'] == 'MALE']['Culmen Depth (mm)'], df[df['Sex'] == 'FEMALE']['Culmen Depth (mm)'],
    n_bins=10,
    data1_name=species_name+" Males",
    data2_name=species_name+" Females",
    xlabel="Culmen Depth (mm)",
    ylabel="Frequency",
    title=species_name+" Gender Culmen Comparison")

penguins.groupby("Species").apply(plot_species_histogram)

svg

Homework 0: Penguins

Introduction to the Dataset

Making the Visualization

Michael Ting

SpeedChat Enjoyer

Error

Introduction to the Dataset

Making the Visualization

Templates (for web app):

Error