I recently submitted an assignment for a data science module at uni, which asked us to look at two datasets; one contained information about the wholesale prices of fruit and vegetables in England for the years 2004-2012 (https://data.gov.uk/dataset/agricultral_market_reports) and the other orchard space in England and Wales (https://data.gov.uk/dataset/orchard_fruit_survey). The files were saved as as *fruitveg.csv* and *orchfruit_ap&pr_30may13.csv* respectively. I’ve included parts of my submission below which might be of interest to those looking for a simple example of how to work with data to garner insights. It includes cleaning the data, dealing with inconsistencies and ambiguity, visualising the data and finally touches on what it tells us and what we can learn from it.

*Given these two datasets, produce a graphical representation of the changes in average wholesale price and orchard space for each type of dessert apple grown in England and Wales, and discuss what your representation shows.*

**Initial cleaning of datasets in OpenRefine**

I initially opened both files in the spreadsheet program OpenOffice Calc (similar to Microsoft Exel) to get a preview of the datasets and see what kind of formats they were in.

I opened `fruitveg.csv`

as comma separated, and realised the only values which are delimited by commas pertain to beans – as we aren’t dealing with beans, only dessert apples, we can safely open the file with the delimiter set to tabs.

As we are interested in changes in the relationship between the wholesale price and orchard space for each variety over time, I think it would be sensible to have a separate column for the year; I added a new row to the top of the file with the column headers, thereby ensuring consistency with the location of the years (e.g. 2005, 2005, 2006…)

I then saved the file, and opened it up in OpenRefine, to add the additional column for years: I clicked on the Categories column dropdown and applied a text filter using the regular expression `\d{4}$`

which highlighted the rows containing the years. I then selected edit column -> add column based on column, and named the new column ‘Year’, before moving it to the far left to become the first column. Finally, the Year and Category columns had the ‘fill down’ operation applied so that they would have the values duplicated for each applicable row – this allows the file to be interpreted as a dataframe in pandas.

As we are only interested in dessert apples, we run a custom text facet on the Category column where `value = 'Dessert Apples'`

we can then select the rows which match false, and remove all matching rows, leaving us with only the dessert apple entries. I’ve included a screenshot below to see what we are now dealing with.

I noticed inconsistencies in terms of names of apple varieties in both files and changed the names of apple varieties in the two datasets to ensure consistency:

In `fruitveg.csv`

- Removed entries for the Red Pippin variety, as we don’t have any orchard space data for the relevant years in
`orchfruit_ap&pr_30may13.csv`

to plot against - Renamed ‘Cox’s Orange group’ to Cox on the assumption that they were the same variety of apple
- Renamed ‘Jonogold – group’ to Jonagold as it intuitively makes more sense and allows for merging of datasets

In `orchfruit_ap&pr_30may13.csv`

- Removed the years prior to 2004 as we don’t have data on wholesale prices prior to 2004 to plot against
- There was no data for Fiesta/Red Pippin apples (which I would have interpreted as being the same as Red Pippin) for the years between 2004 and 2015 so I removed this column
- Removed apple varieties that weren’t present in
`fruitveg.csv`

, such as Jazz and Kanzi - Removed other types of fruit like cherries and plums as we aren’t interested in them
- Renamed ‘Cox(and clones)’ to Cox on the assumption that they can be regarded as the same variety
- Renamed ‘Gala(and clones)’ to Gala on the assumption that they can be regarded as the same variety
- Renamed ‘Jonagold(and clones)’ to Jonagold on the assumption that they can be regarded as the same variety
- Renamed ‘WorcesterPearmain’ to Worcester Pearmain to ensure consistency in names across the two datasets, allowing them to be merged
- Renamed ‘EgremontRusset’ to Egremont Russet to ensure consistency in names across the two datasets, allowing them to be merged
- Renamed ‘Other dessert varieties’ to Other to ensure consistency in names across the two datasets, allowing them to be merged

I exported both files in csv format, `'appleprices.csv'`

and `'orchardspace.csv'`

(Note that in the case of `fruitveg.csv`

OpenRefine has now made the values delimited by commas, rather than tabs.)

**Import the datasets and load necessary libraries**

Here I load the Python libraries I’ll be using and import the two cleaned datasets from the data directory into two dataframes, `appleprices_df`

and `orchardspace_df`

.

1 2 3 4 5 6 7 8 9 |
# load necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import math # import datasets appleprices_df = pd.read_csv('data/appleprices.csv') orchardspace_df = pd.read_csv('data/orchardspace.csv') |

**Identify and handle missing data/ambiguity**

As we are only provided with orchard space information for the years 2004, 2007, 2009 and 2012, we will only be able to produce a visualisation for these years; Therefore I’ll remove data for the other years in `appleprices_df`

as we won’t be needing it for our purposes.

1 2 3 4 5 6 7 |
# filter appleprices_df so it only contains the years 2004, 2007, 2009 and 2012 appleprices_df = appleprices_df[ (appleprices_df['Year'] == 2004) | (appleprices_df['Year'] == 2007) | (appleprices_df['Year'] == 2009) | (appleprices_df['Year'] == 2012)] appleprices_df.reset_index(drop=True, inplace=True) |

In order to produce a plot, we need to determine the average wholesale price for each variety of apple in a given year. You can see from above screenshot that there are many monthly readings for the price missing – this could be due to the particular variety not being grown during those month, or some other reason. Regardless, we will calculate an average from the values present. You can see in the code below that the mean price is calculated by taking the average of the prices for each month, and the value added to each row under the new column, ‘Mean Price’.

1 2 3 4 |
# create a mean price column appleprices_df['Mean Price'] = appleprices_df[['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']].mean(axis=1) columnNames = appleprices_df.columns # we will use this later for reindexing appleprices_df.head() |

Year | Category | Description | Quality | Units | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Mean Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 2004 | Dessert Apples | Cox | 1st | £/kg | 0.62 | 0.63 | 0.6 | 0.58 | NaN | NaN | NaN | NaN | 0.68 | 0.51 | 0.53 | 0.58 | 0.591250 |

1 | 2004 | Dessert Apples | Cox | 2nd | £/kg | 0.40 | 0.40 | 0.4 | 0.42 | NaN | NaN | NaN | NaN | 0.39 | 0.33 | 0.30 | 0.33 | 0.371250 |

2 | 2004 | Dessert Apples | Cox | Ave | £/kg | 0.55 | 0.54 | 0.5 | 0.51 | NaN | NaN | NaN | NaN | 0.64 | 0.45 | 0.46 | 0.50 | 0.518750 |

3 | 2004 | Dessert Apples | Discovery | 1st | £/kg | NaN | NaN | NaN | NaN | NaN | NaN | 0.75 | 0.51 | 0.37 | NaN | NaN | NaN | 0.543333 |

4 | 2004 | Dessert Apples | Discovery | 2nd | £/kg | NaN | NaN | NaN | NaN | NaN | NaN | 0.66 | 0.33 | 0.25 | NaN | NaN | NaN | 0.413333 |

We have information of wholesale prices for ‘Other Early/Mid/Late Season’ apple varieties in `appleprices_df`

, and we also have orchard space data for other apple varieties in `orchardspace_df`

. It would be reasonable to assume that the orchard space data incorporates the varieties for all seasons so we will work with this assumption. For these price values we aren’t given a quality (it is recorded as -1 which we can take to mean not applicable), so we will have to address this.

It is necessary then to get an average of wholesale prices for ‘other’ apple varieties by getting the average for each of the early, mid and late varieties, and then getting the mean of the combined price. This is what we will do in the code below; we iterate over the rows in `appleprices_df`

and each time we find a record with the description (variety) of ‘Other Early Season’ we take a note of its index, adding it to the list indiciesOfOther. We can then iterate over the indices in this list and create a mean value for each year, adding it as a new row to our dataframe with the description of ‘Other’, quality of ‘Ave’ and relevant year.

You can see from the output of the code block below, that these rows have been appended to the bottom of `appleprices_df`

. Note that we haven’t recorded things like category and units – this is because it’s unnecessary – we will be removing these columns shortly anyway to arrive at our final combined dataframe.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# deal with 'other varieties' where quality = -1 indiciesOfOther = [] for index, row in appleprices_df.iterrows(): if appleprices_df.iloc[index]['Description'] == "Other Early Season": indiciesOfOther.append(index) for index in indiciesOfOther: avg = (appleprices_df.iloc[index]['Mean Price'] + appleprices_df.iloc[index+1]['Mean Price'] + appleprices_df.iloc[index+2]['Mean Price']) / 3 otherAvg = pd.DataFrame({'Year': appleprices_df.iloc[index]['Year'], 'Description': "Other", 'Quality': "Ave", 'Mean Price': avg}, index = [len(appleprices_df.index)]) appleprices_df = pd.concat([appleprices_df, otherAvg]).reindex_axis(columnNames, axis=1) appleprices_df.tail() |

Year | Category | Description | Quality | Units | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Mean Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

98 | 2012 | Dessert Apples | Jonagold | Ave | £/kg | NaN | 0.5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.77 | 0.73 | 0.66 | 0.665000 |

99 | 2004 | NaN | Other | Ave | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.480595 |

100 | 2007 | NaN | Other | Ave | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.533250 |

101 | 2009 | NaN | Other | Ave | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.534583 |

102 | 2012 | NaN | Other | Ave | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.738333 |

In `appleprices_df`

, we don’t know know exactly what prices recorded under ‘ave’ quality refers to. We will assume it relates to the average wholesale price of the apple variety, regardless of quality, and will use these values to calculate the mean price, as it’s likely to be the most accurate measurement. For this reason, we can discard the rows which relate to apples of 1st and 2nd quality.

You can see from the output of the code block below that we are left only with prices for apple varieties of average quality.

1 2 3 4 |
# only keep records of average quality appleprices_df = appleprices_df[appleprices_df['Quality'] == "Ave"] appleprices_df.reset_index(drop=True, inplace=True) appleprices_df.head() |

Year | Category | Description | Quality | Units | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Mean Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 2004 | Dessert Apples | Cox | Ave | £/kg | 0.55 | 0.54 | 0.5 | 0.51 | NaN | NaN | NaN | NaN | 0.64 | 0.45 | 0.46 | 0.50 | 0.518750 |

1 | 2004 | Dessert Apples | Discovery | Ave | £/kg | NaN | NaN | NaN | NaN | NaN | NaN | 0.74 | 0.48 | 0.32 | NaN | NaN | NaN | 0.513333 |

2 | 2004 | Dessert Apples | Egremont Russet | Ave | £/kg | 0.54 | 0.55 | NaN | NaN | NaN | NaN | NaN | 0.83 | 0.63 | 0.58 | 0.58 | 0.59 | 0.614286 |

3 | 2004 | Dessert Apples | Spartan | Ave | £/kg | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.45 | 0.42 | 0.31 | 0.393333 |

4 | 2004 | Dessert Apples | Worcester Pearmain | Ave | £/kg | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.48 | 0.40 | NaN | NaN | 0.440000 |

**Put the data into an appropriate form for plotting**

To get a dataframe into a format similar to the one we were suggested to produce (with columns apple variety, year, wholesale price and orchard space), we need to prepare both dataframe for combining; They both share the columns year and apple variety so they will be combined on these properties.

In `appleprices_df`

we can delete all the columns apart from year, description and mean price. We do this in the code block below, and then sort and rename the columns to more meaningful names, which can be used in our combine operation.

1 2 3 4 5 6 7 8 9 |
# only keep year, apple variety and avg price appleprices_df = appleprices_df[['Year', 'Description', 'Mean Price']] # sort and rename columns appleprices_df.sort_values(by=['Year', 'Description'], inplace=True) appleprices_df.rename(index=str, columns={"Description": "Apple variety", "Mean Price": "Average wholesale price (£/kg)"}, inplace=True) appleprices_df.reset_index(drop=True, inplace=True) appleprices_df.head() |

Year | Apple variety | Average wholesale price (£/kg) | |
---|---|---|---|

0 | 2004 | Cox | 0.518750 |

1 | 2004 | Discovery | 0.513333 |

2 | 2004 | Egremont Russet | 0.614286 |

3 | 2004 | Gala | 0.455714 |

4 | 2004 | Jonagold | 0.433333 |

We will now use the pandas melt function to transform `orchardspace_df`

in a workable format which can be combined with `appleprices_df`

. We will also assume that the orchard space is measured in acres, as this seems the most sensible measurement. After renaming the columns you can now see that the two dataframes share two common columns, Year and Apple variety.

1 2 3 |
orchardspace_df = pd.melt(orchardspace_df, id_vars=['year']) orchardspace_df.rename(index=str, columns={"year": "Year", "variable": "Apple variety", "value": "Orchard space (acres)"}, inplace=True) orchardspace_df.head() |

Year | Apple variety | Orchard space (acres) | |
---|---|---|---|

0 | 2004 | Discovery | 301.0 |

1 | 2007 | Discovery | 189.0 |

2 | 2009 | Discovery | 177.0 |

3 | 2012 | Discovery | 157.0 |

4 | 2004 | Worcester Pearmain | 213.0 |

Finally we can combine the two dataframes into `combined_df`

which we will use to produce a plot.

1 2 3 |
# combine datasets combined_df = pd.merge(appleprices_df, orchardspace_df, on=['Year', 'Apple variety']) combined_df.head() |

Year | Apple variety | Average wholesale price (£/kg) | Orchard space (acres) | |
---|---|---|---|---|

0 | 2004 | Cox | 0.518750 | 3144.0 |

1 | 2004 | Discovery | 0.513333 | 301.0 |

2 | 2004 | Egremont Russet | 0.614286 | 308.0 |

3 | 2004 | Gala | 0.455714 | 669.0 |

4 | 2004 | Jonagold | 0.433333 | 231.0 |

**Visualise and interpret the data**

We will first create a scatterplot with all the variables plotted, and see if we can garner any insights from it, or if it raises any questions which we can explore further. In the plot below, the orchard size is represented by the actual size of the plots, and the different colours pertain to different varieties of apple. Because we aren’t focusing on the individual varieties of apples at this stage, I have omitted a legend indicating what varieties the various colors relate to.

1 |
combined_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(combined_df['Orchard space (acres)']/2), c=['red', 'blue', 'orange', 'green', 'purple', 'wheat', 'pink', 'grey', 'brown', 'turquoise', 'yellow']) |

The plot is quite busy, as it maps a number of variables; we can immediately notice the correlation of wholesale price against time, which suggests that as time goes on, apples are becoming more expensive. It’s difficult to notice any relationship between wholesale price and orchard size.

Let’s have a look now at the individual apple varieties over the years and see if anything becomes apparent.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
apple1_df = combined_df[combined_df['Apple variety']=='Cox'] apple1_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(apple1_df['Orchard space (acres)']/2)) plt.title('Cox', fontsize=14) apple2_df = combined_df[combined_df['Apple variety']=='Egremont Russet'] apple2_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(apple2_df['Orchard space (acres)']/2)) plt.title('Egremont Russet', fontsize=14) apple3_df = combined_df[combined_df['Apple variety']=='Gala'] apple3_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(apple3_df['Orchard space (acres)']/2)) plt.title('Gala', fontsize=14) apple4_df = combined_df[combined_df['Apple variety']=='Worcester Pearmain'] apple4_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(apple4_df['Orchard space (acres)']/2)) plt.title('Worcester Pearmain', fontsize=14) apple5_df = combined_df[combined_df['Apple variety']=='Spartan'] apple5_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(apple5_df['Orchard space (acres)']/2)) plt.title('Spartan', fontsize=14) apple6_df = combined_df[combined_df['Apple variety']=='Other'] apple6_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(apple6_df['Orchard space (acres)']/2)) plt.title('Other', fontsize=14) apple7_df = combined_df[combined_df['Apple variety']=='Discovery'] apple7_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(apple7_df['Orchard space (acres)']/2)) plt.title('Discovery', fontsize=14) apple8_df = combined_df[combined_df['Apple variety']=='Jonagold'] apple8_df.plot.scatter(x='Year', y='Average wholesale price (£/kg)', s=(apple8_df['Orchard space (acres)']/2)) plt.title('Jonagold', fontsize=14) |

Studying the plots of individual apple varieties gives us a couple more insights.

They confirm our initial assumption that the wholesale price of apples is consistently rising (significantly) each year.

It is evident that for some reason the price of apples (well, at least some of the varieties) took a drop in 2009. A quick Google reveals that “A wet-ish summer, following a cold winter and a mild spring when the bees worked their alchemy on the fluttering, transient blossom, means that 2009 promises to yield a bumper crop.” (Colquhoun, 2009, The Telegraph – http://www.telegraph.co.uk/foodanddrink/6218852/Apples-buy-local-buy-English-buy-often.html) This bumper crop obviously drove the cost of apples down that year.

In terms of the relationship of orchard space with wholesale price, I think it would be safe to say that there is no statistically significant correlation. It is worth noting, however, that there is little change in orchard space over time for each of the varieties. I find this somewhat surprising and would have expected some varieties to get more or less popular over the years, with their respective orchard space reflecting this.