A simple data exploration (Python/OpenRefine)

I recently submitted an assignment for a data science module at uni, which asked us to look at two datasets; one contained information about the wholesale prices of fruit and vegetables in England for the years 2004-2012 (https://data.gov.uk/dataset/agricultral_market_reports) and the other orchard space in England and Wales (https://data.gov.uk/dataset/orchard_fruit_survey). The files were saved as as fruitveg.csv and orchfruit_ap&pr_30may13.csv respectively. I’ve included parts of my submission below which might be of interest to those looking for a simple example of how to work with data to garner insights. It includes cleaning the data, dealing with inconsistencies and ambiguity, visualising the data and finally touches on what it tells us and what we can learn from it.

Given these two datasets, produce a graphical representation of the changes in average wholesale price and orchard space for each type of dessert apple grown in England and Wales, and discuss what your representation shows.

 

Initial cleaning of datasets in OpenRefine

I initially opened both files in the spreadsheet program OpenOffice Calc (similar to Microsoft Exel) to get a preview of the datasets and see what kind of formats they were in.

I opened fruitveg.csv as comma separated, and realised the only values which are delimited by commas pertain to beans – as we aren’t dealing with beans, only dessert apples, we can safely open the file with the delimiter set to tabs.

As we are interested in changes in the relationship between the wholesale price and orchard space for each variety over time, I think it would be sensible to have a separate column for the year; I added a new row to the top of the file with the column headers, thereby ensuring consistency with the location of the years (e.g. 2005, 2005, 2006…)

I then saved the file, and opened it up in OpenRefine, to add the additional column for years: I clicked on the Categories column dropdown and applied a text filter using the regular expression \d{4}$ which highlighted the rows containing the years. I then selected edit column -> add column based on column, and named the new column ‘Year’, before moving it to the far left to become the first column. Finally, the Year and Category columns had the ‘fill down’ operation applied so that they would have the values duplicated for each applicable row – this allows the file to be interpreted as a dataframe in pandas.

As we are only interested in dessert apples, we run a custom text facet on the Category column where value = 'Dessert Apples' we can then select the rows which match false, and remove all matching rows, leaving us with only the dessert apple entries. I’ve included a screenshot below to see what we are now dealing with.

I noticed inconsistencies in terms of names of apple varieties in both files and changed the names of apple varieties in the two datasets to ensure consistency:

In fruitveg.csv

  • Removed entries for the Red Pippin variety, as we don’t have any orchard space data for the relevant years in orchfruit_ap&pr_30may13.csv to plot against
  • Renamed ‘Cox’s Orange group’ to Cox on the assumption that they were the same variety of apple
  • Renamed ‘Jonogold – group’ to Jonagold as it intuitively makes more sense and allows for merging of datasets

In orchfruit_ap&pr_30may13.csv

  • Removed the years prior to 2004 as we don’t have data on wholesale prices prior to 2004 to plot against
  • There was no data for Fiesta/Red Pippin apples (which I would have interpreted as being the same as Red Pippin) for the years between 2004 and 2015 so I removed this column
  • Removed apple varieties that weren’t present in fruitveg.csv, such as Jazz and Kanzi
  • Removed other types of fruit like cherries and plums as we aren’t interested in them
  • Renamed ‘Cox(and clones)’ to Cox on the assumption that they can be regarded as the same variety
  • Renamed ‘Gala(and clones)’ to Gala on the assumption that they can be regarded as the same variety
  • Renamed ‘Jonagold(and clones)’ to Jonagold on the assumption that they can be regarded as the same variety
  • Renamed ‘WorcesterPearmain’ to Worcester Pearmain to ensure consistency in names across the two datasets, allowing them to be merged
  • Renamed ‘EgremontRusset’ to Egremont Russet to ensure consistency in names across the two datasets, allowing them to be merged
  • Renamed ‘Other dessert varieties’ to Other to ensure consistency in names across the two datasets, allowing them to be merged

I exported both files in csv format, 'appleprices.csv' and 'orchardspace.csv' (Note that in the case of fruitveg.csv OpenRefine has now made the values delimited by commas, rather than tabs.)

 

Import the datasets and load necessary libraries

Here I load the Python libraries I’ll be using and import the two cleaned datasets from the data directory into two dataframes, appleprices_df and orchardspace_df.

Identify and handle missing data/ambiguity

As we are only provided with orchard space information for the years 2004, 2007, 2009 and 2012, we will only be able to produce a visualisation for these years; Therefore I’ll remove data for the other years in appleprices_df as we won’t be needing it for our purposes.

In order to produce a plot, we need to determine the average wholesale price for each variety of apple in a given year. You can see from above screenshot that there are many monthly readings for the price missing – this could be due to the particular variety not being grown during those month, or some other reason. Regardless, we will calculate an average from the values present. You can see in the code below that the mean price is calculated by taking the average of the prices for each month, and the value added to each row under the new column, ‘Mean Price’.

 YearCategoryDescriptionQualityUnitsJanFebMarAprMayJunJulAugSepOctNovDecMean Price
02004Dessert ApplesCox1st£/kg0.620.630.60.58NaNNaNNaNNaN0.680.510.530.580.591250
12004Dessert ApplesCox2nd£/kg0.400.400.40.42NaNNaNNaNNaN0.390.330.300.330.371250
22004Dessert ApplesCoxAve£/kg0.550.540.50.51NaNNaNNaNNaN0.640.450.460.500.518750
32004Dessert ApplesDiscovery1st£/kgNaNNaNNaNNaNNaNNaN0.750.510.37NaNNaNNaN0.543333
42004Dessert ApplesDiscovery2nd£/kgNaNNaNNaNNaNNaNNaN0.660.330.25NaNNaNNaN0.413333

We have information of wholesale prices for ‘Other Early/Mid/Late Season’ apple varieties in appleprices_df, and we also have orchard space data for other apple varieties in orchardspace_df. It would be reasonable to assume that the orchard space data incorporates the varieties for all seasons so we will work with this assumption. For these price values we aren’t given a quality (it is recorded as -1 which we can take to mean not applicable), so we will have to address this.

It is necessary then to get an average of wholesale prices for ‘other’ apple varieties by getting the average for each of the early, mid and late varieties, and then getting the mean of the combined price. This is what we will do in the code below; we iterate over the rows in appleprices_df and each time we find a record with the description (variety) of ‘Other Early Season’ we take a note of its index, adding it to the list indiciesOfOther. We can then iterate over the indices in this list and create a mean value for each year, adding it as a new row to our dataframe with the description of ‘Other’, quality of ‘Ave’ and relevant year.

You can see from the output of the code block below, that these rows have been appended to the bottom of appleprices_df. Note that we haven’t recorded things like category and units – this is because it’s unnecessary – we will be removing these columns shortly anyway to arrive at our final combined dataframe.

 YearCategoryDescriptionQualityUnitsJanFebMarAprMayJunJulAugSepOctNovDecMean Price
982012Dessert ApplesJonagoldAve£/kgNaN0.5NaNNaNNaNNaNNaNNaNNaN0.770.730.660.665000
992004NaNOtherAveNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.480595
1002007NaNOtherAveNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.533250
1012009NaNOtherAveNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.534583
1022012NaNOtherAveNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.738333

In appleprices_df, we don’t know know exactly what prices recorded under ‘ave’ quality refers to. We will assume it relates to the average wholesale price of the apple variety, regardless of quality, and will use these values to calculate the mean price, as it’s likely to be the most accurate measurement. For this reason, we can discard the rows which relate to apples of 1st and 2nd quality.

You can see from the output of the code block below that we are left only with prices for apple varieties of average quality.

YearCategoryDescriptionQualityUnitsJanFebMarAprMayJunJulAugSepOctNovDecMean Price 
02004Dessert ApplesCoxAve£/kg0.550.540.50.51NaNNaNNaNNaN0.640.450.460.500.518750
12004Dessert ApplesDiscoveryAve£/kgNaNNaNNaNNaNNaNNaN0.740.480.32NaNNaNNaN0.513333
22004Dessert ApplesEgremont RussetAve£/kg0.540.55NaNNaNNaNNaNNaN0.830.630.580.580.590.614286
32004Dessert ApplesSpartanAve£/kgNaNNaNNaNNaNNaNNaNNaNNaNNaN0.450.420.310.393333
42004Dessert ApplesWorcester PearmainAve£/kgNaNNaNNaNNaNNaNNaNNaNNaN0.480.40NaNNaN0.440000

 

Put the data into an appropriate form for plotting

To get a dataframe into a format similar to the one we were suggested to produce (with columns apple variety, year, wholesale price and orchard space), we need to prepare both dataframe for combining; They both share the columns year and apple variety so they will be combined on these properties.

In appleprices_df we can delete all the columns apart from year, description and mean price. We do this in the code block below, and then sort and rename the columns to more meaningful names, which can be used in our combine operation.

YearApple varietyAverage wholesale price (£/kg) 
02004Cox0.518750
12004Discovery0.513333
22004Egremont Russet0.614286
32004Gala0.455714
42004Jonagold0.433333

We will now use the pandas melt function to transform orchardspace_df in a workable format which can be combined with appleprices_df. We will also assume that the orchard space is measured in acres, as this seems the most sensible measurement. After renaming the columns you can now see that the two dataframes share two common columns, Year and Apple variety.

YearApple varietyOrchard space (acres) 
02004Discovery301.0
12007Discovery189.0
22009Discovery177.0
32012Discovery157.0
42004Worcester Pearmain213.0

Finally we can combine the two dataframes into combined_df which we will use to produce a plot.

 YearApple varietyAverage wholesale price (£/kg)Orchard space (acres)
02004Cox0.5187503144.0
12004Discovery0.513333301.0
22004Egremont Russet0.614286308.0
32004Gala0.455714669.0
42004Jonagold0.433333231.0

Visualise and interpret the data

We will first create a scatterplot with all the variables plotted, and see if we can garner any insights from it, or if it raises any questions which we can explore further. In the plot below, the orchard size is represented by the actual size of the plots, and the different colours pertain to different varieties of apple. Because we aren’t focusing on the individual varieties of apples at this stage, I have omitted a legend indicating what varieties the various colors relate to.

The plot is quite busy, as it maps a number of variables; we can immediately notice the correlation of wholesale price against time, which suggests that as time goes on, apples are becoming more expensive. It’s difficult to notice any relationship between wholesale price and orchard size.

Let’s have a look now at the individual apple varieties over the years and see if anything becomes apparent.

Studying the plots of individual apple varieties gives us a couple more insights.

They confirm our initial assumption that the wholesale price of apples is consistently rising (significantly) each year.

It is evident that for some reason the price of apples (well, at least some of the varieties) took a drop in 2009. A quick Google reveals that “A wet-ish summer, following a cold winter and a mild spring when the bees worked their alchemy on the fluttering, transient blossom, means that 2009 promises to yield a bumper crop.” (Colquhoun, 2009, The Telegraph – http://www.telegraph.co.uk/foodanddrink/6218852/Apples-buy-local-buy-English-buy-often.html) This bumper crop obviously drove the cost of apples down that year.

In terms of the relationship of orchard space with wholesale price, I think it would be safe to say that there is no statistically significant correlation. It is worth noting, however, that there is little change in orchard space over time for each of the varieties. I find this somewhat surprising and would have expected some varieties to get more or less popular over the years, with their respective orchard space reflecting this.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.