We recently introduced our first feature titled Cookery Collections. It’s a web interface to a database of over 10,000 cookbooks and their corresponding recipe ingredients. It was a massive undertaking and we had no idea we would encounter so many obstacles.
Gathering the data
We utilized a number of resources for our recipe and ingredient gathering. Our primary tool was web scraping but we had to manually manipulate the data as it was coming in. We operated under the guidance of US and international copyright law that stipulates that recipes and their ingredient names are not protected but the measures and methods are.
A mere listing of ingredients is not protected under copyright law. However, where a recipe or formula is accompanied by substantial literary expression in the form of an explanation or directions, or when there is a collection of recipes as in a cookbook, there may be a basis for copyright protection. – Copyright.gov
We indexed cookery titles and ISBN’s from public libraries and various print publishing websites such as Penguin Random House and Bloomsbury. Book covers were gathered from web searches and retail sites such as Amazon and Indigo. Ingredient listings were gathered from individual recipe web pages and aggregate sites like eat your books.
Results – Here is what we gathered
Notable issues with the data
Cookery ingredients can be either very regional or vague. For instance, we have approximately 11,000 recipes that contain the phrase ‘black beans’. If we drill deeper we expose variations of this including the following:
- Black Bean Sauce With Garlic
- Canned Black Beans
- Black Bean Paste
- Black Bean Paste With Chiles
- Black Bean Dip
- Tinned Black Beans
- Canned Seasoned Black Beans
- Black Bean Cooking Liquid
- Calypso Black Beans
- Dried Black Beans
- Cooked Black Beans
- Black Bean Garlic Sauce
- Chile Black Bean Garlic Sauce
- Canned Black Bean Puree
- Chinese Salted Black Beans
- Black Bean Salsa
- Black Beans
- etc…
Clearly ‘Tinned Black Beans’ and ‘Canned Black Beans’ are regional phrases describing the same thing but what about ‘Black Beans’, ‘Cooked Black Beans’ or ‘Cooked Seasoned Black Beans’?. Is the author referring to a regional product or are they using a regional phrase to describe a generic food? At this stage we aren’t certain but we will address that once we’ve processed our 1,2M recipes. Stay tuned.
Why is it important to standardize this data?
There are a number of reasons to standardize the names of whole food ingredients. The team at Google X describe it well in this article. For our project, we need standardized whole food data so that we can map all of our upcoming features and function to the unique food entities. Short term goals include:
- Enable users to search for recipes containing key words such as ‘beans’ or ‘black beans’. If the user is in the UK they may use the phrase ‘tinned’ but in the Americas they may use ‘canned’ or ‘jarred’
- Classify each recipe ingredient as a protein, vegetable, fruit etc
- Add supplemental data such as nutritional data from USDA and other international bodies
- Map each ingredient to a (endless) list of supermarket UPC’s and EAN’s
We’re still in the process of separating the individual ingredients from their recipes. We’ve completed 385,334 of 1,2M rows so we have a long way to go. If you have ideas on how to speed up this process – please get in touch.
—