How to create a database for Charity Commission data
What data is available
The NCVO research team receives a lot of requests on charities, including queries on the number of charities in a local authority area, which charities work with children and young people, what are the top 100 charities by assets, and many more.
To be able to answer these questions and provide the most up-to-date information on charities, we created a database using data that is publicly available from the Charity Commission for England and Wales. This data is provided as a data dump and updated on a monthly basis. As charities have different financial periods and are allowed to submit their accounts to the Charity Commission within ten months after the end of their financial year, these updates are crucial to us.
The data dump contains the following information:
- All registered charities in England and Wales
- Additional information on each charity
- Financial data for each charity
- Financial data for charities with an income of over £500k (‘part b’)
You can find an exact description of the tables and data fields on the Charity Commission site on data definitions.
How to access and download the data
Each month, the Charity Commission provide two zip files to download from their site.
- Charity register extract
- Table build scripts
The ‘charity register extract’ contains all the charity data described above.
The charity register extract data is stored in a ‘.bcp’ format, which stands for Bulk Copy Program. This file format is used to import data into a Microsoft SQL Server database. However, this is an enterprise piece of software and as such organisations need to pay for it. Microsoft provide a free, “Express”, version of the software but it does have size limitations (maximum of 10GB of data storage) that can have significant impact if data capacity needs are not reasonably or properly anticipated.
Most likely, the Charity Commission use Microsoft SQL Server but at NCVO we use ‘MySQL’, an open source database server, which provides an excellent free alternative. However, in order to import the data, the .bcp files first need to be translated into MySQL format like CSV.
If you are using Microsoft SQL Server, you have the option to import the data in the .bcp source files into a Microsoft SQL Server database. But as we do not use this method, we are unable to give you any guidance on this.
How to get the data into a database
David Kane, a former researcher at NCVO, has created a ‘GitHub repository’ – a digital storage space that includes all the necessary files, links, instructions and python scripts. All the files are free to download and the instructions will take you through the following steps:
- Download files from the Charity Commission
- Convert .bcp files to CSV files
- Create database for the data
- Import data in the CSV files to the database
The instructions rely on knowledge running Python programs and querying databases using SQL. If you’re not too familiar with Python and SWL, you may need to do some additional research. In terms of getting started with Python, you can install Anaconda, which provides you with most Python libraries you will need.
BCP: Bulk Copy Program – a type of file that the Charity Commission uses to save the data and SQL statements – it is designed to be used with Microsoft SQL Server.
Database server: A piece of software that shares resources between databases – essentially it houses databases and allows you to access them and control what they do.
Database: A collection of information that is organised so that it can be easily accessed, managed and updated. For example, it allows you to create and store a series of spreadsheets in one place, which links them all together and allows you to retrieve the data using a computer language (SQL).
Git: A technology that enables users to easily control different versions of files. It is very popular with software developers who work in teams and need to ensure versions of the same code do not get mixed up with each other in the process of building programs/applications/websites.
GitHub: A company who offer repositories.
Microsoft SQL Server: Microsoft’s commercial database server, also called MS SQL Server.
MySQL: An ‘Open Source’ database platform that refers to both a database server and the type of SQL it uses.
Open Source: Free-to-use software distributed on a specific licence where, loosely speaking, others’ versions of the software can be used for making profit as long as it is not exclusively re-licensed to another legal entity, even when modified.
Python: A popular computer language that is very popular in handling and analysing data – we use Python 3.5 (current version is 3.5.2).
SQL: Structured Query Language – a computer language designed specifically for interacting with databases. It can retrieve, delete and update data as well as build and delete databases.
Repository: A place to store files and folders that (in most cases) uses Git to control different versions of the files.