A new attribute-linked residential property price dataset for England and Wales, 2011–2019

Current research on residential house price variation in the UK is limited by the lack of an open and comprehensive house price database that contains both transaction price alongside dwelling attributes such as size. This research outlines one approach which addresses this deficiency in England and Wales through combining transaction information from the official open Land Registry Price Paid Data (LR-PPD) and property size information from the official open Domestic Energy Performance Certificates (EPCs). A four-stage data linkage is created to generate a new linked dataset, representing 79% of the full market sales in the LR-PPD. This new linked dataset offers greater flexibility for the exploration of house price (£/m2) variation in England and Wales at different scales over postcode units between 2011 and 2019. Open access linkage codes will allow for future updates beyond 2019.


Introduction
Comparative international analyses of house prices are constrained by differences in definition, data structure, spatial/time scales and coverage. These limit both comparative analysis and within-country analysis of housing markets [1,2]. House price data deficiencies hinder research on residential house price variation in the UK, and limit understanding of the housing market. Modelling of UK-based house price changes dates back to the 1970s [3,4] with much of the data used either UCL  A new attribute-linked residential property price dataset for England andWales, 2011-2019 Data linkage The data linkage method used here is similar to an earlier published method [35], but with greater granularity in the matching rules. Linkage between the PPD and Domestic EPC dataset is achieved though several phases dealing with successively more complex address matching challenges. Before matching, transactions in the LR-PPD without postcodes in the Domestic EPCs dataset are excluded -this accounts for 0.55% of the data -leaving a total of 23,999,656 transactions for matching. Figure 1 shows an example of the data linkage process, with the basic idea of linkage between these two datasets being to match by full postal delivery address (i.e., postcode plus detailed address strings). These two datasets both contain property information at address level but their address structures differ, thus basic data standardisation is needed. First, all address strings in the Domestic EPCs are capitalised and stored in new variables. These newly created address variables are used to achieve an initial data linkage. To deal with more complex subsequent linkage passes, 183 new variables are created in the LR-PPD and 99 new variables are created in the Domestic EPCs (Appendix A).
A matching method containing a four-stage (251 matching rules) process was designed and is outlined in Fig. 2. In the Domestic EPCs, each record is created using a unique identifier named id. Each transaction in the LR-PPD has a unique identifier named transactionid. Taking Stage 1 as an

Figure 1
An example of the data linkage process. Match rate of linked house price data in England andWales, 1995-2019. example of the matching process; all the matches are based on a temporary address string (i.e., postcode+saonpaonstreet) with the algorithm testing whether postcode+saonpaonstreet in LR-PPD is equal to any postcode +ADDRE in the Domestic EPCs. Where they match directly, records for both datasets are joined, removed from the original data and stored in a new temporary linked data table, DATA 1. For records where a match is not achieved on the first pass, the algorithm moves onto a further set of matching tests in Stage 2.
Problems emerge where one property may have more than one Domestic EPC. Where this is the case, only property transactions with just one successfully linked EPC will be moved from the temporary DATA 1 and directly stored in the final linked-EPC PPD dataset. Property transactions with successful links to more than one EPC are stored in a separate dataset, DATA 3. These data are filtered to select all Domestic EPCs for which total floor area is neither NULL nor 0 and then linked where the EPC inspection date or lodgement date is closest to the transaction date in the LR-PPD. This result will then be stored in the final linked-EPC PPD dataset. Stages 2 to 4 follow a similar process to Stage 1. The linked-EPC PPD dataset is the data linkage result. These data linkage results link back to the original Domestic EPCs and to the LR-PPD by their unique identifiers.
Following the four-stage data linkage, 16,846,834 transaction records in England and Wales between 1995 and 2019 were successfully linked with Domestic EPCs. These comprise the linked dataset. The match rate of transactions in England is shown in Fig. 3. The match rate between 2011 and 2019 is higher than 90%, while the match rate of the rest of the period is considerably lower, this is mainly due to the EPCs dataset only covering the period between 1/10/2008 and 31/8/2019. The match rate of 56.20% in 2008 is particularly low but rapidly increases to over 88% after 2010. As the match rate before 2008 is significantly lower than for the period after 2008, only the linked data between 2009 and 2019 are used to conduct the evaluation of data linkage.

Evaluation of the data linkage between 2009 and 2019
Match rates offer a crude way to quantify the matching performance, but visual comparison of the house price frequency distributions for the new linked data and original LR-PPD data reveals a clearer picture of matching performance. Histograms of the logarithm of transaction price from both datasets are shown in Fig. 4. In each graph, the distribution of the linked data (blue) is overlaid onto the distribution of the original LR-PPD dataset (white). The area of visible white bars represents the proportion of un-matched cases. Importantly, there was no significant loss of information as a result of un-matched cases in the data linkage between 2010 and 2019.
The Kolmogorov-Smirnov test (K-S test) and the Jeffreys divergence (J-divergence) can be used to quantify the extent of house price information lost. The K-S test is a nonparametric test that examines the differences in the shape of a distribution. The K-S test, statistic D, is based on the maximum absolute difference between two cumulative distribution functions. Here, the test will be used to quantify the difference of two house price distributions (original data vs. linked data). The Jeffreys divergence (J-divergence), derived from information theory, is a function used to establish the distance of one probability distribution to another [36][37][38]. To calculate the J-divergence, the data from two different samples must first be assigned to k different categories. In the case UCL  A new attribute-linked residential property price dataset for England andWales, 2011-2019

Figure 4
House price distribution of original data and linked data, 2009-2019. of this research, these categories are a simple subdivision of the log house price into bins. The J-divergence is then defined as 1 1 ln ln where k is the number of categories, p j is the proportion of data points in category j in the original house price data, and q j is the proportion of data points in category j in the linked house price data. The final divergence measure, J, ranges from 0 to 1. If the distribution of both data samples across all the categories is the same, J will be 0. Larger values of J indicate greater differences between the two distributions.
To compute the J-divergence, the original data and linked data are divided into 100 bins, the 100 bins are created based on the 100 equal intervals of log house price in the original data in a given year. The results of the J-divergence and K-S tests are shown in Fig. 5 A new attribute-linked residential property price dataset for England and Wales, 2011-2019 the K-S tests are less than 0.05 (the conventional default threshold for statistical significance), indicating a statistically significant difference between the original house price data and the linked house price data. The D statistic is relatively low (less than 0.007) after 2010. This demonstrates that the house price datasets before and after linkage are highly similar after 2010. The J-divergence results also show that the linked data exhibits relatively low information loss after 2010. Given the information lost in terms of J-divergence is slightly higher in 2010 compared to the loss after 2010, the newly created house price data from 2011 to 2019 is more representative than that for other years. Therefore we keep the 2011 to 2019 time period. With the geo-referenced data, the overall match rates between 2011 and 2019 by LA (Fig. 6)  Looking at annual match rates across LAs in England and Wales (Fig. 7), 70% of LAs represent an annual match rate over 90% from 2011 to 2019, while 98% of the LAs represent an annual match rate over 80%. Figure 7 colours the six LAs with annual match rates lower than 80%. A new attribute-linked residential property price dataset for England and Wales, 2011-2019

Figure 6
Overall match rates at local authority level between 2011 and 2019.
For example, Domestic EPCs in the City of London at postcode 'EC2Y 9BB' are not available hence transactions in 'EC2Y 9BB' cannot be successfully matched, 0.26% of house price transactions in the LR-PPD (1/1/2011-31/10/2019) fail to link for this reason. Some transactions in the LR-PPD can relate to a postcode unit which is also identified in the EPC data but contain no matching property identifiers. For example, one flat sold in 2011 in Camden failed to match because Domestic EPCs are not available for this property. The potential reasons for non-availability of property records in Domestic EPCs could be that records have been incorrectly loaded by the surveyor or that the property owner has opted out.

Data cleaning
Of the linked data, 6,753,307 records can be geo-referenced by linking the NSPL between 1/1/2011 and 31/10/2019 in England and Wales. This data comprises the transaction information in the LR-PPD together with property size (total floor area and number of habitable rooms) in the EPCs. Some properties' total floor area and number of habitable rooms are recorded in the EPCs with missing

Figure 7
Match rate across local authority in England andWales, 2011-2019. or unlikely values (e.g., total floor area records as 0.01). This data is excluded prior to analysis. All the excluded transactions along with cleaning methods are listed in Table 1, which accounts for 15.11% of the linked geo-referenced data.
After removing the transactions listed in Table 1, 5,732,838 transactions are left. This represents 79.11% of full market property sales in the LR-PPD in England and Wales between 1/1/2011 and 31/10/2019. This linked dataset, like the LR-PPD, fully covers all the regional areas, local authorities and MSOAs in England and Wales. The LR-PPD covers 99.99% of LSOAs and this is also the same for the final linked data. Although the newly linked data is not as comprehensive as the LR-PPD, it is the largest open access house price dataset in England and Wales (1/1/2011-31/10/2019) containing both the transaction price and total floor area.

Dataset access
The final linked dataset details 5,732,838 transactions in England and Wales (1/1/2011-31/10/2019). It not only adds in a property's total floor area and the number of habitable rooms, but also includes a new unique identifier (i.e., id) and other non-address fields (except LMK_KEY field) in the Domestic EPC dataset. Codes for other commonly used spatial units from Output Area (OA) to region are also included in this dataset. It contains 105 fields written in upper or lower case. All the fields written in upper case come from Domestic EPCs, the 33 remaining fields written in lower case are introduced in Github (https://github.com/Bin-Chi/Link-LR-PPD-and-Domestic-EPCs).
The linked, original EPCs and LR-PPD datasets are stored in CSV format and deposited in UKDA ReShare [40]. Postcode and address elements in the linked data stem from address information in LR-PPD, which is subject to Royal Mail copyright. The Royal Mail confirmed on 25/8/2020 that this linked data can be shared both by the first author and by the UK Data Service on the same terms as the original datasets. Therefore the linked data is under a licence that precludes commercial use. Meanwhile, the data linkage is conducted in R and stored in PostGIS. A new attribute-linked residential property price dataset for England andWales, 2011-2019 Potential dataset use and reuse The newly linked dataset offers directly useable information on house price per square metre along with transaction price, total floor area, number of habitable rooms, transaction date and commonly used geographical area identifiers at and over postcode geographical level in England and Wales. As the LR-PPD data for the most recent two months may be incomplete due to the delay between the property transaction and its registration in Land Registry [21], we suggest researchers use transactions before 31/8/2019. This could support quantitative house price research in terms of house price variation within England and Wales after 2011 at multi-geographical scales over postcode level [41]. It also can be used to explore the relationship between house price and a property's energy performance [30,31,42]. In addition, as the LR-PPD is updated monthly and the Domestic EPCs are updated two or four times a year, the open access codes will allow for future updates and thereby maintain a continuously updated dataset of residential property prices in England and Wales.
In this paper, we provide three technical validation approaches (section: Technical validation) to inform potential users of the data quality issues associated with different years in the dataset.
In Table 1, a series of rules are described which we have used to exclude potential errors in the dataset. These are our suggestions and very reasonably, users could develop their own exclusion criteria for use with the raw linked data. In this dataset, before the data linkage, all transactions designated as category B (Additional Price Paid entry) and other property types are removed. Researchers could add these entries back in by modifying the related code shared via the UK Data Service Reshare service (https://reshare.ukdataservice.ac.uk/854240/). To further benefit noncommercial users who would like to access the latest orginal linkage dataset before the technical validation process, we will annually publish a simple version of the latest raw linkage data via the Greater London Authority (GLA)'s London Datastore.
For users who would like to update the linkage dataset themselves with the linkage code, the Domestic EPCs downloaded may be different from the third version used in this paper. For example, by the time this paper was under open review in February 2021, Domestic EPCs had reached their sixth released version (1/10/2008-20/9/2020). This new version covers more variables than the third version (e.g., building's construction age band). Moreover, this sixth version has a different sample size of Domestic EPCs for the same time period compared with the third version. The reasons for this difference are complex, although one of the main reasons is that some property owners are withdrawing their EPC records from the publicly available platform. For users who use the latest linked data to explore house prices during the coronavirus pandemic, we highly recommend Neal Hudson's blog [43] to gain an understanding of how the pandemic increased the HM Land Registry time lag in registrations.

Conclusions
The linkage method was orignally created to enrich the geo-referenced house price dataset in England before 31/7/2017 [35], it still shows a smiliar performance when updated with new published house prices and covers Wales as shown in this research. Within the linkage, properties in the LR-PPD and Domestic EPC dataset have slightly different names (e.g., 'CLEATOR STREET' vs. 'CLEATER STREET'). We manually correct this type of mismatched address string for the properties located in England and record this correction within the linkage codes. This contributes to a less than 1% increase in the total matching rate. Our futher linkage research is to focus on fixing this issue in Wales and for the newly updated transactions in England.
We expect that this new house price dataset will enable new research directions in UK housing analysis. To date, most hedonic house price models have had to contend with the confounding influence of variations in dwelling size in different housing market areas. This new dataset will enable more parsimonious models of price variation to be explored where proxies for size can be dispensed with. A new attribute-linked residential property price dataset for England andWales, 2011-2019 Declarations and conflict of interest The authors declare no conflicts of interest in connection to this article.

Open data and materials availability
The datasets generated during and/or analysed during the current study are available in the repository: https://reshare. ukdataservice.ac.uk/854240/. [4] McAvinchey ID, Maclennan D. A regional comparison of house price inflation rates in Britain, 1967-76. Urban Stud. 1982:43-57.

References
[5] Wabe JS. A study of house prices as a means of establishing the value of journey time, the rate of time preference and the valuation of some aspects of environment in the London Metropolitan Region. Appl Econ. 1971;3(4):247-55.
[10] Ahlfeldt GM, Holman N, Wendland N. An assessment of the effects of conservation areas on value. English Heritage; 2012.
[11] Gray D.           In this column, variables on the left side of the symbol (=) refer to address fields in the LR-PPD, variables on the right side of the symbol (=) refer to address fields in the Domestic EPCs. Symbol (=) refers to string matching function. Table B. Continued